Predicting Customer Churn with Logistic Regression in Python

14 min readJan 12, 2025

Let’s start with the basics — what exactly is customer churn? It’s when a customer stops doing business with a company. Whether it’s a streaming service subscription that wasn’t renewed or a once-loyal shopper who’s now ghosting their favorite store, churn is bad news for businesses.

Why does it matter? Simple: keeping existing customers is way cheaper (and usually easier) than finding new ones. By predicting which customers are likely to leave, businesses can step in with retention strategies — think discounts, personalized offers, or a quick nudge to remind customers why they signed up in the first place.

Now, let’s talk about the star of the show: logistic regression. If you’ve heard the phrase “binary classification” tossed around in data science, this is one of the go-to tools for the job. Logistic regression helps us answer “yes or no” questions, like: Is this customer likely to churn? It’s straightforward, effective, and perfect for tackling problems like this.

In this article, we’ll walk you through how to use logistic regression in Python to predict customer churn. Whether you’re a beginner or just brushing up your skills, you’re in the right place. Ready? Let’s dive in!

Understanding the Dataset

Photo by Anastassia Anufrieva on Unsplash

Before we can predict anything, we need a good dataset. So, what does a typical churn dataset look like? It usually includes things like:

Customer demographics: Age, gender, location, etc.
Account info: How long they’ve been a customer, their subscription type, or their payment method.
Usage patterns: How often they use the service, the number of purchases, or time spent interacting with the product.
Support interactions: How many times they’ve contacted support and how those interactions went.

Think of these features as puzzle pieces. Together, they help us understand the bigger picture: why some customers stay and why others leave.

Where Do These Datasets Come From?

If you’re following along, you can grab publicly available datasets like the Telco Customer Churn Dataset on Kaggle. Businesses, on the other hand, usually pull this data from their customer relationship management (CRM) systems, support logs, and product analytics.

Prepping the Data for Action

Raw data is rarely perfect. Here’s what we’ll do to clean things up:

Fill in the blanks: Missing values are common, but we’ll handle them by filling them with the median, mode, or something else that makes sense.
Tidy up the categories: Features like “Payment Method” might be words (e.g., “Credit Card,” “Paypal”), but models need numbers. We’ll use one-hot encoding to fix that.
Scale it down: Some columns, like monthly charges, might have values ranging from $10 to $500. Scaling these to a consistent range ensures all features get fair treatment when building the model.

By the end of this step, we’ll have a clean, structured dataset that’s ready to feed into our logistic regression model. Think of it as setting the stage for a smooth prediction process.

Setting Up the Python Environment

Alright, let’s get our hands dirty with some Python! To predict customer churn, we’ll need a few essential tools in our coding toolkit. Think of these libraries as your trusty sidekicks — they make the whole process a breeze.

Libraries You’ll Need

Here’s the lineup:

pandas: For organizing and analyzing your data (think Excel but way cooler).
numpy: For crunching numbers like a pro.
matplotlib & seaborn: To make your data look good with charts and graphs.
scikit-learn: The all-in-one library for building and evaluating our logistic regression model.

Getting Everything Installed

If you don’t already have these libraries installed, no worries — it’s super easy. Just open up your terminal or command prompt and run:

pip install pandas numpy matplotlib seaborn scikit-learn

If you’re using Jupyter Notebook, make sure to restart your kernel after installing these packages so everything works smoothly.

Why These Tools?

Here’s the deal:

pandas helps you clean and explore your dataset.
seaborn and matplotlib make it easier to spot trends and patterns.
And scikit-learn? It’s where the magic happens. From training our model to evaluating its performance, this library has got you covered.

With this setup, we’re ready to dive into the fun part: analyzing the data and building a model. Buckle up — it’s going to be an exciting ride!

Exploratory Data Analysis (EDA)

Before we jump into modeling, let’s get to know our data a little better. Think of EDA as the “first date” with your dataset — it’s where we uncover patterns, spot weird stuff, and figure out what’s really going on.

Step 1: Visualizing Churn Rates

First up, let’s check out how many customers are actually churning. Are we dealing with a 50/50 split, or is churn a rare event? A simple bar chart or pie chart can give us the answer.

import matplotlib.pyplot as plt

# Example: Churn distribution
churn_counts = data['Churn'].value_counts()
churn_counts.plot(kind='bar', color=['skyblue', 'salmon'])
plt.title('Churn vs. Retained Customers')
plt.show()

This helps us see if our dataset is balanced or if we’re dealing with an imbalance (hint: most churn datasets are imbalanced).

Step 2: Digging Into Features

Now, let’s see how individual features relate to churn. For example:

Do higher monthly charges lead to more churn?
Are customers with shorter contracts more likely to leave?

Heatmaps and pairplots are your best friends here. They help visualize correlations and relationships between features.

import seaborn as sns

# Example: Correlation heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Feature Correlations')
plt.show()

Step 3: Spotting Insights

Here’s where you’ll start to notice interesting trends. For instance:

Customers with high monthly charges might churn more.
People on month-to-month plans could be more likely to leave compared to those with annual contracts.

These insights will guide us in building a model that knows what to focus on.

Step 4: Cleaning Up the Weird Stuff

Sometimes, EDA reveals oddities like outliers or missing data. Use this time to clean things up:

Outliers: Consider removing or capping extreme values if they don’t make sense.
Missing Data: Fill in gaps using techniques like median imputation or dropping rows with too many missing values.

By the end of this step, you’ll have a much better understanding of your dataset and a list of hypotheses to test with your logistic regression model. It’s like piecing together a puzzle before seeing the full picture!

Building the Logistic Regression Model

Alright, the stage is set, and now it’s time for the main event: building our logistic regression model. This is where we’ll train the model to recognize patterns in the data and predict which customers might churn. Let’s break it down step by step.

Step 1: Splitting the Data

First things first, we need to split our data into two parts:

Training data: This is where the model learns.
Testing data: This is where we see how well it performs on unseen data.

We’ll use an 80/20 split — 80% for training and 20% for testing. Here’s how:

from sklearn.model_selection import train_test_split

# Define features (X) and target (y)
X = data.drop('Churn', axis=1)
y = data['Churn']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 2: Training the Model

Now, let’s bring in logistic regression and train it on the data. This part is as simple as it gets with scikit-learn:

from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression(max_iter=1000)  # Increase max_iter if your data is complex
model.fit(X_train, y_train)

Boom! You’ve trained your first logistic regression model.

Step 3: Making Predictions

Once the model is trained, we can use it to predict churn for the test data.

# Predict churn for test data
y_pred = model.predict(X_test)

Step 4: Evaluating Performance

Let’s see how well our model did. We’ll use a mix of metrics to get a full picture:

Accuracy: How often the model gets it right.
Precision & Recall: How well it identifies churners without false alarms.
F1 Score: The sweet spot between precision and recall.
ROC-AUC: How well the model separates churners from non-churners.

Here’s the code for evaluation:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Evaluate the model
print("Accuracy:", accuracy_score(y_test, y_pred))
print("Precision:", precision_score(y_test, y_pred))
print("Recall:", recall_score(y_test, y_pred))
print("F1 Score:", f1_score(y_test, y_pred))
print("ROC-AUC:", roc_auc_score(y_test, model.predict_proba(X_test)[:, 1]))

What Did We Learn?

This step gives us the first glimpse of how well the model works. If the scores look good — awesome! If not, don’t worry. We can tweak things like hyperparameters or features to improve performance (we’ll get into that in the next section).

At this point, you’ve got a trained model that can predict churn. Not bad, right?

Fine-Tuning the Model

So, you’ve built your logistic regression model, and it’s making decent predictions — but “decent” isn’t always enough. Let’s level up by fine-tuning the model and squeezing out the best possible performance.

Step 1: Hyperparameter Optimization

Every model has settings called hyperparameters that control how it learns. For logistic regression, the two big ones are:

Regularization ©: This balances how much the model fits the data versus generalizing to unseen data.
Penalty Type (L1 or L2): This controls how the model handles features — L1 can shrink less important features to zero, which can also work as a kind of feature selection.

To find the best settings, we can use Grid Search or Random Search. Here’s an example with Grid Search:

from sklearn.model_selection import GridSearchCV

# Define hyperparameters to test
param_grid = {
    'C': [0.01, 0.1, 1, 10],
    'penalty': ['l1', 'l2'],
    'solver': ['liblinear']
}

# Set up Grid Search
grid_search = GridSearchCV(LogisticRegression(max_iter=1000), param_grid, cv=5, scoring='roc_auc')
grid_search.fit(X_train, y_train)

# Best hyperparameters
print("Best Parameters:", grid_search.best_params_)

Once you’ve found the sweet spot, retrain the model with these optimized settings.

Step 2: Feature Selection

Not all features are created equal. Some are super useful, while others just add noise. We can use Recursive Feature Elimination (RFE) to figure out which features matter most.

from sklearn.feature_selection import RFE

# Set up RFE
selector = RFE(LogisticRegression(max_iter=1000), n_features_to_select=5)  # Adjust number of features
selector.fit(X_train, y_train)

# Get the most important features
important_features = X_train.columns[selector.support_]
print("Selected Features:", important_features)

By focusing only on the most important features, we can improve the model’s performance and reduce complexity.

Step 3: Addressing Class Imbalance

If churners make up only a small fraction of your data (which is common), the model might favor predicting “no churn” because it’s the majority class. To fix this:

Use Class Weights: Add class_weight='balanced' when initializing the logistic regression model.
Oversample the Minority Class: Use techniques like SMOTE to balance the data.

from imblearn.over_sampling import SMOTE

# Apply SMOTE to training data
smote = SMOTE(random_state=42)
X_train_sm, y_train_sm = smote.fit_resample(X_train, y_train)

Step 4: Evaluating Improvements

After fine-tuning, rerun the evaluation metrics (accuracy, precision, recall, etc.) to see how much your tweaks improved the model. Don’t forget to check the ROC-AUC — it’s great for understanding overall performance.

Why Fine-Tuning Matters

Fine-tuning isn’t just about squeezing out better numbers. It helps create a model that’s more reliable and better aligned with real-world needs. With a well-tuned logistic regression model, you’ll be catching potential churners left and right like a pro!

Interpreting Model Results

So, your model is up and running, and it’s spitting out predictions. That’s awesome! But numbers alone aren’t enough — we need to dig deeper to understand why the model is making those predictions. Let’s break it down.

Step 1: Understanding Coefficients

In logistic regression, each feature gets a coefficient that tells us how strongly it influences the outcome (churn or not). Positive coefficients increase the likelihood of churn, while negative ones reduce it.

Here’s how to check the coefficients:

# Get feature names and coefficients
coefficients = pd.DataFrame({
    'Feature': X_train.columns,
    'Coefficient': model.coef_[0]
})

# Sort by impact
coefficients = coefficients.sort_values(by='Coefficient', ascending=False)
print(coefficients)

For example:

If MonthlyCharges has a positive coefficient, higher charges are linked to more churn.
If Tenure has a negative coefficient, longer customer relationships reduce churn risk.

You can also convert coefficients into odds ratios for a more intuitive understanding:

coefficients['Odds Ratio'] = np.exp(coefficients['Coefficient'])

Step 2: Visualizing Results

Data is easier to digest with visuals. Here are a couple of key plots:

Confusion Matrix: This shows how many churners were correctly predicted versus missed.

from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay

# Generate and plot confusion matrix
cm = confusion_matrix(y_test, y_pred)
ConfusionMatrixDisplay(cm).plot(cmap='Blues')
plt.show()

ROC Curve: This tells you how well the model separates churners from non-churners.

from sklearn.metrics import roc_curve, auc

# Generate ROC curve
y_pred_prob = model.predict_proba(X_test)[:, 1]
fpr, tpr, _ = roc_curve(y_test, y_pred_prob)
roc_auc = auc(fpr, tpr)

# Plot ROC curve
plt.plot(fpr, tpr, label=f'ROC Curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], 'r--')  # Random guess line
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('ROC Curve')
plt.legend()
plt.show()

Step 3: Explaining Predictions

For real-world use, it’s often important to explain individual predictions. Tools like SHAP (SHapley Additive exPlanations) help show the contribution of each feature to a specific prediction.

pip install shap

import shap

# Initialize SHAP explainer
explainer = shap.Explainer(model, X_test)
shap_values = explainer(X_test)

# Visualize for one customer
shap.plots.waterfall(shap_values[0])

Why Interpretation Matters

Understanding the “why” behind predictions isn’t just for fun — it helps build trust in the model. Plus, these insights can guide business decisions. For instance, if high MonthlyCharges is a major churn driver, the company could introduce discounts or loyalty perks to retain customers.

By interpreting the results, you’re not just building a predictive model — you’re creating actionable insights that can make a real difference.

Deploying the Model

Alright, you’ve built and fine-tuned your churn prediction model, but now it’s time to make it live. Deploying the model means putting it into the real world where it can start making predictions for actual users. Let’s talk about how to get your model out of the Jupyter notebook and into an application or a website.

Step 1: Pick Your Deployment Strategy

There are a few different ways you can deploy a model, depending on your needs:

Batch Processing: If churn predictions are needed once a day or week, you could schedule the model to run in batch mode, generating predictions for a group of customers at a time.
Real-Time Predictions: If you need predictions in real-time (e.g., as customers interact with your platform), you’ll need a live API. This is where frameworks like Flask or FastAPI come in handy.

Let’s go with the real-time API for this example.

Step 2: Save Your Model

Before deploying, you’ll need to save your trained model so it can be loaded in the deployment environment. The best way to do this is with joblib or pickle. Here’s how:

import joblib

# Save the model
joblib.dump(model, 'churn_predictor.pkl')

Now you’ve got a file that contains the trained model, ready for action.

Step 3: Set Up the API

Now let’s build a simple API using Flask. This API will take customer data, run it through the churn prediction model, and return the prediction (yes, the customer will churn or no, they won’t).

Install Flask:

pip install flask

2. Create the API:
Here’s a super basic Flask app to get you started. It listens for POST requests and uses the trained model to predict churn.

from flask import Flask, request, jsonify
import joblib
import numpy as np

# Load the model
model = joblib.load('churn_predictor.pkl')

# Initialize Flask app
app = Flask(__name__)

# Define a route for making predictions
@app.route('/predict', methods=['POST'])
def predict():
    # Get data from the POST request
    data = request.get_json()
    
    # Convert the data into a format the model expects (e.g., a NumPy array)
    customer_data = np.array(data['features']).reshape(1, -1)
    
    # Make prediction
    prediction = model.predict(customer_data)
    
    # Return prediction
    result = {'churn': prediction[0]}
    return jsonify(result)

# Run the app
if __name__ == '__main__':
    app.run(debug=True)

In this code, when a POST request is made to the /predict endpoint, the API takes the features from the request, passes them through the model, and returns whether the customer will churn.

Step 4: Test the API Locally

Run the Flask app:

python app.py

You can now send a POST request to http://127.0.0.1:5000/predict with customer data to get predictions. Here’s an example of what the request might look like:

{
  "features": [25, 1, 5, 0, 30, 60]  # Example customer features
}

And the response might look like:

{
  "churn": 1  # 1 means the customer is likely to churn
}

Step 5: Deploy to the Cloud

To make your model accessible to the world (or at least your business), you’ll need to host it somewhere. Popular cloud platforms like Heroku, AWS, or Google Cloud make it easy to deploy APIs. For instance, with Heroku, you can deploy in just a few steps:

Install the Heroku CLI.
Create a Procfile in your project directory (tells Heroku how to run your app):

web: python app.py

3. Deploy using Git:

git init
git add .
git commit -m "Initial commit"
heroku create
git push heroku master

Once deployed, your model will be accessible via a URL like https://your-app-name.herokuapp.com/predict.

Step 6: Automating Retraining

Now that your model is live, don’t forget to monitor its performance. Over time, customer behavior might change, and your model’s predictions could become less accurate. To avoid this, you can set up automatic retraining (e.g., monthly or quarterly) using new customer data to keep the model fresh. Tools like Airflow or Kubeflow can help automate this process.

Why Deployment Matters

Deploying your model brings all the hard work to life. Instead of just having cool predictions sitting in a notebook, you’re now making real-time, data-driven decisions that can help reduce churn and boost customer retention.

And there you have it! Your churn prediction model is ready for action in the real world. Time to show the business how predictive analytics can make a real difference! 🚀

Wrapping It All Up

You’ve made it to the finish line! 🎉 By now, you’ve not only built a logistic regression model to predict customer churn, but you’ve also fine-tuned it, evaluated its performance, and dug deep into what’s driving those predictions. Let’s recap the journey and talk about the next steps.

What We Did

Explored the Dataset: We got to know the data inside out — churn trends, correlations, and those sneaky missing values.
Built the Model: Using Python and scikit-learn, we trained a logistic regression model to predict churn.
Fine-Tuned for Success: Tweaked hyperparameters, balanced the data, and focused on important features to boost performance.
Interpreted the Results: Made sense of the coefficients and used tools like confusion matrices and SHAP to explain predictions.

Why It Matters

Predicting churn isn’t just a cool data science project — it’s a game-changer for businesses. By identifying at-risk customers early, companies can step in with retention strategies like discounts, personalized offers, or better support. This isn’t just about saving revenue — it’s about building stronger, long-term relationships with customers.

What’s Next?

Want to take it further? Here are a few ideas:

Deploy the Model: Turn this into a real-world application using tools like Flask or FastAPI. Imagine a dashboard where team members can see churn predictions in real-time.
Automate the Workflow: Use pipelines to automate data cleaning, model training, and evaluation. It’s a huge time-saver.
Test Other Models: Logistic regression is great, but you could try more complex models like Random Forests, Gradient Boosting, or Neural Networks to see if they improve accuracy.
Combine with Business Strategy: Work with your company’s marketing or customer success teams to act on the insights from your model.

Final Thoughts

Customer churn is a tough challenge, but with the right data and tools, it’s totally manageable. Logistic regression might seem simple compared to flashier machine learning models, but it’s powerful, interpretable, and gets the job done.

Congrats on building a solid foundation! Now go out there and show churn who’s boss. 🚀