Analyzing Health Data with Logistic Regression in Python

13 min readJan 19, 2025

Photo by National Cancer Institute on Unsplash

When it comes to healthcare, data is a big deal. Whether it’s predicting if someone has diabetes or figuring out who might need a hospital readmission, health data is at the heart of making better decisions. But here’s the thing — data alone doesn’t do much unless you know how to make sense of it. That’s where tools like logistic regression come in.

So, what’s the big deal about logistic regression? It’s like a superpower for answering yes-or-no questions with data. Think about it: does this patient have a disease or not? Will this treatment work or won’t it? Logistic regression helps us connect the dots between the numbers and real-world outcomes, which is crucial for improving patient care.

In this article, we’re going to walk you through how to use Python and logistic regression to analyze health data. Whether you’re a total beginner or just brushing up your skills, we’ll break things down step by step — no overly technical jargon, promise. By the end, you’ll be ready to dive into your own health data projects like a pro.

Understanding Logistic Regression

Let’s start with the basics: what is logistic regression, anyway? Think of it as the trusty sidekick to linear regression, but instead of predicting continuous numbers (like someone’s height), it helps us answer questions that have just two possible outcomes. For example:

Does this patient have a certain disease (yes or no)?
Will this medication work (effective or not)?

The magic of logistic regression lies in how it handles these binary decisions. Instead of giving you a straight-up “yes” or “no,” it predicts probabilities. For instance, it might tell you there’s a 75% chance a patient has a disease. You can then set a threshold (like 50%) to decide whether to classify it as “yes” or “no.”

Why Use Logistic Regression for Health Data?

Health data is full of these binary situations: positive vs. negative test results, high risk vs. low risk, admitted vs. discharged. Logistic regression is great because:

It’s straightforward and easy to interpret.
It works well when your data isn’t too complicated.
It gives meaningful insights about which factors (like age, blood pressure, or cholesterol) are driving the outcomes.

How Does It Work?

Imagine you’re plotting patient data on a graph — age on the x-axis and disease status (yes or no) on the y-axis. The data points might look scattered all over, so drawing a straight line (like you would with linear regression) doesn’t make sense. Instead, logistic regression draws an “S-shaped curve” that better fits the data and gives you probabilities for each outcome.

This simple yet powerful approach is why logistic regression is one of the go-to tools for healthcare professionals and data scientists alike.

By the end of this guide, you’ll not only understand how it works but also know how to build and use a logistic regression model to tackle real-world health problems. Let’s keep going!

Setting Up Your Python Environment

Alright, now that we’ve got the basics of logistic regression down, it’s time to get our hands dirty with Python. Don’t worry if you’re not a coding whiz — this part is super manageable, and I’ll walk you through it step by step.

What You’ll Need

Before diving into the code, make sure you have Python installed. If you don’t, grab it from python.org. You’ll also need some key libraries that do all the heavy lifting for us:

numpy: For crunching numbers.
pandas: For organizing and exploring your data.
matplotlib and seaborn: For making cool charts.
scikit-learn: The superstar library that handles logistic regression.

To install these, just run this command in your terminal or Jupyter Notebook:

pip install numpy pandas matplotlib seaborn scikit-learn

Where Do You Get the Data?

If you’ve already got a health dataset, great! If not, no worries — there are tons of free ones available online. For this tutorial, you could use something like the Pima Indians Diabetes Dataset from Kaggle. It’s a classic in health data analysis.

Here’s how you can load your data once you have it saved as a CSV file:

import pandas as pd

# Load the dataset
data = pd.read_csv('path_to_your_dataset.csv')
print(data.head())  # Check out the first few rows

Setting Up Your Workspace

You’ll want to work in a Python-friendly environment. Here are a couple of great options:

Jupyter Notebook: Perfect for writing and running Python code in chunks. Install it with pip install notebook.
VS Code or PyCharm: Great if you prefer a more traditional coding setup.

Once everything’s set up, you’re ready to start exploring your data and building your logistic regression model. In the next section, we’ll dive into cleaning and prepping the data so it’s ready for action. Let’s go!

Preprocessing Health Data

Before we jump into building a logistic regression model, we need to prep our data. Think of this as tidying up your kitchen before cooking — it makes everything run smoother and helps avoid disasters.

1. Data Cleaning

Health data can be messy. There might be missing values, duplicates, or outliers that could throw off your analysis. Here’s how you can clean things up:

Check for Missing Values: Missing data is super common. You can either fill in the gaps (imputation) or drop the rows/columns if the missing data isn’t too much.

# Check for missing values
print(data.isnull().sum())

# Option 1: Drop rows with missing values
data = data.dropna()

# Option 2: Fill missing values with the mean
data['column_name'] = data['column_name'].fillna(data['column_name'].mean())

Remove Duplicates: Double-check your data for duplicate entries and get rid of them.

data = data.drop_duplicates()

2. Feature Engineering

Sometimes the raw data isn’t enough, and you’ll need to create new features that add more value to your analysis. For instance, if you’re working with health data, you might calculate a BMI score from weight and height or categorize ages into groups.

# Example: Adding a BMI column
data['BMI'] = data['Weight'] / (data['Height'] / 100) ** 2

Don’t forget to encode categorical variables (like “Male” and “Female”) into numbers, because logistic regression doesn’t play well with strings.

# Encode categorical variables
data['Gender'] = data['Gender'].map({'Male': 0, 'Female': 1})

3. Scaling the Data

Logistic regression works best when your features are on a similar scale. For example, if one feature is in the range of 0–1 and another is in the thousands, it can mess things up. Use StandardScaler to normalize your data:

from sklearn.preprocessing import StandardScaler

scaler = StandardScaler()
data[['Age', 'BMI', 'BloodPressure']] = scaler.fit_transform(data[['Age', 'BMI', 'BloodPressure']])

4. Splitting the Data

Finally, split your dataset into a training set (to build the model) and a test set (to evaluate it). A good rule of thumb is 80% for training and 20% for testing.

from sklearn.model_selection import train_test_split

X = data.drop('Target', axis=1)  # Features
y = data['Target']  # Target variable

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now your data is all cleaned, scaled, and ready to go. In the next section, we’ll finally build our logistic regression model and see some results. Let’s do this!

Building a Logistic Regression Model

Alright, the fun part is here — let’s build that logistic regression model! By now, your data is clean, prepped, and ready to roll. We’ll train the model, see how it works, and make some predictions.

Step 1: Import the Model

We’re using LogisticRegression from scikit-learn because it’s simple, powerful, and perfect for this task.

from sklearn.linear_model import LogisticRegression

# Initialize the model
model = LogisticRegression()

Step 2: Train the Model

Training the model is basically telling it, “Hey, here’s the data — figure out the patterns!” You’ll feed it your training set and let it do its thing.

# Train the model
model.fit(X_train, y_train)

print("Model training complete!")

Step 3: Make Predictions

Once the model is trained, it’s time to test it out on the test set. This is where we see how well it learned from the training data.

# Make predictions
y_pred = model.predict(X_test)
y_pred_proba = model.predict_proba(X_test)[:, 1]  # Get probabilities

The predict method gives you hard classifications (like 0 or 1), while predict_proba tells you the probabilities (like 0.75 or 0.32).

Step 4: Interpret the Coefficients

The beauty of logistic regression is that it’s not a black box — you can peek under the hood and see which features are influencing the outcome.

# Coefficients and feature importance
coefficients = model.coef_[0]
features = X_train.columns

for feature, coef in zip(features, coefficients):
    print(f"{feature}: {coef:.4f}")

Positive coefficients mean the feature increases the likelihood of the outcome, while negative ones mean it decreases the likelihood.

Step 5: Test the Model

How well is your model doing? Let’s evaluate it with the test set. We’ll look at metrics like accuracy, precision, recall, and the F1 score.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Evaluate the model
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

Bonus: Visualizing Results

Sometimes, seeing is believing. Let’s plot a ROC curve to visualize how well the model distinguishes between outcomes.

from sklearn.metrics import roc_curve, auc
import matplotlib.pyplot as plt

fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, color='blue', lw=2, label=f'ROC curve (AUC = {roc_auc:.2f})')
plt.plot([0, 1], [0, 1], color='gray', linestyle='--')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.title('Receiver Operating Characteristic')
plt.legend(loc="lower right")
plt.show()

Boom! You’ve just built and tested your first logistic regression model. In the next section, we’ll dig into how to fine-tune and interpret these results even further. But for now, give yourself a high-five — you’re crushing it!

Evaluating Model Performance

Okay, so your logistic regression model is up and running — nice work! But how do you know if it’s any good? That’s where model evaluation comes in. Let’s break it down step by step and figure out how well your model is predicting outcomes.

1. Key Metrics to Know

When it comes to evaluating a model, there’s more to life than just accuracy. Here are the big players:

Accuracy: The percentage of correct predictions. Great for balanced datasets but not so much for imbalanced ones.
Precision: Out of all the positive predictions, how many were actually correct?
Recall: Out of all the actual positives, how many did the model catch?
F1 Score: The balance between precision and recall (useful when they’re at odds).
ROC-AUC: A score that shows how well the model separates the classes (closer to 1 is better).

Here’s how you calculate them:

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, roc_auc_score

# Calculate metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred_proba)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")
print(f"ROC-AUC: {roc_auc:.2f}")

2. The Confusion Matrix

Think of the confusion matrix as a scorecard that breaks down your predictions:

True Positives (TP): You said “yes,” and it was actually “yes.”
True Negatives (TN): You said “no,” and it was actually “no.”
False Positives (FP): You said “yes,” but it was actually “no.”
False Negatives (FN): You said “no,” but it was actually “yes.”

Here’s how to visualize it:

from sklearn.metrics import confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["No", "Yes"], yticklabels=["No", "Yes"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

This matrix helps you see where your model is getting it right and where it’s tripping up.

3. Visualizing the ROC Curve

The ROC curve is a super handy way to see how well your model separates the two classes. It plots the true positive rate (TPR) against the false positive rate (FPR) at different thresholds.

from sklearn.metrics import roc_curve

# Plot the ROC curve
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})", color="blue")
plt.plot([0, 1], [0, 1], "k--", color="gray")  # Random guess line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

A curve that hugs the top-left corner is what you’re aiming for — it means your model is doing a great job separating the classes.

4. What to Do If Your Model’s Not Performing?

Sometimes, the results aren’t what you’d hoped for. That’s okay — it happens! Here are some tips to improve:

Check for Overfitting: If your model does great on training data but bombs on test data, it might be overfitting. Try regularization or simplifying the model.
Handle Imbalanced Data: If one class dominates the dataset, use techniques like oversampling, undersampling, or weighted classes.
Feature Engineering: Maybe your features aren’t capturing enough information. Try creating new ones or removing irrelevant ones.

And that’s it! You’ve learned how to measure your model’s performance and pinpoint areas for improvement. Next up, we’ll dive into a real-world case study to put all this theory into action. Let’s keep going!

Practical Case Study: Predicting Diabetes

Photo by Diabetesmagazijn.nl on Unsplash

Let’s put everything we’ve learned into practice! In this case study, we’ll use the Pima Indians Diabetes Dataset to predict whether a person is likely to have diabetes based on their health metrics. This dataset includes features like age, BMI, and blood pressure — perfect for logistic regression.

Step 1: Load the Data

First things first, let’s load the dataset and take a peek at what we’re working with.

import pandas as pd

# Load the dataset
data = pd.read_csv('diabetes.csv')
print(data.head())  # Show the first few rows

Step 2: Preprocess the Data

Next, clean up the data just like we talked about earlier. Check for missing values, scale the numeric features, and split the data into training and test sets.

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler

# Check for missing values
print(data.isnull().sum())

# Feature scaling
scaler = StandardScaler()
scaled_features = scaler.fit_transform(data.drop('Outcome', axis=1))

# Split the data
X = pd.DataFrame(scaled_features, columns=data.columns[:-1])
y = data['Outcome']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Step 3: Train the Logistic Regression Model

Now, let’s train the model on the training set.

from sklearn.linear_model import LogisticRegression

# Initialize and train the model
model = LogisticRegression()
model.fit(X_train, y_train)

print("Model training complete!")

Step 4: Evaluate the Model

Once the model is trained, it’s time to see how well it performs. We’ll calculate the accuracy, precision, recall, F1 score, and plot a confusion matrix.

from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, confusion_matrix
import seaborn as sns
import matplotlib.pyplot as plt

# Make predictions
y_pred = model.predict(X_test)

# Metrics
accuracy = accuracy_score(y_test, y_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)

print(f"Accuracy: {accuracy:.2f}")
print(f"Precision: {precision:.2f}")
print(f"Recall: {recall:.2f}")
print(f"F1 Score: {f1:.2f}")

# Confusion matrix
cm = confusion_matrix(y_test, y_pred)
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues", xticklabels=["No Diabetes", "Diabetes"], yticklabels=["No Diabetes", "Diabetes"])
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
plt.show()

Step 5: Interpret the Results

Once you’ve got the evaluation metrics and confusion matrix, it’s time to interpret what they mean. For example:

Accuracy: How often is the model correct overall?
Precision: When the model predicts diabetes, how often is it right?
Recall: How many of the actual diabetes cases did the model catch?

If precision and recall are at odds, use the F1 score to find a balance.

Step 6: Visualize the ROC Curve

Finally, let’s plot the ROC curve to visualize the model’s performance across different thresholds.

from sklearn.metrics import roc_curve, auc

y_pred_proba = model.predict_proba(X_test)[:, 1]  # Probabilities for the positive class
fpr, tpr, _ = roc_curve(y_test, y_pred_proba)
roc_auc = auc(fpr, tpr)

plt.figure()
plt.plot(fpr, tpr, label=f"ROC Curve (AUC = {roc_auc:.2f})", color="blue")
plt.plot([0, 1], [0, 1], "k--", color="gray")  # Random guess line
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.title("ROC Curve")
plt.legend(loc="lower right")
plt.show()

What Did We Learn?

By the end of this case study, you’ve:

Loaded and cleaned real-world health data.
Built a logistic regression model.
Evaluated the model’s performance using metrics and visualizations.
Interpreted the results to make meaningful conclusions.

With this workflow, you’re now ready to tackle your own health data projects. Go ahead — test your skills, and see what insights you can uncover!

Wrapping It All Up

Phew, what a ride! Let’s take a moment to recap everything we’ve covered. By now, you’ve gotten a solid understanding of logistic regression and how to use it to analyze health data in Python. Let’s summarize the journey:

What We Learned

The Basics of Logistic Regressio

You now know it’s a classification tool, not a straight-line regression buddy.
It predicts probabilities and helps answer “yes” or “no” questions, perfect for health data like predicting diabetes or heart disease.

Prepping the Data

Cleaning messy data, scaling features, and splitting it into training and testing sets are essential steps.
You’ve also explored the magic of feature engineering — turning raw data into something meaningful.

Building and Evaluating the Model

With just a few lines of Python, you trained a logistic regression model that predicts outcomes based on health metrics.
You learned how to evaluate its performance using metrics like accuracy, precision, recall, and even the F1 score.

Putting It Into Practice

In the case study, we applied everything step by step, analyzing real-world health data and gaining actionable insights.

Why Logistic Regression Rocks

Logistic regression is like the Swiss Army knife of data analysis. It’s:

Simple yet powerful: Great for small to medium-sized datasets.
Interpretable: You can see exactly which features are driving predictions.
Widely applicable: From healthcare to marketing, it’s useful in almost any field.

What’s Next?

Now that you’re comfortable with logistic regression, why stop there? Here are some ideas to level up:

Try other datasets: Explore more health data, like predicting heart disease or hospital readmissions.
Compare with other models: Test out decision trees, random forests, or neural networks to see how they stack up.
Dive deeper into feature selection: Learn techniques like Lasso regression to zero in on the most important features.
Experiment with hyperparameter tuning: Adjust settings like the regularization strength to optimize your model’s performance.

Final Thoughts

Logistic regression might not be the flashiest tool in the machine learning toolkit, but it’s reliable, versatile, and gets the job done — kind of like your favorite pair of sneakers. Whether you’re working on health data, customer behavior, or any other binary classification problem, this technique is a great place to start.

So, go ahead, put your newfound skills to work, and keep exploring the fascinating world of data analysis. Who knows? You might just uncover the next big insight! 🚀