blog.pierrehenry.be

How to Build Your First GLM in Python Without Getting Lost

Photo by Magdalena Grabowska How to Build Your First GLM in Python Without Getting Lost - Photo by Magdalena Grabowska on Unsplash

Alright, let’s dive right in. I want to walk you through a little data science learning project I’ve been working on—a GLM, or generalized linear model. If you’re just getting started with regression in Python, this is a great way to get your hands dirty without getting overwhelmed. We’ll keep it simple, practical, and I’ll show you exactly how I set things up, step by step.

What’s a GLM, Anyway?

So, GLM stands for generalized linear model. In this context, we’re basically talking about regression. The idea is to fit a model to some sample data—think of it as a learning experiment. Nothing too fancy, but it’s a solid foundation for more complex stuff down the road.

Loading and Exploring the Data

First things first, we need some data. I’m using a sample CSV file. Here’s how I load it up in Python using pandas:

1import pandas as pd
2
3# Load the CSV file
4data = pd.read_csv('sample_data.csv')
5print(data.head())

That’s it. If you’re familiar with pandas, this is super straightforward. We just read the CSV and take a quick look at the data.

Fitting the GLM

Now, let’s get to the core of it: fitting the model. The script is really small and quick—just a few lines to get the regression going. Here’s the basic flow:

  1. Load the data
  2. Process it (if needed)
  3. Fit the GLM
  4. Print the summary

Here’s what that looks like in code:

 1import statsmodels.api as sm
 2
 3# Assume 'X' is your feature matrix and 'y' is your target variable
 4X = data[['feature1', 'feature2']]
 5y = data['target']
 6
 7![Photo by pavan adepu](https://images.unsplash.com/photo-1589313388773-9e27fc31e1aa?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=M3w2NjcyMjF8MHwxfHNlYXJjaHwyfHxkYXRhJTIwc2NpZW5jZSUyMEdMTSUyMFB5dGhvbnxlbnwwfDB8fHwxNzY3MDczNjg1fDA&ixlib=rb-4.1.0&q=80&w=1080 "How to Build Your First GLM in Python Without Getting Lost")
 8*How to Build Your First GLM in Python Without Getting Lost - Photo by [pavan adepu](https://unsplash.com/@pa1adepu) on [Unsplash](https://unsplash.com/photos/yellow-and-black-snake-on-black-surface-cLuUTA6QvKo)*
 9
10# Add a constant to the model (intercept)
11X = sm.add_constant(X)
12
13# Fit the GLM
14model = sm.GLM(y, X, family=sm.families.Gaussian())
15results = model.fit()
16
17# Print the summary
18print(results.summary())

And that’s pretty much it! You don’t need much to get a GLM regression up and running. It’s very straightforward if you’re even a little bit familiar with Python and pandas.

Visualizing the Results

For visualization, I like to use matplotlib. It’s a really nice library for plotting results. Here’s a quick example:

1import matplotlib.pyplot as plt
2
3plt.scatter(data['feature1'], y, label='Data')
4plt.plot(data['feature1'], results.fittedvalues, color='red', label='Fitted Line')
5plt.xlabel('Feature 1')
6plt.ylabel('Target')
7plt.legend()
8plt.show()

Taking It Up a Notch: A More Complex Pipeline

Now, if you want to go a bit further, I’ve also built a slightly more complex project. It’s similar to the first one, but with a few extra steps:

Here’s a quick rundown of the pipeline:

 1import joblib
 2import shap
 3from reportlab.pdfgen import canvas
 4
 5# Load data as before...
 6
 7# Fit the model
 8model = sm.GLM(y, X, family=sm.families.Gaussian())
 9results = model.fit()
10
11# Save the model
12joblib.dump(results, 'glm_model.pkl')
13
14![This photo was taken by Mahdi Molaee in 2018-May.](https://images.unsplash.com/photo-1692970502570-3c2802c1e4b5?crop=entropy&cs=tinysrgb&fit=max&fm=jpg&ixid=M3w2NjcyMjF8MHwxfHNlYXJjaHwzfHxkYXRhJTIwc2NpZW5jZSUyMEdMTSUyMFB5dGhvbnxlbnwwfDB8fHwxNzY3MDczNjg1fDA&ixlib=rb-4.1.0&q=80&w=1080 "How to Build Your First GLM in Python Without Getting Lost")
15*How to Build Your First GLM in Python Without Getting Lost - Photo by [Mahdi Molaee](https://unsplash.com/@madiielo) on [Unsplash](https://unsplash.com/photos/a-close-up-of-a-snake-on-the-ground-f5eUDWnJPl4)*
16
17# SHAP values for explainability
18explainer = shap.Explainer(model, X)
19shap_values = explainer(X)
20shap.summary_plot(shap_values, X)
21
22# Generate a PDF report
23c = canvas.Canvas("report.pdf")
24c.drawString(100, 750, "GLM Regression Report")
25c.save()

You can see the pipeline is a bit more involved, but still manageable. Using joblib is great for saving your models, and SHAP is super useful for understanding what’s going on under the hood. ReportLab lets you create a nice-looking PDF report, which is handy if you need to share results.

Building Robust Models

Whenever you’re working with linear or logistic regression—or really any generalized model—it’s important to know exactly what you need to build. Think through your requirements before you start writing code. Understand your data, and make sure your model is robust. If you don’t really know what data you have to process, you’re going to run into trouble.

“You don’t need much for doing a GLM regression, and it’s very straightforward.”

“It’s very important to know what you need to build, and then to build this in your Python script.”


Key Takeaways


Pierre-Henry Soria

GitHub · PierreHenry.Dev · YouTube

<< Previous Post

|

Next Post >>

#Beginner Tutorial #Data Science #Glm #Machine Learning #Python #Tech