There are many ways to check that there is constant variance of errors across values of the X variables in a regression model. If you’re looking for a statistical test, you can use White’s test or the Breusch-Pagan test, both of which can be implemented using
statsmodels. This post will go over a visual way to check for heteroskedasticity using residual plots after you’ve built your linear regression model.
You can access the code in the
statsmodels canvas below, or read on for an in-depth, line-by-line explanation.
Set up: fit a linear regression model
import statsmodels.api as sm # Create X and y dataframes X = df[["petal_width"]] y = df[["petal_length"]] # Add constant according to statsmodels documentation X_sm = sm.add_constant(X) # Create model, fit, and print results mod_sm = sm.OLS(y,X_sm) res_sm = mod_sm.fit()
Now that the results are saved as
res_sm, we can plot the fitted values against the residuals.
Residual plot: fitted values vs. residuals using
import matplotlib.pyplot as plt # Plot fitted values vs. residuals to test for heteroskedasticity plt.scatter(res_sm.fittedvalues, res_sm.resid) plt.xlabel('Fitted Values') plt.ylabel('Residuals') plt.axhline(y = 0, color = 'r') plt.show()
From the plot, we can see that the residuals seem evenly distributed for each fitted value. When examining this kind of plot, we ’re looking for any distinct, observable patterns among the residuals. For example, if the residuals increase or decrease systematically as the fitted values increase, this may indicate that the model is missing some important linear or nonlinear relationship in the data.
A cone-like shape on the left shows that variance of the residuals increases as our X variable increases, indicating non-constant variance or heteroskedasticity. The random scattering of points on the right shows that the variance of the residuals is constant across values of the X variable. As our residual plot generated earlier resembles more of the plot on the right--a random cloud of data points, with no discernible pattern--we can move forward with our regression analysis.
Alternative plotting functions
If you want to generate a few regression plots, including the one we created manually above, you can use the
import matplotlib.pyplot as plt fig = plt.figure(figsize = (8,6)) # Create regression plots for specified X variable sm.graphics.plot_regress_exog(res_sm, 'petal_width', fig = fig) plt.show()
The function takes in a fitted linear regression model, a named X variable (i.e.
'petal_width'), and a figure object, and produces 4 plots. From top left, going clockwise:
- Fitted values vs. chosen X variable, including confidence intervals of each prediction
- Residuals vs. chosen X variable, helps to detect heteroskedasticity
- Component-Component Plus Residual (CCPR) plot, accounts for the effects of other X variables in the model, when examining the relationship between the y variable and the chosen X variable.
- Partial regression plot examines relationship between the y variable and chosen X variable, when all other X variables are held constant
This function is even more useful for multiple linear regression models involving several X variables, in which you want to isolate the effects of one variable at a time.
Einblick is an agile data science platform that provides data scientists with a collaborative workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick customers include Cisco, DARPA, Fuji, NetApp and USDA. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.