Heteroscedasticity occurs when the variance of the residuals (errors) of a model is not constant. This can cause problems in the interpretation of the model's coefficients, as the ordinary least squares (OLS) estimator assumes homoscedasticity (constant variance of the residuals). If the variance is not constant, the OLS estimator may have unreliable results.
If your simple linear regression model exhibits heteroscedasticity, you can adjust the model to account for it in several ways. One way is to use weighted least squares (WLS) regression, which allows you to specify a weight for each data point. The weights can be chosen to downweight the influence of points with large residuals, which can help to stabilize the variance of the residuals and improve the model's accuracy.
In this example, we’ll calculate the weights based on the fitted values and residuals of an OLS model exhibiting heteroskedasticity. The full code can be found in the canvas below. Open and fork it to get started, or read on for the code explanation.
1. Install packages and create data
!pip install statsmodels import numpy as np import matplotlib.pyplot as plt import statsmodels.api as sm
Skip to step 2 if you already have an existing dataset. In our case, we generated fake data to match a particular pattern.
# Generate heteroskedastic data following the equation y = mx + b + e # where m is the slope, b is the intercept, and e is the error term np.random.seed(1) m = 1.5 b = 5 x = list(range(100)) * 3 var = [val**1.6 for val in x] e = np.random.normal(loc = 0, scale = [val**0.5 for val in var]) y = b + m * np.array(x) + e
In the above code snippet, we used NumPy to generate data according to the formula:
In the generation process, we have ensured that the variance, simulated by the error term,
e, increases as
X increases, resulting in heteroskedastic data. We can plot the data to confirm.
# Create scatter plot of data plt.scatter(x = x, y = y) plt.xlabel("x") plt.ylabel("y") plt.title("Heteroskedastic Data") plt.show()
From the plot, we can see that as x gets larger, the variance of y seems to increase as well. This is because of how we intentionally generated the data.
2. Build ordinary least squares (OLS) regression model
# Add constant according to statsmodels documentation X_sm = sm.add_constant(x) # Create model, fit, and print results mod_sm = sm.OLS(y, X_sm) res_sm = mod_sm.fit() res_sm.summary()
From the results, we can see that the OLS model accounts for 82.7% of the variance in y. But, if we create a residual plot, we can see that there is a clear pattern of heteroskedasticity, which will lead us to explore a weighted least squares model.
# Plot fitted values vs. residuals to test for heteroskedasticity plt.scatter(res_sm.fittedvalues, res_sm.resid) plt.xlabel('Fitted Values') plt.ylabel('Residuals') plt.axhline(y = 0, color = 'r') plt.show()
The distinctive cone shape is a classic indicator of heteroskedasticity, violating one of the key assumptions of OLS regression.
3. Build weighted least squares (WLS) model
There are several ways to determine the weights to use in weighted least squares (WLS) regression. One common approach is to use the inverse of the variance of the residuals as the weights.
To estimate the variance of the residuals, you can regress the absolute value of the residuals from the OLS regression onto the fitted values. The fitted values of the new OLS regression is an estimate of the standard deviation. Thus you can calculate the weights by taking the inverse of the squared fitted values.
# Regress absolute values of residuals on fitted values # Save the absolute values of the residuals of the OLS model y_resid = [abs(resid) for resid in res_sm.resid] # Add constant according to statsmodels documentation to the fitted values # of the OLS model X_resid = sm.add_constant(res_sm.fittedvalues) # Create OLS model, fit, and print results mod_resid = sm.OLS(y_resid, X_resid) res_resid = mod_resid.fit() # Estimate of std. dev. (sigma) mod_fv = res_resid.fittedvalues # Calculate weights weights = 1 / (mod_fv**2) weights
array([0.03579928, 0.03326449, 0.03098969, 0.02894049, 0.02708803, 0.0254079 , 0.02387937, 0.02248474, 0.02120881, 0.02003849, 0.01896242, 0.01797077, 0.01705491, 0.01620733, 0.01542139, 0.01469127, 0.0140118 , 0.01337839, 0.01278699, 0.01223395,...])
Another way to determine the weights is to use subject-matter knowledge to assign higher weights to data points that are more important or more reliable. For example, if you are analyzing data from a survey and some respondents are more reliable than others, you may want to assign higher weights to the responses from the more reliable respondents.
Building WLS model
# Fit the weighted least squares model model = sm.WLS(y, X_sm, weights = weights) results = model.fit() # Print the results print(results.summary())
From the results, we can see that the weighted least squares model explains 87.9% of the variance in y, about 5% more than the OLS model.
Another way to adjust the model for heteroscedasticity is to transform the dependent variable using a function that reduces the variance of the residuals. For example, you could use a log transformation if the variance increases with the mean of the dependent variable.
Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.