Ordinary Least Squares (OLS) in statsmodels

Einblick Content Team - January 5th, 2023

In this post, we’ll be going over two ways to perform linear regression using ordinary least squares (OLS) estimation using the statsmodels library in Python. Conveniently the two functions are OLS() and ols(). The output from statsmodels provides a number of useful diagnostic statistics and information about the model.

All of the code is available in the canvas embedded below. Otherwise, read on for a step-by-step walkthrough of how to use statsmodels to run a simple linear regression model.

Note: OLS is the standard method for fitting linear regression models. The statsmodels library, usually imported with the alias sm or smf, provides a number of different regression models.

`statsmodels.api` `sm.OLS()` example

import statsmodels.api as sm

# Create X and y dataframes
X = df[["petal_width"]]
y = df[["petal_length"]]

# Add constant according to statsmodels documentation
X_sm = sm.add_constant(X)

# Create model, fit, and print results
mod_sm = sm.OLS(y,X_sm)
res_sm = mod_sm.fit()
res_sm.summary()

`statsmodels.formula.api` `smf.ols()` example

import statsmodels.formula.api as smf

# Use formula to build model, fit, and print results
mod_smf = smf.ols(formula='petal_length ~ petal_width', data=df)
res_smf = mod_smf.fit()
res_smf.summary()

Setup: import packages, load data

1. Install and import seaborn, pandas, and statsmodels

2. Load and prepare your data

# Import and install packages (we'll import statsmodels later)
%pip install statsmodels
import seaborn as sns
import pandas as pd

# Load iris dataset from seaborn
iris = sns.load_dataset("iris")
iris.head()

Output:

# Subset for just one species of flower
versicolor = iris[iris["species"] == "versicolor"]

# Write df to Dataframes menu
einblick.write_df("versicolor", versicolor)

We subset the data for a particular flower species, versicolor, and used the einblick.write_df() to write the dataframe to the lefthand Dataframes menu. This is an optional step, but gives access to Einblick's other operators and functionality.

`statsmodels.api` `sm.OLS()` function in-depth

Now we're ready to fit our model, with our data now stored as a DataFrame, df.

Import statsmodels.api under the alias sm. This contains our first function, sm.OLS().

import statsmodels.api as sm

# Create X and y dataframes
X = df[["petal_width"]]
y = df[["petal_length"]]

# Add constant according to statsmodels documentation
X_sm = sm.add_constant(X)

# Create model, fit, and print results
mod_sm = sm.OLS(y,X_sm)
res_sm = mod_sm.fit()
res_sm.summary()

Output:

3. Fit the model to your data: pass the independent and dependent variables to the OLS(y, X) function and fit the model using the fit() method.

  • NOTE on df: because we saved the dataframe earlier to Einblick's Dataframes menu, we can now use a variety of operators on the dataframe, and refer to the dataframe as df, as we do in this instance.
  • NOTE on sm.add_constant(X): in the case of the OLS() function from statsmodels.api, before we fit the model, we need to use the sm.add_constant(X) function, which adds a column of constants to the X dataframe, before passing that into the OLS() function. This is a structural requirement so that the ordinary least squares estimation can be performed properly.
  • NOTE on sm.OLS() syntax: in the case of the OLS() function, note that the y argument comes before the X argument.

Now that the model is built and fitted, you can get the results.

4. Use the summary function for model evaluation: the fitted OLS model, res_sm, includes a summary() method that provides a detailed summary of the model fit, including the coefficient estimates, standard errors, and p-values for each feature. Based on the results, we can say that a 1-unit increase in petal width is associated with a 1.87-unit increase in petal length, with a p-value of 0.000, and a 95% confidence interval of (1.44, 2.30).

BONUS: `statsmodels.api` `sm.OLS()` with Generative AI

If you want to speed up your regression analysis even more, check out our AI agent, Einblick Prompt, which can create data workflows from as little as one sentence. In the below canvas, we used generative AI to build a regression model, and show the results. Check out how we did it below:

Using Generative AI in Einblick

  1. Open and fork the canvas
  2. Connect the iris dataset
  3. Right-click anywhere in the canvas > Prompt
  4. Type in: "Use OLS to predict petal length, display regression results."
  5. Run the code in Einblick's data notebook immediately

Try out Prompt, and let us know what you think!

`statsmodels.formula.api` `smf.ols()` function in-depth

The next method we're using comes from statsmodels.formula.api under the alias smf. This contains our second function, smf.ols().

import statsmodels.formula.api as smf

# Use formula to build model, fit, and print results
mod_smf = smf.ols(formula='petal_length ~ petal_width', data=df)
res_smf = mod_smf.fit()
res_smf.summary()
  • NOTE: we need a new import statement, and that the designated OLS function is now lowercase.
  • NOTE: in this case, we do not need to separate our X and y variables into separate dataframes, and we also do not need to add a column of constants to our X variables. This is due to how the smf.ols() function works.
  • NOTE on smf.ols() syntax: in the case of the smf.ols(), the function takes in a formula argument, which may seem familiar if you've used the programming language, R before. The argument takes in a string that represents the linear regression formula for your model. The syntax for the formula starts with the y variable name also known as your response, outcome, or dependent variable, followed by a tilde, ~, and then followed by your X variable names also know as your predictor, input, or independent variables. In this case, we only have 1 X variable, so the formula looks like y ~ X. If there were multiple X variables, the formula would look like y ~ X1 + X2 + X3 + ... + Xn, where a + would separate each X variable. The smf.ols() function also takes in a data argument, which is our dataframe, df in this case.

Just as above, we use the fit() and summary() methods to fit the model and get summary statistics. These methods yield the same method as above. The difference between the two functions is syntactical for you as the user, and may depend on whether or not you want to write out all of your variables in a string variable or not.

About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.