In this post, we’ll be going over two ways to perform linear regression using ordinary least squares (OLS) estimation using the `statsmodels`

library in Python. Conveniently the two functions are `OLS()`

and `ols()`

. The output from `statsmodels`

provides a number of useful diagnostic statistics and information about the model.

All of the code is available in the canvas embedded below. Otherwise, read on for a step-by-step walkthrough of how to use `statsmodels`

to run a simple linear regression model.

**Note: **OLS is the standard method for fitting linear regression models. The `statsmodels`

library, usually imported with the alias `sm`

or `smf`

, provides a number of different regression models.

## `statsmodels.api` `sm.OLS()` example

```
import statsmodels.api as sm
# Create X and y dataframes
X = df[["petal_width"]]
y = df[["petal_length"]]
# Add constant according to statsmodels documentation
X_sm = sm.add_constant(X)
# Create model, fit, and print results
mod_sm = sm.OLS(y,X_sm)
res_sm = mod_sm.fit()
res_sm.summary()
```

## `statsmodels.formula.api` `smf.ols()` example

```
import statsmodels.formula.api as smf
# Use formula to build model, fit, and print results
mod_smf = smf.ols(formula='petal_length ~ petal_width', data=df)
res_smf = mod_smf.fit()
res_smf.summary()
```

## Setup: import packages, load data

1. Install and import `seaborn`

, `pandas`

, and `statsmodels`

2. Load and prepare your data

```
# Import and install packages (we'll import statsmodels later)
%pip install statsmodels
import seaborn as sns
import pandas as pd
# Load iris dataset from seaborn
iris = sns.load_dataset("iris")
iris.head()
```

**Output:**

```
# Subset for just one species of flower
versicolor = iris[iris["species"] == "versicolor"]
# Write df to Dataframes menu
einblick.write_df("versicolor", versicolor)
```

We subset the data for a particular flower species, versicolor, and used the `einblick.write_df()`

to write the dataframe to the lefthand Dataframes menu. This is an optional step, but gives access to Einblick's other operators and functionality.

## `statsmodels.api` `sm.OLS()` function in-depth

Now we're ready to fit our model, with our data now stored as a DataFrame, `df`

.

Import `statsmodels.api`

under the alias `sm`

. This contains our first function, `sm.OLS()`

.

```
import statsmodels.api as sm
# Create X and y dataframes
X = df[["petal_width"]]
y = df[["petal_length"]]
# Add constant according to statsmodels documentation
X_sm = sm.add_constant(X)
# Create model, fit, and print results
mod_sm = sm.OLS(y,X_sm)
res_sm = mod_sm.fit()
res_sm.summary()
```

**Output:**

3. **Fit the model to your data: **pass the independent and dependent variables to the `OLS(y, X)`

function and fit the model using the `fit()`

method.

**NOTE on**because we saved the dataframe earlier to Einblick's Dataframes menu, we can now use a variety of operators on the dataframe, and refer to the dataframe as`df`

:`df`

, as we do in this instance.**NOTE on**in the case of the`sm.add_constant(X)`

:`OLS()`

function from`statsmodels.api`

, before we fit the model, we need to use the`sm.add_constant(X)`

function, which adds a column of constants to the`X`

dataframe, before passing that into the`OLS()`

function. This is a structural requirement so that the ordinary least squares estimation can be performed properly.**NOTE on**in the case of the`sm.OLS()`

syntax:`OLS()`

function, note that the y argument comes before the X argument.

Now that the model is built and fitted, you can get the results.

4. **Use the summary function for model evaluation:** the fitted

`OLS`

model, `res_sm`

, includes a `summary()`

method that provides a detailed summary of the model fit, including the coefficient estimates, standard errors, and p-values for each feature. Based on the results, we can say that a 1-unit increase in petal width is associated with a 1.87-unit increase in petal length, with a p-value of 0.000, and a 95% confidence interval of (1.44, 2.30).## BONUS: `statsmodels.api` `sm.OLS()` with Generative AI

If you want to speed up your regression analysis even more, check out our AI agent, Einblick Prompt, which can create data workflows from as little as one sentence. In the below canvas, we used generative AI to build a regression model, and show the results. Check out how we did it below:

### Using Generative AI in Einblick

- Open and fork the canvas
- Connect the iris dataset
- Right-click anywhere in the canvas > Prompt
- Type in: "Use OLS to predict petal length, display regression results."
- Run the code in Einblick's data notebook immediately

Try out Prompt, and let us know what you think!

## `statsmodels.formula.api` `smf.ols()` function in-depth

The next method we're using comes from `statsmodels.formula.api`

under the alias `smf`

. This contains our second function, `smf.ols()`

.

```
import statsmodels.formula.api as smf
# Use formula to build model, fit, and print results
mod_smf = smf.ols(formula='petal_length ~ petal_width', data=df)
res_smf = mod_smf.fit()
res_smf.summary()
```

**NOTE:**we need a new import statement, and that the designated OLS function is now lowercase.**NOTE:**in this case, we do not need to separate our X and y variables into separate dataframes, and we also do not need to add a column of constants to our X variables. This is due to how the`smf.ols()`

function works.**NOTE on**in the case of the`smf.ols()`

syntax:`smf.ols()`

, the function takes in a`formula`

argument, which may seem familiar if you've used the programming language, R before. The argument takes in a string that represents the linear regression formula for your model. The syntax for the formula starts with the y variable name also known as your response, outcome, or dependent variable, followed by a tilde,`~`

, and then followed by your X variable names also know as your predictor, input, or independent variables. In this case, we only have 1 X variable, so the formula looks like`y ~ X`

. If there were multiple X variables, the formula would look like`y ~ X1 + X2 + X3 + ... + Xn`

, where a`+`

would separate each X variable. The`smf.ols()`

function also takes in a`data`

argument, which is our dataframe,`df`

in this case.

Just as above, we use the `fit()`

and `summary()`

methods to fit the model and get summary statistics. These methods yield the same method as above. The difference between the two functions is syntactical for you as the user, and may depend on whether or not you want to write out all of your variables in a string variable or not.

### About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.