Multicollinearity is a statistical phenomenon that occurs when two or more features in a regression model are highly correlated. This can lead to unreliable model coefficients, as the model may be unable to accurately distinguish the unique contributions of each feature to the target variable. In this post, we'll show 3 ways to test for multicollinearity, using functions from the `statsmodels`

, `seaborn`

, and `pandas`

packages.

See the below example and code snippets for a dataset on car lease transactions, stored in a `pandas`

DataFrame:

## Set up: install packages, and prep data

```
# Import packages
%pip install statsmodels
import pandas as pd
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant
# Subset potential independent variables
X = df[["LeasePrice", "PurchasePrice", "RemainingLength"]]
# Add column of constants for VIF function
X_const = add_constant(X)
```

## Option 1: calculate Variance Inflation Factors (VIF)

```
# Compute the variance inflation factor (VIF) for each feature
vif = pd.Series([variance_inflation_factor(X_const.values, i) for i in range(X.shape[1])],
index = X.columns)
print("Variance Inflation Factors:")
print(vif)
```

**Output:**

```
Variance Inflation Factors:
LeasePrice 36.920037
PurchasePrice 11.649848
RemainingLength 13.693264
dtype: float64
```

The VIF is calculated as the ratio of variance of the full model to the variance of the reduced model, where the reduced model is a model with one of the features removed. Generally a VIF greater than 10 is considered high. The higher the VIF, the more linearly related that variable is with the other variables.

From the output, we can see that there is multicollinearity present. We should consider removing one or more of the variables from the feature set.

## Option 2: calculate correlation coefficients

```
# Compute the pairwise correlations between features of a pandas DataFrame
corr = X.corr()
print("Correlation Coefficients")
print(corr)
```

**Output:**

```
Correlation Coefficients
LeasePrice PurchasePrice RemainingLength
LeasePrice 1.00000 0.875000 0.017140
PurchasePrice 0.87500 1.000000 -0.386624
RemainingLength 0.01714 -0.386624 1.000000
```

A correlation coefficient measures how linearly related two variables are. The range for a correlation coefficient is between -1 and 1. For example, someone's age, and their years of work experience is highly correlated. We would expect a correlation coefficient close to 1.

Typically, we say that two variables are correlated if their correlation coefficient is greater than 0.5, or less than -0.5.

From the results, we can see that `PurchasePrice`

and `LeasePrice`

are highly correlated, with a coefficient of 0.875. We should consider removing at least one of the variables from the resulting model.

## Option 3: create scatterplots

Since multicollinearity is about the presence of linear relationships between variables. You can just use your visualization package of choice, `matplotlib`

, `seaborn`

, `plotly`

, or something else, and look for pairwise linear relationships. A great function to use is seaborn’s `pairplot()`

function, which will create a scatterplot for each pair of continuous variables in your dataset. Here we've isolated just the X variables.

```
import seaborn as sns
sns.pairplot(X) # X is a dataframe with your independent variables
```

**Output:**

As you can see, there seem to be linear relationships between `PurchasePrice`

and `LeasePrice`

. This reflects our other tests of multicollinearity.

### About

Einblick is an AI-native data science platform that provides data teams with an agile workflow to swiftly explore data, build predictive models, and deploy data apps. Founded in 2020, Einblick was developed based on six years of research at MIT and Brown University. Einblick is funded by Amplify Partners, Flybridge, Samsung Next, Dell Technologies Capital, and Intel Capital. For more information, please visit www.einblick.ai and follow us on LinkedIn and Twitter.