Python: Why VIF Return Inf Value?

Renesh Bedre    3 minute read

Background

Variance Inflation Factor (VIF) is used for detecting multicollinearity in regression models. It measures how much the variance of a regression coefficient is inflated due to multicollinearity with other independent variables in the model.

In Python, VIF can be calculated using the variance_inflation_factor() function from the statsmodels package. However, you may encounter a situation where you get Inf (infinity) values as the VIF for some of the independent variables.

Identifying and removing the multicollinearity issues is essential for robust predictive modeling in machine learning.

This article explains the reasons behind Inf values for the VIF with an example analysis.

Why inf values for VIF?

You can get inf values for VIF due to the perfect multicollinearity. This happens when two or more independent variables in a model are perfectly linearly dependent. That is, one independent variable in the model can be entirely predicted by another independent variable.

If you have multiple identical columns in the input dataset, there will be perfect multicollinearity.

In addition, high correlation (correlation coefficients close to 1 or -1) between the independent variables can also give very high VIF values that could lead to inf values.

VIF calculation Example

The following example explains how you get the inf values for the VIF.

Create an example dataset,

# import package
import pandas as pd

# load example dataset
df = pd.read_csv("https://reneshbedre.github.io/assets/posts/reg/bp.csv")

# view
df.head()
    BP  Age  Weight   BSA  Dur  Pulse  Stress
0  105   47    85.4  1.75  5.1     63      33
1  115   49    94.2  2.10  3.8     70      14
2  116   49    95.3  1.98  8.2     72      10
3  117   50    94.7  2.01  5.8     73      99
4  112   51    89.4  1.89  7.0     72      95

Create separate datasets for independent variables (Age, Weight, BSA, Dur, Pulse, Stress) and dependent variables (BP),

# independent variables
X = df[['Age', 'Weight', 'BSA', 'Dur', 'Pulse', 'Stress']]   

 # dependent variables
y = df['BP']   

Add duplicate independent variable to have perfect multicollinearity,

# independent variables
X['Age_dup'] = df['Age']   

# view X
X.head()

  Age  Weight   BSA  Dur  Pulse  Stress  Age_dup
0   47    85.4  1.75  5.1     63      33       47
1   49    94.2  2.10  3.8     70      14       49
2   49    95.3  1.98  8.2     72      10       49
3   50    94.7  2.01  5.8     73      99       50
4   51    89.4  1.89  7.0     72      95       51

Calculate multicollinearity using the variance_inflation_factor() function from the statsmodels package,

import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

X = sm.add_constant(X)
# fit the regression model
reg = sm.OLS(y, X).fit()
# get Variance Inflation Factor (VIF) 
pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
  variables       VIF
0       Age       inf
1    Weight  8.417035
2       BSA  5.328751
3       Dur  1.237309
4     Pulse  4.413575
5    Stress  1.834845
6   Age_dup       inf

You can see that the VIF values for the Age and Agw_dups variables are inf. This is because both Age and Age_dups are perfectly linearly dependent (due to exact values in between those variables). It means that Age and Age_dup have perfect multicollinearity.

The multicollinearity issue in regression models can be identified by checking for duplicate columns and performing the pairwise correlation. The multicollinearity issue can be resolved by removing one of the variables causing the multicollinearity.

Enhance your skills with courses on machine learning


This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.