Python: Why VIF Return Inf Value?
Background
Variance Inflation Factor (VIF) is used for detecting multicollinearity in regression models. It measures how much the variance of a regression coefficient is inflated due to multicollinearity with other independent variables in the model.
In Python, VIF can be calculated using the variance_inflation_factor()
function from the statsmodels package. However,
you may encounter a situation where you get Inf
(infinity) values as the VIF for some of the independent variables.
Identifying and removing the multicollinearity issues is essential for robust predictive modeling in machine learning.
This article explains the reasons behind Inf
values for the VIF with an example analysis.
Why inf
values for VIF?
You can get inf
values for VIF due to the perfect multicollinearity. This happens when two or more independent variables
in a model are perfectly linearly dependent. That is, one independent variable in the model can be entirely predicted by
another independent variable.
If you have multiple identical columns in the input dataset, there will be perfect multicollinearity.
In addition, high correlation (correlation coefficients close to 1 or -1) between the independent variables can also
give very high VIF values that could lead to inf
values.
VIF calculation Example
The following example explains how you get the inf
values for the VIF.
Create an example dataset,
# import package
import pandas as pd
# load example dataset
df = pd.read_csv("https://reneshbedre.github.io/assets/posts/reg/bp.csv")
# view
df.head()
BP Age Weight BSA Dur Pulse Stress
0 105 47 85.4 1.75 5.1 63 33
1 115 49 94.2 2.10 3.8 70 14
2 116 49 95.3 1.98 8.2 72 10
3 117 50 94.7 2.01 5.8 73 99
4 112 51 89.4 1.89 7.0 72 95
Create separate datasets for independent variables (Age, Weight, BSA, Dur, Pulse, Stress) and dependent variables (BP),
# independent variables
X = df[['Age', 'Weight', 'BSA', 'Dur', 'Pulse', 'Stress']]
# dependent variables
y = df['BP']
Add duplicate independent variable to have perfect multicollinearity,
# independent variables
X['Age_dup'] = df['Age']
# view X
X.head()
Age Weight BSA Dur Pulse Stress Age_dup
0 47 85.4 1.75 5.1 63 33 47
1 49 94.2 2.10 3.8 70 14 49
2 49 95.3 1.98 8.2 72 10 49
3 50 94.7 2.01 5.8 73 99 50
4 51 89.4 1.89 7.0 72 95 51
Calculate multicollinearity using the variance_inflation_factor()
function from the statsmodels package,
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor
X = sm.add_constant(X)
# fit the regression model
reg = sm.OLS(y, X).fit()
# get Variance Inflation Factor (VIF)
pd.DataFrame({'variables':X.columns[1:], 'VIF':[variance_inflation_factor(X.values, i+1) for i in range(len(X.columns[1:]))]})
variables VIF
0 Age inf
1 Weight 8.417035
2 BSA 5.328751
3 Dur 1.237309
4 Pulse 4.413575
5 Stress 1.834845
6 Age_dup inf
You can see that the VIF values for the Age and Agw_dups variables are inf
. This is because both Age and Age_dups are perfectly linearly dependent (due to exact values in between those variables).
It means that Age
and Age_dup
have perfect multicollinearity.
The multicollinearity issue in regression models can be identified by checking for duplicate columns and performing the pairwise correlation. The multicollinearity issue can be resolved by removing one of the variables causing the multicollinearity.
Enhance your skills with courses on machine learning
- Advanced Learning Algorithms
- Machine Learning Specialization
- Machine Learning with Python
- Machine Learning for Data Analysis
- Supervised Machine Learning: Regression and Classification
- Unsupervised Learning, Recommenders, Reinforcement Learning
- Deep Learning Specialization
- AI For Everyone
- AI in Healthcare Specialization
- Cluster Analysis in Data Mining
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.