Checking and handling missing values (NaN
) in pandas
- In pandas dataframe the NULL or missing values (missing data) are denoted as
NaN
. Sometimes, PythonNone
can also be considered as missing values. - Missing values in datasets can cause the complication in data handling and analysis, loss of information and efficiency, and can produce biased results.
- It is necessary to check the missing data in datasets for feature engineering such as imputation of missing data, dropping the records with missing data, etc.
Check NULL or missing (NaN) values in pandas series
Create the pandas series with missing (NaN) values,
import pandas as pd
import numpy as np
data = {'a': 1, 'b': 2, 'c': 3, 'd': np.nan, 'e':4, 'f':np.nan}
ser = pd.Series(data=data)
ser
# output
a 1.0
b 2.0
c 3.0
d NaN
e 4.0
f NaN
dtype: float64
Check missing values in pandas series with isnull()
function,
ser.isna()
# output
a False
b False
c False
d True
e False
f True
dtype: bool
# missing values are indicated by True
Count the missing values in pandas series using the sum()
function
ser.isna().sum()
# output
2
Check NULL or missing (NaN) values in pandas dataframe
Create the pandas dataframe with missing (NaN) values,
import pandas as pd
import numpy as np
df = pd.DataFrame({'col1':['A', 'B', 'C', 'D', 'E'], 'col2':[1,2,3,4,np.nan], 'col3':[0.1,np.nan,0.3,0.4,0.5]})
df
# output
col1 col2 col3
0 A 1.0 0.1
1 B 2.0 NaN
2 C 3.0 0.3
3 D 4.0 0.4
4 E NaN 0.5
Check the missing values in pandas dataframe using isnull()
function
df.isna()
# output
col1 col2 col3
0 False False False
1 False False True
2 False False False
3 False False False
4 False True False
# missing values are indicated by True
# check if overall dataframe has any missing values
df.isna().any().any()
# output
True
Count the missing values in each column in the pandas dataframe using the sum()
function
df.isna().sum()
# output
col1 0
col2 1
col3 1
dtype: int64
Handling missing values in pandas dataframe
Drop the missing values in pandas DataFrame using the dropna()
function. The dropna()
drop all rows where missing
values are present in DataFrame.
df.dropna()
# output
col1 col2 col3
0 A 1.0 0.1
2 C 3.0 0.3
3 D 4.0 0.4
# it drops a complete row where missing value is present in any column
Drop the rows from DataFrame where only specific column contains missing value (NaN
),
# drop row if col2 contains NaN value
df[df['col2'].notna()]
# output
col1 col2 col3
0 A 1.0 0.1
1 B 2.0 NaN
2 C 3.0 0.3
3 D 4.0 0.4
Fill the missing values with average or median values. This is also called the imputation of missing values.
# fill each column missing values with average value for that column
df.fillna(df.mean())
# output
col1 col2 col3
0 A 1.0 0.10
1 B 2.0 0.35
2 C 3.0 0.30
3 D 4.0 0.40
4 E 2.5 0.50
# fill each column missing values with median value for that column
df.fillna(df.median())
col1 col2 col3
0 A 1.0 0.10
1 B 2.0 0.35
2 C 3.0 0.30
3 D 4.0 0.40
4 E 2.5 0.50
For a categorical variable, the mode (most frequent value) can be used for filling the missing values
# create dataframe with a categorical variable
df1 = pd.DataFrame({'col1':['A', 'B', 'C', 'D', 'E'], 'col2':[1,2,3,4,np.nan], 'col3':[0,np.nan,0,0,1]})
# make col3 as a categorical variable
df1['col3'] = df1.col3.astype('category')
# impute missing value in col3 with mode
df1['col3'].fillna(value=df1['col3'].mode()[0], inplace=True)
df1
# output
col1 col2 col3
0 A 1.0 0.0
1 B 2.0 0.0
2 C 3.0 0.0
3 D 4.0 0.0
4 E NaN 1.0
Fill the missing values with any constant values
# fill col2 missing values with 0
df['col2'].fillna(value=0, inplace=True)
df
# output
col1 col2 col3
0 A 1.0 0.1
1 B 2.0 NaN
2 C 3.0 0.3
3 D 4.0 0.4
4 E 0.0 0.5
Fill the missing value with the non-missing value that appears before the missing value
df.fillna(method='ffill', inplace=True)
df
# output
col1 col2 col3
0 A 1.0 0.1
1 B 2.0 0.1
2 C 3.0 0.3
3 D 4.0 0.4
4 E 4.0 0.5
Fill the missing value with the non-missing value that appear after the missing value
df.fillna(method='bfill', inplace=True)
df
# output
col1 col2 col3
0 A 1.0 0.1
1 B 2.0 0.3
2 C 3.0 0.3
3 D 4.0 0.4
4 E NaN 0.5
See more parameters at pandas fillna
usage
Enhance your skills with courses on Python and pandas
- Python for Data Analysis: Pandas & NumPy
- Mastering Data Analysis with Pandas: Learning Path Part 1
- Data Analysis Using Python
- Python for Everybody Specialization
References
- Barnard J, Meng XL. Applications of multiple imputation in medical studies: from AIDS to NHANES. Statistical methods in medical research. 1999 Feb;8(1):17-36.
Learn more about Python
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.