How to Calculate Descriptive Statistics in Python with Pandas DataFrame
The purpose of descriptive statistics is to summarise the statistical characteristics of the data in a meaningful way without inferring anything about them.
A descriptive statistic summarizes the central tendency (mean, median, mode), spread of the data (range, standard deviation, and variance), the shape of the data, and frequency of the data.
The describe()
function from pandas
calculates the descriptive statistics for a DataFrame.
The basic syntax for the describe()
function is,
# for all columns
df.describe()
# for specific column
df["column_name"].describe()
Where, df
is pandas DataFrame
The describe()
function calculates the following descriptive statistics for numeric data from a pandas DataFrame,
- Count
- Mean
- Standard deviation (
std
) - Minimum (
min
) - 25% (25th Percentile or First quartile)
- 50% (50th percentile or Second quartile or Median)
- 75% (75th Percentile or Third quartile)
- Maximum (
max
)
The describe()
function calculates the following descriptive statistics for categorical data (e.g. strings) from a pandas DataFrame,
- Count
- Unique (number of unique values)
- Top (most common value)
- Frequency (
freq
)
Note: By default,
describe()
function returns descriptive statistics only for numerical columns if a data frame contains both numerical and categorical columns.
The following examples explain how to use the describe()
function to get descriptive statistics from a pandas DataFrame.
Calculate descriptive statistics for numerical pandas DataFrame
Create a numerical pandas DataFrame,
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})
# view DataFrame
Age Height
0 25 5.5
1 30 6.2
2 20 5.0
3 35 4.9
4 38 5.9
Calculate descriptive statistics,
df.describe()
Age Height
count 5.000000 5.000000
mean 29.600000 5.500000
std 7.300685 0.561249
min 20.000000 4.900000
25% 25.000000 5.000000
50% 30.000000 5.500000
75% 35.000000 5.900000
max 38.000000 6.200000
The describe()
function outputs the values for count, mean, Standard deviation (std
), minimum, maximum, and first
quartile (25%), median (50%), and third quartile (75%) values.
Calculate the variance using var()
function from pandas DataFrame,
df.var()
Age 53.300
Height 0.315
dtype: float64
Calculate the range (difference between max and min values) from pandas DataFrame,
# for Age variable
df.Age.max() - df.Age.min()
18
# for Height column
df.Height.max() - df.Height.min()
1.29
Calculate descriptive statistics for categorical pandas DataFrame
Create a categorical pandas DataFrame,
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'school':['A', 'B', 'C', 'D', 'E'], 'state':["TX", "TX", "CA", "CA", "CA"],
'temp':["hot", "hot", "mild", "mild", "mild"]})
# view DataFrame
school state temp
0 A TX hot
1 B TX hot
2 C CA mild
3 D CA mild
4 E CA mild
Calculate descriptive statistics,
df.describe()
school state temp
count 5 5 5
unique 5 2 2
top A CA mild
freq 1 3 3
By default, the describe()
outputs the values for count, number of unique values, most common values (top
), and
frequency (freq
) of the most common value.
Calculate descriptive statistics for mixed pandas DataFrame
By default, the describe()
function returns descriptive statistics for the numerical column if you have mixed data
types (numerical and categorical).
You can pass the include='all
parameter to describe()
function to get descriptive statistics for each data type
Create a mixed data type pandas DataFrame,
# load package
import pandas as pd
# create a DataFrame
df = pd.DataFrame({'name':['A', 'B', 'C', 'D', 'E'], 'Age':[25, 30, 20, 35, 38], 'Height':[5.5, 6.2, 5, 4.9, 5.9]})
# view DataFrame
name Age Height
0 A 25 5.5
1 B 30 6.2
2 C 20 5.0
3 D 35 4.9
4 E 38 5.9
Calculate descriptive statistics for both numerical and categorical variables,
df.describe(include = "all")
name Age Height
count 5 5.000000 5.000000
unique 5 NaN NaN
top A NaN NaN
freq 1 NaN NaN
mean NaN 29.600000 5.500000
std NaN 7.300685 0.561249
min NaN 20.000000 4.900000
25% NaN 25.000000 5.000000
50% NaN 30.000000 5.500000
75% NaN 35.000000 5.900000
max NaN 38.000000 6.200000
Enhance your skills with courses Python
- Python for Everybody Specialization
- Python 3 Programming Specialization
- Introduction to Data Science in Python
- Mastering Data Analysis with Pandas: Learning Path Part 1
- Python for Data Analysis: Pandas & NumPy
This work is licensed under a Creative Commons Attribution 4.0 International License
Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.