Generate dataset with normal distribution in Python

Renesh Bedre 2 minute read

probability density function (PDF)

A random dataset with a standard normal distribution (aka Gaussian distribution) i.e N(µ = 0, σ² = 1) can be generated using numpy.random.normal function.

A normal distribution is a type of continuous probability distribution and its probability density function (PDF) for any random variable X is given as,

probability density function (PDF)

Generate a random dataset with normal distribution

Let’s generated a random dataset with a standard normal distribution using a numpy.random.normal function,

numpy.random.normal function takes the following arguments as inputs,

loc: Mean value (center) of the random sample
scale: Standard deviation (spread) of the random sample. This should be a non-negative value
size: A random sample size

For example, generate a random sample of size 500 with a mean of 0 and standard deviation of 1 (dataset with a standard normal distribution),

import numpy as np

rand_data = np.random.normal(loc=0, scale=1, size=500)
rand_data
# output
array([-0.15736721,  1.07037315,  0.77830583, ...,  0.08949697,
        0.61544847, -1.16288735])

rand_data.shape
# output
(500,)

Verify the mean and standard deviation of the created random dataset,

import statistics as st
# mean
st.mean(rand_data)
# output
-0.004

# standard deviation
st.stdev(rand_data)
0.9542

The above mean and standard deviation are approximately equal to the provided values to numpy.random.normal function.

Visualize the standard normal distribution of random dataset with given mean and standard deviation,

import matplotlib.pyplot as plt
from scipy.stats import norm

h = plt.hist(rand_data, bins=30, histtype='bar', density=1, ec='k') 
plt.plot(h[1], norm.pdf(h[1], st.mean(rand_data), st.stdev(rand_data)), color='r', linewidth=2)
plt.title(r'Mean=%.4f, Std Dev=%.4f' %(st.mean(rand_data), st.stdev(rand_data)))
plt.show()

Standard normal distribution plot in Python

Check the standard normal distribution of the randomly generated data using the quantile-quantile (QQ) plot (aka normal probability plot),

Create a QQ plot,

import statsmodels.api as sm

sm.qqplot(rand_data, line='45')
plt.xlabel("Theoretical Quantiles")
plt.ylabel("Sample Quantiles")
plt.show()

Standard normal distribution QQ plot in Python

The QQ plot suggests that the generated random data is normally distributed (data plotted on a straight line). If data points don’t appear on a straight line, it could violate the assumption of normal distribution.

Generate a random Dataframe with normal distribution

You can also generate a random DataFrame with multiple columns where each column has a normal distribution

import numpy as np
import pandas as pd

df = pd.DataFrame({'col1': np.random.normal(loc=0, scale=1, size=500), 
                   'col2': np.random.normal(loc=2, scale=1, size=500)})

df.head(2)
# output
       col1      col2
0  1.394381  1.049982
1 -0.237809  1.556581

df.shape
# output
(500, 2)

Analyze descriptive statistics on a generated Dataframe,

df.describe()
# output
             col1        col2
count  500.000000  500.000000
mean     0.047423    1.994156
std      1.035954    0.959236
min     -2.479637   -0.693878
25%     -0.673236    1.360467
50%      0.035872    2.001211
75%      0.768002    2.626911
max      3.237509    4.724744

The mean and standard deviation are approximately equal to that one provided in the np.random.normal() function.

Enhance your skills with courses on Statistics and Python

References

numpy.random.normal

If you have any questions, comments, corrections, or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License

Share on

Twitter Facebook LinkedIn

Generate dataset with normal distribution in Python

Generate a random dataset with normal distribution

Generate a random Dataframe with normal distribution

Enhance your skills with courses on Statistics and Python

References

Share on

You may also enjoy

Calculate Coverage From BAM File

Python: Why VIF Return Inf Value?

Find Max and Min Sequence Length in Fasta

Get Non-overlapping Portion Between Two Regions in bedtools