How to identify and drop duplicates based on single and multiple columns in a pandas DataFrame

Renesh Bedre 4 minute read

drop duplicates pandas DataFrame

In this article, you will learn how to identify and drop duplicates in a pandas Dataframe and Series. Let’s first create random Pandas DataFrame (you can also import pandas DataFrame from file).

import pandas as pd

# Create DataFrame
df = pd.DataFrame({ 'col1':[10, 11, 25, 10], 
                    'col2':[2, 3, 25, 2],
                    'col3': ['A', 'B', 'B', 'A']})
df
# output
   col1  col2 col3
0    10     2    A
1    11     3    B
2    25    25    B
3    10     2    A

Drop duplicate rows based on all columns

You can use the pandas drop_duplicates() function to drop the duplicate rows based on all columns. You can either keep the first or last occurrence of duplicate rows or completely drop the duplicate rows.

drop duplicates pandas
DataFrame with keep parameter

For example, drop duplicate rows and keep the first occurrence,

df.drop_duplicates() # same as df.drop_duplicates(keep='first')
# output
   col1  col2 col3
0    10     2    A
1    11     3    B
2    25    25    B

If you want to keep the last occurrence, pass the keep='last' parameter,

df.drop_duplicates(keep='last')
# output
   col1  col2 col3
1    11     3    B
2    25    25    B
3    10     2    A

If you want to drop duplicate rows entirely (without keeping any occurrence), you can pass keep='False' parameter,

drop duplicates pandas
DataFrame completely

df.drop_duplicates(keep=False)
# output
   col1  col2 col3
1    11     3    B
2    25    25    B

As you see in the above example, the duplicated rows were entirely dropped without keeping any occurrence

Drop duplicate rows based on specific columns

By default, the drop_duplicates() function drop duplicates rows based on all columns. If you want to drop duplicate rows based on specific columns, pass the subset=['column_names'] parameter

For example, drop duplicate rows based on col3 (you can also pass keep parameter to the keep the preferred row),

df.drop_duplicates(subset='col3')
# output
   col1  col2 col3
0    10     2    A
1    11     3    B

Drop duplicate rows based on multiples columns e.g. col1 and col3 (you can also pass keep parameter to keep the preferred row),

df.drop_duplicates(subset=['col1', 'col3'])
# output
   col1  col2 col3
0    10     2    A
1    11     3    B
2    25    25    B

Drop duplicate rows and update DataFrame

If you want to update the current DataFrame by dropping the duplicate rows, you can pass the inplace=True parameter,

For example, if you want to drop the duplicated rows and simultaneously update the DataFrame,

df.drop_duplicates(inplace=True)
df
# output
   col1  col2 col3
0    10     2    A
1    11     3    B
2    25    25    B

Drop duplicate from pandas Series

You can also use the drop_duplicates() function to drop duplicated values in pandas Series. The other parameters of drop_duplicates() such as keep and inplace also apply to the pandas series

# create pandas series
dfs = pd.Series(['A', 'B', 'B', 'C', 'D'])
dfs.drop_duplicates()
# output
0    A
1    B
3    C
4    D
dtype: object

Identifying duplicate rows

You can use pandas duplicated() function to identify the duplicated rows as boolean values series (True if duplicated)

For example, identify duplicated rows based on all columns in the DataFrame,

df.duplicated()
# output
0    False
1    False
2    False
3     True
dtype: bool

The last row of the DataFrame is duplicated as it returns True

Now, identify duplicated rows based on specific columns in the DataFrame,

df.duplicated(subset='col3')
# output
0    False
1    False
2     True
3     True
dtype: bool

# get duplicated rows
df.loc[df.duplicated(subset='col3')]
# output
   col1  col2 col3
2    25    25    B
3    10     2    A

In addition, you can also pass the keep parameter to identify the first or last occurrences of the duplicated rows in a DataFrame,

For example, identify the first occurrence of duplicated rows,

df.duplicated(keep='last')
# output
0     True
1    False
2    False
3    False
dtype: bool

Counting duplicated rows in pandas DataFrame

You can also count the duplicated rows in pandas DataFrame based on the duplicated value in single or multiple columns

Count duplicated rows based on all columns,

df.duplicated().sum()
# output
1

The one row is duplicated based on all columns in a DataFrame

Count duplicated rows based on specific columns,

df.duplicated(subset='col3').sum()
# output
2

The two rows are duplicated based on col3 values in a DataFrame

Summary

In this article, you learned how to use pandas drop_duplicates() and duplicated() functions to identify and drop duplicated rows in DataFrame and Series. In addition, you also learned how to identfiy and count the duplicated rows in a DataFrame.

Enhance your skills with courses on Python and pandas

If you have any questions, comments or recommendations, please email me at reneshbe@gmail.com

This work is licensed under a Creative Commons Attribution 4.0 International License

Some of the links on this page may be affiliate links, which means we may get an affiliate commission on a valid purchase. The retailer will pay the commission at no additional cost to you.

Share on

Twitter Facebook LinkedIn

How to identify and drop duplicates based on single and multiple columns in a pandas DataFrame

Drop duplicate rows based on all columns

Drop duplicate rows based on specific columns

Drop duplicate rows and update DataFrame

Drop duplicate from pandas Series

Identifying duplicate rows

Counting duplicated rows in pandas DataFrame

Summary

Enhance your skills with courses on Python and pandas

Share on

You may also enjoy

Calculate Coverage From BAM File

Python: Why VIF Return Inf Value?

Find Max and Min Sequence Length in Fasta

Get Non-overlapping Portion Between Two Regions in bedtools