One of the annoying things you have to deal with in a large data set is duplicate rows. But this become very easy and simple if you use Pandas. For those of you who are not familiar with Pandas, it is an open source Python library that provides functions and data structure for data analysis. The data structure in Pandas that is used to store tabular data is called a Data Frame. In this article you will learn how to identify and remove duplicate rows from a Pandas data frame.
To use Pandas library in your Python program, you need to import it first.
import pandas as pd
Now we create a dataframe from a dictionary.
data = {"Name": ["Tom", "Jack", "Ella", "Jeff", "Tom"], "Age": [30, 28, 24, 45, 30"], "Height":[165, 160, 160, 170, 165], "Weight":[70, 60, 60, 82, 70] } df = pd.DataFrame(data) print(df.head())
The output will be:
Name | Age | Height | Weight | |
0 | Tom | 30 | 165 | 70 |
1 | Jack | 28 | 160 | 60 |
2 | Ella | 24 | 160 | 60 |
3 | Jeff | 45 | 170 | 82 |
4 | Tom | 30 | 165 | 70 |
Finding Duplicate Rows
In the sample dataframe that we have created, you might have noticed that rows 0 and 4 are exactly the same. You can identify such duplicate rows in a Pandas dataframe by calling the duplicated function. The duplicated
function returns a Boolean series with value True
indicating a duplicate row.
print(df.duplicated())
Output:
0 False 1 False 2 False 3 False 4 True dtype: bool
By default the first row in a duplicated set is marked as False
and all others marked as True
. In the above example rows 0 and 4 are duplicates but only row 4 is marked as True
. You can change this behaviour by setting the keep parameter to one of the following values.
first
: First occurance in the duplicate set is marked as False and others marked True. This is the defaultlast
: Last occurance among the duplicates is marked as False and others marked True.False
: All duplicates are marked as True.
For example to mark all rows in a duplicate set as True:
print(df.duplicated(keep="last"))
Outputs
0 True 1 False 2 False 3 False 4 False dtype: bool
When finding duplicates instead of matching all columns in a row, you can specify a subset of columns to be checked. For example, to find rows that have the sane values in height and weight columns:
print(df.duplicated(["Height","Weight"]))
Output
0 False 1 False 2 True 3 False 4 True dtype: bool
Displaying duplicate rows
To display duplicated rows only, you can filter the dataframe like this:
print(df[df.duplicated(keep=False)])
Output:
Name Age Height Weight 0 Tom 30 165 70 4 Tom 30 165 70
Removing Duplicate Rows
You can remove duplicate rows from a Pandas dataframe using the drop_duplicates function. drop_duplicates
function returns a dataframe after removing duplicated rows. By default, the first occurance among the duplicates is retained and others removed. You can change this default behavior by setting the keep
parameter. The following values are possible.
first
: First occurance among the duplicates is retained and others removed.last
: Keep the last occurance of the duplicated row and remove others in the set.False
: Drop all duplicate rows.
Example
df.drop_duplicates()
Output
Name Age Height Weight 0 Tom 30 165 70 1 Jack 28 160 60 2 Ella 24 160 60 3 Jeff 45 170 82
df.drop_duplicates(keep="last")
Output
Name Age Height Weight 1 Jack 28 160 60 2 Ella 24 160 60 3 Jeff 45 170 82 4 Tom 30 165 70
Same as in duplicated method, you can also specify a subset of columns like below.
df.drop_duplicates(["Height","Weight"])
Output
Name Age Height Weight 0 Tom 30 165 70 1 Jack 28 160 60 3 Jeff 45 170 82