How to Drop Duplicates in Python Pandas (2023)

In this tutorial, we will learn how to drop duplicates in Python Pandas using the drop_duplicates() function. Duplicates can occur in datasets, and removing them is an essential part of data cleaning. The drop_duplicates() function allows us to remove duplicate values from the entire dataset or from specific columns.

Syntax of drop_duplicates()

The drop_duplicates() function in Python Pandas has the following syntax:

df.drop_duplicates(subset='column_name', keep='first', inplace=False, ignore_index=False)

Here is a breakdown of the different parameters:

  • subset: Specifies the column(s) to consider for identifying duplicates. It can be a single column name or a list of column names.
  • keep: Determines which duplicates to keep. It can be set to 'first', 'last', or False.
    • 'first': Keeps the first occurrence of each value and removes the rest.
    • 'last': Keeps the last occurrence of each value and removes the rest.
    • False: Removes all occurrences of duplicate values.
  • inplace: Specifies whether to drop duplicates in place or return a copy of the DataFrame.
  • ignore_index: If set to True, the resulting axis will be labeled from 0 to n-1.

Now, let's explore each aspect of dropping duplicates in Python Pandas in more detail.

Dropping Duplicates from the Entire Dataset

To remove duplicates from the entire dataset, we can simply call the drop_duplicates() function without specifying any parameters. This will remove all rows that have identical values across all columns.

df.drop_duplicates()

Dropping Duplicates Based on a Column

To drop duplicates based on a specific column, we can use the subset parameter and pass the column name as its value. This will remove rows that have duplicate values in the specified column.

df.drop_duplicates(subset='column_name')

Keeping the Last Occurrence of Duplicates

By default, the drop_duplicates() function keeps the first occurrence of each value and removes the rest. However, we can change this behavior by setting the keep parameter to 'last'. This will keep the last occurrence of each value and remove the rest.

df.drop_duplicates(subset='column_name', keep='last')

Dropping Duplicates from Multiple Columns

To drop duplicates based on multiple columns, we can pass a list of column names to the subset parameter. This will remove rows that have duplicate values in any of the specified columns.

df.drop_duplicates(subset=['column1', 'column2', 'column3'])

Dropping Duplicates and Keeping None

If we want to remove all occurrences of duplicate values, regardless of their order, we can set the keep parameter to False.

df.drop_duplicates(subset='column_name', keep=False)

Conclusion

In this tutorial, we have learned how to drop duplicates in Python Pandas using the drop_duplicates() function. We explored different scenarios, such as dropping duplicates from the entire dataset, dropping duplicates based on a column, keeping the last occurrence of duplicates, dropping duplicates from multiple columns, and dropping all occurrences of duplicate values. By using these techniques, you can effectively clean your datasets and remove any duplicate entries.

Remember, dropping duplicates is just one aspect of data cleaning. Depending on your specific requirements, you may need to perform additional data preprocessing steps.

Top Articles
Latest Posts
Article information

Author: Ouida Strosin DO

Last Updated: 18/12/2023

Views: 6522

Rating: 4.6 / 5 (56 voted)

Reviews: 95% of readers found this page helpful

Author information

Name: Ouida Strosin DO

Birthday: 1995-04-27

Address: Suite 927 930 Kilback Radial, Candidaville, TN 87795

Phone: +8561498978366

Job: Legacy Manufacturing Specialist

Hobby: Singing, Mountain biking, Water sports, Water sports, Taxidermy, Polo, Pet

Introduction: My name is Ouida Strosin DO, I am a precious, combative, spotless, modern, spotless, beautiful, precious person who loves writing and wants to share my knowledge and understanding with you.