How to Find and Drop Duplicate Columns in a Pandas DataFrame (2024)

In this article, we will discuss how to find duplicate columns in a Pandas DataFrame and drop them. While Pandas provides direct APIs to find duplicate rows, there is no direct API to find duplicate columns. Therefore, we need to build our own API for that.

First, let's create a DataFrame with duplicate columns. We will use the following data:

# List of Tuples
students = [('jack', 34, 'Sydney', 34, 'Sydney', 34),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Aadi', 16, 'New York', 16, 'New York', 16),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Riti', 30, 'Delhi', 30, 'Delhi', 30),
            ('Riti', 30, 'Mumbai', 30, 'Mumbai', 30),
            ('Aadi', 40, 'London', 40, 'London', 40),
            ('Sachin', 30, 'Delhi', 30, 'Delhi', 30)]

# Create a DataFrame object
dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin'])
print("Original DataFrame:")
print(dfObj)

The DataFrame created looks like this:

Name Age City Marks Address Pin
0 jack 34 Sydney 34 Sydney 34
1 Riti 30 Delhi 30 Delhi 30
2 Aadi 16 New York 16 New York 16
3 Riti 30 Delhi 30 Delhi 30
4 Riti 30 Delhi 30 Delhi 30
5 Riti 30 Mumbai 30 Mumbai 30
6 Aadi 40 London 40 London 40
7 Sachin 30 Delhi 30 Delhi 30

As we can observe, there are 3 duplicate columns in this DataFrame: Marks, Address, and Pin. Let's see how to find them.

Finding Duplicate Columns

To find the duplicate columns, we need to iterate over the DataFrame column-wise and check if any other column exists in the DataFrame with the same contents. If yes, then that column name will be stored in a duplicate column list.

Here's the code to find duplicate columns:

def getDuplicateColumns(df):
    ''' Get a list of duplicate columns.
    It will iterate over all the columns in the dataframe and find the columns whose contents are duplicate.
    :param df: Dataframe object
    :return: List of columns whose contents are duplicates.
    '''
    duplicateColumnNames = set()

    # Iterate over all the columns in the dataframe
    for x in range(df.shape[1]):
        # Select column at xth index
        col = df.iloc[:, x]

        # Iterate over all the columns in DataFrame from (x+1)th index till end
        for y in range(x + 1, df.shape[1]):
            # Select column at yth index
            otherCol = df.iloc[:, y]

            # Check if two columns at x and y index are equal
            if col.equals(otherCol):
                duplicateColumnNames.add(df.columns.values[y])

    return list(duplicateColumnNames)

Now, let's use this API to find the duplicate columns in the dfObj DataFrame:

# Get list of duplicate columns
duplicateColumnNames = getDuplicateColumns(dfObj)

print('Duplicate Columns are as follows:')
for col in duplicateColumnNames:
    print('Column name:', col)

The output will be:

Duplicate Columns are as follows:
Column name: Address
Column name: Marks
Column name: Pin

We have successfully identified the duplicate columns in the DataFrame.

Dropping Duplicate Columns

To remove the duplicate columns, we can pass the list of duplicate column names returned by our API to the DataFrame.drop() function.

Here's the code to drop the duplicate columns:

# Delete duplicate columns
newDf = dfObj.drop(columns=getDuplicateColumns(dfObj))

print("Modified DataFrame:")
print(newDf)

The output will be:

Modified DataFrame:
    Name  Age       City
0   jack   34     Sydney
1   Riti   30      Delhi
2   Aadi   16   New York
3   Riti   30      Delhi
4   Riti   30      Delhi
5   Riti   30     Mumbai
6   Aadi   40     London
7  Sachin  30      Delhi

We have successfully dropped the duplicate columns from the DataFrame.

Conclusion

In this article, we discussed how to find and drop duplicate columns in a Pandas DataFrame. We built a custom API to identify the duplicate columns and used the DataFrame.drop() function to remove them. By following these steps, you can effectively manage duplicate columns in your DataFrame.

Remember, removing duplicate columns can help improve the quality and efficiency of your data analysis and processing.

How to Find and Drop Duplicate Columns in a Pandas DataFrame (2024)
Top Articles
Latest Posts
Article information

Author: Mr. See Jast

Last Updated:

Views: 6520

Rating: 4.4 / 5 (55 voted)

Reviews: 94% of readers found this page helpful

Author information

Name: Mr. See Jast

Birthday: 1999-07-30

Address: 8409 Megan Mountain, New Mathew, MT 44997-8193

Phone: +5023589614038

Job: Chief Executive

Hobby: Leather crafting, Flag Football, Candle making, Flying, Poi, Gunsmithing, Swimming

Introduction: My name is Mr. See Jast, I am a open, jolly, gorgeous, courageous, inexpensive, friendly, homely person who loves writing and wants to share my knowledge and understanding with you.