In this article, we will discuss how to find duplicate columns in a Pandas DataFrame and drop them. While Pandas provides direct APIs to find duplicate rows, there is no direct API to find duplicate columns. Therefore, we need to build our own API for that.
First, let's create a DataFrame with duplicate columns. We will use the following data:
# List of Tuples
students = [('jack', 34, 'Sydney', 34, 'Sydney', 34),
('Riti', 30, 'Delhi', 30, 'Delhi', 30),
('Aadi', 16, 'New York', 16, 'New York', 16),
('Riti', 30, 'Delhi', 30, 'Delhi', 30),
('Riti', 30, 'Delhi', 30, 'Delhi', 30),
('Riti', 30, 'Mumbai', 30, 'Mumbai', 30),
('Aadi', 40, 'London', 40, 'London', 40),
('Sachin', 30, 'Delhi', 30, 'Delhi', 30)]
# Create a DataFrame object
dfObj = pd.DataFrame(students, columns=['Name', 'Age', 'City', 'Marks', 'Address', 'Pin'])
print("Original DataFrame:")
print(dfObj)
The DataFrame created looks like this:
Name | Age | City | Marks | Address | Pin | |
---|---|---|---|---|---|---|
0 | jack | 34 | Sydney | 34 | Sydney | 34 |
1 | Riti | 30 | Delhi | 30 | Delhi | 30 |
2 | Aadi | 16 | New York | 16 | New York | 16 |
3 | Riti | 30 | Delhi | 30 | Delhi | 30 |
4 | Riti | 30 | Delhi | 30 | Delhi | 30 |
5 | Riti | 30 | Mumbai | 30 | Mumbai | 30 |
6 | Aadi | 40 | London | 40 | London | 40 |
7 | Sachin | 30 | Delhi | 30 | Delhi | 30 |
As we can observe, there are 3 duplicate columns in this DataFrame: Marks, Address, and Pin. Let's see how to find them.
Finding Duplicate Columns
To find the duplicate columns, we need to iterate over the DataFrame column-wise and check if any other column exists in the DataFrame with the same contents. If yes, then that column name will be stored in a duplicate column list.
Here's the code to find duplicate columns:
def getDuplicateColumns(df):
''' Get a list of duplicate columns.
It will iterate over all the columns in the dataframe and find the columns whose contents are duplicate.
:param df: Dataframe object
:return: List of columns whose contents are duplicates.
'''
duplicateColumnNames = set()
# Iterate over all the columns in the dataframe
for x in range(df.shape[1]):
# Select column at xth index
col = df.iloc[:, x]
# Iterate over all the columns in DataFrame from (x+1)th index till end
for y in range(x + 1, df.shape[1]):
# Select column at yth index
otherCol = df.iloc[:, y]
# Check if two columns at x and y index are equal
if col.equals(otherCol):
duplicateColumnNames.add(df.columns.values[y])
return list(duplicateColumnNames)
Now, let's use this API to find the duplicate columns in the dfObj
DataFrame:
# Get list of duplicate columns
duplicateColumnNames = getDuplicateColumns(dfObj)
print('Duplicate Columns are as follows:')
for col in duplicateColumnNames:
print('Column name:', col)
The output will be:
Duplicate Columns are as follows:
Column name: Address
Column name: Marks
Column name: Pin
We have successfully identified the duplicate columns in the DataFrame.
Dropping Duplicate Columns
To remove the duplicate columns, we can pass the list of duplicate column names returned by our API to the DataFrame.drop()
function.
Here's the code to drop the duplicate columns:
# Delete duplicate columns
newDf = dfObj.drop(columns=getDuplicateColumns(dfObj))
print("Modified DataFrame:")
print(newDf)
The output will be:
Modified DataFrame:
Name Age City
0 jack 34 Sydney
1 Riti 30 Delhi
2 Aadi 16 New York
3 Riti 30 Delhi
4 Riti 30 Delhi
5 Riti 30 Mumbai
6 Aadi 40 London
7 Sachin 30 Delhi
We have successfully dropped the duplicate columns from the DataFrame.
Conclusion
In this article, we discussed how to find and drop duplicate columns in a Pandas DataFrame. We built a custom API to identify the duplicate columns and used the DataFrame.drop()
function to remove them. By following these steps, you can effectively manage duplicate columns in your DataFrame.
Remember, removing duplicate columns can help improve the quality and efficiency of your data analysis and processing.