How to compare two dataframes in Python

Learn how to compare two DataFrames in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

How to compare two dataframes in Python
Published on: 
Mon
Apr 6, 2026
Updated on: 
Wed
Apr 8, 2026
The Replit Team

Comparing dataframes is a common task in data analysis with Python. It helps you validate data, track changes, and ensure consistency between datasets. Python offers several powerful methods for this purpose.

In this article, you'll learn several techniques for comparing dataframes. We cover everything from simple equality checks to advanced methods, along with practical tips, real-world applications, and debugging advice for your projects.

Using the == operator for element-wise comparison

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 30], 'B': [4, 5, 6]})
result = df1 == df2
print(result)--OUTPUT--A B
0 True True
1 True True
2 False True

The == operator offers a direct, element-wise comparison between two dataframes. It produces a new dataframe of the same dimensions, but filled with boolean values. A True value means the elements at that position match, whereas False highlights a difference.

This technique is particularly useful for pinpointing the exact location of discrepancies. In the example, the output shows False only for the cell containing 30 in df2 because it doesn't match the corresponding 3 in df1. It’s a quick and visual way to spot inconsistencies.

Basic dataframe comparison methods

Moving beyond the granular, boolean output of the == operator, pandas also equips you with methods for checking overall equality and isolating specific discrepancies.

Using the equals() method for exact equality

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
are_equal = df1.equals(df2)
print(f"Dataframes are equal: {are_equal}")--OUTPUT--Dataframes are equal: True

For a simple, definitive answer on whether two dataframes are identical, you can use the equals() method. It returns a single boolean—True if the dataframes are perfect duplicates and False otherwise. This is different from the == operator, which gives you a detailed, element-by-element breakdown.

  • A True result means everything matches: the values, the column and row order, and the data types.
  • Even a minor difference, like a float instead of an integer (e.g., 3.0 vs 3), will cause equals() to return False.

Using the compare() method to spot differences

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 30], 'B': [4, 5, 60]})
diff = df1.compare(df2)
print(diff)--OUTPUT--A B
self other self other
2 3 30 6 60

When you need to see exactly what changed, the compare() method is your best bet. It returns a new dataframe that only shows the specific cells where the two dataframes differ. This is much cleaner than the boolean output from the == operator, as it filters out all the matching data.

  • The resulting dataframe uses a multi-level column index, with self and other to show the differing values side by side.
  • self refers to the original dataframe (df1), and other refers to the one it was compared against (df2).

Comparing specific columns between dataframes

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, 30], 'B': [4, 5, 6]})
column_equal = df1['B'].equals(df2['B'])
print(f"Column B is equal: {column_equal}")--OUTPUT--Column B is equal: True

Sometimes you only need to verify a single column. You can apply the equals() method directly to columns by selecting them from each dataframe, like with df1['B'] and df2['B']. This lets you perform a focused comparison, ignoring any differences elsewhere in the dataframes.

  • The expression df1['B'].equals(df2['B']) checks if column B is identical in both dataframes, returning a single True or False.

Advanced dataframe comparison techniques

When you need more than a simple equality check, advanced methods offer powerful ways to handle structural differences and compare data with custom precision.

Finding rows present in one dataframe but not the other

import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [1, 2, 4]})
only_in_df1 = df1[~df1['key'].isin(df2['key'])]
print(only_in_df1)--OUTPUT--key value
2 C 3

To find rows unique to one dataframe, you can combine boolean indexing with the isin() method. This technique is perfect for identifying records that exist in df1 but are missing from df2, based on a shared key column.

  • The isin() method first checks which values in df1['key'] are also present in df2['key'], returning a boolean Series.
  • The ~ operator then inverts this result, effectively selecting keys that are unique to df1.
  • This final boolean Series is used to filter df1, showing only the rows that don't have a match.

Using merge to identify differences

import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [1, 20, 4]})
merged = pd.merge(df1, df2, on='key', how='outer', indicator=True, suffixes=('_1', '_2'))
print(merged)--OUTPUT--key value_1 value_2 _merge
0 A 1.0 1.0 both
1 B 2.0 20.0 both
2 C 3.0 NaN left_only
3 D NaN 4.0 right_only

Using the merge() function with an outer join is a powerful method for a full comparison. By setting how='outer', you create a combined dataframe that includes all rows from both the original and the comparison dataframe, ensuring no records are lost.

The key to this technique is the indicator=True argument. It adds a special _merge column that explicitly tells you the source of each row.

  • left_only means the row exists only in the first dataframe.
  • right_only means it's unique to the second.
  • both indicates the key exists in both dataframes, allowing you to compare their values side-by-side.

Using assert_frame_equal with custom tolerances

import pandas as pd
import numpy as np
from pandas.testing import assert_frame_equal

df1 = pd.DataFrame({'A': [1.0001, 2.0002, 3.0003]})
df2 = pd.DataFrame({'A': [1.0002, 2.0001, 3.0004]})
try:
assert_frame_equal(df1, df2, atol=0.001)
print("Dataframes are equal within tolerance")
except AssertionError:
print("Dataframes are different beyond tolerance")--OUTPUT--Dataframes are equal within tolerance

When working with floating-point numbers, small precision errors can cause exact comparisons to fail unexpectedly. The assert_frame_equal function, found in the pandas.testing module, is built to handle this. It's especially useful in automated tests where you need to verify that two dataframes are "close enough."

  • You can set an acceptable margin of error using parameters like atol for absolute tolerance.
  • If all differences fall within this tolerance, the assertion passes silently.
  • If any difference exceeds the tolerance, the function raises an AssertionError, which is why it's often wrapped in a try...except block.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. While mastering individual methods like equals() and merge() is essential, you can use Agent 4 to move from piecing together techniques to building complete applications.

Instead of just comparing dataframes, describe the entire tool you want to build. Agent handles everything from writing the code and connecting to databases to deploying your app. You can create practical tools like:

  • A data validation utility that compares new datasets against a master file, flagging any discrepancies or missing rows.
  • A change-tracking dashboard that ingests two versions of a report and visualizes what was added, removed, or modified.
  • A reconciliation tool that matches records from two different sources and isolates the unmatched entries for review.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Comparing dataframes can sometimes throw a curveball, especially with tricky NaN values, mismatched column orders, or different data types.

Handling NaN values when comparing dataframes

A common trip-up involves NaN (Not a Number) values. In pandas, a NaN value is never considered equal to another NaN. This means using the == operator will return False for cells that both contain NaN, which can be misleading. Fortunately, the equals() method is designed to handle this; it correctly treats NaNs in the same position as a match.

Dealing with column order issues in equals() method

The equals() method is also strict about structure. If two dataframes contain the exact same data but have their columns in a different order, the method will return False. To compare dataframes regardless of column sequence, you can sort the columns alphabetically in both dataframes before making the comparison. This ensures you're only checking the data itself, not the presentation.

Resolving data type mismatches in comparisons

Similarly, data type mismatches can cause comparisons to fail. For instance, a column of integers (like 1, 2) won't be considered equal to a column of floats (1.0, 2.0) by the equals() method. This is a frequent issue when data comes from mixed sources. You can resolve it by using the astype() method to standardize the data types across your dataframes before you run the comparison.

Handling NaN values when comparing dataframes

Comparing dataframes with missing data can be tricky. Since NaN values aren't considered equal to each other, an element-wise check with the == operator can produce a surprising result, even when the dataframes appear identical. The following code demonstrates this behavior.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, 5, 6]})

# This will not work as expected with NaN values
are_equal = (df1 == df2).all().all()
print(f"Dataframes are equal: {are_equal}")

Because NaN == NaN evaluates to False, the .all() method finds a False value in the comparison dataframe and reports that the dataframes are not equal. The code below shows how to get around this.

import pandas as pd
import numpy as np

df1 = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': [1, 2, np.nan], 'B': [4, 5, 6]})

# Use equals() which handles NaN values correctly
are_equal = df1.equals(df2)
print(f"Dataframes are equal: {are_equal}")

The equals() method is the right tool for this job. It's designed to know that two NaN values in the same spot are a match, which is why it returns True. You'll want to reach for this method whenever you're comparing dataframes with missing data, which often happens after cleaning or merging files. It gives you an accurate check where the == operator would fail.

Dealing with column order issues in equals() method

The equals() method’s strictness extends to column order. Even if two dataframes contain identical information, they won't be considered equal if their columns are arranged differently. This is a common source of confusion. The code below demonstrates this exact scenario.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'B': [4, 5, 6], 'A': [1, 2, 3]}) # Different column order

# Direct comparison affected by column order
are_equal = df1.equals(df2)
print(f"Dataframes are equal: {are_equal}")

The equals() method returns False because the columns in df2 are in a different order. This structural mismatch causes the comparison to fail, even with identical data. The next example demonstrates how to resolve this.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'B': [4, 5, 6], 'A': [1, 2, 3]}) # Different column order

# Sort columns before comparison
df1_sorted = df1.reindex(sorted(df1.columns), axis=1)
df2_sorted = df2.reindex(sorted(df2.columns), axis=1)
are_equal = df1_sorted.equals(df2_sorted)
print(f"Dataframes are equal: {are_equal}")

To bypass the strict column order requirement, you can reindex both dataframes to sort their columns alphabetically. The code achieves this with reindex(sorted(df.columns), axis=1). This forces the equals() method to focus only on the data, ignoring structural differences. It's a crucial step when you're comparing dataframes from different files or sources, as their column order isn't always guaranteed to be consistent.

Resolving data type mismatches in comparisons

Data type mismatches are another common hurdle. The equals() method is strict, so it won't consider an integer column equal to a string column, even if the values appear identical. The following code demonstrates how this can lead to unexpected results.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': ['1', '2', '3'], 'B': [4, 5, 6]}) # String type in column A

# Direct comparison with different types
are_equal = df1.equals(df2)
print(f"Dataframes are equal: {are_equal}")

The comparison returns False because column A holds integers in one dataframe and strings in the other. The equals() method requires identical data types for a match. The following example shows how to handle this.

import pandas as pd

df1 = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})
df2 = pd.DataFrame({'A': ['1', '2', '3'], 'B': [4, 5, 6]}) # String type in column A

# Convert types before comparison
df2['A'] = df2['A'].astype(int)
are_equal = df1.equals(df2)
print(f"Dataframes are equal: {are_equal}")

To resolve this, you can standardize the data types before the comparison. The astype() method lets you convert a column to a specific type, such as changing strings to integers with df2['A'].astype(int). This ensures that equals() compares the actual numeric values instead of getting tripped up by type differences. It's a crucial step when working with data from different files, where one might store numbers as text.

Real-world applications

Now that you can navigate common pitfalls, you can confidently apply these comparison techniques to solve practical problems in data analysis.

Validating data cleaning with compare()

Imagine you’ve just run a script to clean a raw dataset. You now have two dataframes: the original and the cleaned version. You need to confirm your script worked as intended—without manually scanning thousands of rows.

This is a perfect use case for the compare() method. It provides a concise report showing only the cells that were modified. The output, with its clear self and other columns, acts as a precise log of your cleaning operations, showing the "before" and "after" for each change.

This allows you to quickly verify that typos were fixed, formats were standardized, or missing values were imputed correctly. It’s an essential quality assurance step that ensures your cleaning process improved the data without introducing new errors.

Detecting anomalies between monthly sales reports with merge

When comparing data over time, such as monthly sales reports, you often need to see the full picture. This includes identifying new or discontinued products and spotting significant changes in performance.

Using pd.merge() with an outer join is the ideal approach here. By setting how='outer' and indicator=True, you create a unified dataframe that tells a complete story. The special _merge column categorizes each row for you.

Rows marked as left_only represent products from the first report that are missing in the second, while right_only highlights new entries. For rows marked both, you can compare values side-by-side to detect anomalies, like a sudden sales drop, that warrant a closer look.

Validating data cleaning with compare()

For example, you can use compare() to verify that missing values and invalid entries in a raw dataset were correctly replaced during cleaning.

import pandas as pd
import numpy as np

# Dataset before and after cleaning
original_df = pd.DataFrame({'value': [1, np.nan, 3, -999, 5]})
cleaned_df = pd.DataFrame({'value': [1, 0, 3, 0, 5]})

# Compare to identify what was changed during cleaning
changes = original_df.compare(cleaned_df)
print(changes)

This code shows how the compare() method isolates changes between two dataframes. The original_df contains a NaN value and a -999 placeholder, which are both replaced with 0 in the cleaned_df.

  • The compare() method generates a new dataframe that displays only these modified rows.
  • Its output uses self and other columns to place the original and new values side-by-side.

This makes it easy to audit specific adjustments without having to scan the entire dataset.

Detecting anomalies between monthly sales reports with merge

By performing calculations on the merged data, you can quantify these changes and automatically filter for the most significant ones.

import pandas as pd

# Sales data from two consecutive months
april = pd.DataFrame({
'product': ['A', 'B', 'C', 'D'],
'sales': [120, 85, 90, 110]
})
may = pd.DataFrame({
'product': ['A', 'B', 'C', 'E'],
'sales': [125, 210, 95, 60]
})

# Identify significant changes (>50% increase or decrease)
comparison = pd.merge(april, may, on='product', how='outer', suffixes=('_apr', '_may'))
comparison['pct_change'] = (comparison['sales_may'] / comparison['sales_apr'] - 1) * 100
anomalies = comparison[abs(comparison['pct_change']) > 50].dropna()
print(anomalies)

This code snippet identifies significant sales fluctuations between two monthly reports. It uses pd.merge() with an outer join to combine the april and may dataframes, ensuring all products are included for a complete comparison.

  • A new pct_change column is created to calculate the sales difference for products present in both months.
  • The code then filters these results to isolate anomalies—products where the sales change was greater than 50%.
  • Finally, dropna() is used to remove any entries with missing values, cleaning up the final output.

Get started with Replit

Put your knowledge into practice by building a real tool. Just tell Replit Agent what you need: "Create a utility that compares two CSVs and flags discrepancies" or "Build a dashboard that visualizes changes between two reports."

Replit Agent writes the code, tests for errors, and deploys the app for you. Skip the manual work and start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.