How to merge dataframes in Python

Learn how to merge dataframes in Python. This guide covers different methods, tips, real-world applications, and debugging common errors.

How to merge dataframes in Python
Published on: 
Tue
Mar 3, 2026
Updated on: 
Wed
Mar 4, 2026
The Replit Team Logo Image
The Replit Team

To combine datasets in Python, you merge dataframes—a core skill for data analysis. The pandas library provides powerful functions like merge() that simplify this process and ensure data integrity.

Here, you'll discover different merge techniques, practical tips, and real-world applications. You will also get clear debugging advice to resolve common errors and confidently handle any dataframe combination task you face.

Using pd.merge() to join dataframes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)--OUTPUT--key value1 value2
0 A 1 4
1 B 2 5

The pd.merge() function combines dataframes by matching values in one or more common columns. Here, on='key' explicitly tells pandas to use the 'key' column as the join criterion for df1 and df2.

The resulting dataframe demonstrates the function's default behavior, which is an inner join. This means it only keeps rows where the key exists in both dataframes.

  • Keys 'A' and 'B' are included because they are shared.
  • Key 'C' from df1 and 'D' from df2 are dropped because they lack a match in the other dataframe.

Common methods for dataframe joining

While pd.merge() is a great starting point, pandas offers more specialized tools like DataFrame.join() and pd.concat(), plus more advanced SQL-style join options.

Using DataFrame.join() for index-based merging

import pandas as pd

df1 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [4, 5, 6]}, index=['A', 'B', 'D'])
joined_df = df1.join(df2, how='inner')
print(joined_df)--OUTPUT--value1 value2
A 1 4
B 2 5

The DataFrame.join() method is your go-to for merging dataframes based on their index. It’s a convenient alternative to pd.merge() when you're working with row labels instead of columns.

  • In this example, df1.join(df2) aligns the dataframes using their common index values.
  • The how='inner' argument ensures that only rows with matching indices—'A' and 'B'—are kept in the final output.

Using pd.concat() to combine dataframes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['C', 'D'], 'value': [3, 4]})
concat_rows = pd.concat([df1, df2])
concat_cols = pd.concat([df1, df2], axis=1)
print("Rows:", concat_rows.shape[0], "Columns:", concat_cols.shape[1])
print(concat_rows)--OUTPUT--Rows: 4 Columns: 4
key value
0 A 1
1 B 2
0 C 3
1 D 4

Unlike merging, pd.concat() is perfect for stacking dataframes. It simply pieces them together along an axis—either vertically or horizontally—without looking for matching values. Think of it as gluing datasets together.

  • By default, it stacks rows (axis=0), appending df2 below df1.
  • Setting axis=1 changes the behavior to stack columns, placing the dataframes side by side.

Using SQL-style join operations with pd.merge()

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'D', 'E'], 'value2': [4, 5, 6]})
inner_join = pd.merge(df1, df2, on='key', how='inner')
left_join = pd.merge(df1, df2, on='key', how='left')
right_join = pd.merge(df1, df2, on='key', how='right')
print(f"Inner: {len(inner_join)}, Left: {len(left_join)}, Right: {len(right_join)}")--OUTPUT--Inner: 1, Left: 3, Right: 3

The pd.merge() function becomes even more powerful when you use the how parameter to specify the join type, just like in SQL. This gives you precise control over how dataframes are combined based on their matching keys.

  • inner: This is the default. It keeps only rows with keys that exist in both dataframes. In the example, only key 'A' is common.
  • left: Keeps all rows from the left dataframe (df1) and merges matching rows from the right (df2).
  • right: Does the opposite, keeping all rows from the right dataframe (df2) and filling in matches from the left.

Advanced dataframe merging techniques

Building on the basics, you can now handle more complex merges involving multiple columns, duplicate column names, and data validation using advanced pd.merge() parameters.

Merging on multiple columns with compound keys

import pandas as pd

df1 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 1],
'value1': [100, 200, 300]
})
df2 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 2],
'value2': [400, 500, 600]
})
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])
print(merged_df)--OUTPUT--key1 key2 value1 value2
0 A 1 100 400
1 A 2 200 500

Sometimes, a single column isn't enough to uniquely link your data. You can merge on multiple columns by passing a list of column names to the on parameter. This tells pandas to match rows only when the values in all specified key columns are identical across both dataframes.

  • In the example, on=['key1', 'key2'] creates a compound key.
  • The row with ('B', 1) from df1 is dropped because it doesn't have an exact match for that pair in df2.

Handling duplicate column names with suffixes

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2], 'data': [10, 20]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'value': [3, 4], 'data': [30, 40]})
merged_df = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print(merged_df)--OUTPUT--key value_left data_left value_right data_right
0 A 1 10 3 30

When you merge dataframes with identically named columns—other than the merge key—pandas raises an error to prevent ambiguity. The suffixes parameter is your solution. It lets you automatically rename these conflicting columns by providing a tuple like ('_left', '_right').

  • The first suffix is appended to overlapping column names from the left dataframe (df1).
  • The second suffix is applied to those from the right dataframe (df2).

This simple step prevents name collisions and ensures all your data is preserved in the final output.

Using indicator and validate parameters for advanced merging

import pandas as pd

df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(
df1, df2, on='key', how='outer',
indicator=True, validate='1:1'
)
print(merged_df)--OUTPUT--key value_x value_y _merge
0 A 1.0 4.0 both
1 B 2.0 5.0 both
2 C 3.0 NaN left_only
3 D NaN 6.0 right_only

You can gain deeper insight and control over your merges with the indicator and validate parameters. They're great for debugging and ensuring data integrity.

  • Setting indicator=True adds a special _merge column. This column explicitly shows whether a row's key came from the left dataframe, the right, or both.
  • Using validate='1:1' checks that merge keys are unique in both dataframes. If a key is duplicated, pandas raises an error, preventing incorrect results from silent data issues.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the dataframe merging techniques we've explored, Replit Agent can turn them into production-ready tools:

  • Build a customer data unifier that merges user profiles from marketing and sales databases using pd.merge() to create a single view.
  • Create an e-commerce dashboard that joins product details with sales records using a left join to track inventory and performance.
  • Deploy a log file aggregator that uses pd.concat() to stack and analyze server logs from multiple sources into one comprehensive dataset.

Try building your next data tool by describing it to Replit Agent, and watch as it writes, tests, and deploys the code for you.

Common errors and challenges

Merging dataframes can sometimes throw errors or produce unexpected results, but most issues trace back to a few common data preparation oversights.

Troubleshooting merge on object errors with mismatched data types

A merge on object error often signals that your key columns have different data types. You might be trying to merge a column of numbers with a column of text that just happens to contain numbers. Pandas can't match 123 with '123' by default, so the merge fails.

To fix this, you need to ensure the key columns share the same data type before merging. You can easily convert a column's type using the astype() method to create consistency across your dataframes.

Debugging missing data after merging with case-sensitive keys

If your merged dataframe is missing rows you expected to see, the cause is often case sensitivity in your merge keys. Pandas treats 'Apple' and 'apple' as completely different values, so if one dataframe uses uppercase and the other uses lowercase, those rows won't match.

The solution is to standardize the case in both key columns before the merge. You can convert both columns to a consistent format using a string method like .str.lower(), ensuring that keys are treated as identical.

Fixing duplicate values in merge keys with drop_duplicates()

Duplicate values in a merge key can create a much larger dataframe than you intended—a phenomenon known as a Cartesian product. This happens when a key appears multiple times in both dataframes, causing pandas to create a new row for every possible combination.

Before merging, it's good practice to clean your data by removing unintentional duplicates. You can use the drop_duplicates() method on your dataframe, specifying the key column, to ensure each key is unique and prevent the merge from creating unwanted extra rows.

Troubleshooting merge on object errors with mismatched data types

When key columns have mismatched data types, like integers and strings, the merge can fail silently. Instead of an error, you get an empty dataframe because pandas finds no matches. The code below shows this exact scenario in action.

import pandas as pd

df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})

merged_df = pd.merge(df1, df2, on='key')
print(merged_df) # Empty dataframe

The key column in df1 holds integers while df2's contains strings. Pandas can't match the integer 1 with the string '1', so the merge returns nothing. The following code demonstrates the correction.

import pandas as pd

df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})

df2['key'] = df2['key'].astype(int)
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)

The fix is to align the data types before merging. By applying df2['key'].astype(int), you convert the string keys to integers, allowing pd.merge() to correctly match the rows. This kind of mismatch is common when you're combining data from different sources—like a CSV file and a database—where numbers can be easily misinterpreted as text. A quick check with df.info() can save you a lot of trouble.

Debugging missing data after merging with case-sensitive keys

If your merge results are missing data, check for case sensitivity in your keys. Pandas treats values like 'C001' and 'c001' as completely different, so they won't match. The following code shows this common pitfall in action.

import pandas as pd

customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})

merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 0 records

The customers dataframe uses uppercase IDs, but the orders dataframe uses lowercase. Because pd.merge() treats 'C001' and 'c001' as different, it finds no matches, returning an empty dataframe. The following code demonstrates the fix.

import pandas as pd

customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})

orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})

customers['customer_id'] = customers['customer_id'].str.upper()
orders['customer_id'] = orders['customer_id'].str.upper()

merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 2 records

The fix is to standardize the case before merging. By applying the .str.upper() method to both customer_id columns, you ensure that values like 'C001' and 'c001' are treated as identical, allowing pd.merge() to correctly match the records. This kind of mismatch is common when combining data from different systems or sources with inconsistent data entry standards, so it's a good habit to check for it to prevent silent data loss.

Fixing duplicate values in merge keys with drop_duplicates()

Duplicate keys in your source data can cause a merge to generate more rows than you expect. This happens when a key appears multiple times, creating a new row for each match. The following code shows how a single duplicate can skew your results.

import pandas as pd

products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})

inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})

result = pd.merge(products, inventory, on='product_id')
print(result) # Shows duplicate rows for product_id 101

The products dataframe has two different entries for product_id 101. The merge pairs the single inventory record with both of these, creating an extra row and skewing the final result. The code below shows the fix.

import pandas as pd

products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})

inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})

products_unique = products.drop_duplicates(subset=['product_id'])
result = pd.merge(products_unique, inventory, on='product_id')
print(result) # Clean result without duplicates

The fix is to clean your data before the merge. By using products.drop_duplicates(subset=['product_id']), you create a new dataframe where each product ID is unique. This prevents the merge from generating extra rows for each duplicate key. It's a crucial data cleaning step to ensure a clean, one-to-one match and avoid skewed results, especially when combining datasets where one should have unique identifiers.

Real-world applications

Navigating common pitfalls prepares you to apply dataframe merging to real-world business and research applications.

Analyzing customer purchases with pd.merge()

By merging sales transactions with customer details using pd.merge(), you can analyze purchasing behavior across different customer segments.

import pandas as pd

sales_df = pd.DataFrame({
'customer_id': [101, 102, 101, 103, 104],
'amount': [150.50, 200.75, 50.25, 300.00, 120.00]
})

customer_df = pd.DataFrame({
'customer_id': [101, 102, 103, 105],
'segment': ['Premium', 'Standard', 'Premium', 'Standard']
})

# Join sales with customer information
result = pd.merge(sales_df, customer_df, on='customer_id')
print(result.groupby('segment')['amount'].sum())

This code combines sales data with customer details using pd.merge(), linking them on the shared customer_id column. This step creates a single, enriched dataframe that contains both purchase amounts and customer segments for matched records.

  • The groupby('segment') method then organizes this merged data based on customer type.
  • Finally, sum() calculates the total purchase amount for each segment, making it easy to compare the spending habits of different customer groups.

Enriching geographic data with how='left' join for analysis

A left join is perfect for enriching a primary dataset with supplementary data, such as adding state-level economic metrics to city information, while ensuring no original city records are lost in the process.

import pandas as pd

cities_df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'state': ['NY', 'CA', 'IL', 'TX', 'AZ'],
'population': [8.4, 4.0, 2.7, 2.3, 1.7] # in millions
})

metrics_df = pd.DataFrame({
'state': ['NY', 'CA', 'IL', 'TX', 'WA'],
'gdp_growth': [2.1, 3.5, 1.8, 4.2, 3.9],
'unemployment': [4.2, 4.1, 4.7, 3.8, 3.6]
})

# Enrich city data with state metrics using left join
enriched_df = pd.merge(cities_df, metrics_df, on='state', how='left')
enriched_df['econ_index'] = enriched_df['gdp_growth'] - enriched_df['unemployment']
print(enriched_df[['city', 'state', 'econ_index']].sort_values('econ_index', ascending=False))

This code combines city data with state-level economic metrics using a left join. The how='left' argument ensures every city from the original cities_df is kept in the final dataframe, linking the two datasets on the shared state column.

  • A new econ_index column is calculated by subtracting unemployment from GDP growth, creating a simple economic health score.
  • The final output then sorts the cities by this new index, ranking them based on their state's economic performance.

Get started with Replit

Turn your new skills into a real tool. Describe your idea to Replit agent, like "unify customer and sales data from two CSVs" or "build a dashboard joining product details with live inventory."

The agent writes the code, tests for errors, and deploys your app, handling the entire development cycle. Turn your data-merging idea into a finished product. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.