How to merge dataframes in Python
Learn how to merge dataframes in Python. This guide covers different methods, tips, real-world applications, and debugging common errors.

To combine datasets in Python, you merge dataframes—a core skill for data analysis. The pandas library provides powerful functions like merge() that simplify this process and ensure data integrity.
Here, you'll discover different merge techniques, practical tips, and real-world applications. You will also get clear debugging advice to resolve common errors and confidently handle any dataframe combination task you face.
Using pd.merge() to join dataframes
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value2': [4, 5, 6]})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)--OUTPUT--key value1 value2
0 A 1 4
1 B 2 5
The pd.merge() function combines dataframes by matching values in one or more common columns. Here, on='key' explicitly tells pandas to use the 'key' column as the join criterion for df1 and df2.
The resulting dataframe demonstrates the function's default behavior, which is an inner join. This means it only keeps rows where the key exists in both dataframes.
- Keys 'A' and 'B' are included because they are shared.
- Key 'C' from
df1and 'D' fromdf2are dropped because they lack a match in the other dataframe.
Common methods for dataframe joining
While pd.merge() is a great starting point, pandas offers more specialized tools like DataFrame.join() and pd.concat(), plus more advanced SQL-style join options.
Using DataFrame.join() for index-based merging
import pandas as pd
df1 = pd.DataFrame({'value1': [1, 2, 3]}, index=['A', 'B', 'C'])
df2 = pd.DataFrame({'value2': [4, 5, 6]}, index=['A', 'B', 'D'])
joined_df = df1.join(df2, how='inner')
print(joined_df)--OUTPUT--value1 value2
A 1 4
B 2 5
The DataFrame.join() method is your go-to for merging dataframes based on their index. It’s a convenient alternative to pd.merge() when you're working with row labels instead of columns.
- In this example,
df1.join(df2)aligns the dataframes using their common index values. - The
how='inner'argument ensures that only rows with matching indices—'A' and 'B'—are kept in the final output.
Using pd.concat() to combine dataframes
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2]})
df2 = pd.DataFrame({'key': ['C', 'D'], 'value': [3, 4]})
concat_rows = pd.concat([df1, df2])
concat_cols = pd.concat([df1, df2], axis=1)
print("Rows:", concat_rows.shape[0], "Columns:", concat_cols.shape[1])
print(concat_rows)--OUTPUT--Rows: 4 Columns: 4
key value
0 A 1
1 B 2
0 C 3
1 D 4
Unlike merging, pd.concat() is perfect for stacking dataframes. It simply pieces them together along an axis—either vertically or horizontally—without looking for matching values. Think of it as gluing datasets together.
- By default, it stacks rows (
axis=0), appendingdf2belowdf1. - Setting
axis=1changes the behavior to stack columns, placing the dataframes side by side.
Using SQL-style join operations with pd.merge()
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value1': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'D', 'E'], 'value2': [4, 5, 6]})
inner_join = pd.merge(df1, df2, on='key', how='inner')
left_join = pd.merge(df1, df2, on='key', how='left')
right_join = pd.merge(df1, df2, on='key', how='right')
print(f"Inner: {len(inner_join)}, Left: {len(left_join)}, Right: {len(right_join)}")--OUTPUT--Inner: 1, Left: 3, Right: 3
The pd.merge() function becomes even more powerful when you use the how parameter to specify the join type, just like in SQL. This gives you precise control over how dataframes are combined based on their matching keys.
inner: This is the default. It keeps only rows with keys that exist in both dataframes. In the example, only key 'A' is common.left: Keeps all rows from the left dataframe (df1) and merges matching rows from the right (df2).right: Does the opposite, keeping all rows from the right dataframe (df2) and filling in matches from the left.
Advanced dataframe merging techniques
Building on the basics, you can now handle more complex merges involving multiple columns, duplicate column names, and data validation using advanced pd.merge() parameters.
Merging on multiple columns with compound keys
import pandas as pd
df1 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 1],
'value1': [100, 200, 300]
})
df2 = pd.DataFrame({
'key1': ['A', 'A', 'B'],
'key2': [1, 2, 2],
'value2': [400, 500, 600]
})
merged_df = pd.merge(df1, df2, on=['key1', 'key2'])
print(merged_df)--OUTPUT--key1 key2 value1 value2
0 A 1 100 400
1 A 2 200 500
Sometimes, a single column isn't enough to uniquely link your data. You can merge on multiple columns by passing a list of column names to the on parameter. This tells pandas to match rows only when the values in all specified key columns are identical across both dataframes.
- In the example,
on=['key1', 'key2']creates a compound key. - The row with
('B', 1)fromdf1is dropped because it doesn't have an exact match for that pair indf2.
Handling duplicate column names with suffixes
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B'], 'value': [1, 2], 'data': [10, 20]})
df2 = pd.DataFrame({'key': ['A', 'C'], 'value': [3, 4], 'data': [30, 40]})
merged_df = pd.merge(df1, df2, on='key', suffixes=('_left', '_right'))
print(merged_df)--OUTPUT--key value_left data_left value_right data_right
0 A 1 10 3 30
When you merge dataframes with identically named columns—other than the merge key—pandas raises an error to prevent ambiguity. The suffixes parameter is your solution. It lets you automatically rename these conflicting columns by providing a tuple like ('_left', '_right').
- The first suffix is appended to overlapping column names from the left dataframe (
df1). - The second suffix is applied to those from the right dataframe (
df2).
This simple step prevents name collisions and ensures all your data is preserved in the final output.
Using indicator and validate parameters for advanced merging
import pandas as pd
df1 = pd.DataFrame({'key': ['A', 'B', 'C'], 'value': [1, 2, 3]})
df2 = pd.DataFrame({'key': ['A', 'B', 'D'], 'value': [4, 5, 6]})
merged_df = pd.merge(
df1, df2, on='key', how='outer',
indicator=True, validate='1:1'
)
print(merged_df)--OUTPUT--key value_x value_y _merge
0 A 1.0 4.0 both
1 B 2.0 5.0 both
2 C 3.0 NaN left_only
3 D NaN 6.0 right_only
You can gain deeper insight and control over your merges with the indicator and validate parameters. They're great for debugging and ensuring data integrity.
- Setting
indicator=Trueadds a special_mergecolumn. This column explicitly shows whether a row's key came from the left dataframe, the right, or both. - Using
validate='1:1'checks that merge keys are unique in both dataframes. If a key is duplicated, pandas raises an error, preventing incorrect results from silent data issues.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
For the dataframe merging techniques we've explored, Replit Agent can turn them into production-ready tools:
- Build a customer data unifier that merges user profiles from marketing and sales databases using
pd.merge()to create a single view. - Create an e-commerce dashboard that joins product details with sales records using a
leftjoin to track inventory and performance. - Deploy a log file aggregator that uses
pd.concat()to stack and analyze server logs from multiple sources into one comprehensive dataset.
Try building your next data tool by describing it to Replit Agent, and watch as it writes, tests, and deploys the code for you.
Common errors and challenges
Merging dataframes can sometimes throw errors or produce unexpected results, but most issues trace back to a few common data preparation oversights.
Troubleshooting merge on object errors with mismatched data types
A merge on object error often signals that your key columns have different data types. You might be trying to merge a column of numbers with a column of text that just happens to contain numbers. Pandas can't match 123 with '123' by default, so the merge fails.
To fix this, you need to ensure the key columns share the same data type before merging. You can easily convert a column's type using the astype() method to create consistency across your dataframes.
Debugging missing data after merging with case-sensitive keys
If your merged dataframe is missing rows you expected to see, the cause is often case sensitivity in your merge keys. Pandas treats 'Apple' and 'apple' as completely different values, so if one dataframe uses uppercase and the other uses lowercase, those rows won't match.
The solution is to standardize the case in both key columns before the merge. You can convert both columns to a consistent format using a string method like .str.lower(), ensuring that keys are treated as identical.
Fixing duplicate values in merge keys with drop_duplicates()
Duplicate values in a merge key can create a much larger dataframe than you intended—a phenomenon known as a Cartesian product. This happens when a key appears multiple times in both dataframes, causing pandas to create a new row for every possible combination.
Before merging, it's good practice to clean your data by removing unintentional duplicates. You can use the drop_duplicates() method on your dataframe, specifying the key column, to ensure each key is unique and prevent the merge from creating unwanted extra rows.
Troubleshooting merge on object errors with mismatched data types
When key columns have mismatched data types, like integers and strings, the merge can fail silently. Instead of an error, you get an empty dataframe because pandas finds no matches. The code below shows this exact scenario in action.
import pandas as pd
df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})
merged_df = pd.merge(df1, df2, on='key')
print(merged_df) # Empty dataframe
The key column in df1 holds integers while df2's contains strings. Pandas can't match the integer 1 with the string '1', so the merge returns nothing. The following code demonstrates the correction.
import pandas as pd
df1 = pd.DataFrame({'key': [1, 2, 3], 'value': ['a', 'b', 'c']})
df2 = pd.DataFrame({'key': ['1', '2', '4'], 'data': ['d', 'e', 'f']})
df2['key'] = df2['key'].astype(int)
merged_df = pd.merge(df1, df2, on='key')
print(merged_df)
The fix is to align the data types before merging. By applying df2['key'].astype(int), you convert the string keys to integers, allowing pd.merge() to correctly match the rows. This kind of mismatch is common when you're combining data from different sources—like a CSV file and a database—where numbers can be easily misinterpreted as text. A quick check with df.info() can save you a lot of trouble.
Debugging missing data after merging with case-sensitive keys
If your merge results are missing data, check for case sensitivity in your keys. Pandas treats values like 'C001' and 'c001' as completely different, so they won't match. The following code shows this common pitfall in action.
import pandas as pd
customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})
merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 0 records
The customers dataframe uses uppercase IDs, but the orders dataframe uses lowercase. Because pd.merge() treats 'C001' and 'c001' as different, it finds no matches, returning an empty dataframe. The following code demonstrates the fix.
import pandas as pd
customers = pd.DataFrame({
'customer_id': ['C001', 'C002', 'C003'],
'name': ['Alice', 'Bob', 'Charlie']
})
orders = pd.DataFrame({
'order_id': [1, 2, 3],
'customer_id': ['c001', 'c002', 'c004'], # Note lowercase
'amount': [100, 200, 150]
})
customers['customer_id'] = customers['customer_id'].str.upper()
orders['customer_id'] = orders['customer_id'].str.upper()
merged_df = pd.merge(customers, orders, on='customer_id')
print(f"Records after merge: {len(merged_df)}") # 2 records
The fix is to standardize the case before merging. By applying the .str.upper() method to both customer_id columns, you ensure that values like 'C001' and 'c001' are treated as identical, allowing pd.merge() to correctly match the records. This kind of mismatch is common when combining data from different systems or sources with inconsistent data entry standards, so it's a good habit to check for it to prevent silent data loss.
Fixing duplicate values in merge keys with drop_duplicates()
Duplicate keys in your source data can cause a merge to generate more rows than you expect. This happens when a key appears multiple times, creating a new row for each match. The following code shows how a single duplicate can skew your results.
import pandas as pd
products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})
inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})
result = pd.merge(products, inventory, on='product_id')
print(result) # Shows duplicate rows for product_id 101
The products dataframe has two different entries for product_id 101. The merge pairs the single inventory record with both of these, creating an extra row and skewing the final result. The code below shows the fix.
import pandas as pd
products = pd.DataFrame({
'product_id': [101, 102, 101], # Duplicate product_id
'product_name': ['Laptop', 'Phone', 'Tablet']
})
inventory = pd.DataFrame({
'product_id': [101, 102, 103],
'quantity': [5, 10, 8]
})
products_unique = products.drop_duplicates(subset=['product_id'])
result = pd.merge(products_unique, inventory, on='product_id')
print(result) # Clean result without duplicates
The fix is to clean your data before the merge. By using products.drop_duplicates(subset=['product_id']), you create a new dataframe where each product ID is unique. This prevents the merge from generating extra rows for each duplicate key. It's a crucial data cleaning step to ensure a clean, one-to-one match and avoid skewed results, especially when combining datasets where one should have unique identifiers.
Real-world applications
Navigating common pitfalls prepares you to apply dataframe merging to real-world business and research applications.
Analyzing customer purchases with pd.merge()
By merging sales transactions with customer details using pd.merge(), you can analyze purchasing behavior across different customer segments.
import pandas as pd
sales_df = pd.DataFrame({
'customer_id': [101, 102, 101, 103, 104],
'amount': [150.50, 200.75, 50.25, 300.00, 120.00]
})
customer_df = pd.DataFrame({
'customer_id': [101, 102, 103, 105],
'segment': ['Premium', 'Standard', 'Premium', 'Standard']
})
# Join sales with customer information
result = pd.merge(sales_df, customer_df, on='customer_id')
print(result.groupby('segment')['amount'].sum())
This code combines sales data with customer details using pd.merge(), linking them on the shared customer_id column. This step creates a single, enriched dataframe that contains both purchase amounts and customer segments for matched records.
- The
groupby('segment')method then organizes this merged data based on customer type. - Finally,
sum()calculates the total purchase amount for each segment, making it easy to compare the spending habits of different customer groups.
Enriching geographic data with how='left' join for analysis
A left join is perfect for enriching a primary dataset with supplementary data, such as adding state-level economic metrics to city information, while ensuring no original city records are lost in the process.
import pandas as pd
cities_df = pd.DataFrame({
'city': ['New York', 'Los Angeles', 'Chicago', 'Houston', 'Phoenix'],
'state': ['NY', 'CA', 'IL', 'TX', 'AZ'],
'population': [8.4, 4.0, 2.7, 2.3, 1.7] # in millions
})
metrics_df = pd.DataFrame({
'state': ['NY', 'CA', 'IL', 'TX', 'WA'],
'gdp_growth': [2.1, 3.5, 1.8, 4.2, 3.9],
'unemployment': [4.2, 4.1, 4.7, 3.8, 3.6]
})
# Enrich city data with state metrics using left join
enriched_df = pd.merge(cities_df, metrics_df, on='state', how='left')
enriched_df['econ_index'] = enriched_df['gdp_growth'] - enriched_df['unemployment']
print(enriched_df[['city', 'state', 'econ_index']].sort_values('econ_index', ascending=False))
This code combines city data with state-level economic metrics using a left join. The how='left' argument ensures every city from the original cities_df is kept in the final dataframe, linking the two datasets on the shared state column.
- A new
econ_indexcolumn is calculated by subtracting unemployment from GDP growth, creating a simple economic health score. - The final output then sorts the cities by this new index, ranking them based on their state's economic performance.
Get started with Replit
Turn your new skills into a real tool. Describe your idea to Replit agent, like "unify customer and sales data from two CSVs" or "build a dashboard joining product details with live inventory."
The agent writes the code, tests for errors, and deploys your app, handling the entire development cycle. Turn your data-merging idea into a finished product. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.



.png)