How to concatenate two dataframes in Python
Learn how to concatenate two dataframes in Python. Explore methods, tips, real-world applications, and how to debug common errors.

You often need to combine dataframes in Python for data manipulation and analysis. The pandas library offers powerful tools, like the concat() function, to merge datasets efficiently and accurately.
In this article, we'll explore various concatenation techniques and share practical tips for common scenarios. You'll also find real-world applications and debugging advice to help you master dataframe combination for any project.
Basic concatenation with pd.concat()
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result)--OUTPUT--A B
0 1 3
1 2 4
0 5 7
1 6 8
The pd.concat() function is your go-to for stacking dataframes. It takes an iterable—in this case, a list of dataframes [df1, df2]—and joins them along an axis. By default, it stacks them vertically (axis=0), aligning the data by matching column labels. This type of data manipulation is where AI-powered coding with Python really shines.
Pay close attention to the index in the output. The function preserves the original index from each dataframe, which is why you see the labels 0 and 1 repeated. This default behavior can cause issues later, so it's a common practice to reset the index after concatenation.
Basic concatenation methods
Beyond the default vertical stacking, you can also use the older append() method, join columns with axis=1, or reset the index using ignore_index=True.
Using the append() method
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = df1.append(df2) # Note: append is deprecated in newer pandas versions
print(result)--OUTPUT--A B
0 1 3
1 2 4
0 5 7
1 6 8
The append() method offers another way to stack dataframes. Unlike pd.concat(), which is a top-level function, you call append() directly on a dataframe and pass the one you want to add. This method achieves the same vertical stacking as the default pd.concat(), also preserving the original indices. For more details on appending dataframes in Python, check out our comprehensive guide.
However, it's important to note:
- The
append()method has been deprecated in recent pandas versions. - For modern code, you should always prefer using
pd.concat()for better performance and consistency.
Concatenating along columns with axis=1
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)--OUTPUT--A B C D
0 1 3 5 7
1 2 4 6 8
To join dataframes horizontally, you can set the axis parameter to 1. This tells pd.concat() to align the dataframes side-by-side, which adds new columns instead of stacking rows.
- The function uses the index to match and align rows from each dataframe.
- This method is ideal when you're combining datasets that share the same index but have different columns, creating a wider table.
Concatenating with ignore_index=True
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)--OUTPUT--A B
0 1 3
1 2 4
2 5 7
3 6 8
To avoid the duplicate index labels we saw earlier, you can set the ignore_index parameter to True. This tells pd.concat() to discard the original indices from each dataframe and create a new one from scratch.
This is a common and highly recommended practice because:
- It ensures your resulting dataframe has a clean, unique index ranging from 0 to n-1.
- It prevents potential errors in later operations that rely on unique index labels for slicing or joining.
Advanced concatenation techniques
Beyond basic stacking, you can handle mismatched columns with different join options, create a hierarchical index with keys, or fill in gaps using combine_first().
Using different join options
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]})
inner_join = pd.concat([df1, df2], join='inner')
outer_join = pd.concat([df1, df2], join='outer')
print("Inner join:\n", inner_join, "\n\nOuter join:\n", outer_join)--OUTPUT--Inner join:
B
0 3
1 4
0 5
1 6
Outer join:
A B C
0 1.0 3.0 NaN
1 2.0 4.0 NaN
0 NaN 5.0 7.0
1 NaN 6.0 8.0
When your dataframes don't share the exact same columns, the join parameter in pd.concat() lets you decide how to handle the mismatch. This is crucial for managing data that doesn't align perfectly. Here’s how the two main options work:
- An
'inner'join keeps only the columns that appear in all dataframes. In the example, since only column 'B' is shared, it's the only one in the final result. - An
'outer'join—the default behavior—keeps all columns from every dataframe. It fills any missing values withNaN, which is why you see gaps for columns 'A' and 'C'.
Adding hierarchical index with keys
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print(result)
print("\nAccessing first dataframe:")
print(result.loc['df1'])--OUTPUT--A B
df1 0 1 3
1 2 4
df2 0 5 7
1 6 8
Accessing first dataframe:
A B
0 1 3
1 2 4
The keys parameter lets you create a hierarchical index, which is a great way to keep track of your data's origin after concatenation. By passing a list of labels like ['df1', 'df2'], you add an outer level to the index that identifies which original dataframe each row came from. This is a powerful alternative to resetting the index.
- This creates a
MultiIndex, where the outer keys group the original indices. - It allows for easy data retrieval. For example, you can select all rows from the first dataframe using
result.loc['df1'].
Using combine_first() for merging with priority
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [None, 6], 'B': [7, 8], 'C': [9, 10]})
result = df1.combine_first(df2)
print(result)--OUTPUT--A B C
0 1.0 7.0 9.0
1 2.0 4.0 10.0
The combine_first() method is a smart way to patch missing data by merging two dataframes. It prioritizes the calling dataframe—in this case, df1—and uses the second dataframe (df2) to fill in any gaps. This technique complements other approaches for merging dataframes in Python.
- It fills
Nonevalues indf1with corresponding values fromdf2. Notice how theNonein column 'B' is replaced with7. - Non-missing values in
df1are always kept. That's why the4in column 'B' remains unchanged. - It also incorporates new columns, like 'C', from the second dataframe.
Move faster with Replit
Replit is an AI-powered development platform that lets you start coding Python instantly. It comes with all Python dependencies pre-installed, so you can skip setup and focus on building.
While mastering functions like pd.concat() is essential, building a full application requires more than just piecing techniques together. This is where Agent 4 comes in. It helps you go from an idea to a working product by handling the code, databases, APIs, and even deployment, all from a simple description.
Instead of piecing together individual functions, you can describe the complete application you want to build. Agent 4 can take it from there, creating tools such as:
- A sales dashboard that uses
pd.concat()to merge daily reports from different regions into a single, unified view. - A data cleaning utility that uses
combine_first()to patch missing customer information in one dataset with data from a secondary source. - A feature engineering script that joins multiple data sources horizontally with
axis=1to create a wider dataset for a machine learning model.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with a powerful tool like pd.concat(), you might encounter issues with mismatched columns, duplicate indices, or high memory usage.
Fixing errors when concatenating DataFrames with mismatched columns
When you combine DataFrames that don't share the same columns, you'll often see NaN (Not a Number) values appear. This is the default behavior of join='outer', which keeps all columns from all DataFrames and fills the gaps. If you only want to keep columns that are common to every DataFrame, you can switch to join='inner' to create a cleaner, intersectional result. Understanding the basics of creating dataframes in Python can help you avoid these structural mismatches from the start.
Resolving duplicate indices when using pd.concat()
A frequent challenge is the creation of duplicate index labels, since pd.concat() preserves the original indices by default. This can cause errors when you try to select data using methods like .loc[]. The most straightforward solution is to set the ignore_index=True parameter, which discards the old indices and generates a new, clean one from 0 to n-1.
Optimizing memory usage when concatenating large DataFrames
Concatenating large DataFrames can consume significant memory because pandas typically creates a new data object. To manage this, you can take several steps to optimize performance and prevent your system from slowing down or experiencing memory leaks.
- Process your data in smaller chunks instead of concatenating a large list of DataFrames all at once.
- Downcast numeric data types where possible—for example, from
float64tofloat32—if the lower precision is acceptable for your analysis. - Explicitly delete the original DataFrames after concatenation using
delto free up memory immediately.
Fixing errors when concatenating DataFrames with mismatched columns
A common snag is concatenating DataFrames that don't have any overlapping columns. By default, pd.concat() will still join them, creating a new DataFrame that includes all columns from both and fills the gaps with NaN values. The code below shows this in action.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2])
print(result)
Because the DataFrames are stacked vertically without any shared columns, pd.concat() produces a sparse result with many NaN values. The code below demonstrates how to achieve a more useful outcome.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)
If your goal is to combine dataframes with completely different columns, stacking them vertically creates a sparse result filled with NaN values. The solution is to join them horizontally by setting axis=1. This aligns the dataframes side-by-side based on their index, creating a wider, more coherent dataset. This approach is perfect when you're adding new features or measurements that correspond to the same set of rows, effectively expanding your dataset's width instead of its length.
Resolving duplicate indices when using pd.concat()
By default, pd.concat() keeps the original indices from each DataFrame. This often leads to duplicate labels, which can cause ambiguity when you try to select data. The code below demonstrates how this can cause unexpected results when using .loc[].
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result.loc[0]) # This will be ambiguous
Because both original dataframes contain an index of 0, the call to result.loc[0] is ambiguous. Pandas can't decide which of the two rows with that label to select. The following example shows how to avoid this issue.
import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result.loc[0]) # Now we get only the first row
The fix is to set the ignore_index parameter to True. This tells pd.concat() to create a fresh, continuous index for the new dataframe, resolving the ambiguity. As a result, result.loc[0] now correctly selects the first row. You should always consider this option when your original indices aren't meaningful and you need to prevent selection errors down the line.
Optimizing memory usage when concatenating large DataFrames
Combining many DataFrames, especially in a loop, can quickly eat up your memory. Each time you call pd.concat(), pandas may create a new copy, leading to poor performance. The following code demonstrates this common but inefficient pattern in action.
import pandas as pd
big_df = pd.DataFrame()
for i in range(100):
temp_df = pd.DataFrame({'A': [i], 'B': [i*2]})
big_df = pd.concat([big_df, temp_df]) # Inefficient approach
This loop is inefficient because it rebuilds the entire big_df from scratch with each pass. As the dataframe grows, each pd.concat() call becomes progressively slower. The code below shows a much better way to handle this.
import pandas as pd
df_list = []
for i in range(100):
df_list.append(pd.DataFrame({'A': [i], 'B': [i*2]}))
big_df = pd.concat(df_list) # More efficient approach
A much better approach is to collect all your dataframes in a list first. After the loop finishes, you call pd.concat() just once on that list. It's far more memory-efficient because it avoids creating a new, larger dataframe with every iteration. This technique is crucial when you're combining many dataframes, as it significantly improves performance and prevents your program from slowing down, especially when working with large datasets.
Real-world applications
With the mechanics of pd.concat() covered, you can now use it for real-world tasks like combining time series or building multi-level datasets. These patterns are also perfect for vibe coding approaches.
Combining time series data for trend analysis
You can use pd.concat() to stitch together separate time-based datasets, like monthly sales reports, creating a single continuous timeline for trend analysis.
import pandas as pd
# Creating sample monthly data
jan_data = pd.DataFrame({
'date': pd.date_range('2023-01-01', '2023-01-05'),
'sales': [120, 135, 110, 140, 125]
})
feb_data = pd.DataFrame({
'date': pd.date_range('2023-02-01', '2023-02-05'),
'sales': [130, 145, 115, 150, 135]
})
# Combine monthly data for trend analysis
combined_data = pd.concat([jan_data, feb_data], ignore_index=True)
combined_data['month'] = combined_data['date'].dt.month_name()
print(combined_data)
This code merges two separate monthly sales DataFrames into a single, unified dataset. It’s a common task when you have data split across multiple files or tables.
- The
pd.concat()function stacks the January and February data vertically. - Using
ignore_index=Truecreates a fresh, continuous index, which prevents conflicts from the original DataFrames. - Finally, a new
monthcolumn is added by extracting the month name from thedatecolumn, making the dataset easier to interpret and analyze. This pattern is especially useful when merging CSV files in Python from different time periods.
Building a multi-level analysis dataset with keys
By using the keys parameter in pd.concat(), you can build a multi-level dataset that keeps your data organized by its source, making it easy to analyze segments like regional sales separately.
import pandas as pd
# Sales data from two different regions
us_sales = pd.DataFrame({
'product_id': [101, 102, 103],
'units_sold': [150, 200, 75],
'revenue': [4500, 3800, 2925]
})
eu_sales = pd.DataFrame({
'product_id': [101, 102, 104],
'units_sold': [90, 120, 60],
'revenue': [2970, 2280, 2400]
})
# Combine with hierarchical index
combined_sales = pd.concat([us_sales, eu_sales], keys=['US', 'EU'])
print(combined_sales)
print("\nUS sales summary:")
print(combined_sales.loc['US'].sum())
This code combines sales data from two regions, the US and EU, into a single DataFrame. By using the keys parameter, it creates a hierarchical index that labels which rows belong to which original dataset. This is a powerful way to keep your data organized after merging.
- This multi-level index makes it easy to select data from a specific source. For example,
combined_sales.loc['US']isolates only the data from the US sales DataFrame. - You can then perform calculations like
sum()on just that subset, allowing for targeted analysis within the combined dataset.
Get started with Replit
Put your knowledge into practice. Tell Replit Agent to build "a dashboard that combines sales data from US and EU CSVs" or "a utility that patches a customer list with data from a secondary file".
Replit Agent writes the code, tests for errors, and deploys your app. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



