How to concatenate two dataframes in Python

Learn how to concatenate two dataframes in Python. Explore methods, tips, real-world applications, and how to debug common errors.

How to concatenate two dataframes in Python
Published on: 
Fri
Feb 20, 2026
Updated on: 
Mon
Apr 6, 2026
The Replit Team

You often need to combine dataframes in Python for data manipulation and analysis. The pandas library offers powerful tools, like the concat() function, to merge datasets efficiently and accurately.

In this article, we'll explore various concatenation techniques and share practical tips for common scenarios. You'll also find real-world applications and debugging advice to help you master dataframe combination for any project.

Basic concatenation with pd.concat()

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result)--OUTPUT--A B
0 1 3
1 2 4
0 5 7
1 6 8

The pd.concat() function is your go-to for stacking dataframes. It takes an iterable—in this case, a list of dataframes [df1, df2]—and joins them along an axis. By default, it stacks them vertically (axis=0), aligning the data by matching column labels. This type of data manipulation is where AI-powered coding with Python really shines.

Pay close attention to the index in the output. The function preserves the original index from each dataframe, which is why you see the labels 0 and 1 repeated. This default behavior can cause issues later, so it's a common practice to reset the index after concatenation.

Basic concatenation methods

Beyond the default vertical stacking, you can also use the older append() method, join columns with axis=1, or reset the index using ignore_index=True.

Using the append() method

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = df1.append(df2) # Note: append is deprecated in newer pandas versions
print(result)--OUTPUT--A B
0 1 3
1 2 4
0 5 7
1 6 8

The append() method offers another way to stack dataframes. Unlike pd.concat(), which is a top-level function, you call append() directly on a dataframe and pass the one you want to add. This method achieves the same vertical stacking as the default pd.concat(), also preserving the original indices. For more details on appending dataframes in Python, check out our comprehensive guide.

However, it's important to note:

  • The append() method has been deprecated in recent pandas versions.
  • For modern code, you should always prefer using pd.concat() for better performance and consistency.

Concatenating along columns with axis=1

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)--OUTPUT--A B C D
0 1 3 5 7
1 2 4 6 8

To join dataframes horizontally, you can set the axis parameter to 1. This tells pd.concat() to align the dataframes side-by-side, which adds new columns instead of stacking rows.

  • The function uses the index to match and align rows from each dataframe.
  • This method is ideal when you're combining datasets that share the same index but have different columns, creating a wider table.

Concatenating with ignore_index=True

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result)--OUTPUT--A B
0 1 3
1 2 4
2 5 7
3 6 8

To avoid the duplicate index labels we saw earlier, you can set the ignore_index parameter to True. This tells pd.concat() to discard the original indices from each dataframe and create a new one from scratch.

This is a common and highly recommended practice because:

  • It ensures your resulting dataframe has a clean, unique index ranging from 0 to n-1.
  • It prevents potential errors in later operations that rely on unique index labels for slicing or joining.

Advanced concatenation techniques

Beyond basic stacking, you can handle mismatched columns with different join options, create a hierarchical index with keys, or fill in gaps using combine_first().

Using different join options

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'B': [5, 6], 'C': [7, 8]})
inner_join = pd.concat([df1, df2], join='inner')
outer_join = pd.concat([df1, df2], join='outer')
print("Inner join:\n", inner_join, "\n\nOuter join:\n", outer_join)--OUTPUT--Inner join:
B
0 3
1 4
0 5
1 6

Outer join:
A B C
0 1.0 3.0 NaN
1 2.0 4.0 NaN
0 NaN 5.0 7.0
1 NaN 6.0 8.0

When your dataframes don't share the exact same columns, the join parameter in pd.concat() lets you decide how to handle the mismatch. This is crucial for managing data that doesn't align perfectly. Here’s how the two main options work:

  • An 'inner' join keeps only the columns that appear in all dataframes. In the example, since only column 'B' is shared, it's the only one in the final result.
  • An 'outer' join—the default behavior—keeps all columns from every dataframe. It fills any missing values with NaN, which is why you see gaps for columns 'A' and 'C'.

Adding hierarchical index with keys

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], keys=['df1', 'df2'])
print(result)
print("\nAccessing first dataframe:")
print(result.loc['df1'])--OUTPUT--A B
df1 0 1 3
1 2 4
df2 0 5 7
1 6 8

Accessing first dataframe:
A B
0 1 3
1 2 4

The keys parameter lets you create a hierarchical index, which is a great way to keep track of your data's origin after concatenation. By passing a list of labels like ['df1', 'df2'], you add an outer level to the index that identifies which original dataframe each row came from. This is a powerful alternative to resetting the index.

  • This creates a MultiIndex, where the outer keys group the original indices.
  • It allows for easy data retrieval. For example, you can select all rows from the first dataframe using result.loc['df1'].

Using combine_first() for merging with priority

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [None, 4]})
df2 = pd.DataFrame({'A': [None, 6], 'B': [7, 8], 'C': [9, 10]})
result = df1.combine_first(df2)
print(result)--OUTPUT--A B C
0 1.0 7.0 9.0
1 2.0 4.0 10.0

The combine_first() method is a smart way to patch missing data by merging two dataframes. It prioritizes the calling dataframe—in this case, df1—and uses the second dataframe (df2) to fill in any gaps. This technique complements other approaches for merging dataframes in Python.

  • It fills None values in df1 with corresponding values from df2. Notice how the None in column 'B' is replaced with 7.
  • Non-missing values in df1 are always kept. That's why the 4 in column 'B' remains unchanged.
  • It also incorporates new columns, like 'C', from the second dataframe.

Move faster with Replit

Replit is an AI-powered development platform that lets you start coding Python instantly. It comes with all Python dependencies pre-installed, so you can skip setup and focus on building.

While mastering functions like pd.concat() is essential, building a full application requires more than just piecing techniques together. This is where Agent 4 comes in. It helps you go from an idea to a working product by handling the code, databases, APIs, and even deployment, all from a simple description.

Instead of piecing together individual functions, you can describe the complete application you want to build. Agent 4 can take it from there, creating tools such as:

  • A sales dashboard that uses pd.concat() to merge daily reports from different regions into a single, unified view.
  • A data cleaning utility that uses combine_first() to patch missing customer information in one dataset with data from a secondary source.
  • A feature engineering script that joins multiple data sources horizontally with axis=1 to create a wider dataset for a machine learning model.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with a powerful tool like pd.concat(), you might encounter issues with mismatched columns, duplicate indices, or high memory usage.

Fixing errors when concatenating DataFrames with mismatched columns

When you combine DataFrames that don't share the same columns, you'll often see NaN (Not a Number) values appear. This is the default behavior of join='outer', which keeps all columns from all DataFrames and fills the gaps. If you only want to keep columns that are common to every DataFrame, you can switch to join='inner' to create a cleaner, intersectional result. Understanding the basics of creating dataframes in Python can help you avoid these structural mismatches from the start.

Resolving duplicate indices when using pd.concat()

A frequent challenge is the creation of duplicate index labels, since pd.concat() preserves the original indices by default. This can cause errors when you try to select data using methods like .loc[]. The most straightforward solution is to set the ignore_index=True parameter, which discards the old indices and generates a new, clean one from 0 to n-1.

Optimizing memory usage when concatenating large DataFrames

Concatenating large DataFrames can consume significant memory because pandas typically creates a new data object. To manage this, you can take several steps to optimize performance and prevent your system from slowing down or experiencing memory leaks.

  • Process your data in smaller chunks instead of concatenating a large list of DataFrames all at once.
  • Downcast numeric data types where possible—for example, from float64 to float32—if the lower precision is acceptable for your analysis.
  • Explicitly delete the original DataFrames after concatenation using del to free up memory immediately.

Fixing errors when concatenating DataFrames with mismatched columns

A common snag is concatenating DataFrames that don't have any overlapping columns. By default, pd.concat() will still join them, creating a new DataFrame that includes all columns from both and fills the gaps with NaN values. The code below shows this in action.

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2])
print(result)

Because the DataFrames are stacked vertically without any shared columns, pd.concat() produces a sparse result with many NaN values. The code below demonstrates how to achieve a more useful outcome.

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'C': [5, 6], 'D': [7, 8]})
result = pd.concat([df1, df2], axis=1)
print(result)

If your goal is to combine dataframes with completely different columns, stacking them vertically creates a sparse result filled with NaN values. The solution is to join them horizontally by setting axis=1. This aligns the dataframes side-by-side based on their index, creating a wider, more coherent dataset. This approach is perfect when you're adding new features or measurements that correspond to the same set of rows, effectively expanding your dataset's width instead of its length.

Resolving duplicate indices when using pd.concat()

By default, pd.concat() keeps the original indices from each DataFrame. This often leads to duplicate labels, which can cause ambiguity when you try to select data. The code below demonstrates how this can cause unexpected results when using .loc[].

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2])
print(result.loc[0]) # This will be ambiguous

Because both original dataframes contain an index of 0, the call to result.loc[0] is ambiguous. Pandas can't decide which of the two rows with that label to select. The following example shows how to avoid this issue.

import pandas as pd
df1 = pd.DataFrame({'A': [1, 2], 'B': [3, 4]})
df2 = pd.DataFrame({'A': [5, 6], 'B': [7, 8]})
result = pd.concat([df1, df2], ignore_index=True)
print(result.loc[0]) # Now we get only the first row

The fix is to set the ignore_index parameter to True. This tells pd.concat() to create a fresh, continuous index for the new dataframe, resolving the ambiguity. As a result, result.loc[0] now correctly selects the first row. You should always consider this option when your original indices aren't meaningful and you need to prevent selection errors down the line.

Optimizing memory usage when concatenating large DataFrames

Combining many DataFrames, especially in a loop, can quickly eat up your memory. Each time you call pd.concat(), pandas may create a new copy, leading to poor performance. The following code demonstrates this common but inefficient pattern in action.

import pandas as pd
big_df = pd.DataFrame()
for i in range(100):
temp_df = pd.DataFrame({'A': [i], 'B': [i*2]})
big_df = pd.concat([big_df, temp_df]) # Inefficient approach

This loop is inefficient because it rebuilds the entire big_df from scratch with each pass. As the dataframe grows, each pd.concat() call becomes progressively slower. The code below shows a much better way to handle this.

import pandas as pd
df_list = []
for i in range(100):
df_list.append(pd.DataFrame({'A': [i], 'B': [i*2]}))
big_df = pd.concat(df_list) # More efficient approach

A much better approach is to collect all your dataframes in a list first. After the loop finishes, you call pd.concat() just once on that list. It's far more memory-efficient because it avoids creating a new, larger dataframe with every iteration. This technique is crucial when you're combining many dataframes, as it significantly improves performance and prevents your program from slowing down, especially when working with large datasets.

Real-world applications

With the mechanics of pd.concat() covered, you can now use it for real-world tasks like combining time series or building multi-level datasets. These patterns are also perfect for vibe coding approaches.

Combining time series data for trend analysis

You can use pd.concat() to stitch together separate time-based datasets, like monthly sales reports, creating a single continuous timeline for trend analysis.

import pandas as pd

# Creating sample monthly data
jan_data = pd.DataFrame({
'date': pd.date_range('2023-01-01', '2023-01-05'),
'sales': [120, 135, 110, 140, 125]
})

feb_data = pd.DataFrame({
'date': pd.date_range('2023-02-01', '2023-02-05'),
'sales': [130, 145, 115, 150, 135]
})

# Combine monthly data for trend analysis
combined_data = pd.concat([jan_data, feb_data], ignore_index=True)
combined_data['month'] = combined_data['date'].dt.month_name()
print(combined_data)

This code merges two separate monthly sales DataFrames into a single, unified dataset. It’s a common task when you have data split across multiple files or tables.

  • The pd.concat() function stacks the January and February data vertically.
  • Using ignore_index=True creates a fresh, continuous index, which prevents conflicts from the original DataFrames.
  • Finally, a new month column is added by extracting the month name from the date column, making the dataset easier to interpret and analyze. This pattern is especially useful when merging CSV files in Python from different time periods.

Building a multi-level analysis dataset with keys

By using the keys parameter in pd.concat(), you can build a multi-level dataset that keeps your data organized by its source, making it easy to analyze segments like regional sales separately.

import pandas as pd

# Sales data from two different regions
us_sales = pd.DataFrame({
'product_id': [101, 102, 103],
'units_sold': [150, 200, 75],
'revenue': [4500, 3800, 2925]
})

eu_sales = pd.DataFrame({
'product_id': [101, 102, 104],
'units_sold': [90, 120, 60],
'revenue': [2970, 2280, 2400]
})

# Combine with hierarchical index
combined_sales = pd.concat([us_sales, eu_sales], keys=['US', 'EU'])
print(combined_sales)
print("\nUS sales summary:")
print(combined_sales.loc['US'].sum())

This code combines sales data from two regions, the US and EU, into a single DataFrame. By using the keys parameter, it creates a hierarchical index that labels which rows belong to which original dataset. This is a powerful way to keep your data organized after merging.

  • This multi-level index makes it easy to select data from a specific source. For example, combined_sales.loc['US'] isolates only the data from the US sales DataFrame.
  • You can then perform calculations like sum() on just that subset, allowing for targeted analysis within the combined dataset.

Get started with Replit

Put your knowledge into practice. Tell Replit Agent to build "a dashboard that combines sales data from US and EU CSVs" or "a utility that patches a customer list with data from a secondary file".

Replit Agent writes the code, tests for errors, and deploys your app. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.