How to merge CSV files in Python
Learn how to merge CSV files in Python. Explore various methods, tips and tricks, real-world applications, and how to debug common errors.

You often need to merge CSV files in Python for data analysis and management. Python's libraries give you powerful tools to combine multiple files into one unified dataset efficiently.
In this article, you'll explore several techniques to merge your data. You'll find practical tips, real-world applications, and debugging advice to help you choose the best method for your project.
Using pandas to concatenate CSV files
import pandas as pd
df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
merged_df = pd.concat([df1, df2])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged CSV contains {len(merged_df)} rows")--OUTPUT--Merged CSV contains 300 rows
The pandas library treats each CSV as a DataFrame, which is essentially a table in memory. The key function here is pd.concat(). It stacks the two DataFrames vertically, an efficient way to combine files that share the same column layout.
When saving the result with to_csv(), including index=False is a crucial step. This argument prevents pandas from writing the DataFrame's index as an extra column in your final CSV file, which keeps your data clean and avoids potential confusion later.
Basic CSV merging techniques
If you prefer to work without external libraries, Python’s built-in csv and glob modules offer powerful, dependency-free alternatives for combining your data.
Merging CSV files with the csv module
import csv
with open('merged_file.csv', 'w', newline='') as outfile:
writer = csv.writer(outfile)
for filename in ['file1.csv', 'file2.csv']:
with open(filename, 'r', newline='') as infile:
reader = csv.reader(infile)
if filename == 'file1.csv': # Write header only once
writer.writerows(reader)
else:
next(reader) # Skip header in subsequent files
writer.writerows(reader)--OUTPUT--# No console output, but creates a merged CSV file
This approach uses Python's built-in csv module to read from your source files and write to a single output file. The script iterates through each CSV, ensuring the final file is correctly formatted without duplicate headers.
- It writes the entire content of the first file, including its header, to the new merged file.
- For every subsequent file, it uses
next(reader)to skip the header row before appending the remaining data.
Using glob to merge multiple CSV files
import pandas as pd
import glob
csv_files = glob.glob('data/*.csv')
merged_df = pd.concat([pd.read_csv(file) for file in csv_files])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged {len(csv_files)} CSV files")--OUTPUT--Merged 5 CSV files
The glob module is your go-to for finding files that match a specific pattern. The expression glob.glob('data/*.csv') gathers all file paths ending in .csv from the data directory, making it easy to work with dozens of files at once.
- A list comprehension then reads each file into a DataFrame:
[pd.read_csv(file) for file in csv_files]. - Finally,
pd.concat()takes this list of DataFrames and merges them into one, scaling up the simple concatenation you saw earlier.
Merging with standard file I/O operations
with open('file1.csv', 'r') as f1:
header = f1.readline()
file1_content = f1.read()
with open('merged_file.csv', 'w') as output:
output.write(header)
output.write(file1_content)
with open('file2.csv', 'r') as f2:
next(f2) # Skip header line
output.write(f2.read())--OUTPUT--# No console output, but creates a merged CSV file
This method offers a lightweight way to merge files using only Python's built-in file operations. It's a manual but memory-efficient approach, perfect for very large datasets where loading everything into a DataFrame isn't practical. The script handles headers carefully to avoid duplicates.
- It reads the header from
file1.csvusingreadline()and writes it to the new file. - For any other files, it uses
next()to skip their headers before appending the remaining content.
Advanced CSV merging methods
When simple stacking isn't enough, you can use more advanced methods to join data on key columns, process huge files efficiently, and accelerate the entire merge.
Joining CSV files with merge() on key columns
import pandas as pd
customers = pd.read_csv('customers.csv')
orders = pd.read_csv('orders.csv')
# Join customers with their orders
merged_data = pd.merge(customers, orders, on='customer_id', how='inner')
print(f"Merged data has {merged_data.shape[0]} rows and {merged_data.shape[1]} columns")
merged_data.to_csv('customer_orders.csv', index=False)--OUTPUT--Merged data has 120 rows and 8 columns
The pd.merge() function is your tool for more sophisticated joins. Unlike concat(), which stacks data vertically, merge() combines DataFrames horizontally based on a shared column—similar to a SQL join. It's ideal for linking related datasets, like customer details with their purchase history.
- The
on='customer_id'argument specifies the common column to match rows between the two files. - Setting
how='inner'creates an inner join. This means the final output will only contain rows where thecustomer_idexists in both thecustomersandordersDataFrames.
Processing large CSV files with chunksize
import pandas as pd
chunk_size = 10000
merged_chunks = pd.DataFrame()
for chunk in pd.read_csv('large_file1.csv', chunksize=chunk_size):
merged_chunks = pd.concat([merged_chunks, chunk])
for chunk in pd.read_csv('large_file2.csv', chunksize=chunk_size):
merged_chunks = pd.concat([merged_chunks, chunk])
print(f"Processed data in chunks, total rows: {len(merged_chunks)}")
merged_chunks.to_csv('merged_large_files.csv', index=False)--OUTPUT--Processed data in chunks, total rows: 1250000
When you're working with CSV files too large to fit into memory, the chunksize parameter in pd.read_csv() is your solution. It allows you to process a massive file in smaller, manageable pieces instead of loading it all at once, which prevents your system from running out of RAM.
- The code iterates through each file, reading it in segments of a specified size—in this case, 10,000 rows at a time.
- Each chunk is then appended to a main DataFrame using
pd.concat(), gradually building the complete dataset without overwhelming your computer.
Using multiprocessing for faster CSV merging
import pandas as pd
from multiprocessing import Pool
import glob
def process_file(filename):
return pd.read_csv(filename)
csv_files = glob.glob('data/*.csv')
with Pool(processes=4) as pool:
dfs = pool.map(process_file, csv_files)
merged_df = pd.concat(dfs)
print(f"Parallel processing complete. Merged {len(csv_files)} files with {len(merged_df)} total rows")
merged_df.to_csv('parallel_merged.csv', index=False)--OUTPUT--Parallel processing complete. Merged 5 files with 500000 total rows
For a significant speed boost, you can use Python's multiprocessing module to read multiple files simultaneously. This approach leverages multiple CPU cores to handle tasks in parallel, which is especially effective when you're I/O-bound from reading many separate files from disk.
- A
Poolof worker processes is created—in this case, four. - The
pool.map()function distributes the list of file paths, assigning each file to a different process to be read into a DataFrame concurrently. - Finally,
pd.concat()merges the resulting list of DataFrames into one.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. You can describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
The CSV merging techniques in this article can be the foundation for powerful, real-world tools. With Replit Agent, you can turn these concepts into production applications.
- Build a dashboard that automatically combines daily report CSVs from a folder into a single, unified view.
- Create a data enrichment utility that joins new user sign-ups with an existing customer database using
pd.merge()on a common ID. - Deploy a log analysis pipeline that processes and merges massive datasets by reading files in chunks, just like the
chunksizemethod.
You can take these concepts from idea to deployed application without getting bogged down in boilerplate. Describe your app, and Replit Agent writes the code, tests it, and handles fixes automatically. Start building your next project with Replit Agent.
Common errors and challenges
Merging CSVs is powerful, but you might run into a few common roadblocks like mismatched columns or encoding errors.
When you use pd.concat(), you might get an error if your files don't have the exact same columns. This happens when column names or their order differs across files. To fix this, you can either standardize the column names before merging or use the function's built-in flexibility.
- Setting
join='inner'tells pandas to only keep columns that are present in every single file. - The default,
join='outer', keeps all columns from all files and fills any missing spots withNaN(Not a Number) values, which is useful but can create sparse data.
Another common hiccup is when the same column has different data types in different files. For example, a 'user_id' column might be a number in one CSV and text in another. When merged, pandas often defaults to the generic object type, which can slow down your analysis and lead to unexpected behavior.
The best way to prevent this is to enforce consistency when you read the files. Use the dtype parameter in pd.read_csv() to explicitly tell pandas what data type each column should be, ensuring uniformity from the start.
Sometimes, you'll encounter a UnicodeDecodeError. This error usually means your CSV files were saved with different character encodings, like UTF-8 versus latin-1. Python gets confused when it tries to read a file with the wrong "alphabet."
To resolve this, you need to specify the correct encoding for each file. You can often find a file's encoding by opening it in a good text editor. Once you know it, pass it to the encoding parameter in pd.read_csv() to ensure the data is read correctly without garbled characters.
Handling mismatched column errors when using pd.concat()
Using pd.concat() on files with different column structures won't cause an error. Instead, pandas keeps every column from all files and fills missing data with NaN values. This can result in a sparse DataFrame that is difficult to manage. See it in action below.
import pandas as pd
# Files with different columns
df1 = pd.read_csv('sales.csv') # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location
# This will concatenate but keep all columns, creating NaN values
merged_df = pd.concat([df1, df2])
print(merged_df.isnull().sum())
The pd.concat() function combines the sales and inventory DataFrames. Because their columns don't match, pandas fills the gaps with NaN values. The code below shows how you can manage this mismatched data.
import pandas as pd
# Files with different columns
df1 = pd.read_csv('sales.csv') # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location
# Use only columns that are in both dataframes
common_columns = list(set(df1.columns).intersection(set(df2.columns)))
merged_df = pd.concat([df1[common_columns], df2[common_columns]])
print(f"Merged on common columns: {common_columns}")
This solution proactively finds common ground between your files. It uses set().intersection() to create a list of only the column names that appear in both DataFrames. By selecting just these common_columns before calling pd.concat(), you create a clean, merged file that contains only shared data. This approach is perfect when you need to combine datasets with overlapping but non-identical structures and want to avoid a final file filled with empty values.
Dealing with inconsistent data types between CSV files
Inconsistent data types create tricky issues. A column like 'revenue' might be text in one file and a number in another. When you merge them, pandas often converts the entire column to text, making calculations like sums impossible. The code below demonstrates this problem.
import pandas as pd
# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50
# This will concatenate but may cause issues with operations later
merged_df = pd.concat([df1, df2])
# This will raise an error due to inconsistent types
total_revenue = merged_df['revenue'].sum()
The sum() operation fails because the merged revenue column contains strings from sales_2021.csv. Python can't add text and numbers together. The following code demonstrates how to handle this during the import process.
import pandas as pd
# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50
# Convert string to float in the first dataframe
df1['revenue'] = df1['revenue'].str.replace(',', '').astype(float)
# Now concatenation will work properly
merged_df = pd.concat([df1, df2])
total_revenue = merged_df['revenue'].sum()
print(f"Total revenue: ${total_revenue:.2f}")
This solution cleans the data before merging. It uses .str.replace() to remove commas from the string-based revenue column, then converts it to a number with .astype(float).
By standardizing the data type in each DataFrame first, you ensure pd.concat() creates a unified column ready for analysis. This is crucial when you plan to perform calculations, as it prevents errors and ensures your results are accurate.
Resolving encoding issues when merging CSV files
A UnicodeDecodeError is a common roadblock that appears when your CSV files don't share the same character encoding. For example, one file might be saved as UTF-8 while another uses latin-1, causing Python to fail when reading the data.
The code below shows what happens when you try to merge files with conflicting encodings without specifying how to read them.
import pandas as pd
# Trying to merge files with different encodings
try:
df1 = pd.read_csv('european_sales.csv') # UTF-8 encoding
df2 = pd.read_csv('asian_sales.csv') # May have a different encoding
merged_df = pd.concat([df1, df2])
print(f"Merged dataframe has {len(merged_df)} rows")
except UnicodeDecodeError:
print("Error: Could not decode file with the default encoding")
The pd.read_csv() function defaults to UTF-8, so it fails when it tries to read a file saved with a different character set. The following code demonstrates how to handle this during the import process.
import pandas as pd
# Explicitly specify encodings for each file
df1 = pd.read_csv('european_sales.csv', encoding='utf-8')
df2 = pd.read_csv('asian_sales.csv', encoding='iso-8859-1')
# Now concatenation works with the proper encodings
merged_df = pd.concat([df1, df2])
print(f"Successfully merged {len(df1)} and {len(df2)} rows")
This solution directly addresses the encoding mismatch by telling pandas how to interpret each file. You can prevent a UnicodeDecodeError by specifying the correct character set in pd.read_csv() using the encoding parameter.
- This ensures all characters, especially those in international datasets, are read correctly.
- For example, you might use
encoding='utf-8'for one file andencoding='iso-8859-1'for another, allowing a clean merge.
Real-world applications
Now that you can navigate common merging errors, you can use these skills to build practical data tools and applications.
Cleaning data while merging CSV files
Merging your files creates a single, unified dataset where you can efficiently handle duplicates, fix data types, and remove missing values all at once.
import pandas as pd
# Read files and merge
df1 = pd.read_csv('sales_q1.csv')
df2 = pd.read_csv('sales_q2.csv')
merged_df = pd.concat([df1, df2])
# Clean the merged data
merged_df = merged_df.drop_duplicates()
merged_df = merged_df.dropna(subset=['order_id', 'amount'])
merged_df['amount'] = merged_df['amount'].astype(float)
print(f"Clean merged data: {len(merged_df)} rows")
This script first merges two quarterly sales files using pd.concat(). After combining the data, it performs a multi-step cleanup to ensure data quality before any analysis.
- It removes any identical rows with
drop_duplicates(). - Then,
dropna()discards records missing anorder_idoramount. - Finally, it standardizes the
amountcolumn to a numeric type usingastype(float), which is essential for accurate calculations.
Creating a dashboard from multiple CSV data sources
You can merge data from different sources, like sales and inventory reports, to build a visual dashboard that reveals key business insights.
import pandas as pd
import matplotlib.pyplot as plt
# Load data from different departments
sales = pd.read_csv('sales.csv')
inventory = pd.read_csv('inventory.csv')
# Merge on product_id
combined = pd.merge(sales, inventory, on='product_id')
# Create a simple dashboard
top_products = combined.groupby('product_name')['revenue'].sum().nlargest(5)
top_products.plot(kind='bar')
plt.title('Top 5 Products by Revenue')
plt.savefig('sales_dashboard.png')
print(f"Dashboard created with data from {len(sales)} sales and {len(inventory)} inventory records")
This script shows how to create a visual report from separate datasets. It joins sales and inventory data using pd.merge() on their shared product_id, linking sales figures with product details.
- The code then aggregates this combined data with
groupby('product_name')to calculate total revenue. - It isolates the top five performers using
nlargest(5). - Finally,
matplotlibgenerates a bar chart from this data and saves it as a PNG, turning raw numbers into a shareable insight.
Get started with Replit
Turn what you've learned into a real tool. Describe your idea to Replit Agent: "Build a utility that merges all CSVs in a directory" or "Create an app that joins user data with purchase history on a common ID."
Replit Agent writes the code, tests for errors, and deploys your app, turning your idea into a live tool. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.



.png)