How to merge CSV files in Python

Learn how to merge CSV files in Python. Explore various methods, tips and tricks, real-world applications, and how to debug common errors.

How to merge CSV files in Python
Published on: 
Tue
Mar 3, 2026
Updated on: 
Thu
Mar 5, 2026
The Replit Team Logo Image
The Replit Team

You often need to merge CSV files in Python for data analysis and management. Python's libraries give you powerful tools to combine multiple files into one unified dataset efficiently.

In this article, you'll explore several techniques to merge your data. You'll find practical tips, real-world applications, and debugging advice to help you choose the best method for your project.

Using pandas to concatenate CSV files

import pandas as pd

df1 = pd.read_csv('file1.csv')
df2 = pd.read_csv('file2.csv')
merged_df = pd.concat([df1, df2])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged CSV contains {len(merged_df)} rows")--OUTPUT--Merged CSV contains 300 rows

The pandas library treats each CSV as a DataFrame, which is essentially a table in memory. The key function here is pd.concat(). It stacks the two DataFrames vertically, an efficient way to combine files that share the same column layout.

When saving the result with to_csv(), including index=False is a crucial step. This argument prevents pandas from writing the DataFrame's index as an extra column in your final CSV file, which keeps your data clean and avoids potential confusion later.

Basic CSV merging techniques

If you prefer to work without external libraries, Python’s built-in csv and glob modules offer powerful, dependency-free alternatives for combining your data.

Merging CSV files with the csv module

import csv

with open('merged_file.csv', 'w', newline='') as outfile:
   writer = csv.writer(outfile)
   for filename in ['file1.csv', 'file2.csv']:
       with open(filename, 'r', newline='') as infile:
           reader = csv.reader(infile)
           if filename == 'file1.csv':  # Write header only once
               writer.writerows(reader)
           else:
               next(reader)  # Skip header in subsequent files
               writer.writerows(reader)--OUTPUT--# No console output, but creates a merged CSV file

This approach uses Python's built-in csv module to read from your source files and write to a single output file. The script iterates through each CSV, ensuring the final file is correctly formatted without duplicate headers.

  • It writes the entire content of the first file, including its header, to the new merged file.
  • For every subsequent file, it uses next(reader) to skip the header row before appending the remaining data.

Using glob to merge multiple CSV files

import pandas as pd
import glob

csv_files = glob.glob('data/*.csv')
merged_df = pd.concat([pd.read_csv(file) for file in csv_files])
merged_df.to_csv('merged_file.csv', index=False)
print(f"Merged {len(csv_files)} CSV files")--OUTPUT--Merged 5 CSV files

The glob module is your go-to for finding files that match a specific pattern. The expression glob.glob('data/*.csv') gathers all file paths ending in .csv from the data directory, making it easy to work with dozens of files at once.

  • A list comprehension then reads each file into a DataFrame: [pd.read_csv(file) for file in csv_files].
  • Finally, pd.concat() takes this list of DataFrames and merges them into one, scaling up the simple concatenation you saw earlier.

Merging with standard file I/O operations

with open('file1.csv', 'r') as f1:
   header = f1.readline()
   file1_content = f1.read()

with open('merged_file.csv', 'w') as output:
   output.write(header)
   output.write(file1_content)
   
   with open('file2.csv', 'r') as f2:
       next(f2)  # Skip header line
       output.write(f2.read())--OUTPUT--# No console output, but creates a merged CSV file

This method offers a lightweight way to merge files using only Python's built-in file operations. It's a manual but memory-efficient approach, perfect for very large datasets where loading everything into a DataFrame isn't practical. The script handles headers carefully to avoid duplicates.

  • It reads the header from file1.csv using readline() and writes it to the new file.
  • For any other files, it uses next() to skip their headers before appending the remaining content.

Advanced CSV merging methods

When simple stacking isn't enough, you can use more advanced methods to join data on key columns, process huge files efficiently, and accelerate the entire merge.

Joining CSV files with merge() on key columns

import pandas as pd

customers = pd.read_csv('customers.csv')
orders = pd.read_csv('orders.csv')

# Join customers with their orders
merged_data = pd.merge(customers, orders, on='customer_id', how='inner')
print(f"Merged data has {merged_data.shape[0]} rows and {merged_data.shape[1]} columns")
merged_data.to_csv('customer_orders.csv', index=False)--OUTPUT--Merged data has 120 rows and 8 columns

The pd.merge() function is your tool for more sophisticated joins. Unlike concat(), which stacks data vertically, merge() combines DataFrames horizontally based on a shared column—similar to a SQL join. It's ideal for linking related datasets, like customer details with their purchase history.

  • The on='customer_id' argument specifies the common column to match rows between the two files.
  • Setting how='inner' creates an inner join. This means the final output will only contain rows where the customer_id exists in both the customers and orders DataFrames.

Processing large CSV files with chunksize

import pandas as pd

chunk_size = 10000
merged_chunks = pd.DataFrame()

for chunk in pd.read_csv('large_file1.csv', chunksize=chunk_size):
   merged_chunks = pd.concat([merged_chunks, chunk])
   
for chunk in pd.read_csv('large_file2.csv', chunksize=chunk_size):
   merged_chunks = pd.concat([merged_chunks, chunk])
   
print(f"Processed data in chunks, total rows: {len(merged_chunks)}")
merged_chunks.to_csv('merged_large_files.csv', index=False)--OUTPUT--Processed data in chunks, total rows: 1250000

When you're working with CSV files too large to fit into memory, the chunksize parameter in pd.read_csv() is your solution. It allows you to process a massive file in smaller, manageable pieces instead of loading it all at once, which prevents your system from running out of RAM.

  • The code iterates through each file, reading it in segments of a specified size—in this case, 10,000 rows at a time.
  • Each chunk is then appended to a main DataFrame using pd.concat(), gradually building the complete dataset without overwhelming your computer.

Using multiprocessing for faster CSV merging

import pandas as pd
from multiprocessing import Pool
import glob

def process_file(filename):
   return pd.read_csv(filename)

csv_files = glob.glob('data/*.csv')
with Pool(processes=4) as pool:
   dfs = pool.map(process_file, csv_files)

merged_df = pd.concat(dfs)
print(f"Parallel processing complete. Merged {len(csv_files)} files with {len(merged_df)} total rows")
merged_df.to_csv('parallel_merged.csv', index=False)--OUTPUT--Parallel processing complete. Merged 5 files with 500000 total rows

For a significant speed boost, you can use Python's multiprocessing module to read multiple files simultaneously. This approach leverages multiple CPU cores to handle tasks in parallel, which is especially effective when you're I/O-bound from reading many separate files from disk.

  • A Pool of worker processes is created—in this case, four.
  • The pool.map() function distributes the list of file paths, assigning each file to a different process to be read into a DataFrame concurrently.
  • Finally, pd.concat() merges the resulting list of DataFrames into one.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. You can describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

The CSV merging techniques in this article can be the foundation for powerful, real-world tools. With Replit Agent, you can turn these concepts into production applications.

  • Build a dashboard that automatically combines daily report CSVs from a folder into a single, unified view.
  • Create a data enrichment utility that joins new user sign-ups with an existing customer database using pd.merge() on a common ID.
  • Deploy a log analysis pipeline that processes and merges massive datasets by reading files in chunks, just like the chunksize method.

You can take these concepts from idea to deployed application without getting bogged down in boilerplate. Describe your app, and Replit Agent writes the code, tests it, and handles fixes automatically. Start building your next project with Replit Agent.

Common errors and challenges

Merging CSVs is powerful, but you might run into a few common roadblocks like mismatched columns or encoding errors.

When you use pd.concat(), you might get an error if your files don't have the exact same columns. This happens when column names or their order differs across files. To fix this, you can either standardize the column names before merging or use the function's built-in flexibility.

  • Setting join='inner' tells pandas to only keep columns that are present in every single file.
  • The default, join='outer', keeps all columns from all files and fills any missing spots with NaN (Not a Number) values, which is useful but can create sparse data.

Another common hiccup is when the same column has different data types in different files. For example, a 'user_id' column might be a number in one CSV and text in another. When merged, pandas often defaults to the generic object type, which can slow down your analysis and lead to unexpected behavior.

The best way to prevent this is to enforce consistency when you read the files. Use the dtype parameter in pd.read_csv() to explicitly tell pandas what data type each column should be, ensuring uniformity from the start.

Sometimes, you'll encounter a UnicodeDecodeError. This error usually means your CSV files were saved with different character encodings, like UTF-8 versus latin-1. Python gets confused when it tries to read a file with the wrong "alphabet."

To resolve this, you need to specify the correct encoding for each file. You can often find a file's encoding by opening it in a good text editor. Once you know it, pass it to the encoding parameter in pd.read_csv() to ensure the data is read correctly without garbled characters.

Handling mismatched column errors when using pd.concat()

Using pd.concat() on files with different column structures won't cause an error. Instead, pandas keeps every column from all files and fills missing data with NaN values. This can result in a sparse DataFrame that is difficult to manage. See it in action below.

import pandas as pd

# Files with different columns
df1 = pd.read_csv('sales.csv')  # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv')  # Columns: product, quantity, location

# This will concatenate but keep all columns, creating NaN values
merged_df = pd.concat([df1, df2])
print(merged_df.isnull().sum())

The pd.concat() function combines the sales and inventory DataFrames. Because their columns don't match, pandas fills the gaps with NaN values. The code below shows how you can manage this mismatched data.

import pandas as pd

# Files with different columns
df1 = pd.read_csv('sales.csv')  # Columns: date, product, amount
df2 = pd.read_csv('inventory.csv')  # Columns: product, quantity, location

# Use only columns that are in both dataframes
common_columns = list(set(df1.columns).intersection(set(df2.columns)))
merged_df = pd.concat([df1[common_columns], df2[common_columns]])
print(f"Merged on common columns: {common_columns}")

This solution proactively finds common ground between your files. It uses set().intersection() to create a list of only the column names that appear in both DataFrames. By selecting just these common_columns before calling pd.concat(), you create a clean, merged file that contains only shared data. This approach is perfect when you need to combine datasets with overlapping but non-identical structures and want to avoid a final file filled with empty values.

Dealing with inconsistent data types between CSV files

Inconsistent data types create tricky issues. A column like 'revenue' might be text in one file and a number in another. When you merge them, pandas often converts the entire column to text, making calculations like sums impossible. The code below demonstrates this problem.

import pandas as pd

# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv')  # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv')  # 'revenue' stored as float: 1000.50

# This will concatenate but may cause issues with operations later
merged_df = pd.concat([df1, df2])
# This will raise an error due to inconsistent types
total_revenue = merged_df['revenue'].sum()

The sum() operation fails because the merged revenue column contains strings from sales_2021.csv. Python can't add text and numbers together. The following code demonstrates how to handle this during the import process.

import pandas as pd

# Files with same columns but different data types
df1 = pd.read_csv('sales_2021.csv')  # 'revenue' stored as string: "1,000.50"
df2 = pd.read_csv('sales_2022.csv')  # 'revenue' stored as float: 1000.50

# Convert string to float in the first dataframe
df1['revenue'] = df1['revenue'].str.replace(',', '').astype(float)

# Now concatenation will work properly
merged_df = pd.concat([df1, df2])
total_revenue = merged_df['revenue'].sum()
print(f"Total revenue: ${total_revenue:.2f}")

This solution cleans the data before merging. It uses .str.replace() to remove commas from the string-based revenue column, then converts it to a number with .astype(float).

By standardizing the data type in each DataFrame first, you ensure pd.concat() creates a unified column ready for analysis. This is crucial when you plan to perform calculations, as it prevents errors and ensures your results are accurate.

Resolving encoding issues when merging CSV files

A UnicodeDecodeError is a common roadblock that appears when your CSV files don't share the same character encoding. For example, one file might be saved as UTF-8 while another uses latin-1, causing Python to fail when reading the data.

The code below shows what happens when you try to merge files with conflicting encodings without specifying how to read them.

import pandas as pd

# Trying to merge files with different encodings
try:
   df1 = pd.read_csv('european_sales.csv')  # UTF-8 encoding
   df2 = pd.read_csv('asian_sales.csv')  # May have a different encoding
   
   merged_df = pd.concat([df1, df2])
   print(f"Merged dataframe has {len(merged_df)} rows")
except UnicodeDecodeError:
   print("Error: Could not decode file with the default encoding")

The pd.read_csv() function defaults to UTF-8, so it fails when it tries to read a file saved with a different character set. The following code demonstrates how to handle this during the import process.

import pandas as pd

# Explicitly specify encodings for each file
df1 = pd.read_csv('european_sales.csv', encoding='utf-8')
df2 = pd.read_csv('asian_sales.csv', encoding='iso-8859-1')

# Now concatenation works with the proper encodings
merged_df = pd.concat([df1, df2])
print(f"Successfully merged {len(df1)} and {len(df2)} rows")

This solution directly addresses the encoding mismatch by telling pandas how to interpret each file. You can prevent a UnicodeDecodeError by specifying the correct character set in pd.read_csv() using the encoding parameter.

  • This ensures all characters, especially those in international datasets, are read correctly.
  • For example, you might use encoding='utf-8' for one file and encoding='iso-8859-1' for another, allowing a clean merge.

Real-world applications

Now that you can navigate common merging errors, you can use these skills to build practical data tools and applications.

Cleaning data while merging CSV files

Merging your files creates a single, unified dataset where you can efficiently handle duplicates, fix data types, and remove missing values all at once.

import pandas as pd

# Read files and merge
df1 = pd.read_csv('sales_q1.csv')
df2 = pd.read_csv('sales_q2.csv')
merged_df = pd.concat([df1, df2])

# Clean the merged data
merged_df = merged_df.drop_duplicates()
merged_df = merged_df.dropna(subset=['order_id', 'amount'])
merged_df['amount'] = merged_df['amount'].astype(float)

print(f"Clean merged data: {len(merged_df)} rows")

This script first merges two quarterly sales files using pd.concat(). After combining the data, it performs a multi-step cleanup to ensure data quality before any analysis.

  • It removes any identical rows with drop_duplicates().
  • Then, dropna() discards records missing an order_id or amount.
  • Finally, it standardizes the amount column to a numeric type using astype(float), which is essential for accurate calculations.

Creating a dashboard from multiple CSV data sources

You can merge data from different sources, like sales and inventory reports, to build a visual dashboard that reveals key business insights.

import pandas as pd
import matplotlib.pyplot as plt

# Load data from different departments
sales = pd.read_csv('sales.csv')
inventory = pd.read_csv('inventory.csv')

# Merge on product_id
combined = pd.merge(sales, inventory, on='product_id')

# Create a simple dashboard
top_products = combined.groupby('product_name')['revenue'].sum().nlargest(5)
top_products.plot(kind='bar')
plt.title('Top 5 Products by Revenue')
plt.savefig('sales_dashboard.png')
print(f"Dashboard created with data from {len(sales)} sales and {len(inventory)} inventory records")

This script shows how to create a visual report from separate datasets. It joins sales and inventory data using pd.merge() on their shared product_id, linking sales figures with product details.

  • The code then aggregates this combined data with groupby('product_name') to calculate total revenue.
  • It isolates the top five performers using nlargest(5).
  • Finally, matplotlib generates a bar chart from this data and saves it as a PNG, turning raw numbers into a shareable insight.

Get started with Replit

Turn what you've learned into a real tool. Describe your idea to Replit Agent: "Build a utility that merges all CSVs in a directory" or "Create an app that joins user data with purchase history on a common ID."

Replit Agent writes the code, tests for errors, and deploys your app, turning your idea into a live tool. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.