How to merge CSV files in Python

Learn how to merge CSV files in Python. This guide covers various methods, tips, real-world uses, and common error debugging.

Published on:

Tue

Mar 3, 2026

Updated on:

Mon

Apr 13, 2026

The Replit Team

ON THIS PAGE

Example H2

You can merge multiple CSV files in Python, a common task when you need to consolidate data. This process combines separate files into a unified dataset for simpler analysis and management.

In this article, we'll show you several techniques to combine your data effectively. You'll find practical tips, explore real-world applications, and get advice to debug the common errors that can occur.

Using `pandas` to concatenate CSV files

import pandas as pd df1 = pd.read_csv('file1.csv') df2 = pd.read_csv('file2.csv') merged_df = pd.concat([df1, df2]) merged_df.to_csv('merged_file.csv', index=False) print(f"Merged CSV contains {len(merged_df)} rows")--OUTPUT--Merged CSV contains 300 rows

The pandas library simplifies this process with its pd.concat() function. This function takes a list of DataFrames—in this case, from your two CSV files—and stacks them vertically into a single DataFrame. It's an efficient way to combine datasets that share the same column structure, similar to appending DataFrames in Python.

When saving the result, using to_csv('merged_file.csv', index=False) is key. The index=False argument tells pandas not to write the DataFrame's index as a new column in your CSV file. This keeps your final dataset clean and avoids redundant data.

Basic CSV merging techniques

For situations where you don't need the overhead of pandas, Python's standard library provides simpler, more direct ways to combine your CSV data.

Merging CSV files with the `csv` module

import csv with open('merged_file.csv', 'w', newline='') as outfile: writer = csv.writer(outfile) for filename in ['file1.csv', 'file2.csv']: with open(filename, 'r', newline='') as infile: reader = csv.reader(infile) if filename == 'file1.csv': # Write header only once writer.writerows(reader) else: next(reader) # Skip header in subsequent files writer.writerows(reader)--OUTPUT--# No console output, but creates a merged CSV file

This approach uses Python's built-in csv module for more granular control. The script iterates through your list of CSV files and writes their contents into a single output file.

For the first file, it writes all rows, including the header.
For all subsequent files, it uses next(reader) to skip the header row, preventing duplicates in the final merged file.

This method is efficient for simple merges and doesn't require any external libraries.

Using `glob` to merge multiple CSV files

import pandas as pd import glob csv_files = glob.glob('data/*.csv') merged_df = pd.concat([pd.read_csv(file) for file in csv_files]) merged_df.to_csv('merged_file.csv', index=False) print(f"Merged {len(csv_files)} CSV files")--OUTPUT--Merged 5 CSV files

This approach is perfect for merging a large number of files without listing them manually. It combines the power of pandas with Python's built-in glob module to dynamically find and combine your data. Understanding reading CSV files in Python forms the foundation for these merging operations.

The glob.glob('data/*.csv') function finds all file paths in the data directory that end with .csv. The asterisk (*) is a wildcard that matches any characters.
This list of file paths is then used in a list comprehension to create a list of DataFrames, which pd.concat() merges into a single dataset.

Merging with standard file I/O operations

with open('file1.csv', 'r') as f1: header = f1.readline() file1_content = f1.read() with open('merged_file.csv', 'w') as output: output.write(header) output.write(file1_content) with open('file2.csv', 'r') as f2: next(f2) # Skip header line output.write(f2.read())--OUTPUT--# No console output, but creates a merged CSV file

This method offers a low-level approach, treating the CSV files as plain text. It's a straightforward way to stitch files together without any external libraries, giving you direct control over the process.

The script first reads the header and content from file1.csv separately.
It then writes this header and content into a new file, merged_file.csv.
Finally, it opens file2.csv, skips its header using next(f2), and appends the remaining content to the merged file.

Advanced CSV merging methods

Beyond simple concatenation, you can tackle complex joins on key columns with merge(), process large files using chunksize, and even speed up the entire operation.

Joining CSV files with `merge()` on key columns

import pandas as pd customers = pd.read_csv('customers.csv') orders = pd.read_csv('orders.csv') # Join customers with their orders merged_data = pd.merge(customers, orders, on='customer_id', how='inner') print(f"Merged data has {merged_data.shape[0]} rows and {merged_data.shape[1]} columns") merged_data.to_csv('customer_orders.csv', index=False)--OUTPUT--Merged data has 120 rows and 8 columns

When you need to combine datasets that are related but not identical, pd.merge() is the right tool. Unlike concatenation, it performs a database-style join, linking rows from different files based on a shared key column. This is perfect for matching customer information with their corresponding orders.

The on='customer_id' argument specifies the common column to join the two DataFrames.
how='inner' ensures that only records with a matching customer_id in both the customers and orders files are included in the result.

Processing large CSV files with `chunksize`

import pandas as pd chunk_size = 10000 merged_chunks = pd.DataFrame() for chunk in pd.read_csv('large_file1.csv', chunksize=chunk_size): merged_chunks = pd.concat([merged_chunks, chunk]) for chunk in pd.read_csv('large_file2.csv', chunksize=chunk_size): merged_chunks = pd.concat([merged_chunks, chunk]) print(f"Processed data in chunks, total rows: {len(merged_chunks)}") merged_chunks.to_csv('merged_large_files.csv', index=False)--OUTPUT--Processed data in chunks, total rows: 1250000

When working with massive CSV files, loading everything into memory at once can crash your program. The chunksize parameter in pd.read_csv() offers a memory-efficient solution. It processes the file in smaller, manageable pieces instead of all at once. This technique is part of broader strategies for handling large datasets in Python.

The code iterates through each file, reading it in chunks of a specified size—in this case, 10,000 rows.
Each chunk is appended to a final DataFrame using pd.concat(), gradually building the complete merged dataset without overwhelming your system's memory.

Using multiprocessing for faster CSV merging

import pandas as pd from multiprocessing import Pool import glob def process_file(filename): return pd.read_csv(filename) csv_files = glob.glob('data/*.csv') with Pool(processes=4) as pool: dfs = pool.map(process_file, csv_files) merged_df = pd.concat(dfs) print(f"Parallel processing complete. Merged {len(csv_files)} files with {len(merged_df)} total rows") merged_df.to_csv('parallel_merged.csv', index=False)--OUTPUT--Parallel processing complete. Merged 5 files with 500000 total rows

To speed up merging, you can use Python's multiprocessing module. It's especially effective for I/O-bound tasks like reading files from a disk. This code creates a pool of worker processes that read multiple CSV files at the same time, which is much faster than reading them one by one.

The Pool(processes=4) line sets up four separate processes to work in parallel.
The pool.map() function then assigns each process a file to read using the process_file function.

Once all processes finish, pd.concat() assembles the resulting DataFrames into a single dataset. This parallel approach can significantly cut down the total time needed for the merge.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. There's no need to configure environments or install packages.

Instead of piecing together individual functions, you can use Agent 4 to build a complete application. Describe the tool you want, and the Agent will take it from a concept to a working product. For example, you could build:

A sales dashboard that automatically merges daily CSV reports from different store locations.
A data enrichment tool that joins new customer signups with marketing campaign data using a shared customer_id column.
A log processor that consolidates and cleans fragmented server logs from multiple CSV files into a single, unified dataset for analysis.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Merging CSV files can introduce unexpected issues, but most common errors are straightforward to diagnose and resolve with the right approach.

When you use pd.concat(), you might run into trouble if your CSV files don't have the exact same columns. This happens if column names differ or if one file has extra columns that others lack.

By default, pandas performs an outer join, keeping all columns from all files and filling missing values with NaN (Not a Number). This can complicate your analysis. To manage this, you can:

Use join='inner' within pd.concat() to keep only the columns that are present in every file.
Stick with the default behavior and clean up the resulting DataFrame by filling or dropping the NaN values afterward.

Another common hiccup is when the same column has different data types across files—for example, a user_id column stored as a number in one CSV and as text in another.

When pandas merges inconsistent types, it often defaults to the flexible but less efficient object data type, which can prevent you from performing mathematical operations. The best fix is to enforce consistency by using the dtype parameter in pd.read_csv() to specify the correct data type from the start.

A UnicodeDecodeError is a sign that your CSV files are saved with different text encodings. This error means Python can't interpret a file's byte sequence because it's using the wrong character map, a common issue when files come from different operating systems or programs.

To resolve this, you need to identify each file's encoding and specify it with the encoding parameter in pd.read_csv(). While UTF-8 is a common standard, you may need to test other formats like latin1 or cp1252 to find the one that works.

Handling mismatched column errors when using `pd.concat()`

Using pd.concat() on files with different columns results in a DataFrame containing all columns from every file. Pandas fills the missing cells with NaN values, which can create a messy dataset that's difficult to analyze without further cleaning.

The code below demonstrates this by merging two files with different schemas and then counting the resulting null values.

import pandas as pd # Files with different columns df1 = pd.read_csv('sales.csv') # Columns: date, product, amount df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location # This will concatenate but keep all columns, creating NaN values merged_df = pd.concat([df1, df2]) print(merged_df.isnull().sum())

Since the two files have different columns, pd.concat() fills the missing data with NaN values, creating an untidy result. The following code demonstrates how to manage this for a cleaner merge.

import pandas as pd # Files with different columns df1 = pd.read_csv('sales.csv') # Columns: date, product, amount df2 = pd.read_csv('inventory.csv') # Columns: product, quantity, location # Use only columns that are in both dataframes common_columns = list(set(df1.columns).intersection(set(df2.columns))) merged_df = pd.concat([df1[common_columns], df2[common_columns]]) print(f"Merged on common columns: {common_columns}")

This solution isolates the columns that both DataFrames share. It converts the column names into sets and finds their intersection(). By selecting only these common columns before concatenating, you ensure the final DataFrame is clean and free of NaN values. This is a great way to handle files from different sources where you only care about the overlapping data.

Dealing with inconsistent data types between CSV files

A frustrating issue occurs when a column has inconsistent data types across files, like 'revenue' stored as text in one and a number in another. While pd.concat() may succeed, subsequent calculations will fail. The code below shows this in action.

import pandas as pd # Files with same columns but different data types df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50" df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50 # This will concatenate but may cause issues with operations later merged_df = pd.concat([df1, df2]) # This will raise an error due to inconsistent types total_revenue = merged_df['revenue'].sum()

The sum() operation fails because you can't add a string value to a number. This type mismatch prevents mathematical calculations on the merged column. The code below shows how to fix this by standardizing the data first.

import pandas as pd # Files with same columns but different data types df1 = pd.read_csv('sales_2021.csv') # 'revenue' stored as string: "1,000.50" df2 = pd.read_csv('sales_2022.csv') # 'revenue' stored as float: 1000.50 # Convert string to float in the first dataframe df1['revenue'] = df1['revenue'].str.replace(',', '').astype(float) # Now concatenation will work properly merged_df = pd.concat([df1, df2]) total_revenue = merged_df['revenue'].sum() print(f"Total revenue: ${total_revenue:.2f}")

This solution standardizes the data before merging. It targets the problematic 'revenue' column, first using .str.replace(',', '') to remove commas and then .astype(float) to convert the text to a number. By cleaning the data first, you ensure that pd.concat() creates a column with a consistent numeric type. This allows mathematical operations like sum() to execute without errors. It's a common issue when data comes from different sources or is entered manually.

Resolving encoding issues when merging CSV files

A UnicodeDecodeError often appears when merging files from different systems, as they may use incompatible text encodings. This error halts your script because Python can't interpret the file's byte sequence. The code below triggers this exact issue when reading two files.

import pandas as pd # Trying to merge files with different encodings try: df1 = pd.read_csv('european_sales.csv') # UTF-8 encoding df2 = pd.read_csv('asian_sales.csv') # May have a different encoding merged_df = pd.concat([df1, df2]) print(f"Merged dataframe has {len(merged_df)} rows") except UnicodeDecodeError: print("Error: Could not decode file with the default encoding")

The try...except block catches the error because pd.read_csv() assumes a default encoding that doesn't match one of the files. This mismatch stops the script. The following code shows how to fix this during the read process.

import pandas as pd # Explicitly specify encodings for each file df1 = pd.read_csv('european_sales.csv', encoding='utf-8') df2 = pd.read_csv('asian_sales.csv', encoding='iso-8859-1') # Now concatenation works with the proper encodings merged_df = pd.concat([df1, df2]) print(f"Successfully merged {len(df1)} and {len(df2)} rows")

This solution directly addresses the encoding mismatch by specifying the correct format for each file. By using the encoding parameter in pd.read_csv(), you tell pandas exactly how to interpret the files, preventing the UnicodeDecodeError. For example, one file might use 'utf-8' while another needs 'iso-8859-1'. This is a common problem when you're working with data from different geographic regions or legacy systems, so it's a good practice to check encodings upfront.

Real-world applications

Now that you know how to troubleshoot common issues, you can apply these merging techniques to solve real-world data problems with vibe coding.

Cleaning data while merging CSV files

Merging files is often the perfect time to clean your data, letting you handle duplicates, missing values, and incorrect types all in one go.

import pandas as pd # Read files and merge df1 = pd.read_csv('sales_q1.csv') df2 = pd.read_csv('sales_q2.csv') merged_df = pd.concat([df1, df2]) # Clean the merged data merged_df = merged_df.drop_duplicates() merged_df = merged_df.dropna(subset=['order_id', 'amount']) merged_df['amount'] = merged_df['amount'].astype(float) print(f"Clean merged data: {len(merged_df)} rows")

This code demonstrates a practical workflow for merging and cleaning data simultaneously. After combining two CSV files with pd.concat(), it applies three key cleaning steps to the resulting DataFrame. Once merged and cleaned, you might also want to learn about creating CSV files in Python to save your processed results.

First, drop_duplicates() removes any redundant rows.
Next, dropna() discards records that lack essential data in the order_id or amount columns.
Finally, astype(float) standardizes the amount column to a numeric format, enabling accurate mathematical operations.

Creating a dashboard from multiple CSV data sources

You can combine data from different business areas, like sales and inventory, to build a dashboard that visualizes key metrics.

The code below joins these two datasets using pd.merge() on a shared product_id. It then calculates the total revenue for each product, finds the top five, and uses matplotlib to plot the results in a bar chart, creating a simple but effective sales dashboard.

import pandas as pd import matplotlib.pyplot as plt # Load data from different departments sales = pd.read_csv('sales.csv') inventory = pd.read_csv('inventory.csv') # Merge on product_id combined = pd.merge(sales, inventory, on='product_id') # Create a simple dashboard top_products = combined.groupby('product_name')['revenue'].sum().nlargest(5) top_products.plot(kind='bar') plt.title('Top 5 Products by Revenue') plt.savefig('sales_dashboard.png') print(f"Dashboard created with data from {len(sales)} sales and {len(inventory)} inventory records")

This code demonstrates how to synthesize insights from different data sources. It starts by enriching sales records with inventory information using pd.merge() on the common product_id. The script then pivots from raw data to analysis.

It uses groupby() to aggregate sales figures for each unique product.
The nlargest(5) method isolates the top performers.
Finally, plot() and savefig() create and save a visual report, making the findings easy to share.

Get started with Replit

Now, turn this knowledge into a tool. Tell Replit Agent to "build a dashboard that merges daily sales CSVs" or "create a utility to clean and combine log files from different servers."

The Agent writes the code, tests for errors, and deploys your app. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit