How to clean data in Python
Ready to clean data in Python? Learn different methods, tips and tricks, see real-world applications, and find solutions to common errors.

Data cleaning in Python is a fundamental skill for anyone who works with data. It transforms raw, messy datasets into a reliable foundation for analysis and machine learning models.
In this article, we'll explore key techniques and practical tips for your own projects. You'll see real-world applications and learn how to debug common data issues, so you can tackle any dataset with confidence.
Basic data cleaning with pandas
import pandas as pd
# Create a sample dataframe with some issues
df = pd.DataFrame({
'Name': ['John', 'Anna', 'Peter', 'Linda'],
'Age': [28, None, 42, 35],
'Salary': ['$5000', '$6000', '$4500', '$7000']
})
print(df)--OUTPUT--Name Age Salary
0 John 28.0 $5000
1 Anna NaN $6000
2 Peter 42.0 $4500
3 Linda 35.0 $7000
We're using the pandas library to create a sample DataFrame, the standard structure for handling tabular data in Python. This example intentionally mimics a messy, real-world dataset by including a couple of common problems that require cleaning.
You'll notice two specific issues:
- The
Agecolumn contains aNonevalue, which pandas reads asNaN(Not a Number), a common placeholder for missing data. - The
Salarycolumn is stored as text with dollar signs, which prevents you from performing any mathematical calculations on it.
Handling common data issues
With those issues identified, you can start cleaning the data by handling missing values, removing duplicates, and converting columns to their correct data types.
Filling missing values
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
df_filled = df.fillna(df.mean())
print(df_filled)--OUTPUT--A B
0 1.0 5.0
1 2.0 6.5
2 2.3 6.5
3 4.0 8.0
One common way to handle missing data is to replace it with a calculated value, a technique called imputation. Alternatively, you might consider removing NaN values in Python entirely. The fillna() method in pandas is perfect for this. In this example, we're using the mean of each column to fill in the gaps.
- The
df.mean()function first calculates the average for columnAand columnBindividually, ignoring theNaNvalues. - Then,
fillna()replaces eachNaNwith the calculated mean for its respective column, resulting in a complete dataset.
Removing duplicate rows with drop_duplicates()
import pandas as pd
df = pd.DataFrame({'A': [1, 2, 2, 3], 'B': ['a', 'b', 'b', 'c']})
df_unique = df.drop_duplicates()
print(df_unique)--OUTPUT--A B
0 1 a
1 2 b
3 3 c
Duplicate data can skew your analysis, so it's crucial to remove it. This concept also applies to removing duplicates from a list in general Python programming. The drop_duplicates() method simplifies this process by identifying and deleting rows that are exact copies of others.
- By default, the function considers all columns to define a duplicate.
- In the example, the row at index 2 is removed because its values (
2,'b') are identical to the row at index 1. - The method keeps the first occurrence of the duplicate row and discards the subsequent ones.
Converting data types
import pandas as pd
df = pd.DataFrame({
'date': ['2023-01-01', '2023-01-02', '2023-01-03'],
'numeric': ['100', '200', '300']
})
df['date'] = pd.to_datetime(df['date'])
df['numeric'] = pd.to_numeric(df['numeric'])
print(df.dtypes)--OUTPUT--date datetime64[ns]
numeric int64
dtype: object
Data often gets imported as text, which can be a problem when columns contain numbers or dates. To perform calculations or time-series analysis, you'll need to convert these columns to their correct types. Pandas provides specialized functions for this.
- The
pd.to_datetime()function intelligently parses date-like strings into a proper datetime format. - Similarly,
pd.to_numeric()converts strings that represent numbers into a numeric type likeint64.
This step ensures your data is structured correctly for accurate analysis.
Advanced data cleaning techniques
With your data's structure in place, you can now focus on refining its content by cleaning text, removing statistical outliers, and scaling numerical features.
Cleaning and normalizing text data
import pandas as pd
df = pd.DataFrame({
'text': [' Lower CASE ', 'UPPER case', ' Mixed CASE ', 'special@#$']
})
df['text_clean'] = df['text'].str.strip().str.lower()
print(df)--OUTPUT--text text_clean
0 Lower CASE lower case
1 UPPER case upper case
2 Mixed CASE mixed case
3 special@#$ special@#$
Text data often contains inconsistencies like extra spaces or mixed capitalization, which can interfere with analysis. You can standardize your text using pandas' string methods. The .str accessor lets you apply these functions to an entire column at once.
.str.strip()removes leading and trailing whitespace..str.lower()converts all characters to lowercase.
This normalization process ensures that different variations of the same text are treated as identical, making your dataset much more reliable.
Detecting and removing outliers with z-scores
import pandas as pd
import numpy as np
np.random.seed(0)
df = pd.DataFrame({'values': np.random.normal(0, 1, 100)})
df.loc[0:1, 'values'] = [10, -10] # Add outliers
z_scores = (df['values'] - df['values'].mean()) / df['values'].std()
df_no_outliers = df[abs(z_scores) <= 3]
print(f"Original size: {len(df)}, After removing outliers: {len(df_no_outliers)}")--OUTPUT--Original size: 100, After removing outliers: 98
Outliers are extreme data points that can skew statistical analysis. The z-score is a great tool for identifying them by measuring how far a value is from the average.
- First, the code calculates the z-score for each point in the
'values'column. A z-score tells you how many standard deviations a data point is from the mean. - We then filter the DataFrame, keeping only rows where the absolute z-score is 3 or less—a common threshold for outlier detection.
This process removes the two extreme values we added, cleaning the dataset for more accurate results.
Scaling features with sklearn
import pandas as pd
from sklearn.preprocessing import MinMaxScaler, StandardScaler
df = pd.DataFrame({
'feature1': [10, 20, 30, 40, 50],
'feature2': [100, 200, 300, 400, 500]
})
scaler = MinMaxScaler()
df_normalized = pd.DataFrame(scaler.fit_transform(df), columns=df.columns)
print(df_normalized)--OUTPUT--feature1 feature2
0 0.0 0.0
1 0.25 0.25
2 0.5 0.5
3 0.75 0.75
4 1.0 1.0
When features in your dataset have vastly different scales, it can throw off machine learning algorithms. Feature scaling fixes this, and sklearn's MinMaxScaler is a great tool for the job. This is part of the broader topic of normalizing data in Python. It transforms your data by squishing or stretching each feature's values into a consistent range, typically between 0 and 1.
- The code creates an instance of the
MinMaxScaler. - You then use the
fit_transform()method, which learns the scaling parameters from your data and applies the transformation in one step. - This process puts all features on a level playing field for your model.
Move faster with Replit
Replit is an AI-powered development platform where all Python dependencies come pre-installed, so you can skip setup and start coding instantly. This lets you move directly from learning individual techniques to applying them in a full development environment.
Instead of piecing together functions, you can build complete applications with Agent 4. It takes your project description and builds a working product by handling the code, database connections, and deployment.
- A data-cleaning utility that ingests raw CSV files, removes duplicate entries with
drop_duplicates(), and standardizes text fields using.str.lower(). - A financial dashboard that converts currency strings to numbers for calculations and uses z-scores to automatically flag transactions that are statistical outliers.
- A machine learning preprocessor that prepares a dataset by filling missing values using
fillna()and scaling numerical features withMinMaxScaler.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with powerful tools, data cleaning can throw tricky errors your way, but most have straightforward solutions.
The apply() function is incredibly flexible for running custom logic on your data, but it's often a source of performance bottlenecks. A common mistake is using it for operations that pandas can already perform much faster with built-in, vectorized functions. For instance, instead of writing a custom function with apply() to clean text, you can often get the same result more efficiently with methods like .str.lower() or .str.replace().
When you use groupby(), you're splitting your data into segments to run calculations on each part. Errors often pop up when the column you're grouping by isn't clean. For example, if a 'Country' column contains both "USA" and "U.S.A.", groupby() will treat them as two separate groups, skewing your results. This is why it's crucial to standardize categorical data before you start aggregating it.
Boolean indexing is a powerful way to filter your DataFrame, but it can lead to a frustrating ValueError. This error typically means the boolean series you're using to filter doesn't have the same number of rows as the DataFrame. It often happens when you create a filter based on a subset of your data but then try to apply it to the original, larger DataFrame. Always make sure your filter's length matches the data you're trying to index.
Troubleshooting issues with the apply() function
The apply() function can be tricky when vibe coding. A common mistake is using it to perform an action, like printing a result, without actually returning a value. When your function doesn't return anything, pandas fills the new column with None.
See what happens in the code below, where a missing return statement causes the operation to fail and produce an entire column of unwanted values.
import pandas as pd
df = pd.DataFrame({'value': [1, 2, 3, 4, 5]})
# This will fail because the lambda is not returning a value
df['doubled'] = df.apply(lambda row: print(row['value'] * 2))
print(df)
The lambda function only prints the result instead of returning it. Since apply() receives no value, it defaults to filling the new column with None. The corrected code below shows how to get the intended output.
import pandas as pd
df = pd.DataFrame({'value': [1, 2, 3, 4, 5]})
# Fix: return the value instead of printing it
df['doubled'] = df.apply(lambda row: row['value'] * 2, axis=1)
print(df)
The fix is simple: ensure your function includes a return statement. The corrected lambda function now returns the calculated value instead of just printing it. We also specify axis=1 to instruct pandas to apply the function to each row, not each column. This allows apply() to populate the new column correctly. You'll want to watch for this whenever you're creating a new column based on row-wise calculations; it's easy to forget the return.
Fixing errors when using groupby() operations
A frequent stumble with groupby() isn't about the grouping itself, but what comes after. For more comprehensive guidance on using groupby in Python, this function creates an intermediate object that requires an aggregation—like sum() or mean()—to produce a meaningful result. Without it, you don't get a DataFrame.
The code below shows what happens when you call groupby() without chaining an aggregation method. Instead of a clean summary, you get an object that isn't directly useful for analysis.
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'value': [1, 2, 3, 4]
})
# This will raise an error - we need to aggregate after groupby
result = df.groupby('group')
print(result)
The groupby() call only creates the groups without performing any calculation. That's why printing the result shows a memory object instead of a summarized table. The code below shows how to complete the operation.
import pandas as pd
df = pd.DataFrame({
'group': ['A', 'A', 'B', 'B'],
'value': [1, 2, 3, 4]
})
# Fix: apply an aggregation function after groupby
result = df.groupby('group')['value'].sum()
print(result)
The fix is to chain an aggregation function like sum() after your groupby() call. The groupby('group') method only prepares the data by creating groups; it doesn't perform any calculations on its own. By adding ['value'].sum(), you're telling pandas to sum the value column for each distinct group. This is a crucial step anytime you want to summarize data. Without an aggregation, you're left with an intermediate object instead of a final result.
Debugging issues with boolean indexing
A common issue with boolean indexing is applying a filter and seeing no change. This happens because filtering creates a new view of the data—it doesn't modify the original DataFrame unless you explicitly reassign it. The code below shows this oversight.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
# This comparison doesn't filter the dataframe
df[df['age'] > 30]
print(df) # Still prints the original dataframe
The expression df[df['age'] > 30] creates a filtered view, but the code discards it by not assigning it to a variable. You're left printing the original DataFrame. The following example shows how to fix this.
import pandas as pd
df = pd.DataFrame({
'name': ['Alice', 'Bob', 'Charlie'],
'age': [25, 30, 35]
})
# Fix: assign the filtered result to a new variable
filtered_df = df[df['age'] > 30]
print(filtered_df) # Now shows only rows where age > 30
The fix is to reassign the result. The filtering operation df[df['age'] > 30] creates a new, filtered view of your data but doesn't change the original DataFrame. To save the result, you must assign it to a variable, like filtered_df = df[df['age'] > 30]. This is a crucial step whenever you select a subset of data based on a condition, as pandas operations often return new objects instead of modifying them in place.
Real-world applications
With a handle on troubleshooting, you can see how these cleaning techniques power everything from e-commerce analysis to financial forecasting.
Cleaning e-commerce data for sales analysis
In a typical e-commerce dataset, often loaded from reading a CSV file in Python, you'll often find messy price formats, missing quantities, and duplicate orders that must be cleaned before you can accurately analyze sales performance.
import pandas as pd
# Sample e-commerce data with issues
sales = pd.DataFrame({
'order_date': ['2023-01-15', '2023-02-20', '2023-02-20', '2023-03-05'],
'product_id': [101, 102, 102, 103],
'price': ['$120.50', '$85.99', '$85.99', '$210.75'],
'quantity': [1, 2, 2, None]
})
sales['order_date'] = pd.to_datetime(sales['order_date'])
sales['price'] = sales['price'].str.replace('$', '').astype(float)
sales['quantity'] = sales['quantity'].fillna(1)
sales_clean = sales.drop_duplicates()
sales_clean['total'] = sales_clean['price'] * sales_clean['quantity']
print(sales_clean)
This example demonstrates how to clean a messy sales dataset for accurate analysis. The code tackles several common issues in sequence, transforming the raw data into a reliable format for calculations.
- First, it converts date strings to a proper datetime format and cleans the
pricecolumn by removing dollar signs and changing the type to a float. - Next, it assumes a default quantity of
1for any missing values and removes duplicate order entries usingdrop_duplicates(). - Finally, a new
totalcolumn is calculated from the cleaned price and quantity data.
Preparing time series data for forecasting with interpolate()
The interpolate() function is a powerful tool for time-series forecasting, as it intelligently fills gaps by estimating values from their neighbors, which preserves the data's sequential integrity.
import pandas as pd
import numpy as np
# Sample time series data with missing values
data = pd.DataFrame({
'date': pd.date_range(start='2023-01-01', periods=6, freq='D'),
'value': [100, np.nan, 120, 115, np.nan, 135]
})
data['value_interpolated'] = data['value'].interpolate(method='linear')
data['day_of_week'] = data['date'].dt.dayofweek
data['is_weekend'] = data['day_of_week'].apply(lambda x: 1 if x >= 5 else 0)
data['rolling_mean_3d'] = data['value_interpolated'].rolling(window=3, min_periods=1).mean()
print(data[['date', 'value', 'value_interpolated', 'is_weekend', 'rolling_mean_3d']])
This code prepares time-series data for analysis by filling gaps and engineering new features. It uses several key pandas functions to transform the raw data into a format suitable for modeling.
- The
interpolate()method fills missingNaNvalues with a straight-line estimate between the nearest valid points. - New features like
is_weekendare created from the date column to add useful context. - A 3-day rolling average is calculated with
rolling().mean()to smooth out short-term fluctuations and reveal underlying trends.
Get started with Replit
Turn these techniques into a working application. Just tell Replit Agent what you need: "Build a CSV cleaner that removes duplicates and converts price columns to numbers" or "Create a utility that fills missing data with the column average."
The Agent writes the code, tests for errors, and deploys the app directly from your browser. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



