How to preprocess data in Python

Learn to preprocess data in Python. This guide covers methods, tips, real-world applications, and how to debug common errors.

How to preprocess data in Python
Published on: 
Fri
Feb 20, 2026
Updated on: 
Mon
Apr 6, 2026
The Replit Team

To get reliable results from data analysis or machine learning, you must first preprocess your data. This step transforms raw information into a clean, understandable format to ensure model accuracy.

In this article, you'll learn key techniques, practical tips, and real-world applications. You'll also get debugging advice to help you handle common data challenges and build more robust machine learning models.

Basic data cleaning with pandas

import pandas as pd

# Load sample data
data = pd.DataFrame({'A': [1, 2, None, 4], 'B': ['x', 'y', 'z', None]})
cleaned_data = data.dropna()
print(cleaned_data)--OUTPUT--A B
0 1 x
1 2 y

This example tackles a common data cleaning task: handling missing values. Using the pandas library, the code creates a sample DataFrame with incomplete data. The key function, dropna(), then removes any rows that contain None values, which represent missing information.

This step is crucial because most machine learning algorithms can't process datasets with missing entries and will often crash or produce unreliable results. Dropping these rows ensures your dataset is complete, providing a solid foundation for your model.

Data transformation techniques

Instead of just removing incomplete data, you can use transformation techniques to fill in gaps, standardize scales, and convert text into a machine-readable format. These techniques are essential when working with various data sources, including reading CSV files in Python.

Handling missing values with fillna()

import pandas as pd

df = pd.DataFrame({'Age': [25, None, 30, None], 'Income': [50000, 60000, None, 70000]})
df['Age'].fillna(df['Age'].mean(), inplace=True)
df['Income'].fillna(df['Income'].mean(), inplace=True)
print(df)--OUTPUT--Age Income
0 25.00 50000.00
1 27.50 60000.00
2 30.00 60000.00
3 27.50 70000.00

Dropping data isn't always the best option, especially in smaller datasets. The fillna() method offers a more nuanced approach by letting you replace missing values instead. In this example, it's used to fill the gaps in the Age and Income columns.

  • The code first calculates the mean(), or average, for each column containing missing data.
  • It then uses this average to replace any None values within those respective columns.
  • The inplace=True argument modifies the original DataFrame directly, so you don't need to create a new variable.

Normalizing data with MinMaxScaler

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

data = pd.DataFrame({'Height': [170, 185, 165, 190], 'Weight': [65, 90, 60, 95]})
scaler = MinMaxScaler()
data_normalized = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
print(data_normalized)--OUTPUT--Height Weight
0 0.200000 0.142857
1 0.800000 0.857143
2 0.000000 0.000000
3 1.000000 1.000000

When features have vastly different scales—like Height and Weight in this example—it can negatively impact model training. Normalization brings all your data onto a common scale without distorting the differences in the ranges of values and is memory-efficient.

  • The MinMaxScaler from scikit-learn rescales each feature to a default range of 0 to 1.
  • The fit_transform() method learns the minimum and maximum values of your data and then applies the transformation, ensuring all features are treated equally by the model.

Encoding categorical variables with OneHotEncoder

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']})
encoder = OneHotEncoder(sparse=False)
encoded = encoder.fit_transform(df[['Color']])
encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color']))
print(encoded_df)--OUTPUT--Color_Blue Color_Green Color_Red
0 0.0 0.0 1.0
1 1.0 0.0 0.0
2 0.0 1.0 0.0
3 0.0 0.0 1.0

Machine learning models don't work with text, so categorical data like 'Red' or 'Blue' needs to be converted into numbers. One-hot encoding is a common technique that transforms these categories into a format models can understand without creating false relationships between them. For more detailed techniques on one hot encoding in Python, you can explore additional approaches.

  • The OneHotEncoder from scikit-learn creates a new binary column for each unique category.
  • A 1.0 in a column indicates the presence of that category for a given row, while a 0.0 indicates its absence.
  • This ensures the model treats each color as a distinct entity without any implied order.

Advanced preprocessing techniques

With the fundamentals covered, you can now apply advanced techniques to standardize features, simplify your dataset, and prepare unstructured text for analysis.

Standardizing features using StandardScaler

import pandas as pd
from sklearn.preprocessing import StandardScaler

data = pd.DataFrame({'Height': [170, 185, 165, 190], 'Weight': [65, 90, 60, 95]})
scaler = StandardScaler()
data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns)
print(data_scaled)--OUTPUT--Height Weight
0 -0.506139 -0.851064
1 0.590496 0.638298
2 -1.097135 -1.063830
3 1.012778 1.276596

Standardization is another powerful scaling technique. Unlike normalization, which fits data into a 0-1 range, the StandardScaler rescales features to have a mean of 0 and a standard deviation of 1. This is especially effective for algorithms sensitive to feature scale. For a deeper dive into different scaling approaches, see our guide on how to normalize data in Python.

  • The fit_transform() method first calculates the mean and standard deviation for each feature, like Height and Weight.
  • It then transforms each value by subtracting the mean and dividing by the standard deviation, centering the data around zero.

Reducing dimensions with PCA

import pandas as pd
from sklearn.decomposition import PCA

data = pd.DataFrame({'Feature1': [1, 2, 3, 4], 'Feature2': [4, 5, 6, 7],
'Feature3': [7, 8, 9, 10]})
pca = PCA(n_components=2)
reduced_data = pca.fit_transform(data)
print(pd.DataFrame(reduced_data, columns=['PC1', 'PC2']))--OUTPUT--PC1 PC2
0 -5.196152 0.000000
1 -1.732051 0.000000
2 1.732051 0.000000
3 5.196152 0.000000

When your dataset has many features, especially correlated ones, it can make your model unnecessarily complex. Principal Component Analysis (PCA) simplifies your data by reducing the number of features while retaining the most important information.

  • The PCA class from scikit-learn is set up to reduce the original three features to two by using n_components=2.
  • These new features, or principal components, are combinations of the original ones that capture the maximum variance.
  • The fit_transform() method then creates and applies this transformation, simplifying the dataset for your model.

Processing text data with NLTK

import nltk
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
nltk.download('punkt', quiet=True)
nltk.download('stopwords', quiet=True)

text = "Natural language processing is an exciting field of computer science."
tokens = word_tokenize(text.lower())
filtered = [word for word in tokens if word not in stopwords.words('english')]
print(filtered)--OUTPUT--['natural', 'language', 'processing', 'exciting', 'field', 'computer', 'science', '.']

Before a model can analyze text, you need to break it down into its core components. This example uses the Natural Language Toolkit (NLTK) to prepare a sentence for analysis by tokenizing it and removing words that don't add much meaning. For rapid prototyping of text processing workflows, vibe coding can help you iterate quickly on different preprocessing approaches.

  • The word_tokenize() function splits the sentence into a list of individual words—or tokens—after converting the text to lowercase for consistency.
  • It then filters out common "stopwords" like is and an, which are removed to help the model focus on the most significant terms.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. This allows you to move from piecing together individual techniques to building complete apps with Agent 4.

Instead of just learning methods, you can describe the app you want to build, and Agent 4 will take it from idea to a working product. For example, you could build:

  • A data cleaning utility that automatically fills or removes missing values from an uploaded dataset.
  • A feature scaling dashboard that normalizes numerical columns and one-hot encodes categorical data to prepare it for model training.
  • A keyword extraction tool that tokenizes raw text, removes stopwords, and identifies the most significant terms for SEO analysis.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with powerful tools, you'll run into tricky situations, but a few key practices can help you navigate common data preprocessing challenges.

Avoiding chained assignment warnings with loc

You might have seen the SettingWithCopyWarning in pandas. It’s not an error, but a heads-up that you might be modifying a copy of your data instead of the original DataFrame. This often happens when you use chained indexing, like df['column'][row_filter], which can have unpredictable results.

To avoid this, you should use the .loc accessor for assignments. It guarantees that you’re working directly on the DataFrame, ensuring your changes are applied correctly. The proper syntax, df.loc[row_filter, 'column'], is more explicit and reliable.

Debugging data type mismatches in merge operations

When you combine DataFrames with a merge operation, a common pitfall is a mismatch in data types between the key columns. If one DataFrame stores a user ID as an integer and the other stores it as a string, the merge won't find any matches, leading to an empty or incomplete result.

Before merging, it’s a good practice to inspect the data types using the .dtypes attribute on both DataFrames. If the key columns don't align, convert one of them to match the other. This simple check can save you a lot of debugging time.

Handling NaN values in groupby operations

By default, when you use the groupby() function, it excludes rows where the grouping key is a NaN value. This behavior can silently skew your analysis because you might not realize that a portion of your data is being ignored. For example, if you're grouping by country and some entries are missing this information, those rows will be dropped from your aggregations.

If you need to include these missing values in your calculations—perhaps to count them or treat them as a separate category—you can set the dropna=False argument within your groupby() call. This tells pandas to treat NaN values as a valid group.

Avoiding chained assignment warnings with loc

Avoiding chained assignment warnings with loc

When you see a SettingWithCopyWarning, pandas is telling you that an operation might not have worked as expected. This usually happens when you try to modify a filtered DataFrame, a practice known as chained indexing. See it in action below.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# This will produce a SettingWithCopyWarning
df['A'][df['A'] > 1] = 10
print(df)

This chained operation, df['A'][df['A'] > 1], attempts to set a value on a filtered selection. Pandas warns you because this might be happening on a temporary copy, not the original DataFrame. The code below shows a safer way.

import pandas as pd

df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]})

# Correct way using loc
df.loc[df['A'] > 1, 'A'] = 10
print(df)

The solution uses the .loc accessor to directly target the rows and column for modification. By specifying the row filter df['A'] > 1 and the column 'A' inside .loc, you ensure the assignment happens on the original DataFrame, not a temporary copy.

This is the most reliable way to prevent the SettingWithCopyWarning and ensure your changes stick. You should always use this method when filtering and assigning values at the same time.

Debugging data type mismatches in merge operations

Debugging data type mismatches in merge operations

A merge operation can fail silently if the key columns have different data types, like one being an integer and the other a string. This common mistake leads to empty or incomplete results, leaving you wondering where your data went.

The code below shows how this mismatch can cause pandas to miss valid matches between two DataFrames.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
df2 = pd.DataFrame({'id': ['1', '2', '4'], 'name': ['a', 'b', 'c']})

# Will miss matching records due to type difference
merged = pd.merge(df1, df2, on='id', how='inner')
print(merged)

The merge operation can't find any matches because the id in df1 is an integer, while the id in df2 is a string. Since 1 and '1' aren't the same to pandas, the result is empty. Check the code below for the fix.

import pandas as pd

df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]})
df2 = pd.DataFrame({'id': ['1', '2', '4'], 'name': ['a', 'b', 'c']})

# Convert id to same type before merging
df1['id'] = df1['id'].astype(str)
merged = pd.merge(df1, df2, on='id', how='inner')
print(merged)

The fix is to make the key columns match. By using astype(str), you convert the id column in one DataFrame to a string. Now that both id columns use the same data type, the merge operation can find the correct matches. It's a good habit to check column types with .dtypes before merging, especially when combining data from different files or databases, as this is where mismatches often occur. For comprehensive techniques on merging DataFrames in Python, explore advanced joining methods.

Handling NaN values in groupby operations

By default, the groupby() function in pandas drops rows where the grouping key is NaN. This can silently skew your analysis because you might not realize a portion of your data is being ignored. The code below shows this in action.

import pandas as pd

df = pd.DataFrame({
'group': ['A', 'A', 'B', None, 'B'],
'value': [1, 2, 3, 4, 5]
})

# NaN groups are silently dropped
result = df.groupby('group').mean()
print(result)

The resulting aggregation only shows groups 'A' and 'B', as groupby() automatically discards the row where the group is None. This can skew your results. Check the code below for the proper way to handle this.

import pandas as pd

df = pd.DataFrame({
'group': ['A', 'A', 'B', None, 'B'],
'value': [1, 2, 3, 4, 5]
})

# Fill NaN with a placeholder before grouping
df['group'] = df['group'].fillna('Unknown')
result = df.groupby('group').mean()
print(result)

The fix is to replace NaN values before you group the data. By using fillna('Unknown'), you convert any None entries into a placeholder string. This ensures the groupby() function treats them as a valid category instead of dropping them. Now, your aggregation includes the previously ignored data under the 'Unknown' group. This is crucial when analyzing incomplete datasets, as it prevents you from accidentally losing information during your analysis.

Real-world applications

These techniques are the foundation for real-world applications, from predicting customer churn to forecasting sales with time series data. When working with production-scale data, you'll also need strategies for handling large datasets in Python.

Preparing customer data for churn analysis using OneHotEncoder

Before you can build a model to predict customer churn, you need to prepare the data by filling in missing values and using OneHotEncoder to convert categorical features into a numerical format. This preprocessing workflow exemplifies how AI coding with Python streamlines machine learning development.

import pandas as pd
from sklearn.preprocessing import OneHotEncoder

# Sample customer data for churn analysis
customers = pd.DataFrame({
'Age': [35, 45, None, 28],
'Subscription': ['Premium', 'Basic', 'Premium', 'Basic'],
'Churned': [0, 1, 0, 1]
})

# Handle missing values and encode categorical feature
customers['Age'].fillna(customers['Age'].mean(), inplace=True)
encoder = OneHotEncoder(sparse=False)
subscription_encoded = pd.DataFrame(
encoder.fit_transform(customers[['Subscription']]),
columns=encoder.get_feature_names_out(['Subscription'])
)
result = pd.concat([customers, subscription_encoded], axis=1)
print(result)

This code gets customer data ready for a machine learning model. It starts by tackling missing information in the Age column, using the fillna() method to replace any empty spots with the average age. This ensures the dataset is complete.

Next, it addresses the categorical Subscription column:

  • The OneHotEncoder transforms text values like 'Premium' and 'Basic' into separate numerical columns.
  • These new columns are then merged back into the original DataFrame with pd.concat(), creating a fully numerical dataset suitable for model training.

Processing time series data for sales forecasting

To build an accurate sales forecast, you'll need to process your time series data by filling in missing values, creating new features from the dates, and normalizing the sales figures.

import pandas as pd
from sklearn.preprocessing import MinMaxScaler

# Sample sales time series data
dates = pd.date_range(start='2023-01-01', periods=5, freq='D')
sales = pd.DataFrame({
'Date': dates,
'Sales': [100, 120, None, 115, 125]
})

# Handle missing values and create time features
sales.set_index('Date', inplace=True)
sales['Sales'].fillna(sales['Sales'].mean(), inplace=True)
sales['DayOfWeek'] = sales.index.dayofweek
scaler = MinMaxScaler()
sales['Sales_Normalized'] = scaler.fit_transform(sales[['Sales']])
print(sales)

This code prepares sales data for a time series model. It sets the Date column as the index, a standard step for time-based analysis, and uses fillna() to replace a missing sales entry with the column’s average.

  • A new feature, DayOfWeek, is created from the date index. This helps the model identify weekly patterns in the data.
  • The MinMaxScaler normalizes the Sales figures into a 0-1 range, which prevents large values from disproportionately influencing the model during training.

Get started with Replit

Turn these techniques into a working tool. Tell Replit Agent to “build a utility that cleans a CSV by filling missing values” or “create a dashboard that applies StandardScaler to uploaded data.”

Replit Agent writes the code, tests for errors, and deploys your app from a simple prompt. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.