How to preprocess data in Python

Learn to preprocess data in Python. This guide covers methods, tips, real-world applications, and how to debug common errors.

Published on:

Fri

Feb 20, 2026

Updated on:

Mon

Apr 6, 2026

The Replit Team

ON THIS PAGE

Example H2

To get reliable results from data analysis or machine learning, you must first preprocess your data. This step transforms raw information into a clean, understandable format to ensure model accuracy.

In this article, you'll learn key techniques, practical tips, and real-world applications. You'll also get debugging advice to help you handle common data challenges and build more robust machine learning models.

Basic data cleaning with `pandas`

import pandas as pd # Load sample data data = pd.DataFrame({'A': [1, 2, None, 4], 'B': ['x', 'y', 'z', None]}) cleaned_data = data.dropna() print(cleaned_data)--OUTPUT--A B 0 1 x 1 2 y

This example tackles a common data cleaning task: handling missing values. Using the pandas library, the code creates a sample DataFrame with incomplete data. The key function, dropna(), then removes any rows that contain None values, which represent missing information.

This step is crucial because most machine learning algorithms can't process datasets with missing entries and will often crash or produce unreliable results. Dropping these rows ensures your dataset is complete, providing a solid foundation for your model.

Data transformation techniques

Instead of just removing incomplete data, you can use transformation techniques to fill in gaps, standardize scales, and convert text into a machine-readable format. These techniques are essential when working with various data sources, including reading CSV files in Python.

Handling missing values with `fillna()`

import pandas as pd df = pd.DataFrame({'Age': [25, None, 30, None], 'Income': [50000, 60000, None, 70000]}) df['Age'].fillna(df['Age'].mean(), inplace=True) df['Income'].fillna(df['Income'].mean(), inplace=True) print(df)--OUTPUT--Age Income 0 25.00 50000.00 1 27.50 60000.00 2 30.00 60000.00 3 27.50 70000.00

Dropping data isn't always the best option, especially in smaller datasets. The fillna() method offers a more nuanced approach by letting you replace missing values instead. In this example, it's used to fill the gaps in the Age and Income columns.

The code first calculates the mean(), or average, for each column containing missing data.
It then uses this average to replace any None values within those respective columns.
The inplace=True argument modifies the original DataFrame directly, so you don't need to create a new variable.

Normalizing data with `MinMaxScaler`

import pandas as pd from sklearn.preprocessing import MinMaxScaler data = pd.DataFrame({'Height': [170, 185, 165, 190], 'Weight': [65, 90, 60, 95]}) scaler = MinMaxScaler() data_normalized = pd.DataFrame(scaler.fit_transform(data), columns=data.columns) print(data_normalized)--OUTPUT--Height Weight 0 0.200000 0.142857 1 0.800000 0.857143 2 0.000000 0.000000 3 1.000000 1.000000

When features have vastly different scales—like Height and Weight in this example—it can negatively impact model training. Normalization brings all your data onto a common scale without distorting the differences in the ranges of values and is memory-efficient.

The MinMaxScaler from scikit-learn rescales each feature to a default range of 0 to 1.
The fit_transform() method learns the minimum and maximum values of your data and then applies the transformation, ensuring all features are treated equally by the model.

Encoding categorical variables with `OneHotEncoder`

import pandas as pd from sklearn.preprocessing import OneHotEncoder df = pd.DataFrame({'Color': ['Red', 'Blue', 'Green', 'Red']}) encoder = OneHotEncoder(sparse=False) encoded = encoder.fit_transform(df[['Color']]) encoded_df = pd.DataFrame(encoded, columns=encoder.get_feature_names_out(['Color'])) print(encoded_df)--OUTPUT--Color_Blue Color_Green Color_Red 0 0.0 0.0 1.0 1 1.0 0.0 0.0 2 0.0 1.0 0.0 3 0.0 0.0 1.0

Machine learning models don't work with text, so categorical data like 'Red' or 'Blue' needs to be converted into numbers. One-hot encoding is a common technique that transforms these categories into a format models can understand without creating false relationships between them. For more detailed techniques on one hot encoding in Python, you can explore additional approaches.

The OneHotEncoder from scikit-learn creates a new binary column for each unique category.
A 1.0 in a column indicates the presence of that category for a given row, while a 0.0 indicates its absence.
This ensures the model treats each color as a distinct entity without any implied order.

Advanced preprocessing techniques

With the fundamentals covered, you can now apply advanced techniques to standardize features, simplify your dataset, and prepare unstructured text for analysis.

Standardizing features using `StandardScaler`

import pandas as pd from sklearn.preprocessing import StandardScaler data = pd.DataFrame({'Height': [170, 185, 165, 190], 'Weight': [65, 90, 60, 95]}) scaler = StandardScaler() data_scaled = pd.DataFrame(scaler.fit_transform(data), columns=data.columns) print(data_scaled)--OUTPUT--Height Weight 0 -0.506139 -0.851064 1 0.590496 0.638298 2 -1.097135 -1.063830 3 1.012778 1.276596

Standardization is another powerful scaling technique. Unlike normalization, which fits data into a 0-1 range, the StandardScaler rescales features to have a mean of 0 and a standard deviation of 1. This is especially effective for algorithms sensitive to feature scale. For a deeper dive into different scaling approaches, see our guide on how to normalize data in Python.

The fit_transform() method first calculates the mean and standard deviation for each feature, like Height and Weight.
It then transforms each value by subtracting the mean and dividing by the standard deviation, centering the data around zero.

Reducing dimensions with `PCA`

import pandas as pd from sklearn.decomposition import PCA data = pd.DataFrame({'Feature1': [1, 2, 3, 4], 'Feature2': [4, 5, 6, 7], 'Feature3': [7, 8, 9, 10]}) pca = PCA(n_components=2) reduced_data = pca.fit_transform(data) print(pd.DataFrame(reduced_data, columns=['PC1', 'PC2']))--OUTPUT--PC1 PC2 0 -5.196152 0.000000 1 -1.732051 0.000000 2 1.732051 0.000000 3 5.196152 0.000000

When your dataset has many features, especially correlated ones, it can make your model unnecessarily complex. Principal Component Analysis (PCA) simplifies your data by reducing the number of features while retaining the most important information.

The PCA class from scikit-learn is set up to reduce the original three features to two by using n_components=2.
These new features, or principal components, are combinations of the original ones that capture the maximum variance.
The fit_transform() method then creates and applies this transformation, simplifying the dataset for your model.

Processing text data with `NLTK`

import nltk from nltk.tokenize import word_tokenize from nltk.corpus import stopwords nltk.download('punkt', quiet=True) nltk.download('stopwords', quiet=True) text = "Natural language processing is an exciting field of computer science." tokens = word_tokenize(text.lower()) filtered = [word for word in tokens if word not in stopwords.words('english')] print(filtered)--OUTPUT--['natural', 'language', 'processing', 'exciting', 'field', 'computer', 'science', '.']

Before a model can analyze text, you need to break it down into its core components. This example uses the Natural Language Toolkit (NLTK) to prepare a sentence for analysis by tokenizing it and removing words that don't add much meaning. For rapid prototyping of text processing workflows, vibe coding can help you iterate quickly on different preprocessing approaches.

The word_tokenize() function splits the sentence into a list of individual words—or tokens—after converting the text to lowercase for consistency.
It then filters out common "stopwords" like is and an, which are removed to help the model focus on the most significant terms.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. This allows you to move from piecing together individual techniques to building complete apps with Agent 4.

Instead of just learning methods, you can describe the app you want to build, and Agent 4 will take it from idea to a working product. For example, you could build:

A data cleaning utility that automatically fills or removes missing values from an uploaded dataset.
A feature scaling dashboard that normalizes numerical columns and one-hot encodes categorical data to prepare it for model training.
A keyword extraction tool that tokenizes raw text, removes stopwords, and identifies the most significant terms for SEO analysis.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with powerful tools, you'll run into tricky situations, but a few key practices can help you navigate common data preprocessing challenges.

Avoiding chained assignment warnings with `loc`

You might have seen the SettingWithCopyWarning in pandas. It’s not an error, but a heads-up that you might be modifying a copy of your data instead of the original DataFrame. This often happens when you use chained indexing, like df['column'][row_filter], which can have unpredictable results.

To avoid this, you should use the .loc accessor for assignments. It guarantees that you’re working directly on the DataFrame, ensuring your changes are applied correctly. The proper syntax, df.loc[row_filter, 'column'], is more explicit and reliable.

Debugging data type mismatches in `merge` operations

When you combine DataFrames with a merge operation, a common pitfall is a mismatch in data types between the key columns. If one DataFrame stores a user ID as an integer and the other stores it as a string, the merge won't find any matches, leading to an empty or incomplete result.

Before merging, it’s a good practice to inspect the data types using the .dtypes attribute on both DataFrames. If the key columns don't align, convert one of them to match the other. This simple check can save you a lot of debugging time.

Handling `NaN` values in `groupby` operations

By default, when you use the groupby() function, it excludes rows where the grouping key is a NaN value. This behavior can silently skew your analysis because you might not realize that a portion of your data is being ignored. For example, if you're grouping by country and some entries are missing this information, those rows will be dropped from your aggregations.

If you need to include these missing values in your calculations—perhaps to count them or treat them as a separate category—you can set the dropna=False argument within your groupby() call. This tells pandas to treat NaN values as a valid group.

Avoiding chained assignment warnings with `loc`

When you see a SettingWithCopyWarning, pandas is telling you that an operation might not have worked as expected. This usually happens when you try to modify a filtered DataFrame, a practice known as chained indexing. See it in action below.

import pandas as pd df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) # This will produce a SettingWithCopyWarning df['A'][df['A'] > 1] = 10 print(df)

This chained operation, df['A'][df['A'] > 1], attempts to set a value on a filtered selection. Pandas warns you because this might be happening on a temporary copy, not the original DataFrame. The code below shows a safer way.

import pandas as pd df = pd.DataFrame({'A': [1, 2, 3], 'B': [4, 5, 6]}) # Correct way using loc df.loc[df['A'] > 1, 'A'] = 10 print(df)

The solution uses the .loc accessor to directly target the rows and column for modification. By specifying the row filter df['A'] > 1 and the column 'A' inside .loc, you ensure the assignment happens on the original DataFrame, not a temporary copy.

This is the most reliable way to prevent the SettingWithCopyWarning and ensure your changes stick. You should always use this method when filtering and assigning values at the same time.

Debugging data type mismatches in `merge` operations

A merge operation can fail silently if the key columns have different data types, like one being an integer and the other a string. This common mistake leads to empty or incomplete results, leaving you wondering where your data went.

The code below shows how this mismatch can cause pandas to miss valid matches between two DataFrames.

import pandas as pd df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]}) df2 = pd.DataFrame({'id': ['1', '2', '4'], 'name': ['a', 'b', 'c']}) # Will miss matching records due to type difference merged = pd.merge(df1, df2, on='id', how='inner') print(merged)

The merge operation can't find any matches because the id in df1 is an integer, while the id in df2 is a string. Since 1 and '1' aren't the same to pandas, the result is empty. Check the code below for the fix.

import pandas as pd df1 = pd.DataFrame({'id': [1, 2, 3], 'value': [10, 20, 30]}) df2 = pd.DataFrame({'id': ['1', '2', '4'], 'name': ['a', 'b', 'c']}) # Convert id to same type before merging df1['id'] = df1['id'].astype(str) merged = pd.merge(df1, df2, on='id', how='inner') print(merged)

The fix is to make the key columns match. By using astype(str), you convert the id column in one DataFrame to a string. Now that both id columns use the same data type, the merge operation can find the correct matches. It's a good habit to check column types with .dtypes before merging, especially when combining data from different files or databases, as this is where mismatches often occur. For comprehensive techniques on merging DataFrames in Python, explore advanced joining methods.

Handling `NaN` values in `groupby` operations

By default, the groupby() function in pandas drops rows where the grouping key is NaN. This can silently skew your analysis because you might not realize a portion of your data is being ignored. The code below shows this in action.

import pandas as pd df = pd.DataFrame({ 'group': ['A', 'A', 'B', None, 'B'], 'value': [1, 2, 3, 4, 5] }) # NaN groups are silently dropped result = df.groupby('group').mean() print(result)

The resulting aggregation only shows groups 'A' and 'B', as groupby() automatically discards the row where the group is None. This can skew your results. Check the code below for the proper way to handle this.

import pandas as pd df = pd.DataFrame({ 'group': ['A', 'A', 'B', None, 'B'], 'value': [1, 2, 3, 4, 5] }) # Fill NaN with a placeholder before grouping df['group'] = df['group'].fillna('Unknown') result = df.groupby('group').mean() print(result)

The fix is to replace NaN values before you group the data. By using fillna('Unknown'), you convert any None entries into a placeholder string. This ensures the groupby() function treats them as a valid category instead of dropping them. Now, your aggregation includes the previously ignored data under the 'Unknown' group. This is crucial when analyzing incomplete datasets, as it prevents you from accidentally losing information during your analysis.

Real-world applications

These techniques are the foundation for real-world applications, from predicting customer churn to forecasting sales with time series data. When working with production-scale data, you'll also need strategies for handling large datasets in Python.

Preparing customer data for churn analysis using `OneHotEncoder`

Before you can build a model to predict customer churn, you need to prepare the data by filling in missing values and using OneHotEncoder to convert categorical features into a numerical format. This preprocessing workflow exemplifies how AI coding with Python streamlines machine learning development.

import pandas as pd from sklearn.preprocessing import OneHotEncoder # Sample customer data for churn analysis customers = pd.DataFrame({ 'Age': [35, 45, None, 28], 'Subscription': ['Premium', 'Basic', 'Premium', 'Basic'], 'Churned': [0, 1, 0, 1] }) # Handle missing values and encode categorical feature customers['Age'].fillna(customers['Age'].mean(), inplace=True) encoder = OneHotEncoder(sparse=False) subscription_encoded = pd.DataFrame( encoder.fit_transform(customers[['Subscription']]), columns=encoder.get_feature_names_out(['Subscription']) ) result = pd.concat([customers, subscription_encoded], axis=1) print(result)

This code gets customer data ready for a machine learning model. It starts by tackling missing information in the Age column, using the fillna() method to replace any empty spots with the average age. This ensures the dataset is complete.

Next, it addresses the categorical Subscription column:

The OneHotEncoder transforms text values like 'Premium' and 'Basic' into separate numerical columns.
These new columns are then merged back into the original DataFrame with pd.concat(), creating a fully numerical dataset suitable for model training.

Processing time series data for sales forecasting

To build an accurate sales forecast, you'll need to process your time series data by filling in missing values, creating new features from the dates, and normalizing the sales figures.

import pandas as pd from sklearn.preprocessing import MinMaxScaler # Sample sales time series data dates = pd.date_range(start='2023-01-01', periods=5, freq='D') sales = pd.DataFrame({ 'Date': dates, 'Sales': [100, 120, None, 115, 125] }) # Handle missing values and create time features sales.set_index('Date', inplace=True) sales['Sales'].fillna(sales['Sales'].mean(), inplace=True) sales['DayOfWeek'] = sales.index.dayofweek scaler = MinMaxScaler() sales['Sales_Normalized'] = scaler.fit_transform(sales[['Sales']]) print(sales)

This code prepares sales data for a time series model. It sets the Date column as the index, a standard step for time-based analysis, and uses fillna() to replace a missing sales entry with the column’s average.

A new feature, DayOfWeek, is created from the date index. This helps the model identify weekly patterns in the data.
The MinMaxScaler normalizes the Sales figures into a 0-1 range, which prevents large values from disproportionately influencing the model during training.

Get started with Replit

Turn these techniques into a working tool. Tell Replit Agent to “build a utility that cleans a CSV by filling missing values” or “create a dashboard that applies StandardScaler to uploaded data.”

Replit Agent writes the code, tests for errors, and deploys your app from a simple prompt. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit