How to impute missing values in Python
Learn to impute missing values in Python. Explore methods, tips, real-world applications, and how to debug common errors.

Missing data can compromise your analysis and machine learning models. Python offers robust techniques to impute these values, which ensures your dataset remains complete and reliable for accurate results.
Here, you'll explore key imputation techniques with practical tips for implementation. We also cover real-world applications and debugging advice to help you confidently manage missing data in your projects.
Using fillna() to replace missing values with a constant
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
df_filled = df.fillna(0)
print(df_filled)--OUTPUT--A B
0 1 5
1 2 0
2 0 0
3 4 8
The fillna() method offers a straightforward approach to imputation by replacing missing values with a constant. In this case, df.fillna(0) scans the entire DataFrame and substitutes every instance of np.nan with 0, creating a new, complete DataFrame.
This strategy is effective when zero is a logical replacement for missing data—for instance, if NaN implies zero sales or no events. While simple, this method can alter the statistical properties of your data, like its mean and variance, so it's best used when you're confident it won't introduce significant bias.
Basic imputation methods
To avoid the potential bias of a constant value, you can use more sophisticated techniques that leverage statistical measures or the existing data's structure.
Using statistical measures for imputation
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
df_mean = df.fillna(df.mean())
print(df_mean)--OUTPUT--A B
0 1.0 5.0
1 2.0 6.5
2 2.3 6.5
3 4.0 8.0
Instead of a fixed number, you can use a column's mean to fill in the gaps. The code df.fillna(df.mean()) calculates the average for each column and uses that value to replace any missing entries. This approach is great because it helps maintain the original distribution of your data.
Other statistical measures also work well:
df.median(): Ideal for data with significant outliers, as it's less sensitive to extreme values.df.mode()[0]: The best option for filling in missing categorical data.
Choosing the right measure depends on your data's characteristics and the scale of your dataset when handling large datasets in Python.
Forward and backward filling with ffill() and bfill()
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
# Forward fill then backward fill for remaining NAs
df_filled = df.ffill().bfill()
print(df_filled)--OUTPUT--A B
0 1.0 5.0
1 2.0 5.0
2 2.0 8.0
3 4.0 8.0
Forward and backward filling are ideal for sequential or time-series data where the order of values is meaningful. This method uses existing data points to fill in the gaps.
ffill(), or forward fill, propagates the last valid observation forward to the next.bfill(), or backward fill, works in reverse, filling missing values with the next valid observation.
Chaining them, as in df.ffill().bfill(), creates a powerful combination. The forward fill handles most gaps, and the backward fill then takes care of any remaining missing values, such as those at the very beginning of a dataset.
Using SimpleImputer from scikit-learn
import pandas as pd
import numpy as np
from sklearn.impute import SimpleImputer
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
imputer = SimpleImputer(strategy='mean')
df_imputed = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_imputed)--OUTPUT--A B
0 1.0 5.0
1 2.0 6.5
2 2.3 6.5
3 4.0 8.0
For a more structured approach, especially within a machine learning pipeline, you can use SimpleImputer from scikit-learn. This class standardizes the process. You initialize it with a chosen strategy—in this case, strategy='mean'.
- The
fit_transform()method first learns the mean from each column and then fills the missing values in one step. - It also supports other strategies, including
'median','most_frequent', and'constant'.
This makes SimpleImputer a versatile tool for preprocessing data before feeding it into a model.
Advanced imputation techniques
When basic methods aren't enough, you can use predictive models like KNNImputer and IterativeImputer or specialized interpolation for more nuanced imputation.
Using KNNImputer for pattern-based filling
import pandas as pd
import numpy as np
from sklearn.impute import KNNImputer
df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],
'B': [5, 6, np.nan, 8, 9],
'C': [10, np.nan, 12, 13, 14]})
imputer = KNNImputer(n_neighbors=2)
df_knn = pd.DataFrame(imputer.fit_transform(df), columns=df.columns)
print(df_knn)--OUTPUT--A B C
0 1.0 5.0 10.0
1 2.0 6.0 11.5
2 3.0 7.0 12.0
3 4.0 8.0 13.0
4 5.0 9.0 14.0
The KNNImputer offers a sophisticated way to handle missing data by using a k-nearest neighbors approach. It identifies the closest data points, or neighbors, to a sample with a missing value and uses their values to perform the imputation.
- In this example,
n_neighbors=2tells the imputer to find the two most similar rows for each missing entry. - The missing value is then replaced with the average of the values from these two neighbors, preserving relationships within the data more effectively than simple statistical measures.
Iterative imputation with IterativeImputer
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
df = pd.DataFrame({'A': [1, 2, np.nan, 4, 5],
'B': [5, 6, np.nan, 8, np.nan]})
mice_imputer = IterativeImputer(max_iter=10, random_state=0)
df_mice = pd.DataFrame(mice_imputer.fit_transform(df), columns=df.columns)
print(df_mice)--OUTPUT--A B
0 1.000000 5.000000
1 2.000000 6.000000
2 2.999991 6.999974
3 4.000000 8.000000
4 5.000000 8.999983
The IterativeImputer is a powerful tool that models each feature with missing values as a function of other features. It essentially treats imputation as a regression problem, making it highly effective when values are missing across multiple columns that are correlated.
- It works in rounds. The
max_iter=10parameter sets the number of cycles it runs to refine its predictions for the missing values. - Since this is an experimental feature in scikit-learn, you must first import
enable_iterative_imputerto use it.
Time series interpolation methods
import pandas as pd
import numpy as np
date_rng = pd.date_range(start='2023-01-01', end='2023-01-10', freq='D')
df = pd.DataFrame(date_rng, columns=['date'])
df['value'] = [10, 11, np.nan, np.nan, 14, 15, np.nan, 17, 18, 19]
df.set_index('date', inplace=True)
df_interp = df.interpolate(method='cubic')
print(df_interp)--OUTPUT--value
date
2023-01-01 10.00
2023-01-02 11.00
2023-01-03 11.87
2023-01-04 12.93
2023-01-05 14.00
2023-01-06 15.00
2023-01-07 16.00
2023-01-08 17.00
2023-01-09 18.00
2023-01-10 19.00
For time-series data, interpolation offers a more refined approach than simple forward or backward filling. The interpolate() method treats your data points as part of a continuous function, allowing it to estimate missing values based on the overall trend.
- In this example,
method='cubic'fits a smooth curve through the known data points to fill the gaps. - This creates more natural-looking imputations, especially when the data has a clear curve or pattern. It's a step up from simpler methods like
'linear', which just connects points with a straight line.
Move faster with Replit
Replit is an AI-powered development platform where all Python dependencies come pre-installed, so you can skip setup and start coding instantly. This lets you move from learning individual techniques to building complete applications faster.
Instead of just piecing together methods, you can use Agent 4 to build a full application from a simple description. Agent handles the coding, database connections, APIs, and deployment for you, taking your idea to a working product. For example:
- A financial dashboard that uses time-series interpolation to fill missing stock prices for smooth, continuous charts.
- A data cleaning utility that takes a CSV and applies
KNNImputerto intelligently fill gaps based on surrounding data points. - A sales forecasting tool that uses
ffill()andbfill()to handle missing daily sales figures before generating a predictive report.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with powerful tools, you might run into a few common pitfalls when imputing data, but they're easy to navigate with the right approach.
Forgetting to set inplace=True when using fillna()
A frequent mistake is calling a method like df.fillna(0) and assuming your original DataFrame has changed. By default, most pandas operations return a new DataFrame, leaving the original untouched. To modify the DataFrame directly, you must specify inplace=True.
Alternatively, and often more safely, you can assign the modified DataFrame to a new variable, like df_filled = df.fillna(0). This preserves your original data while you work with the imputed version.
Handling dtype conversion after imputation
When a column of integers contains missing values, pandas converts its data type (dtype) to float to accommodate the NaN values. After imputation, the column may remain a float even if it now only contains whole numbers, which can cause problems for models expecting integers. You can easily fix this by converting the column back to its intended type using the astype() method, such as df['column'] = df['column'].astype(int).
Imputing missing values in mixed numeric and categorical data
Applying a single imputation strategy across a DataFrame with both numeric and categorical columns often leads to errors. You can't calculate the mean of a text column, and using the mode might not be appropriate for continuous numerical data. The best practice is to handle different column types separately. You can either impute them one by one or use scikit-learn's ColumnTransformer to apply different strategies—like 'mean' for numbers and 'most_frequent' for categories—in a single, clean step.
Forgetting to set inplace=True when using fillna()
It's a classic pandas "gotcha": you run fillna() to fix missing data, but the NaN values stubbornly remain. This is because the method returns a new DataFrame by default, leaving the original unchanged. The code below illustrates this common pitfall.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
# This doesn't modify the original dataframe
df.fillna(0)
print(df) # Still contains NaN values
Because the result of df.fillna(0) isn't assigned to a variable or applied in place, the print() function shows the original, unchanged DataFrame. The code below shows how to make the changes stick.
import pandas as pd
import numpy as np
df = pd.DataFrame({'A': [1, 2, np.nan, 4], 'B': [5, np.nan, np.nan, 8]})
# Either use inplace=True or assign the result back
df.fillna(0, inplace=True)
print(df) # NaN values are replaced
The fix is to either modify the DataFrame directly or reassign it. The code shows the first option by setting inplace=True, which tells pandas to apply the changes to the original DataFrame instead of creating a new one. This ensures the NaN values are permanently replaced. The alternative, reassigning the result to a variable, is also a common and safe practice that preserves your original data for comparison or other operations.
Handling dtype conversion after imputation
To handle missing values, pandas often converts integer columns to floats. After imputation, this data type might not revert, potentially causing issues with models that expect integers. The code below demonstrates how a column remains a float even after filling NaNs.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, np.nan, 40]})
df['value'] = df['value'].fillna(0)
print(df.dtypes) # value is float64, not int64
The value column is now complete, but its data type remains float64 to accommodate the original NaN value. This can cause issues for models expecting integers. The following code shows how to convert it back.
import pandas as pd
import numpy as np
df = pd.DataFrame({'id': [1, 2, 3, 4],
'value': [10, 20, np.nan, 40]})
df['value'] = df['value'].fillna(0).astype('int64')
print(df.dtypes) # value is now int64
The fix is simple: just chain the astype('int64') method after filling the missing values. This command tells pandas to convert the column back to an integer type right after the imputation is complete. It's a crucial step because many machine learning models are strict about their input data types. Always double-check your column dtypes after imputation to ensure they're what you expect.
Imputing missing values in mixed numeric and categorical data
Applying a single imputation strategy across a DataFrame with both numbers and text is a common mistake. You can't calculate the mean of a categorical column, which often leads to incomplete imputation and leaves some NaN values behind.
The following code shows what happens when you try to use df.fillna(df.mean()) on a mixed-type DataFrame. Notice how the numeric column is filled, but the categorical column remains untouched.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'number': [1, 2, np.nan, 4],
'category': ['red', 'blue', np.nan, 'green']
})
df = df.fillna(df.mean())
print(df) # category column will still have NaN
The df.mean() method only calculates values for numeric columns, so it ignores the category column and leaves its NaN value behind. The following code demonstrates how to correctly impute each column type.
import pandas as pd
import numpy as np
df = pd.DataFrame({
'number': [1, 2, np.nan, 4],
'category': ['red', 'blue', np.nan, 'green']
})
df['number'] = df['number'].fillna(df['number'].mean())
df['category'] = df['category'].fillna(df['category'].mode()[0])
print(df) # All NaN values are replaced
The fix is to handle each column type individually. For the number column, you can fill missing values with the mean(). For the category column, the mode()—the most frequent value—is the right choice. This targeted approach ensures all NaN values are correctly replaced without causing errors. It's a crucial step whenever your dataset contains a mix of data types, a common scenario in real-world data analysis.
Real-world applications
Beyond just fixing errors, these imputation techniques are essential for preparing real-world datasets through vibe coding, from customer transactions to medical records.
Handling missing values in customer transaction data
When working with customer transaction data loaded through reading CSV files in Python, missing values in columns like purchase_amount or items_purchased can be reliably filled using statistical methods appropriate for each data type.
import pandas as pd
import numpy as np
# Sample customer transaction data
transactions = pd.DataFrame({
'customer_id': [101, 102, 103, 101, 104, 105],
'purchase_amount': [250, np.nan, 120, 300, np.nan, 180],
'items_purchased': [3, 2, np.nan, 4, 1, 2]
})
# Fill missing purchase amounts with median (more robust for financial data)
transactions['purchase_amount'] = transactions['purchase_amount'].fillna(transactions['purchase_amount'].median())
transactions['items_purchased'] = transactions['items_purchased'].fillna(transactions['items_purchased'].mode()[0])
print(transactions)
This example shows how to selectively impute missing data in a pandas DataFrame. Since a single strategy rarely fits all columns, the code applies different methods to each.
- For the
purchase_amountcolumn, it usesfillna()with the column'smedian(). - For
items_purchased, it fills missing entries with themode(), which is the most frequent value.
This targeted approach ensures that the imputation method is suitable for the specific characteristics of each data column, leading to a more reliable dataset.
Preprocessing medical records with IterativeImputer
For complex datasets like medical records, you can enhance IterativeImputer by pairing it with a powerful model like a random forest to fill in missing values with greater precision.
import pandas as pd
import numpy as np
from sklearn.experimental import enable_iterative_imputer
from sklearn.impute import IterativeImputer
from sklearn.ensemble import RandomForestRegressor
# Sample medical dataset with missing values
patient_data = pd.DataFrame({
'age': [45, 52, np.nan, 36, 61, 42],
'blood_pressure': [120, 135, 118, np.nan, 142, 131],
'cholesterol': [200, np.nan, 240, 180, np.nan, 210],
'glucose': [np.nan, 110, 125, 95, 130, 105]
})
# Use advanced iterative imputation with Random Forest
imputer = IterativeImputer(estimator=RandomForestRegressor(n_estimators=10),
random_state=42)
patient_data_imputed = pd.DataFrame(imputer.fit_transform(patient_data),
columns=patient_data.columns)
print(patient_data_imputed)
This code tackles missing medical data by using IterativeImputer with a custom estimator. Instead of a default model, it's configured to use a RandomForestRegressor. This approach is powerful because it models complex, non-linear relationships between features to predict missing values with greater accuracy.
- The
estimator=RandomForestRegressor(...)argument tells the imputer to use a random forest model for its predictions. - The
fit_transform()method then runs this iterative process, filling in the gaps based on what the model learns from the complete data in other columns.
Get started with Replit
Now, turn these techniques into a real tool. Describe what you want to build to Replit Agent, like "a data cleaning utility that uses KNNImputer" or "a dashboard that interpolates missing time-series data."
Replit Agent writes the code, tests for errors, and deploys your application directly from your browser. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



