How to convert categorical data to numerical data in Python
Learn how to convert categorical data to numerical data in Python. Explore different methods, tips, real-world applications, and common errors.

The conversion of categorical data to a numerical format is a fundamental task in data science. Machine learning models require numerical inputs, which makes this preprocessing step essential for building effective algorithms.
In this article, we'll walk you through several conversion techniques. We'll also provide practical tips, discuss real-world applications, and offer debugging advice to help you select the right approach for your specific needs.
Using LabelEncoder for basic categorical conversion
from sklearn.preprocessing import LabelEncoder
categories = ['red', 'green', 'blue', 'red', 'green']
encoder = LabelEncoder()
encoded_data = encoder.fit_transform(categories)
print(f"Original: {categories}\nEncoded: {encoded_data}")--OUTPUT--Original: ['red', 'green', 'blue', 'red', 'green']
Encoded: [2 1 0 2 1]
The LabelEncoder from scikit-learn offers a simple way to convert categorical labels into integers. The fit_transform() method first identifies all unique categories in your data and then assigns a distinct number to each one. This single step both learns the mapping and applies it.
In this case, the encoder maps the colors to numbers alphabetically:
'blue'becomes0'green'becomes1'red'becomes2
This creates an ordinal relationship (0 < 1 < 2), which can imply a ranking that doesn't exist. It's a key detail to remember, as this can influence how some machine learning models interpret the data.
Basic encoding methods
If the simple integer mapping from LabelEncoder doesn't fit your needs, other methods offer more control for different kinds of categorical data.
Using pandas get_dummies() for one-hot encoding
import pandas as pd
data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green']})
one_hot = pd.get_dummies(data['color'], prefix='color')
print(one_hot)--OUTPUT--color_blue color_green color_red
0 0 0 1
1 0 1 0
2 1 0 0
3 0 0 1
4 0 1 0
The pandas get_dummies() function implements a technique called one-hot encoding in Python. It converts each category into a new column and uses a 1 or 0 to show whether the category is present for a given row. This method is perfect for nominal data where no category is ranked higher than another.
This approach avoids the false ranking created by LabelEncoder. Instead, it creates a simple binary flag:
- A
1indicates the category is present. - A
0indicates it is not.
Mapping with dictionaries for manual encoding
import pandas as pd
data = pd.DataFrame({'size': ['small', 'medium', 'large', 'small', 'large']})
size_mapping = {'small': 1, 'medium': 2, 'large': 3}
data['size_encoded'] = data['size'].map(size_mapping)
print(data)--OUTPUT--size size_encoded
0 small 1
1 medium 2
2 large 3
3 small 1
4 large 3
For ordinal data where a clear order exists, manual encoding with a dictionary gives you complete control. You define the numerical value for each category, which is perfect for preserving a meaningful sequence when accessing dictionary values in Python.
- First, create a dictionary that maps each category string to a specific integer.
- Then, use the pandas
map()function on your data column to apply this custom mapping.
This approach is ideal when you need to preserve an inherent ranking, like small, medium, and large.
Using OrdinalEncoder for ordered categories
from sklearn.preprocessing import OrdinalEncoder
import numpy as np
data = np.array([['low'], ['medium'], ['high'], ['low'], ['high']])
encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']])
encoded_data = encoder.fit_transform(data)
print(f"Original:\n{data}\nEncoded:\n{encoded_data}")--OUTPUT--Original:
[['low']
['medium']
['high']
['low']
['high']]
Encoded:
[[0.]
[1.]
[2.]
[0.]
[2.]]
Scikit-learn's OrdinalEncoder is built for data with a clear ranking. Unlike LabelEncoder, it lets you define the order yourself. You pass a list of categories in your desired sequence to the categories parameter, which ensures the resulting numbers reflect the intended hierarchy.
'low'is mapped to0.'medium'is mapped to1.'high'is mapped to2.
This method is more robust than manual mapping and integrates seamlessly into scikit-learn pipelines.
Advanced encoding techniques
When basic methods fall short, these advanced techniques can help you handle high-cardinality features and extract more predictive information from your categorical data.
Target encoding for predictive power
import category_encoders as ce
import pandas as pd
X = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B']})
y = pd.Series([1, 0, 1, 1, 0]) # Target variable
encoder = ce.TargetEncoder()
encoded = encoder.fit_transform(X, y)
print(encoded)--OUTPUT--category
0 1.000000
1 0.000000
2 1.000000
3 1.000000
4 0.000000
Target encoding replaces each category with the average value of the target variable. This technique, implemented with TargetEncoder from the category_encoders library, is powerful because it directly uses the predictive signal from your target y. The fit_transform method calculates this mean for each category in your feature set X.
- Category
Ais replaced with1.0, the average of its corresponding target values. - Category
Bbecomes0.0. - Category
Cbecomes1.0.
This method is effective but be careful—it can lead to overfitting since it uses the target data during training.
Binary encoding for high-cardinality features
import category_encoders as ce
import pandas as pd
data = pd.DataFrame({'product_id': ['P001', 'P002', 'P003', 'P004', 'P005']})
encoder = ce.BinaryEncoder(cols=['product_id'])
binary_encoded = encoder.fit_transform(data)
print(binary_encoded)--OUTPUT--product_id_0 product_id_1 product_id_2
0 1 0 0
1 0 1 0
2 1 1 0
3 0 0 1
4 1 0 1
Binary encoding offers a memory-efficient way to handle high-cardinality features. The BinaryEncoder from the category_encoders library is a great middle ground, especially for columns with many unique values like product_id.
- First, it converts each category into an integer.
- Next, it represents that integer in binary format.
- Finally, it splits the binary digits into new columns.
This approach creates far fewer columns than one-hot encoding while avoiding the artificial ranking of simple integer mapping.
Embedding categorical features with neural networks
import tensorflow as tf
from tensorflow.keras.layers import Input, Embedding, Flatten
from tensorflow.keras.models import Model
vocab_size = 5 # Number of unique categories
embedding_dim = 2 # Dimension of embedding space
input_layer = Input(shape=(1,))
embedding = Embedding(vocab_size, embedding_dim)(input_layer)
flatten = Flatten()(embedding)
model = Model(inputs=input_layer, outputs=flatten)
# Example of using the model for category indices [0, 1, 2]
result = model.predict([[0], [1], [2]])
print(result)--OUTPUT--[[-0.04277068 0.03056544]
[ 0.00158245 -0.00495145]
[ 0.01470431 -0.01129332]]
For complex categorical data, neural network embeddings offer a powerful solution. This technique learns a dense, multi-dimensional vector representation for each category, which can capture nuanced relationships that other methods miss.
- The Keras
Embeddinglayer is the core of this approach. It maps each category's integer index to a vector of a size you define withembedding_dim. - During training, the model learns the optimal values for these vectors, effectively discovering the semantic similarities between categories on its own.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. You don't have to worry about configuring environments or installing packages to get started.
While the techniques in this article are powerful building blocks, Agent 4 helps you move from learning individual methods to building complete applications. It's a tool that takes your app description and handles the coding, database connections, API integrations, and deployment for you.
Instead of manually piecing together different encoding methods, you can describe the final product you want and let Agent 4 build it:
- A survey analysis tool that automatically converts text responses like 'low', 'medium', and 'high' into a numerical scale for charting.
- A sales forecasting utility that prepares product data for machine learning by converting categories like 'electronics' or 'apparel' into a one-hot encoded format.
- A data preprocessing app that applies target encoding to features like city or zip code to improve a model's predictive power.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with the right tools, you can run into tricky situations that affect your model's performance and reliability.
Handling unseen categories with LabelEncoder
The LabelEncoder is straightforward, but it has a blind spot: new categories. It learns a mapping from the data it's trained on, so if a category appears in your test data that wasn't in your training data, the encoder will raise an error because it doesn't know what to do. To prevent this, you can either ensure all possible categories are present in your training set or use a more robust method like one-hot encoding, which handles this scenario more gracefully.
Dealing with NaN values in categorical encoding
Missing values, often represented as NaN, can trip up many encoding functions. By default, encoders might ignore them or raise an error. Pandas' get_dummies() function, for example, will ignore NaN values unless you set the dummy_na=True parameter. Enabling this creates a dedicated column for missing values, which can be a useful feature for your model to learn from.
Avoiding data leakage in get_dummies() encoding
Data leakage is a subtle but serious issue where information from your test set accidentally influences your training process, giving you a false sense of your model's accuracy. With get_dummies(), this can happen if you encode the entire dataset before splitting it. The correct approach is to split your data first, then fit the encoder only on the training data and use it to transform both the training and testing sets.
Handling unseen categories with LabelEncoder
This is a common pitfall with LabelEncoder. Because the encoder is fitted only on the training set, it's unprepared for new values in your test data. The code below triggers the exact ValueError you'll get when this happens.
from sklearn.preprocessing import LabelEncoder
# Training data
train_colors = ['red', 'green', 'blue']
encoder = LabelEncoder()
encoder.fit(train_colors)
# Testing with new category
test_colors = ['yellow', 'red', 'blue']
try:
encoded_test = encoder.transform(test_colors)
print(encoded_test)
except ValueError as e:
print(f"Error: {e}")
The encoder is fitted only on the training data, so when transform() encounters the new category 'yellow' in the test set, it raises a ValueError. The following code demonstrates a way to handle this scenario.
from sklearn.preprocessing import LabelEncoder
import numpy as np
# Training data
train_colors = ['red', 'green', 'blue']
encoder = LabelEncoder()
encoder.fit(train_colors)
# Testing with new category - with error handling
test_colors = ['yellow', 'red', 'blue']
def transform_with_unknown(encoder, data):
known_categories = set(encoder.classes_)
result = []
for item in data:
if item in known_categories:
result.append(encoder.transform([item])[0])
else:
result.append(-1) # Use -1 for unknown categories
return np.array(result)
encoded_test = transform_with_unknown(encoder, test_colors)
print(encoded_test)
This solution manually handles new categories by checking if an item exists in the encoder's known classes (encoder.classes_) before transformation. If a category is unknown, the function assigns a placeholder value like -1 instead of crashing. It's a crucial workaround when deploying models, as real-world data often introduces new values that weren't present during training. This ensures your application remains stable when encountering unexpected inputs.
Dealing with NaN values in categorical encoding
Missing values are a common headache in data preprocessing. Encoders like scikit-learn's LabelEncoder aren't designed to handle None or NaN values out of the box, which can halt your workflow with an unexpected error. The following code demonstrates this exact problem.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Dataset with missing values
data = pd.DataFrame({'category': ['A', 'B', None, 'C', 'B']})
encoder = LabelEncoder()
try:
data['encoded'] = encoder.fit_transform(data['category'])
print(data)
except TypeError as e:
print(f"Error: {e}")
The fit_transform() method expects a consistent data type, but the presence of None alongside strings in the column triggers a TypeError. The following code shows how to properly prepare the data to prevent this error.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
# Dataset with missing values
data = pd.DataFrame({'category': ['A', 'B', None, 'C', 'B']})
# Fill NaN values before encoding
data['category_filled'] = data['category'].fillna('MISSING')
encoder = LabelEncoder()
data['encoded'] = encoder.fit_transform(data['category_filled'])
print(data)
The solution is to preprocess your data before encoding. By using the pandas fillna('MISSING') method, you replace any NaN values with a placeholder string. This ensures the column contains only strings, allowing LabelEncoder to run without a TypeError. The encoder then treats 'MISSING' as a distinct category and assigns it its own numerical value. This is a crucial step when your dataset might contain incomplete entries, and understanding comprehensive approaches to removing NaN values in Python can be essential.
Avoiding data leakage in get_dummies() encoding
Using get_dummies() separately on your training and test sets can lead to a column mismatch, a subtle form of data leakage. This happens when one dataset contains categories not present in the other, breaking your model's ability to make predictions. The following code demonstrates how this misalignment occurs.
import pandas as pd
# Training data
train_df = pd.DataFrame({'color': ['red', 'green', 'blue']})
train_encoded = pd.get_dummies(train_df['color'], prefix='color')
# Test data with different categories
test_df = pd.DataFrame({'color': ['red', 'yellow', 'orange']})
test_encoded = pd.get_dummies(test_df['color'], prefix='color')
# Problem: Different columns in train and test
print("Train columns:", train_encoded.columns.tolist())
print("Test columns:", test_encoded.columns.tolist())
The training data's columns don't match the test data's because get_dummies() was applied independently. This structural difference will break your model during prediction. The following code demonstrates the correct way to handle this.
import pandas as pd
# Training data
train_df = pd.DataFrame({'color': ['red', 'green', 'blue']})
train_encoded = pd.get_dummies(train_df['color'], prefix='color')
# Test data with different categories
test_df = pd.DataFrame({'color': ['red', 'yellow', 'orange']})
# Use the same set of categories for both train and test
all_categories = pd.get_dummies(pd.Series(['red', 'green', 'blue', 'yellow', 'orange']),
prefix='color')
test_encoded = pd.get_dummies(test_df['color'], prefix='color')
# Align test data with training data columns
test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0)
print("Train columns:", train_encoded.columns.tolist())
print("Test aligned columns:", test_aligned.columns.tolist())
The solution is to align the test data's columns with the training data's. After applying get_dummies() to your test set, use the reindex() method. Pass the training set's columns to reindex() and set fill_value=0 to ensure both DataFrames have the same structure. This is crucial when deploying a model, as it guarantees your feature sets are consistent and prevents errors when the model encounters data with slightly different categories than it was trained on.
Real-world applications
Beyond the technical methods and common errors, these encoding techniques are the backbone of many practical machine learning applications. When building these applications with vibe coding, you can quickly prototype and iterate on your data preprocessing pipeline.
Using LabelEncoder for customer segmentation
In marketing, LabelEncoder is a straightforward way to convert descriptive customer data, like education level, into numbers that a clustering algorithm can use to identify segments.
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.cluster import KMeans
customer_data = pd.DataFrame({
'education': ['high school', 'bachelor', 'master', 'bachelor', 'phd'],
'marital_status': ['single', 'married', 'divorced', 'single', 'married']
})
encoder = LabelEncoder()
customer_data['edu_encoded'] = encoder.fit_transform(customer_data['education'])
customer_data['marital_encoded'] = encoder.fit_transform(customer_data['marital_status'])
kmeans = KMeans(n_clusters=2, random_state=0)
customer_data['segment'] = kmeans.fit_predict(customer_data[['edu_encoded', 'marital_encoded']])
print(customer_data)
This code snippet demonstrates a two-step process for grouping data using k-means clustering in Python. It prepares categorical features for a clustering algorithm by first converting them into a numerical format.
- The
LabelEncodertransforms the string values in theeducationandmarital_statuscolumns into integers. - Next, the
KMeansalgorithm uses these new numerical columns to sort each customer into one of two groups, or clusters.
The final output adds a segment column that contains the cluster assignment for each customer.
Encoding text data for sentiment analysis
To classify text reviews, you first need to convert the words into numerical features and the sentiment labels into integers that a model can process.
import pandas as pd
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.preprocessing import LabelEncoder
from sklearn.naive_bayes import MultinomialNB
reviews = pd.DataFrame({
'text': ['love this product', 'terrible experience', 'good value', 'disappointing'],
'sentiment': ['positive', 'negative', 'positive', 'negative']
})
vectorizer = CountVectorizer()
X = vectorizer.fit_transform(reviews['text'])
encoder = LabelEncoder()
y = encoder.fit_transform(reviews['sentiment'])
clf = MultinomialNB().fit(X, y)
new_review = ["pretty good product"]
prediction = clf.predict(vectorizer.transform(new_review))
print(f"Review: {new_review[0]}")
print(f"Predicted sentiment: {encoder.inverse_transform(prediction)[0]}")
This code builds a basic sentiment analysis in Python model. It uses CountVectorizer to transform the raw text of reviews into a numerical format based on word counts. At the same time, LabelEncoder converts the text labels, like 'positive' and 'negative', into integers that the model can understand. When working with machine learning libraries like scikit-learn, managing system dependencies becomes crucial for ensuring your environment has all required packages.
- A
MultinomialNBclassifier is trained using the numerical text data and the encoded labels. - The trained model then predicts the sentiment of a new review by first converting it with the same
vectorizer. - Finally,
inverse_transformconverts the numerical prediction back into a readable label.
Get started with Replit
Turn these techniques into a working tool. Describe what you want to build to Replit Agent, like “a tool to one-hot encode a CSV column” or “an app that converts survey ratings to a numerical scale.”
Replit Agent takes your description and writes the code, tests for errors, and deploys the app. It builds the complete tool for you. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



