How to convert categorical data to numerical data in Python

Learn how to convert categorical data to numerical data in Python. Explore different methods, tips, real-world applications, and common errors.

Published on:

Fri

Feb 20, 2026

Updated on:

Mon

Apr 6, 2026

The Replit Team

ON THIS PAGE

Example H2

The conversion of categorical data to a numerical format is a fundamental task in data science. Machine learning models require numerical inputs, which makes this preprocessing step essential for building effective algorithms.

In this article, we'll walk you through several conversion techniques. We'll also provide practical tips, discuss real-world applications, and offer debugging advice to help you select the right approach for your specific needs.

Using `LabelEncoder` for basic categorical conversion

from sklearn.preprocessing import LabelEncoder categories = ['red', 'green', 'blue', 'red', 'green'] encoder = LabelEncoder() encoded_data = encoder.fit_transform(categories) print(f"Original: {categories}\nEncoded: {encoded_data}")--OUTPUT--Original: ['red', 'green', 'blue', 'red', 'green'] Encoded: [2 1 0 2 1]

The LabelEncoder from scikit-learn offers a simple way to convert categorical labels into integers. The fit_transform() method first identifies all unique categories in your data and then assigns a distinct number to each one. This single step both learns the mapping and applies it.

In this case, the encoder maps the colors to numbers alphabetically:

'blue' becomes 0
'green' becomes 1
'red' becomes 2

This creates an ordinal relationship (0 < 1 < 2), which can imply a ranking that doesn't exist. It's a key detail to remember, as this can influence how some machine learning models interpret the data.

Basic encoding methods

If the simple integer mapping from LabelEncoder doesn't fit your needs, other methods offer more control for different kinds of categorical data.

Using pandas `get_dummies()` for one-hot encoding

import pandas as pd data = pd.DataFrame({'color': ['red', 'green', 'blue', 'red', 'green']}) one_hot = pd.get_dummies(data['color'], prefix='color') print(one_hot)--OUTPUT--color_blue color_green color_red 0 0 0 1 1 0 1 0 2 1 0 0 3 0 0 1 4 0 1 0

The pandas get_dummies() function implements a technique called one-hot encoding in Python. It converts each category into a new column and uses a 1 or 0 to show whether the category is present for a given row. This method is perfect for nominal data where no category is ranked higher than another.

This approach avoids the false ranking created by LabelEncoder. Instead, it creates a simple binary flag:

A 1 indicates the category is present.
A 0 indicates it is not.

Mapping with dictionaries for manual encoding

import pandas as pd data = pd.DataFrame({'size': ['small', 'medium', 'large', 'small', 'large']}) size_mapping = {'small': 1, 'medium': 2, 'large': 3} data['size_encoded'] = data['size'].map(size_mapping) print(data)--OUTPUT--size size_encoded 0 small 1 1 medium 2 2 large 3 3 small 1 4 large 3

For ordinal data where a clear order exists, manual encoding with a dictionary gives you complete control. You define the numerical value for each category, which is perfect for preserving a meaningful sequence when accessing dictionary values in Python.

First, create a dictionary that maps each category string to a specific integer.
Then, use the pandas map() function on your data column to apply this custom mapping.

This approach is ideal when you need to preserve an inherent ranking, like small, medium, and large.

Using `OrdinalEncoder` for ordered categories

from sklearn.preprocessing import OrdinalEncoder import numpy as np data = np.array([['low'], ['medium'], ['high'], ['low'], ['high']]) encoder = OrdinalEncoder(categories=[['low', 'medium', 'high']]) encoded_data = encoder.fit_transform(data) print(f"Original:\n{data}\nEncoded:\n{encoded_data}")--OUTPUT--Original: [['low'] ['medium'] ['high'] ['low'] ['high']] Encoded: [[0.] [1.] [2.] [0.] [2.]]

Scikit-learn's OrdinalEncoder is built for data with a clear ranking. Unlike LabelEncoder, it lets you define the order yourself. You pass a list of categories in your desired sequence to the categories parameter, which ensures the resulting numbers reflect the intended hierarchy.

'low' is mapped to 0.
'medium' is mapped to 1.
'high' is mapped to 2.

This method is more robust than manual mapping and integrates seamlessly into scikit-learn pipelines.

Advanced encoding techniques

When basic methods fall short, these advanced techniques can help you handle high-cardinality features and extract more predictive information from your categorical data.

Target encoding for predictive power

import category_encoders as ce import pandas as pd X = pd.DataFrame({'category': ['A', 'B', 'A', 'C', 'B']}) y = pd.Series([1, 0, 1, 1, 0]) # Target variable encoder = ce.TargetEncoder() encoded = encoder.fit_transform(X, y) print(encoded)--OUTPUT--category 0 1.000000 1 0.000000 2 1.000000 3 1.000000 4 0.000000

Target encoding replaces each category with the average value of the target variable. This technique, implemented with TargetEncoder from the category_encoders library, is powerful because it directly uses the predictive signal from your target y. The fit_transform method calculates this mean for each category in your feature set X.

Category A is replaced with 1.0, the average of its corresponding target values.
Category B becomes 0.0.
Category C becomes 1.0.

This method is effective but be careful—it can lead to overfitting since it uses the target data during training.

Binary encoding for high-cardinality features

import category_encoders as ce import pandas as pd data = pd.DataFrame({'product_id': ['P001', 'P002', 'P003', 'P004', 'P005']}) encoder = ce.BinaryEncoder(cols=['product_id']) binary_encoded = encoder.fit_transform(data) print(binary_encoded)--OUTPUT--product_id_0 product_id_1 product_id_2 0 1 0 0 1 0 1 0 2 1 1 0 3 0 0 1 4 1 0 1

Binary encoding offers a memory-efficient way to handle high-cardinality features. The BinaryEncoder from the category_encoders library is a great middle ground, especially for columns with many unique values like product_id.

First, it converts each category into an integer.
Next, it represents that integer in binary format.
Finally, it splits the binary digits into new columns.

This approach creates far fewer columns than one-hot encoding while avoiding the artificial ranking of simple integer mapping.

Embedding categorical features with neural networks

import tensorflow as tf from tensorflow.keras.layers import Input, Embedding, Flatten from tensorflow.keras.models import Model vocab_size = 5 # Number of unique categories embedding_dim = 2 # Dimension of embedding space input_layer = Input(shape=(1,)) embedding = Embedding(vocab_size, embedding_dim)(input_layer) flatten = Flatten()(embedding) model = Model(inputs=input_layer, outputs=flatten) # Example of using the model for category indices [0, 1, 2] result = model.predict([[0], [1], [2]]) print(result)--OUTPUT--[[-0.04277068 0.03056544] [ 0.00158245 -0.00495145] [ 0.01470431 -0.01129332]]

For complex categorical data, neural network embeddings offer a powerful solution. This technique learns a dense, multi-dimensional vector representation for each category, which can capture nuanced relationships that other methods miss.

The Keras Embedding layer is the core of this approach. It maps each category's integer index to a vector of a size you define with embedding_dim.
During training, the model learns the optimal values for these vectors, effectively discovering the semantic similarities between categories on its own.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. You don't have to worry about configuring environments or installing packages to get started.

While the techniques in this article are powerful building blocks, Agent 4 helps you move from learning individual methods to building complete applications. It's a tool that takes your app description and handles the coding, database connections, API integrations, and deployment for you.

Instead of manually piecing together different encoding methods, you can describe the final product you want and let Agent 4 build it:

A survey analysis tool that automatically converts text responses like 'low', 'medium', and 'high' into a numerical scale for charting.
A sales forecasting utility that prepares product data for machine learning by converting categories like 'electronics' or 'apparel' into a one-hot encoded format.
A data preprocessing app that applies target encoding to features like city or zip code to improve a model's predictive power.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with the right tools, you can run into tricky situations that affect your model's performance and reliability.

Handling unseen categories with `LabelEncoder`

The LabelEncoder is straightforward, but it has a blind spot: new categories. It learns a mapping from the data it's trained on, so if a category appears in your test data that wasn't in your training data, the encoder will raise an error because it doesn't know what to do. To prevent this, you can either ensure all possible categories are present in your training set or use a more robust method like one-hot encoding, which handles this scenario more gracefully.

Dealing with NaN values in categorical encoding

Missing values, often represented as NaN, can trip up many encoding functions. By default, encoders might ignore them or raise an error. Pandas' get_dummies() function, for example, will ignore NaN values unless you set the dummy_na=True parameter. Enabling this creates a dedicated column for missing values, which can be a useful feature for your model to learn from.

Avoiding data leakage in `get_dummies()` encoding

Data leakage is a subtle but serious issue where information from your test set accidentally influences your training process, giving you a false sense of your model's accuracy. With get_dummies(), this can happen if you encode the entire dataset before splitting it. The correct approach is to split your data first, then fit the encoder only on the training data and use it to transform both the training and testing sets.

Handling unseen categories with `LabelEncoder`

This is a common pitfall with LabelEncoder. Because the encoder is fitted only on the training set, it's unprepared for new values in your test data. The code below triggers the exact ValueError you'll get when this happens.

from sklearn.preprocessing import LabelEncoder # Training data train_colors = ['red', 'green', 'blue'] encoder = LabelEncoder() encoder.fit(train_colors) # Testing with new category test_colors = ['yellow', 'red', 'blue'] try: encoded_test = encoder.transform(test_colors) print(encoded_test) except ValueError as e: print(f"Error: {e}")

The encoder is fitted only on the training data, so when transform() encounters the new category 'yellow' in the test set, it raises a ValueError. The following code demonstrates a way to handle this scenario.

from sklearn.preprocessing import LabelEncoder import numpy as np # Training data train_colors = ['red', 'green', 'blue'] encoder = LabelEncoder() encoder.fit(train_colors) # Testing with new category - with error handling test_colors = ['yellow', 'red', 'blue'] def transform_with_unknown(encoder, data): known_categories = set(encoder.classes_) result = [] for item in data: if item in known_categories: result.append(encoder.transform([item])[0]) else: result.append(-1) # Use -1 for unknown categories return np.array(result) encoded_test = transform_with_unknown(encoder, test_colors) print(encoded_test)

This solution manually handles new categories by checking if an item exists in the encoder's known classes (encoder.classes_) before transformation. If a category is unknown, the function assigns a placeholder value like -1 instead of crashing. It's a crucial workaround when deploying models, as real-world data often introduces new values that weren't present during training. This ensures your application remains stable when encountering unexpected inputs.

Dealing with NaN values in categorical encoding

Missing values are a common headache in data preprocessing. Encoders like scikit-learn's LabelEncoder aren't designed to handle None or NaN values out of the box, which can halt your workflow with an unexpected error. The following code demonstrates this exact problem.

import pandas as pd from sklearn.preprocessing import LabelEncoder # Dataset with missing values data = pd.DataFrame({'category': ['A', 'B', None, 'C', 'B']}) encoder = LabelEncoder() try: data['encoded'] = encoder.fit_transform(data['category']) print(data) except TypeError as e: print(f"Error: {e}")

The fit_transform() method expects a consistent data type, but the presence of None alongside strings in the column triggers a TypeError. The following code shows how to properly prepare the data to prevent this error.

import pandas as pd from sklearn.preprocessing import LabelEncoder # Dataset with missing values data = pd.DataFrame({'category': ['A', 'B', None, 'C', 'B']}) # Fill NaN values before encoding data['category_filled'] = data['category'].fillna('MISSING') encoder = LabelEncoder() data['encoded'] = encoder.fit_transform(data['category_filled']) print(data)

The solution is to preprocess your data before encoding. By using the pandas fillna('MISSING') method, you replace any NaN values with a placeholder string. This ensures the column contains only strings, allowing LabelEncoder to run without a TypeError. The encoder then treats 'MISSING' as a distinct category and assigns it its own numerical value. This is a crucial step when your dataset might contain incomplete entries, and understanding comprehensive approaches to removing NaN values in Python can be essential.

Avoiding data leakage in `get_dummies()` encoding

Using get_dummies() separately on your training and test sets can lead to a column mismatch, a subtle form of data leakage. This happens when one dataset contains categories not present in the other, breaking your model's ability to make predictions. The following code demonstrates how this misalignment occurs.

import pandas as pd # Training data train_df = pd.DataFrame({'color': ['red', 'green', 'blue']}) train_encoded = pd.get_dummies(train_df['color'], prefix='color') # Test data with different categories test_df = pd.DataFrame({'color': ['red', 'yellow', 'orange']}) test_encoded = pd.get_dummies(test_df['color'], prefix='color') # Problem: Different columns in train and test print("Train columns:", train_encoded.columns.tolist()) print("Test columns:", test_encoded.columns.tolist())

The training data's columns don't match the test data's because get_dummies() was applied independently. This structural difference will break your model during prediction. The following code demonstrates the correct way to handle this.

import pandas as pd # Training data train_df = pd.DataFrame({'color': ['red', 'green', 'blue']}) train_encoded = pd.get_dummies(train_df['color'], prefix='color') # Test data with different categories test_df = pd.DataFrame({'color': ['red', 'yellow', 'orange']}) # Use the same set of categories for both train and test all_categories = pd.get_dummies(pd.Series(['red', 'green', 'blue', 'yellow', 'orange']), prefix='color') test_encoded = pd.get_dummies(test_df['color'], prefix='color') # Align test data with training data columns test_aligned = test_encoded.reindex(columns=train_encoded.columns, fill_value=0) print("Train columns:", train_encoded.columns.tolist()) print("Test aligned columns:", test_aligned.columns.tolist())

The solution is to align the test data's columns with the training data's. After applying get_dummies() to your test set, use the reindex() method. Pass the training set's columns to reindex() and set fill_value=0 to ensure both DataFrames have the same structure. This is crucial when deploying a model, as it guarantees your feature sets are consistent and prevents errors when the model encounters data with slightly different categories than it was trained on.

Real-world applications

Beyond the technical methods and common errors, these encoding techniques are the backbone of many practical machine learning applications. When building these applications with vibe coding, you can quickly prototype and iterate on your data preprocessing pipeline.

Using `LabelEncoder` for customer segmentation

In marketing, LabelEncoder is a straightforward way to convert descriptive customer data, like education level, into numbers that a clustering algorithm can use to identify segments.

import pandas as pd from sklearn.preprocessing import LabelEncoder from sklearn.cluster import KMeans customer_data = pd.DataFrame({ 'education': ['high school', 'bachelor', 'master', 'bachelor', 'phd'], 'marital_status': ['single', 'married', 'divorced', 'single', 'married'] }) encoder = LabelEncoder() customer_data['edu_encoded'] = encoder.fit_transform(customer_data['education']) customer_data['marital_encoded'] = encoder.fit_transform(customer_data['marital_status']) kmeans = KMeans(n_clusters=2, random_state=0) customer_data['segment'] = kmeans.fit_predict(customer_data[['edu_encoded', 'marital_encoded']]) print(customer_data)

This code snippet demonstrates a two-step process for grouping data using k-means clustering in Python. It prepares categorical features for a clustering algorithm by first converting them into a numerical format.

The LabelEncoder transforms the string values in the education and marital_status columns into integers.
Next, the KMeans algorithm uses these new numerical columns to sort each customer into one of two groups, or clusters.

The final output adds a segment column that contains the cluster assignment for each customer.

Encoding text data for sentiment analysis

To classify text reviews, you first need to convert the words into numerical features and the sentiment labels into integers that a model can process.

import pandas as pd from sklearn.feature_extraction.text import CountVectorizer from sklearn.preprocessing import LabelEncoder from sklearn.naive_bayes import MultinomialNB reviews = pd.DataFrame({ 'text': ['love this product', 'terrible experience', 'good value', 'disappointing'], 'sentiment': ['positive', 'negative', 'positive', 'negative'] }) vectorizer = CountVectorizer() X = vectorizer.fit_transform(reviews['text']) encoder = LabelEncoder() y = encoder.fit_transform(reviews['sentiment']) clf = MultinomialNB().fit(X, y) new_review = ["pretty good product"] prediction = clf.predict(vectorizer.transform(new_review)) print(f"Review: {new_review[0]}") print(f"Predicted sentiment: {encoder.inverse_transform(prediction)[0]}")

This code builds a basic sentiment analysis in Python model. It uses CountVectorizer to transform the raw text of reviews into a numerical format based on word counts. At the same time, LabelEncoder converts the text labels, like 'positive' and 'negative', into integers that the model can understand. When working with machine learning libraries like scikit-learn, managing system dependencies becomes crucial for ensuring your environment has all required packages.

A MultinomialNB classifier is trained using the numerical text data and the encoded labels.
The trained model then predicts the sentiment of a new review by first converting it with the same vectorizer.
Finally, inverse_transform converts the numerical prediction back into a readable label.

Get started with Replit

Turn these techniques into a working tool. Describe what you want to build to Replit Agent, like “a tool to one-hot encode a CSV column” or “an app that converts survey ratings to a numerical scale.”

Replit Agent takes your description and writes the code, tests for errors, and deploys the app. It builds the complete tool for you. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit