How to remove Unicode characters in Python

This guide shows you how to remove Unicode characters in Python. Learn different methods, tips, real-world uses, and common error fixes.

Published on:

Wed

Mar 25, 2026

Updated on:

Thu

Mar 26, 2026

The Replit Team

ON THIS PAGE

Example H2

Unicode characters often disrupt data processes and text manipulation in Python. Their removal is essential for clean data, which ensures your scripts run smoothly and produce predictable results.

In this article, you'll learn several techniques to handle these characters. You'll find practical tips, real world applications, and debugging advice to help you manage Unicode in your projects.

Using `encode()` and `decode()` to remove Unicode characters

text = "Hello, 世界! This contains Unicode: ❤️ 😊 🐍" ascii_text = text.encode("ascii", "ignore").decode("ascii") print(f"Original: {text}") print(f"ASCII only: {ascii_text}")--OUTPUT--Original: Hello, 世界! This contains Unicode: ❤️ 😊 🐍 ASCII only: Hello, This contains Unicode:

This technique hinges on encode("ascii", "ignore"). The function attempts to convert the string to ASCII, and when it encounters a character it can't process—like "世" or "❤️"—the "ignore" argument tells Python to simply drop it instead of raising an error.

Because encode() returns a bytes object, you need to call decode("ascii") to convert it back into a usable string. This two-step process is a quick and effective way to strip out any non-ASCII characters from your text.

Basic techniques for removing Unicode characters

While the encode() and decode() method is a great starting point, you can achieve more granular control using regular expressions, character translations, or simple filtering.

Using the `re` module to remove non-ASCII characters

import re text = "Hello, 世界! Unicode symbols: ❤️ 😊 🐍" ascii_only = re.sub(r'[^\x00-\x7F]+', '', text) print(f"Original: {text}") print(f"ASCII only: {ascii_only}")--OUTPUT--Original: Hello, 世界! Unicode symbols: ❤️ 😊 🐍 ASCII only: Hello, Unicode symbols:

Regular expressions give you precise control over which characters to remove. The re.sub() function finds all substrings matching a pattern and replaces them. Here, the pattern r'[^\x00-\x7F]+' is the key to how it works.

The range \x00-\x7F represents all standard ASCII characters.
The caret ^ inside the brackets inverts the selection, so it matches anything not in that ASCII range.
The plus sign + tells the function to match one or more of these non-ASCII characters in a row.

By replacing the matches with an empty string (''), you effectively delete all non-ASCII characters from the text.

Using `str.translate()` with character mapping

text = "Hello, 世界! Unicode symbols: ❤️ 😊 🐍" # Create a translation table that maps non-ASCII chars to None translation_table = {ord(char): None for char in text if ord(char) > 127} ascii_only = text.translate(translation_table) print(f"Original: {text}") print(f"ASCII only: {ascii_only}")--OUTPUT--Original: Hello, 世界! Unicode symbols: ❤️ 😊 🐍 ASCII only: Hello, ! Unicode symbols:

The str.translate() method is a highly efficient way to perform character-by-character removal. It operates using a translation table—a dictionary that maps the characters you want to delete to None.

The code builds a translation_table using a dictionary comprehension.
It checks each character's numerical value with ord(char) > 127 to identify everything outside the standard ASCII range.
Each non-ASCII character's code point is then used as a key that maps to None, marking it for deletion.

When you call text.translate(), it applies this table to the string, creating a new version with the unwanted characters stripped out.

Using list comprehension to filter characters

text = "Hello, 世界! Unicode symbols: ❤️ 😊 🐍" ascii_only = ''.join(char for char in text if ord(char) < 128) print(f"Original: {text}") print(f"ASCII only: {ascii_only}")--OUTPUT--Original: Hello, 世界! Unicode symbols: ❤️ 😊 🐍 ASCII only: Hello, ! Unicode symbols:

This method uses a generator expression to build a new string containing only the characters you want to keep. It's a straightforward and highly readable approach. The code iterates through the original text and filters each character based on its numerical value.

The condition ord(char) < 128 checks if a character's code point falls within the standard ASCII range.
Finally, ''.join() takes all the characters that pass the filter and combines them into a single, clean string.

Advanced techniques for Unicode character handling

Moving beyond simple removal, these advanced techniques offer surgical control for handling specific Unicode ranges, normalizing characters with unicodedata, and building custom filters.

Removing specific Unicode ranges

def remove_unicode_range(text, start, end): return ''.join(char for char in text if ord(char) < start or ord(char) > end) text = "Hello, 世界! Emoji: 😊 Latin: é ñ Greek: α β" # Remove only emoji characters (Unicode range 0x1F300-0x1F9FF) no_emoji = remove_unicode_range(text, 0x1F300, 0x1F9FF) print(f"Original: {text}") print(f"No emoji: {no_emoji}")--OUTPUT--Original: Hello, 世界! Emoji: 😊 Latin: é ñ Greek: α β No emoji: Hello, 世界! Emoji: Latin: é ñ Greek: α β

This approach offers surgical precision, letting you remove specific types of characters instead of all non-ASCII content. The custom remove_unicode_range function filters characters based on their numerical code points.

It uses the condition ord(char) < start or ord(char) > end to identify which characters to keep.
Anything with a code point falling between the start and end values is excluded from the final string.

In the example, it removes emojis by targeting their specific Unicode block, leaving other non-ASCII characters like "é" and "α" untouched.

Using Unicode normalization with `unicodedata`

import unicodedata text = "Café Piñata résumé" # Normalize and then remove combining marks (accents, etc.) normalized = ''.join(c for c in unicodedata.normalize('NFKD', text) if not unicodedata.combining(c)) print(f"Original: {text}") print(f"Normalized without accents: {normalized}")--OUTPUT--Original: Café Piñata résumé Normalized without accents: Cafe Pinata resume

This method is perfect when you want to keep the base characters but remove accents or other diacritics. It's not about deleting characters but transforming them. The unicodedata.normalize() function with the 'NFKD' form is the key here. It cleverly separates characters like "é" into a base letter ("e") and a separate combining mark.

The code then iterates through this decomposed string.
unicodedata.combining(c) checks if a character is a combining mark, like an accent.
By keeping only the characters that aren't combining marks, you effectively strip the accents, leaving a clean string.

Building a customizable Unicode filter function

def filter_unicode(text, allowed_categories=('Ll', 'Lu', 'Nd', 'Pc', 'Zs')): import unicodedata return ''.join(c for c in text if unicodedata.category(c) in allowed_categories) text = "Hello, 世界! @#$% 123 ❤️ 😊" filtered = filter_unicode(text) print(f"Original: {text}") print(f"Filtered (letters, numbers, spaces only): {filtered}")--OUTPUT--Original: Hello, 世界! @#$% 123 ❤️ 😊 Filtered (letters, numbers, spaces only): Hello 123

This function offers fine-grained control by filtering characters based on their official Unicode category. The unicodedata.category() method determines if a character is a letter, number, or symbol, letting you create a precise whitelist of character types to keep.

The function uses a default list of allowed_categories to keep common characters like letters ('Ll', 'Lu'), numbers ('Nd'), and spaces ('Zs').
It's highly customizable—you can change the list to allow any character type defined in the Unicode standard, giving you surgical precision over your text.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

For the Unicode removal techniques we've explored, Replit Agent can turn them into production tools:

Build a data sanitization tool that cleans text files by removing emojis and special symbols before analysis.
Create a username validation API that ensures all user inputs are ASCII-only for system compatibility.
Deploy a content migration script that normalizes text from an old database, converting characters like "é" to "e" for a new system.

Describe your app idea, and Replit Agent writes the code, tests it, and fixes issues automatically, all in your browser.

Common errors and challenges

Handling Unicode isn't always straightforward; here are some common pitfalls you might encounter and how to navigate them.

Forgetting to specify encoding when reading files. If you don't explicitly set an encoding like encoding="utf-8" when opening a file, Python might use a system-default encoding that can't handle the characters in your file. This often results in a UnicodeDecodeError and can cause scripts to fail when moved between different operating systems.
Mixing bytes and str types. This mistake typically triggers a TypeError. It happens when you use a method like encode(), which returns a bytes object, and then try to combine it with a regular str without first converting it back using decode().
Comparing strings with different Unicode representations. Two strings can look identical but fail a comparison check with ==. For example, a character like "é" can be represented as a single precomposed character or as a base letter "e" followed by a combining accent mark. Normalizing both strings before comparing them ensures you're evaluating their true content.

Forgetting to specify encoding when reading files

This error is a classic gotcha. When you use the open() function without specifying an encoding, Python defaults to a system-specific setting. If that setting doesn't match your file's content, your script will likely crash. See what happens in this example.

def read_unicode_file(filename): with open(filename, 'r') as f: # No encoding specified return f.read() # This will likely raise UnicodeDecodeError with non-ASCII text files

The open() function defaults to a system-specific encoding. If the file contains characters that this encoding can't interpret, the script crashes. The following example demonstrates the correct way to open files with Unicode content.

def read_unicode_file(filename): with open(filename, 'r', encoding='utf-8') as f: # UTF-8 encoding specified return f.read() # Now correctly handles UTF-8 encoded text files

The fix is to always specify the encoding when reading files. By adding encoding="utf-8" to the open() function, you tell Python exactly how to interpret the file's contents, which prevents a UnicodeDecodeError.

This is especially important when your script handles data from external sources, like user uploads or web APIs, where non-ASCII characters are common. It makes your code portable and reliable across different environments.

Mixing bytes and `str` types

Python enforces a strict separation between text (str) and binary data (bytes). You can't concatenate them directly with an operator like +. This mismatch triggers a TypeError because the operation is undefined. The following code illustrates this common pitfall.

binary_data = b'Hello, \xe4\xb8\x96\xe7\x95\x8c' # UTF-8 encoded bytes text = "Additional text" result = binary_data + text # TypeError: can't concat bytes to str

The error arises because binary_data and text are fundamentally different types. Python can't guess how to combine raw binary data with a text string. The correct approach requires an explicit conversion, as shown in the code below.

binary_data = b'Hello, \xe4\xb8\x96\xe7\x95\x8c' # UTF-8 encoded bytes text = "Additional text" result = binary_data.decode('utf-8') + text # Decode bytes to str first

The solution is to explicitly convert the bytes object into a string before combining them. By calling binary_data.decode('utf-8'), you transform the binary data back into a text string. Now that both variables are of the str type, you can safely concatenate them using the + operator. This error often appears when you're working with data from network requests or files opened in binary mode, so always check your types before combining.

Comparing strings with different Unicode representations

Two strings can appear identical but fail a comparison check. This happens because Unicode can represent the same visual character in different ways, such as a precomposed character versus a base letter plus an accent. This subtle difference can cause unexpected bugs.

The code below demonstrates how this can lead to a == comparison returning False for strings that look the same.

str1 = "café" # 'é' as a single character (composed) str2 = "cafe\u0301" # 'e' followed by combining accent (decomposed) print(f"Visually identical: {str1} and {str2}") print(f"Equal? {str1 == str2}") # False, despite looking the same

The == operator performs a literal comparison of the strings' underlying data. Because the two strings are stored differently despite looking the same, the check returns False. The following code demonstrates the proper way to compare them.

import unicodedata str1 = "café" # 'é' as a single character (composed) str2 = "cafe\u0301" # 'e' followed by combining accent (decomposed) norm1 = unicodedata.normalize('NFC', str1) norm2 = unicodedata.normalize('NFC', str2) print(f"Equal after normalization? {norm1 == norm2}") # True

The solution is to standardize both strings before you compare them. By using unicodedata.normalize('NFC', ...), you convert each string to a consistent, composed form. This ensures their underlying data becomes identical, so the == operator correctly returns True. Keep an eye out for this issue when working with text from different sources, like user input or web APIs, where character representations can vary even when they look the same.

Real-world applications

These techniques, particularly with unicodedata, are essential for practical tasks like creating clean URLs and normalizing text across different languages.

Creating URL-friendly slugs with `unicodedata`

You can combine the unicodedata module with regular expressions to transform a title containing special characters and emojis into a clean, URL-friendly slug.

import re import unicodedata def create_slug(title): # Normalize, remove accents, convert to lowercase text = ''.join(c for c in unicodedata.normalize('NFKD', title.lower()) if not unicodedata.combining(c) and ord(c) < 128) # Replace spaces with hyphens, remove non-alphanumeric return re.sub(r'[^a-z0-9\-]', '', re.sub(r'\s+', '-', text)) original = "Café Review: The Best Croissants in São Paulo! 🥐" slug = create_slug(original) print(f"Original title: {original}") print(f"URL slug: {slug}")

The create_slug function converts a title into a clean, URL-ready string. It’s a multi-step process that ensures the output is simple and predictable.

First, it uses unicodedata.normalize('NFKD') to separate base characters from accents and converts the text to lowercase.
Next, it filters out the accents and any remaining non-ASCII characters, keeping only basic ASCII.
Finally, it uses re.sub() twice—first to replace spaces with hyphens, and then to remove any character that isn't a letter, number, or hyphen.

Using `unicodedata` for multi-language text normalization

When preparing text from multiple languages for data analysis, you can use the unicodedata module to standardize characters and ensure consistency.

import unicodedata def clean_for_analysis(text, preserve_case=False): # Normalize and remove accents normalized = unicodedata.normalize('NFKD', text) cleaned = ''.join(c for c in normalized if not unicodedata.combining(c)) # Optionally convert to lowercase if not preserve_case: cleaned = cleaned.lower() return cleaned text = "Análisis de datos de München y São Paulo" result = clean_for_analysis(text) print(f"Original: {text}") print(f"Cleaned for analysis: {result}")

The clean_for_analysis function prepares text for tasks like machine learning by creating a uniform representation. It's a two-step process for standardizing characters from different languages.

First, unicodedata.normalize('NFKD', text) decomposes complex characters, like "ü", into a base letter and its diacritic.
A generator expression then rebuilds the string, keeping only characters where unicodedata.combining(c) returns False.

By default, it also converts the text to lowercase, which helps ensure different capitalizations of the same word are treated identically during analysis.

Get started with Replit

Turn these techniques into a real tool. Tell Replit Agent to "build a script to clean emojis from a CSV" or "create an API to validate ASCII-only usernames".

The agent writes the code, tests for errors, and deploys your app from a single prompt. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free