How to find duplicates in Python

Learn how to find duplicates in Python with various methods. Discover tips, real-world applications, and how to debug common errors.

Published on:

Tue

Feb 24, 2026

Updated on:

Mon

Apr 6, 2026

The Replit Team

ON THIS PAGE

Example H2

The search for duplicates in Python is a common task for data analysis. When you identify and remove duplicate entries, you ensure data integrity and improve your application's performance.

In this article, you'll explore several techniques to handle duplicates. We'll cover practical tips, real-world applications, and advice to debug your code. You will learn to select the right approach for your specific use case.

Finding duplicates with a loop and counter

numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] duplicates = [] for item in numbers: if numbers.count(item) > 1 and item not in duplicates: duplicates.append(item) print("Duplicates:", duplicates)--OUTPUT--Duplicates: [1, 2, 3]

This method iterates through each element and uses the count() function to determine if it appears more than once. It's a direct approach that's easy to read and understand.

The conditional check if numbers.count(item) > 1 and item not in duplicates: is the core of this technique. The second part of this condition, item not in duplicates, prevents the same duplicate value from being added to the results list multiple times. While clear, this method can be inefficient on large lists since count() re-scans the list for every element.

Using built-in data structures

To improve on the looping method's performance, Python offers specialized built-in data structures that can find duplicates much more efficiently.

Finding duplicates with `collections.Counter`

from collections import Counter numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] counter = Counter(numbers) duplicates = [item for item, count in counter.items() if count > 1] print("Duplicates:", duplicates)--OUTPUT--Duplicates: [1, 2, 3]

The collections.Counter class offers a more Pythonic and efficient solution. It's a specialized dictionary subclass that’s purpose-built for counting hashable objects. When you pass a list to Counter, it returns a dictionary-like object where keys are the elements and values are their frequencies.

A list comprehension then filters this object:

It iterates through each item and its count.
The condition if count > 1 ensures only duplicates are collected.

This method is significantly faster than a manual loop, especially for large datasets, as it processes the entire list in one optimized pass.

Using a `set` to identify duplicates

numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] seen = set() duplicates = set() for item in numbers: if item in seen: duplicates.add(item) else: seen.add(item) print("Duplicates:", list(duplicates))--OUTPUT--Duplicates: [1, 2, 3]

This method leverages the high-speed lookup capabilities of sets. It’s an elegant way to find duplicates in a single pass through your list.

A seen set tracks the unique items you've encountered.
A duplicates set stores any item that's already in seen.

To learn more about creating sets in Python, you can explore different initialization methods and use cases.

Because checking for an item's presence in a set—using item in seen—is much faster than scanning a list, this approach offers a significant performance boost over a simple loop, especially with large datasets. You can also use similar set-based techniques for removing duplicates from a list.

Using dictionary to count occurrences

numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] count_dict = {} for item in numbers: count_dict[item] = count_dict.get(item, 0) + 1 duplicates = [item for item, count in count_dict.items() if count > 1] print("Duplicates:", duplicates)--OUTPUT--Duplicates: [1, 2, 3]

This method builds a frequency map from scratch using a dictionary. As you loop through the list, each number becomes a key in the count_dict. The get(item, 0) method is a clever way to handle items the first time they appear; it safely retrieves the current count or defaults to zero before incrementing. For more details on creating dictionaries in Python, you can explore various initialization methods.

The loop populates count_dict with each item's frequency.
A final list comprehension filters for items with a count greater than one, creating your list of duplicates.

Advanced techniques

If the built-in methods aren't quite enough, you can use advanced techniques like filter() with lambdas or external libraries for the heavy lifting.

Using `filter()` and lambda functions

numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] duplicates = list(set(filter(lambda x: numbers.count(x) > 1, numbers))) print("Duplicates:", duplicates)--OUTPUT--Duplicates: [1, 2, 3]

This approach offers a compact, functional-style one-liner. It uses the filter() function with a lambda to build an iterator containing every element whose count is greater than one. While elegant, this method isn't the most efficient for large lists because it repeatedly calls count().

The lambda function tests each item's frequency.
set() is used to collect only the unique duplicate values from the filtered results.
Finally, list() converts the set into a list.

To understand more about using lambda functions in Python, you can explore their syntax and various applications.

Finding duplicates with `pandas`

import pandas as pd numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] series = pd.Series(numbers) duplicates = series[series.duplicated()].unique() print("Duplicates:", list(duplicates))--OUTPUT--Duplicates: [2, 1, 3]

When you're working with large datasets, the pandas library is an excellent choice. The first step is to convert your list into a pandas Series. This structure is optimized for data analysis and gives you access to a suite of powerful methods.

The duplicated() method identifies every repeated occurrence of an element, marking the first instance as unique.
You then use these boolean results to filter the Series, keeping only the duplicate values. Finally, the unique() method ensures you get a clean list of the distinct duplicate numbers.

Using `numpy` for duplicate detection

import numpy as np numbers = [1, 2, 3, 4, 2, 1, 5, 6, 3] arr = np.array(numbers) unique, counts = np.unique(arr, return_counts=True) duplicates = unique[counts > 1] print("Duplicates:", duplicates)--OUTPUT--Duplicates: [1 2 3]

The numpy library is a powerhouse for numerical computing, making it a great choice for large datasets. It streamlines duplicate detection into a few highly optimized steps.

The np.unique() function is the core of this method. When you set return_counts=True, it returns two arrays: one with unique values and another with their frequencies.
The final step uses boolean indexing. The expression counts > 1 creates a filter that selects only the unique items whose count is greater than one, giving you the final list of duplicates.

Move faster with Replit

The techniques in this article are powerful building blocks. But you can move from piecing them together to building complete apps with Agent 4. It's a tool that builds working software—handling code, databases, APIs, and deployment—directly from your description on Replit, an AI-powered development platform with all Python dependencies pre-installed so you can start coding instantly.

Instead of writing boilerplate code, you can describe the tool you want to build:

A data-cleaning utility that ingests a list of contacts and automatically removes duplicate entries.
An inventory checker that scans product codes from a file and flags any duplicates to prevent stock errors.
A simple log analyzer that processes event data, filters out repeated error messages, and summarizes the unique issues.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

While finding duplicates seems straightforward, you can run into subtle bugs or performance bottlenecks, especially when your data gets more complex.

Debugging performance issues with `count()` in large lists

If your script slows to a crawl on large datasets, the count() method is a likely culprit. Because it scans the entire list for each element, its runtime grows exponentially, making it inefficient for thousands of items or more.

When you suspect a performance lag, using a profiler can confirm that excessive time is spent on count(). The fix is to switch to a more optimized approach, like using collections.Counter or a set, which process the list in a single, much faster pass.

Handling case sensitivity in duplicate string detection

By default, Python's duplicate detection is case-sensitive, meaning "Apple" and "apple" are considered unique. This can lead to missed duplicates in textual data from user input or different sources.

The standard solution is to normalize your data before checking. You can create a consistent format by converting all strings to a single case, usually lowercase, with the lower() method. Once normalized, you can use any of the efficient methods, like set or Counter, to accurately identify duplicates.

Resolving errors when comparing complex objects for duplicates

You might hit a TypeError: unhashable type when trying to find duplicates in a list of custom objects, such as class instances. This happens because data structures like set and collections.Counter don't know how to handle objects that aren't simple types like strings or numbers.

To fix this, you need to teach Python how to compare your objects. You can do this by implementing two special methods in your class:

__eq__(self, other): This method defines what makes two of your objects equal. For example, two User objects might be considered the same if their email addresses match.
__hash__(self): This method returns a unique integer based on the same properties you used in __eq__. It allows your objects to be stored in sets and used as dictionary keys.

By defining these methods, you give Python a clear blueprint for identifying duplicates among your custom objects, resolving the error.

Debugging performance issues with `count()` in large lists

The count() method is simple to use, but it’s not built for speed on large datasets. For every item, it has to scan the entire list again, which can seriously slow down your script. The code below puts this performance penalty on display.

large_list = list(range(10000)) + list(range(5000)) duplicates = [] for item in large_list: if large_list.count(item) > 1 and item not in duplicates: duplicates.append(item) print(f"Found {len(duplicates)} duplicates")

This code’s inefficiency comes from calling count() inside the loop. For a 15,000-item list, this means Python re-scans the entire list thousands of times. Check out a more optimized approach below.

large_list = list(range(10000)) + list(range(5000)) seen = set() duplicates = set() for item in large_list: if item in seen: duplicates.add(item) else: seen.add(item) print(f"Found {len(duplicates)} duplicates")

This improved version uses two sets, seen and duplicates, to find duplicates in a single pass. As the code iterates through the list, it adds each new item to seen. If an item is already in seen, it’s a duplicate and gets added to the duplicates set. This approach is much faster because checking for an item in a set is nearly instant, which avoids the costly re-scans required by the count() method on large lists.

Handling case sensitivity in duplicate string detection

When working with text, case sensitivity can cause your duplicate detection logic to fail. Python sees 'John' and 'JANE' as different strings, so it won't flag them as duplicates even though they represent the same name.

The code below demonstrates this problem. Notice how it only identifies 'John' as a duplicate, completely missing the case-variant 'JANE'.

names = ["John", "jane", "John", "JANE", "Bob"] seen = set() duplicates = set() for name in names: if name in seen: duplicates.add(name) else: seen.add(name) print("Duplicates:", list(duplicates))

Because the if name in seen comparison is case-sensitive, it incorrectly treats 'John' and 'JANE' as distinct entries, resulting in an incomplete list of duplicates. The corrected approach below solves this issue.

names = ["John", "jane", "John", "JANE", "Bob"] seen = set() duplicates = set() for name in names: name_lower = name.lower() if name_lower in seen: duplicates.add(name) else: seen.add(name_lower) print("Duplicates:", list(duplicates))

The fix is to normalize the data before comparison. By converting each name to a consistent format with name.lower(), the check if name_lower in seen can correctly identify variants like 'jane' and 'JANE' as the same. This is a common and necessary step when processing textual data from user input or multiple sources.

The code stores the lowercase versions in the seen set for comparison. However, it adds the original, case-sensitive string to the duplicates set, preserving the original data.

Resolving errors when comparing complex objects for duplicates

Finding duplicates in a list of dictionaries isn't as simple as with numbers or strings. Dictionaries are mutable, meaning their contents can change, so they can't be stored in a set. This limitation causes a TypeError, as the following code demonstrates.

user_records = [ {"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 1, "name": "Alice"} ] duplicates = set() seen = set() for record in user_records: if record in seen: duplicates.add(record) else: seen.add(record) print("Duplicate records:", list(duplicates))

The line seen.add(record) triggers a TypeError because dictionaries are mutable and therefore "unhashable." A set can only store immutable items. The corrected code below shows how to work around this limitation when checking for duplicates.

user_records = [ {"id": 1, "name": "Alice"}, {"id": 2, "name": "Bob"}, {"id": 1, "name": "Alice"} ] duplicates = [] seen = set() for record in user_records: record_key = frozenset(record.items()) if record_key in seen: duplicates.append(record) else: seen.add(record_key) print("Duplicate records:", duplicates)

The fix is to convert each dictionary into an immutable type before checking for duplicates. The code creates a hashable representation using frozenset(record.items()), which turns the dictionary's contents into something a set can store. This lets you add the unique identifier to seen and correctly flag duplicates. This technique is essential when processing data like JSON objects or database records, where you need to identify identical entries.

Real-world applications

Beyond debugging common errors, these techniques are essential for practical tasks like cleaning customer data and managing duplicate files through vibe coding.

Finding duplicate files with `hashlib`

To find duplicate files, you can use Python's hashlib library to generate a unique digital fingerprint for each file's content, allowing you to identify identical files regardless of their names.

import os import hashlib directory = "sample_files" file_hashes = {} duplicate_files = [] for filename in os.listdir(directory): filepath = os.path.join(directory, filename) if os.path.isfile(filepath): with open(filepath, 'rb') as f: filehash = hashlib.md5(f.read()).hexdigest() if filehash in file_hashes: duplicate_files.append((filename, file_hashes[filehash])) else: file_hashes[filehash] = filename print("Duplicate files:", duplicate_files)

This script scans the sample_files directory to find files with identical content, not just similar names. It uses the os module to loop through files and hashlib to analyze their contents. For each file, it generates a unique MD5 hash from its binary data.

A dictionary, file_hashes, stores each unique hash and the name of the first file that produced it.
If a new file generates a hash that’s already in the dictionary, the script identifies the current file as a duplicate.

Identifying duplicate customer records using `set`

When you need to clean up customer data, a set provides an efficient way to identify duplicate records based on a unique field like an email address.

customers = [ {"id": 101, "email": "john@example.com", "name": "John Doe"}, {"id": 102, "email": "jane@example.com", "name": "Jane Smith"}, {"id": 103, "email": "john@example.com", "name": "Johnny Doe"}, {"id": 104, "email": "sarah@example.com", "name": "Sarah Johnson"}, {"id": 105, "email": "jane@example.com", "name": "Jane Smith"} ] seen_emails = set() duplicate_emails = set() for customer in customers: email = customer["email"] if email in seen_emails: duplicate_emails.add(email) else: seen_emails.add(email) print("Duplicate email addresses:", list(duplicate_emails))

This script processes a list of customer dictionaries to find which email addresses appear more than once. It uses two sets to keep track of the data as it iterates through the list.

The seen_emails set stores every unique email it encounters.
If the script finds an email that's already in seen_emails, it adds that email to the duplicate_emails set.

This approach isolates the duplicate entries by using the email field as the key for comparison, ensuring each customer is identified correctly based on their email address.

Get started with Replit

Turn these techniques into a working tool with Replit Agent. Describe what you want to build, like “a utility to find and delete duplicate files in a folder” or “an app that cleans a CSV of duplicate customer emails.”

Replit Agent will write the code, test for errors, and deploy your application. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit