How to do k-means clustering in Python

Learn how to perform k-means clustering in Python. This guide covers methods, tips, real-world applications, and common error debugging.

How to do k-means clustering in Python
Published on: 
Tue
Mar 17, 2026
Updated on: 
Tue
Mar 24, 2026
The Replit Team

K-Means clustering is a powerful unsupervised learning technique to group unlabeled data. Python, with its rich libraries, offers a simple and efficient way to implement this popular algorithm.

Here, you will explore core techniques, practical tips, and real-world applications. You'll also find debugging advice to help you build and refine your own K-Means clustering models effectively.

Basic approach using scikit-learn's KMeans

from sklearn.cluster import KMeans
import numpy as np

data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)
print(kmeans.cluster_centers_)
print(kmeans.labels_)--OUTPUT--[[1. 2.]
[4. 2.]]
[0 0 0 1 1 1]

The scikit-learn library makes implementing K-Means quite simple. In this example, you initialize the KMeans model and tell it to find two groups by setting n_clusters=2. The random_state=0 parameter is important for reproducibility; it ensures the algorithm starts the same way every time you run it. Finally, the .fit(data) method applies the clustering logic to your dataset.

Once the model is trained, you can inspect the outcome. The kmeans.cluster_centers_ attribute reveals the coordinates of the two final cluster centroids. The kmeans.labels_ attribute is an array that assigns each of your original data points to a cluster, effectively telling you which group each point belongs to.

Alternative implementations of k-means

Beyond the convenience of scikit-learn, you can build the algorithm from scratch with NumPy for deeper control or use SciPy for specialized applications.

Implementing k-means from scratch with NumPy

import numpy as np

def kmeans_simple(data, k):
# Random initialization of centroids
centroids = data[np.random.choice(len(data), k, replace=False)]

for _ in range(100):
# Assign points to nearest centroid
distances = np.sqrt(((data - centroids[:, np.newaxis])**2).sum(axis=2))
labels = np.argmin(distances, axis=0)

# Update centroids
new_centroids = np.array([data[labels == i].mean(axis=0) for i in range(k)])
if np.all(centroids == new_centroids):
break
centroids = new_centroids

return centroids, labels

data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
centroids, labels = kmeans_simple(data, 2)
print(centroids)
print(labels)--OUTPUT--[[1. 2.]
[4. 2.]]
[0 0 0 1 1 1]

Building K-Means with NumPy gives you a look under the hood. The custom kmeans_simple function iteratively refines cluster centers. It begins by randomly selecting initial centroids directly from your dataset using np.random.choice.

  • Assignment: The algorithm calculates the distance from each data point to every centroid, assigning the point to the closest one using np.argmin.
  • Update: It then recalculates new centroids by taking the mean of all data points currently assigned to each cluster.

This loop continues until the centroids stop changing between iterations, which means the model has converged on a final solution.

Using SciPy's vector quantization for k-means

from scipy.cluster.vq import kmeans, vq
import numpy as np

data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])
centroids, _ = kmeans(data, 2)
labels, _ = vq(data, centroids)
print(centroids)
print(labels)--OUTPUT--[[1. 2.]
[4. 2.]]
[0 0 0 1 1 1]

SciPy’s approach, found in its vector quantization module, breaks K-Means into two distinct function calls. This method offers a more granular process compared to scikit-learn's single-step fit.

  • The kmeans() function is used first to find the optimal cluster centroids from your dataset.
  • Next, you pass these centroids and the original data into the vq() function, which assigns each data point to the nearest cluster and returns the final labels.

Applying k-means for color quantization

from sklearn.cluster import KMeans
import numpy as np

# Simple RGB color data
colors = np.array([[255, 0, 0], [0, 255, 0], [0, 0, 255],
[255, 255, 0], [255, 0, 255], [0, 255, 255]])
kmeans = KMeans(n_clusters=2, random_state=0).fit(colors)
reduced_colors = kmeans.cluster_centers_[kmeans.labels_].astype(int)
print(reduced_colors)--OUTPUT--[[127 127 255]
[127 255 127]
[127 127 255]
[255 127 127]
[255 127 127]
[127 255 127]]

You can also use K-Means for color quantization—a process for reducing the number of colors in an image. Here, you treat each RGB color as a data point. The algorithm then finds the n_clusters that best represent the entire color palette.

  • The model groups the original colors into two clusters, with each cluster's center becoming a new representative color.
  • The expression kmeans.cluster_centers_[kmeans.labels_] maps each original color to its new representative color, effectively creating your simplified palette.

Advanced k-means techniques

Now that you understand the core mechanics, you can elevate your models with techniques for boosting speed, improving initialization, and finding the ideal number of clusters.

Speeding up clustering with mini-batch k-means

from sklearn.cluster import MiniBatchKMeans
import numpy as np

# Generate sample data
np.random.seed(0)
data = np.random.rand(1000, 2)
mbk = MiniBatchKMeans(n_clusters=3, batch_size=100, random_state=0)
mbk.fit(data)
print("Cluster centers shape:", mbk.cluster_centers_.shape)
print("First 5 labels:", mbk.labels_[:5])--OUTPUT--Cluster centers shape: (3, 2)
First 5 labels: [0 2 2 2 1]

When working with large datasets, standard K-Means can become computationally expensive. MiniBatchKMeans provides a faster alternative by updating centroids using small, random subsets of data. These mini-batches are used in each iteration instead of the entire dataset. In this example, a batch_size of 100 is set.

  • This method drastically reduces the time needed to converge on a solution.
  • The trade-off for this speed is a slight potential decrease in the overall quality of the clusters.

Improving initialization with k-means++

from sklearn.cluster import KMeans
import numpy as np

np.random.seed(0)
data = np.random.rand(100, 2)
kmeans = KMeans(n_clusters=3, init='k-means++', random_state=0)
kmeans.fit(data)
print("Converged:", kmeans.n_iter_ < kmeans.max_iter)
print("Inertia:", round(kmeans.inertia_, 2))--OUTPUT--Converged: True
Inertia: 8.32

The quality of your clusters depends heavily on the initial centroid placement. Using init='k-means++' tackles this by intelligently selecting starting points that are far from each other. This helps you avoid the common trap of a poor random start, which can lead to less accurate clusters.

  • This smarter initialization often leads to faster convergence and more consistent, reliable results.
  • It helps find a solution with low inertia_, which is a measure of how tightly packed the clusters are—a lower score indicates a better fit.

Finding the optimal number of clusters

from sklearn.cluster import KMeans
import numpy as np

np.random.seed(0)
data = np.random.rand(100, 2)
inertias = []
for k in range(1, 6):
kmeans = KMeans(n_clusters=k, random_state=0).fit(data)
inertias.append(kmeans.inertia_)
print("Inertias for k=1 to k=5:", [round(i, 2) for i in inertias])--OUTPUT--Inertias for k=1 to k=5: [16.93, 8.32, 5.47, 4.08, 3.26]

Choosing the right number of clusters is crucial, and this code helps you find the optimal k by testing several options. It loops through a range(1, 6), running the K-Means algorithm for each potential number of clusters and recording the performance score for each run.

  • The score is captured in the inertia_ attribute, which measures how compact the clusters are. A lower value generally means a better fit.
  • You can analyze the list of inertias to find the "elbow point." This is the value of k where the inertia stops dropping dramatically, suggesting you've found a good balance.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. Describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

The K-Means techniques from this article can be turned into production-ready tools. For example, Replit Agent can build:

  • An image color palette generator that uses clustering to extract dominant colors from an uploaded image.
  • A customer segmentation dashboard that groups users into distinct marketing profiles based on their data.
  • A data analysis utility that visualizes the elbow method to help you find the optimal number of clusters for any dataset.

You can turn your own clustering concepts into a working application. Describe your idea to Replit Agent and watch it write, test, and deploy the code for you.

Common errors and challenges

Even a straightforward algorithm like K-Means has common pitfalls that can affect your results, but they're easy to avoid.

Handling empty clusters in KMeans

It's possible for a cluster to become empty during an iteration, meaning no data points are assigned to its centroid. This can happen if a centroid is initialized poorly or if all its points are captured by a closer, more influential cluster. The KMeans implementation in scikit-learn is smart about this; it identifies the empty cluster and reassigns its centroid to a new position, typically the location of the data point that is furthest from its own cluster center. This built-in check helps the algorithm recover and continue, so you rarely need to intervene manually.

Proper feature scaling for accurate clustering

Since K-Means relies on distance to group data, features with vastly different scales can unintentionally skew your results. For instance, if one feature ranges from 0 to 1 and another from 0 to 100,000, the algorithm will give far more weight to the second feature when calculating distances. This can lead to clusters that are nonsensical or biased.

To prevent this, you should scale your data before fitting the model. Using tools like scikit-learn’s StandardScaler or MinMaxScaler ensures that every feature contributes equally to the outcome. This simple preprocessing step is one of the most critical for achieving meaningful and accurate clusters.

Ensuring reproducible results with random_state

Because K-Means starts by randomly placing centroids, running the same code multiple times can yield different cluster assignments. While this is part of the algorithm's nature, it creates problems when you need to debug, compare models, or get consistent results in a production system. A different outcome on every run makes it impossible to reliably evaluate your model's performance.

The fix is simple: always set the random_state parameter in the KMeans model. By providing an integer like random_state=0, you ensure the random initialization is the same every time. This makes your clustering process deterministic and your results fully reproducible.

Handling empty clusters in KMeans

You might encounter an empty cluster when a centroid is initialized far from any data points, or when an outlier pulls all nearby points into another group. Scikit-learn's KMeans implementation handles this gracefully. The following code demonstrates this scenario in action.

from sklearn.cluster import KMeans
import numpy as np

# Dataset with outliers that might cause empty clusters
data = np.array([[1, 1], [1, 2], [2, 1], [100, 100]])

# This might result in an empty cluster
kmeans = KMeans(n_clusters=3, random_state=0).fit(data)
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

The outlier at [100, 100] creates a difficult scenario when you ask for three clusters. One centroid will likely claim the outlier, leaving the other two to compete for the three remaining points, which can result in an empty cluster.

The following code demonstrates a robust way to handle this.

from sklearn.cluster import KMeans
import numpy as np

# Dataset with outliers
data = np.array([[1, 1], [1, 2], [2, 1], [100, 100]])

# Set appropriate number of clusters and increase n_init
kmeans = KMeans(n_clusters=2, random_state=0, n_init=10).fit(data)
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

The solution involves two key adjustments. First, setting n_clusters=2 is more logical for this dataset, as the outlier naturally forms one cluster while the other points form a second. Second, adding n_init=10 makes the algorithm more robust. It runs the clustering process ten times with different starting centroids and selects the best outcome, reducing the risk of an empty cluster caused by a single unlucky initialization. This is especially useful when your data contains outliers.

Proper feature scaling for accurate clustering

It's one thing to talk about scaling, but it's another to see the problem firsthand. When your data contains features with vastly different ranges, the clustering results can be completely misleading. The following code demonstrates exactly what happens in this scenario.

from sklearn.cluster import KMeans
import numpy as np

# Features with different scales
data = np.array([[1, 1000], [2, 2000], [100, 1000], [120, 2000]])

# Without scaling, clustering will be dominated by the second feature
kmeans = KMeans(n_clusters=2, random_state=0).fit(data)
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

The algorithm's distance calculation is skewed by the second feature's large values, making the first feature's contribution almost meaningless. This results in clusters based on only one dimension. The following code shows how to correct for this.

from sklearn.cluster import KMeans
import numpy as np
from sklearn.preprocessing import StandardScaler

# Features with different scales
data = np.array([[1, 1000], [2, 2000], [100, 1000], [120, 2000]])

# Scale the data before clustering
scaler = StandardScaler()
scaled_data = scaler.fit_transform(data)

kmeans = KMeans(n_clusters=2, random_state=0).fit(scaled_data)
print("Cluster centers:", kmeans.cluster_centers_)
print("Labels:", kmeans.labels_)

The fix is to scale your features before clustering. By using scikit-learn's StandardScaler, you can transform the data so every feature has a similar range. The fit_transform(data) method calculates the necessary scaling and applies it in one step. When you then run KMeans on this scaled_data, the algorithm can properly weigh each feature, leading to clusters that accurately reflect the underlying patterns in your data, not just the scale of the numbers.

Ensuring reproducible results with random_state

The random nature of KMeans initialization can lead to different cluster assignments on each run. This makes it difficult to reproduce results—a major problem for debugging and validation. The following code demonstrates this inconsistency by running the same model twice without a fixed state.

from sklearn.cluster import KMeans
import numpy as np

data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Running without fixed random_state
kmeans1 = KMeans(n_clusters=2).fit(data)
kmeans2 = KMeans(n_clusters=2).fit(data)

print("Run 1 labels:", kmeans1.labels_)
print("Run 2 labels:", kmeans2.labels_)

The code runs the KMeans algorithm twice on the same dataset. By creating two separate instances, kmeans1 and kmeans2, without a fixed starting point, you'll likely see different cluster assignments printed for each run. The following code shows how to guarantee identical results.

from sklearn.cluster import KMeans
import numpy as np

data = np.array([[1, 2], [1, 4], [1, 0], [4, 2], [4, 4], [4, 0]])

# Fix the random_state for reproducibility
kmeans1 = KMeans(n_clusters=2, random_state=42).fit(data)
kmeans2 = KMeans(n_clusters=2, random_state=42).fit(data)

print("Run 1 labels:", kmeans1.labels_)
print("Run 2 labels:", kmeans2.labels_)

To fix the inconsistency, you simply set the random_state parameter to a specific integer, like random_state=42. This ensures the initial random placement of centroids is the same every time you run the code. This means both kmeans1 and kmeans2 will produce identical cluster labels. You should always use this parameter whenever you need your results to be reproducible—a crucial step for debugging, comparing models, or deploying your code.

Real-world applications

With a solid grasp of the mechanics, you can apply K-Means to solve practical problems like customer segmentation and document analysis.

Segmenting customers with KMeans for marketing

You can apply K-Means to customer data, such as annual income and spending scores, to find natural segments for more effective marketing.

from sklearn.cluster import KMeans
import numpy as np

# Customer data: [annual_income($K), spending_score(1-100)]
customers = np.array([[15, 39], [15, 81], [45, 40], [74, 35], [75, 78], [30, 60]])
kmeans = KMeans(n_clusters=3, random_state=42).fit(customers)
print("Customer segments:", kmeans.labels_)
print("Segment centers:", kmeans.cluster_centers_)

This code sorts the customers array into three distinct groups based on their features. The model is configured to find exactly three clusters by setting n_clusters=3, and using random_state=42 ensures you get the same result every time you run it.

After the model runs, you can inspect the output:

  • The kmeans.labels_ attribute assigns each customer to a specific group, labeled 0, 1, or 2.
  • The kmeans.cluster_centers_ attribute gives you the coordinates for the central point of each group, which helps define its profile.

Clustering text documents with TfidfVectorizer and KMeans

By first converting text into numerical data with TfidfVectorizer, you can use KMeans to automatically group documents by their subject matter.

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans

documents = [
"Python is a programming language",
"Clustering groups similar data",
"Python is used for machine learning",
"k-means is a clustering algorithm",
"Programming in Python is fun"
]

vectorizer = TfidfVectorizer(stop_words='english')
X = vectorizer.fit_transform(documents)
kmeans = KMeans(n_clusters=2, random_state=42).fit(X)
print("Document clusters:", kmeans.labels_)

This code first uses TfidfVectorizer to transform the raw text into a matrix of numerical features. This process works by measuring the importance of each word in a document relative to the entire collection of documents, while ignoring common English stop_words.

  • The fit_transform method handles both learning the vocabulary and converting the documents into vectors.
  • KMeans then runs on this numerical data, sorting the documents into two distinct groups as specified by n_clusters=2.

The final output, kmeans.labels_, is an array showing which cluster each document was assigned to.

Get started with Replit

Turn your new skills into a real tool. Describe what you want to build to Replit Agent, like “an image color palette generator” or “a customer segmentation dashboard that visualizes clusters.”

The agent writes the code, tests for errors, and deploys your application for you. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.