How to do PCA in Python

Learn how to perform PCA in Python. This guide covers different methods, tips, real-world applications, and common error debugging.

How to do PCA in Python
Published on: 
Tue
Feb 24, 2026
Updated on: 
Mon
Apr 6, 2026
The Replit Team

Principal Component Analysis (PCA) is a core data science technique for dimensionality reduction. It simplifies complex data through the identification of key patterns, which makes large datasets easier to analyze in Python.

In this article, you'll learn to implement PCA with practical examples and essential tips. You'll also explore its diverse real-world applications and find clear advice to debug common issues you may face.

Using scikit-learn for basic PCA

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
data = load_iris().data
pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data)
print(transformed_data.shape)--OUTPUT--(150, 2)

This snippet uses scikit-learn to simplify the well-known Iris dataset. The process boils down to two key lines that do the heavy lifting.

  • First, PCA(n_components=2) creates the PCA model. By setting n_components to 2, you're instructing it to find the two principal components that best represent the variance in the original data.
  • Next, the fit_transform() method learns these components from the data and immediately applies the transformation, all in one step.

The final shape of (150, 2) confirms the dimensionality reduction. The dataset's original four features are now represented by just two, making it far easier to work with.

Basic PCA implementation approaches

Beyond scikit-learn's convenience, you can implement PCA with numpy, visualize the results, and measure each component's impact using explained_variance_ratio_.

Performing PCA with numpy

import numpy as np
from sklearn.datasets import load_iris
data = load_iris().data
data_centered = data - np.mean(data, axis=0)
cov_matrix = np.cov(data_centered, rowvar=False)
eigenvalues, eigenvectors = np.linalg.eigh(cov_matrix)
sorted_indices = np.argsort(eigenvalues)[::-1]
transformed_data = data_centered @ eigenvectors[:, sorted_indices[:2]]
print(transformed_data.shape)--OUTPUT--(150, 2)

Implementing PCA with numpy gives you more control over the process. It starts by centering the data by subtracting the mean from each feature. Then, you calculate the covariance matrix with np.cov() to see how the features vary together.

  • The core of the operation is np.linalg.eigh(), which computes the eigenvalues and eigenvectors from the covariance matrix. These eigenvectors are your principal components.
  • Finally, you project the data onto the top two eigenvectors to get the transformed, lower-dimensional dataset.

Visualizing PCA results with matplotlib

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import matplotlib.pyplot as plt
data = load_iris().data
target = load_iris().target
pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data)
plt.scatter(transformed_data[:, 0], transformed_data[:, 1], c=target)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.show()--OUTPUT--[Plot showing 150 points in 2D space, colored by iris species (3 distinct clusters)]

Visualizing your results helps confirm if the reduction was effective. The code uses matplotlib's plt.scatter() function to plot the two new principal components against each other.

  • The first component becomes the x-axis, and the second becomes the y-axis.
  • Crucially, the c=target argument colors each point by its original Iris species. This reveals whether the components successfully separate the data.

The distinct clusters in the output plot show that PCA has worked well, capturing the essential differences between the species in just two dimensions.

Using explained_variance_ratio_ to understand components

from sklearn.decomposition import PCA
import numpy as np
from sklearn.datasets import load_iris
data = load_iris().data
pca = PCA(n_components=4)
pca.fit(data)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Cumulative variance:", np.cumsum(pca.explained_variance_ratio_))--OUTPUT--Explained variance ratio: [0.92461872 0.05306648 0.01710261 0.00521218]
Cumulative variance: [0.92461872 0.97768521 0.99478782 1. ]

The explained_variance_ratio_ attribute is your guide to understanding each component's impact. It shows the percentage of the dataset's total variance that each principal component captures, which helps you decide how many components you actually need when doing data analysis in Python.

  • The output shows the first component alone accounts for over 92% of the data's variance.
  • By looking at the cumulative sum from np.cumsum(), you can see the first two components together capture almost 98% of the original information. This confirms that a two-component reduction is a great trade-off.

Advanced PCA techniques

While basic PCA is powerful, advanced techniques like IncrementalPCA, KernelPCA, and Pipeline integration handle more complex, real-world data challenges.

Using IncrementalPCA for large datasets

from sklearn.decomposition import IncrementalPCA
import numpy as np
# Simulate a large dataset
data = np.random.randn(1000, 50)
ipca = IncrementalPCA(n_components=2, batch_size=100)
for i in range(10):
ipca.partial_fit(data[i*100:(i+1)*100])
transformed_data = ipca.transform(data)
print(f"Original: {data.shape}, Transformed: {transformed_data.shape}")--OUTPUT--Original: (1000, 50), Transformed: (1000, 2)

When a dataset is too large to fit into memory, IncrementalPCA is your solution. It processes data in smaller chunks, or batches, making it a memory-efficient alternative to standard PCA for handling large datasets in Python, which requires loading everything at once.

  • You initialize the model with IncrementalPCA(), setting the batch_size to define how much data to process at a time.
  • The partial_fit() method is then called in a loop, allowing the model to learn from each batch sequentially without overwhelming your system's memory.

Applying KernelPCA for nonlinear dimensionality reduction

from sklearn.decomposition import KernelPCA
from sklearn.datasets import make_moons
X, _ = make_moons(n_samples=200, noise=0.05, random_state=42)
kpca = KernelPCA(n_components=2, kernel='rbf', gamma=15)
X_kpca = kpca.fit_transform(X)
print(f"Original shape: {X.shape}, Transformed shape: {X_kpca.shape}")
print(f"First 3 samples: {X_kpca[:3]}")--OUTPUT--Original shape: (200, 2), Transformed shape: (200, 2)
First 3 samples: [[-0.0705281 -0.07167132] [-0.07048331 -0.07165496] [-0.07043867 -0.07162293]]

KernelPCA is your tool for data that isn't linearly separable—meaning you can't draw a straight line to divide its groups. The make_moons dataset is a classic example of this, as its two clusters are intertwined. Standard PCA would fail to find meaningful components here.

  • The magic happens with the "kernel trick." By setting kernel='rbf', you're using the Radial Basis Function to project the data into a higher-dimensional space where it becomes separable.
  • The gamma parameter then controls the influence of the kernel. After this transformation, PCA can effectively identify the principal components in the new space.

Creating a PCA Pipeline with preprocessing

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.datasets import load_wine
data = load_wine().data
pca_pipeline = Pipeline([
('scaler', StandardScaler()),
('pca', PCA(n_components=3))
])
transformed_data = pca_pipeline.fit_transform(data)
print(f"Explained variance: {pca_pipeline.named_steps['pca'].explained_variance_ratio_}")--OUTPUT--Explained variance: [0.36198848 0.19087401 0.10844073]

A Pipeline is a fantastic way to bundle preprocessing and modeling steps into a single object. This is crucial for PCA because the algorithm is sensitive to feature scales. Without scaling, features with larger ranges can unfairly dominate the results, which is why normalizing data in Python is essential.

  • The first step, StandardScaler(), standardizes your data so each feature has a mean of 0 and a standard deviation of 1.
  • The second step, PCA(), then performs dimensionality reduction on this properly scaled data.

Calling fit_transform() on the pipeline runs the entire sequence automatically, giving you a clean and repeatable workflow.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Instead of piecing together techniques, you can use Agent 4 to build complete applications. It takes your description and handles everything from writing code to connecting APIs and deploying the final product.

You can go from learning about methods like PCA() to building tools that use them:

  • A data visualization tool that takes a high-dimensional dataset and plots its two most important principal components.
  • A feature analysis dashboard that calculates and displays the explained_variance_ratio_ for each component to help you select the optimal number.
  • A data compression utility that uses techniques like IncrementalPCA to reduce the dimensionality of large files for efficient storage.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Applying PCA in Python can be tricky; common challenges include missing values, feature scaling, and component interpretation.

Handling missing values before applying PCA

The PCA algorithm in scikit-learn can't handle missing data, so it will throw an error if your dataset contains any NaN values. You must preprocess your data by removing missing values in Python to deal with these gaps before you can proceed with dimensionality reduction.

  • Imputation: The most common approach is to fill, or impute, the missing values. You can use simple strategies like replacing them with the mean, median, or mode of their respective columns.
  • Removal: Alternatively, you can remove the rows or columns that contain missing data. This is a simpler fix but can be risky, as you might discard valuable information, especially if you have a small dataset.

Fixing standardization issues with PCA

A frequent mistake is applying PCA directly to unscaled data. Because PCA works by finding directions of maximum variance, features with larger numerical ranges will naturally dominate the principal components, even if they aren't more important.

This skews the results, causing them to reflect arbitrary measurement units rather than true data patterns. To prevent this, you should always scale your data first. Using StandardScaler, often within a Pipeline, ensures that each feature is on a level playing field and contributes fairly to the analysis.

Troubleshooting incorrect component interpretation

Principal components are not your original features; they are new, abstract variables that can be difficult to understand. A common misstep is to look at the first component and assume it's just your most "important" original feature. In reality, it's a weighted combination of all original features.

To correctly interpret a component, you need to inspect its loadings, which are available in the components_ attribute after fitting the model. These loadings are the weights that show how much each original feature contributes to the new component. While explained_variance_ratio_ tells you how much information a component captures, the loadings tell you what that information is actually about.

Handling missing values before applying PCA

The PCA algorithm in scikit-learn can't process datasets with missing values, often marked as NaN. Attempting to run the analysis on incomplete data will stop the process cold and raise an error. The following code demonstrates this exact scenario.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

data = load_iris().data.copy()
data[10:15, 0] = np.nan

pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data)
print(transformed_data.shape)

The error occurs because data[10:15, 0] = np.nan injects missing values into the dataset. The PCA model can't compute with these NaN entries, which stops the analysis. The next example demonstrates the correct approach.

import numpy as np
from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
from sklearn.impute import SimpleImputer

data = load_iris().data.copy()
data[10:15, 0] = np.nan

imputer = SimpleImputer(strategy='mean')
data_imputed = imputer.fit_transform(data)

pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data_imputed)
print(transformed_data.shape)

The fix is to preprocess the data before running PCA. The code uses SimpleImputer from scikit-learn to handle the NaN values. By setting strategy='mean', it replaces each missing value with the average of its column. This imputed data is then passed to PCA, which can now perform the dimensionality reduction successfully. This is a crucial step whenever your dataset might contain incomplete information.

Fixing standardization issues with PCA

Applying PCA to unscaled data is a common pitfall. Features with vastly different ranges can skew the analysis, as the algorithm prioritizes variance. The code below illustrates this issue, showing how results can be misleading without proper standardization.

from sklearn.decomposition import PCA
from sklearn.datasets import load_wine

data = load_wine().data
print(f"Feature means: {data.mean(axis=0)[:3]}")

pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

The output from data.mean(axis=0) reveals large differences in feature scales. This skews the analysis, as seen in the explained_variance_ratio_. The following code shows how to properly prepare the data before applying PCA.

from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.datasets import load_wine

data = load_wine().data
print(f"Feature means before scaling: {data.mean(axis=0)[:3]}")

scaler = StandardScaler()
data_scaled = scaler.fit_transform(data)

pca = PCA(n_components=2)
transformed_data = pca.fit_transform(data_scaled)
print(f"Explained variance ratio: {pca.explained_variance_ratio_}")

The solution is to scale the data before analysis. The code first creates a StandardScaler instance and then uses its fit_transform() method on the data. This step standardizes each feature, giving them equal weight. Only then is the scaled data passed to PCA. This ensures the resulting explained_variance_ratio_ reflects genuine data patterns, not just differences in measurement units. This is a vital preprocessing step whenever your features have varying scales.

Troubleshooting incorrect component interpretation

A frequent mistake is treating principal components as if they're just your original features. The first component isn't your most important column; it's a new, abstract variable created from a mix of all of them. The following code demonstrates this common misunderstanding.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris

data = load_iris().data
feature_names = load_iris().feature_names
pca = PCA(n_components=2)
pca.fit(data)

print(f"Most important features: {feature_names}")
print(f"PC1 loadings: {pca.components_[0]}")

The code presents the original feature_names and the component's weights from pca.components_ without connecting them, making it easy to misinterpret the results. The next example shows how to correctly analyze these component loadings.

from sklearn.decomposition import PCA
from sklearn.datasets import load_iris
import numpy as np

data = load_iris().data
feature_names = load_iris().feature_names
pca = PCA(n_components=2)
pca.fit(data)

for i, component in enumerate(pca.components_):
sorted_indices = np.argsort(np.abs(component))[::-1]
print(f"PC{i+1} most important features:")
for idx in sorted_indices[:2]:
print(f" {feature_names[idx]}: {component[idx]:.3f}")

To correctly interpret a component, you must analyze its loadings. This code shows you how:

  • It iterates through each component in pca.components_ to access its weights.
  • It then uses np.argsort() to rank original features by the magnitude of their contribution.

This process reveals which features most strongly influence each new component, clarifying what the abstract variables actually measure. This is essential for explaining your model's results to others.

Real-world applications

Beyond debugging, PCA shines in real-world scenarios, from analyzing feature importance in datasets to detecting anomalies in complex systems, and it's often combined with techniques like k-means clustering in Python for comprehensive analysis.

Using PCA for feature importance in wine dataset

You can use PCA to identify the most important features in the wine dataset by examining which chemical properties contribute most to its principal components.

from sklearn.decomposition import PCA
from sklearn.datasets import load_wine
import numpy as np

# Load wine dataset and apply PCA
wine = load_wine()
pca = PCA(n_components=2)
pca.fit(wine.data)

# Extract top contributing features to first component
feature_importance = np.abs(pca.components_[0])
top_indices = np.argsort(feature_importance)[::-1][:3]
top_features = [wine.feature_names[i] for i in top_indices]

print(f"Variance explained: {pca.explained_variance_ratio_}")
print(f"Top 3 features in PC1: {top_features}")

This code pinpoints which chemical properties most define the wine dataset's primary patterns. It's a practical way to see what PCA has learned about your features.

  • First, it isolates the first principal component's loadings from pca.components_[0]. These loadings are the weights showing each original feature's contribution.
  • Then, it uses np.argsort() to rank features by the magnitude of their weights, identifying the most influential ones.

Finally, the code retrieves the names of the top three features, revealing what drives the most variance in the data.

Implementing anomaly detection with PCA

PCA can identify anomalies by measuring the reconstruction error, which quantifies how poorly a data point is rebuilt after being compressed and decompressed.

from sklearn.decomposition import PCA
import numpy as np
from sklearn.datasets import make_blobs

# Create dataset with normal points and outliers
X_normal, _ = make_blobs(n_samples=300, centers=1, random_state=42)
X_outliers = np.random.uniform(low=-8, high=8, size=(10, 2))
X = np.vstack([X_normal, X_outliers])

# Apply PCA for reconstruction
pca = PCA(n_components=1)
X_reconstructed = pca.inverse_transform(pca.fit_transform(X))
reconstruction_error = np.sum((X - X_reconstructed)**2, axis=1)
anomalies = np.where(reconstruction_error > np.percentile(reconstruction_error, 95))[0]

print(f"Total points: {X.shape[0]}, Anomalies detected: {len(anomalies)}")
print(f"Anomaly indices (first 5): {anomalies[:5]}")

This code shows how PCA can find outliers. It first creates a dataset with a main cluster of points using make_blobs and adds some random outliers. The core idea is to use PCA to learn the primary structure of the data, which is defined by the normal points.

  • The data is reduced to one dimension with fit_transform() and then restored to its original shape using inverse_transform().
  • Outliers don't fit the main pattern, so they are restored poorly. The code finds points where this restoration difference is in the top 5% using np.percentile() and flags them as anomalies.

Get started with Replit

Now, turn your knowledge into a real application. Describe what you want to build to Replit Agent, like "a tool that visualizes the top two principal components from a CSV" or "a dashboard that calculates explained_variance_ratio_."

Replit Agent writes the code, tests for errors, and deploys your application directly from your browser. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.