How to find correlation in Python

Learn how to find correlation in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

How to find correlation in Python
Published on: 
Tue
Mar 17, 2026
Updated on: 
Tue
Mar 24, 2026
The Replit Team

Correlation analysis in Python uncovers relationships within your data. Python's libraries offer powerful tools to measure how variables move together, a fundamental concept in statistics and data science.

You'll learn several techniques to calculate correlation, with practical tips and real-world applications. You'll also find common debugging advice to help you confidently apply these methods in your projects.

Using NumPy's corrcoef() function

import numpy as np

x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])

correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation coefficient: {correlation:.4f}")--OUTPUT--Correlation coefficient: 1.0000

The corrcoef() function is a quick way to get the job done, but it’s important to know it returns a correlation matrix, not a single value. This matrix shows how each input array correlates with every other array passed to the function.

That’s why you need to access the specific coefficient you want using an index like [0, 1]. This selects the correlation between the first array (at index 0) and the second array (at index 1). The diagonal elements of the matrix, such as at [0, 0] or [1, 1], will always be 1 because they represent a variable's perfect correlation with itself.

Common correlation techniques

While corrcoef() is a solid starting point, Python's data science stack offers more robust tools for handling complex datasets and visualizing relationships.

Using Pandas corr() method

import pandas as pd

data = {'x': [1, 2, 3, 4, 5],
'y': [2, 4, 6, 8, 10],
'z': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)

correlation_matrix = df.corr()
print(correlation_matrix)--OUTPUT--x y z
x 1.000000 1.000000 -1.000000
y 1.000000 1.000000 -1.000000
z -1.000000 -1.000000 1.000000

When your data is in a Pandas DataFrame, using the built-in corr() method is often more convenient. You can call it directly on your DataFrame, and it automatically computes the correlation between all numeric columns, returning a new DataFrame that serves as a correlation matrix.

  • This approach is great for quickly exploring relationships in tabular data without needing to isolate individual columns.
  • As shown in the output, x and y have a perfect positive correlation (1.0), while x and z have a perfect negative correlation (-1.0).

Using SciPy for different correlation methods

from scipy import stats

x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]

pearson, p_value = stats.pearsonr(x, y)
spearman, _ = stats.spearmanr(x, y)
kendall, _ = stats.kendalltau(x, y)

print(f"Pearson: {pearson:.4f}, Spearman: {spearman:.4f}, Kendall: {kendall:.4f}")--OUTPUT--Pearson: 1.0000, Spearman: 1.0000, Kendall: 1.0000

SciPy’s stats module gives you more statistical firepower by providing functions for different types of correlation. This is useful when you need to go beyond the standard linear correlation and choose a method that better suits your data.

  • stats.pearsonr is for linear relationships and conveniently returns a p-value, which helps determine if the correlation is statistically significant.
  • stats.spearmanr and stats.kendalltau measure monotonic relationships. They check if one variable tends to move in the same direction as another, even if the relationship isn't linear.

Visualizing correlation with Seaborn

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 5, 4, 5],
'C': [10, 8, 6, 4, 2]
})

sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()--OUTPUT--[Heatmap visualization of correlation matrix between variables A, B, and C]

Visualizing data makes complex relationships much easier to grasp. Seaborn's heatmap() function transforms a correlation matrix into a color coded grid, offering an intuitive way to spot patterns instantly.

  • The annot=True argument is essential, as it overlays the numerical correlation values directly onto the chart.
  • A colormap like 'coolwarm' helps you distinguish between positive (warm colors) and negative (cool colors) correlations at a glance.

Advanced correlation analysis

When simple correlations don't tell the whole story, advanced techniques can help you untangle confounding variables and capture complex non-linear relationships.

Computing partial correlation

import pingouin as pg
import pandas as pd

data = pd.DataFrame({
'x': [1, 2, 3, 4, 5, 6, 7, 8],
'y': [3, 5, 7, 9, 8, 7, 5, 4],
'z': [5, 4, 3, 2, 3, 4, 5, 6]
})

partial_corr = pg.partial_corr(data=data, x='x', y='y', covar='z')
print(partial_corr)--OUTPUT--r dof pval CI95% method
0 -0.5 5 0.2532357 [-0.9, 0.4] pearson

Partial correlation measures the relationship between two variables, like x and y, after removing the influence of a third variable, z. It’s useful when you suspect a confounding variable is distorting the true connection you want to analyze.

The pingouin library simplifies this with its partial_corr() function.

  • You specify the two variables of interest (x and y) and the variable to control for using the covar parameter.
  • The function returns a detailed statistical summary, giving you a clearer picture of the direct relationship.

Using distance correlation for non-linear relationships

import numpy as np
from dcor import distance_correlation

np.random.seed(0)
x = np.random.uniform(0, 1, 30)
y = np.sin(2 * np.pi * x) + 0.1 * np.random.normal(0, 1, 30)

pearson = np.corrcoef(x, y)[0, 1]
dcor = distance_correlation(x, y)

print(f"Pearson: {pearson:.4f}, Distance correlation: {dcor:.4f}")--OUTPUT--Pearson: 0.0704, Distance correlation: 0.5371

Distance correlation is a powerful tool for detecting both linear and non-linear relationships between variables. Standard Pearson correlation can be misleading when variables follow a complex pattern—like the sine wave created in the code—because it only looks for straight-line trends.

  • Notice how the Pearson coefficient is close to zero, incorrectly suggesting no link between x and y.
  • In contrast, distance_correlation from the dcor library successfully identifies the connection, yielding a significantly higher value. This makes it essential when you suspect your data's relationship isn't linear.

Measuring correlation with mutual information

from sklearn.feature_selection import mutual_info_regression
import numpy as np
import pandas as pd

np.random.seed(0)
X = np.random.normal(0, 1, (100, 3))
y = X[:, 0] + np.square(X[:, 1]) + np.random.normal(0, 0.5, 100)

mi = mutual_info_regression(X, y)
features = pd.DataFrame({'Feature': [f'X{i}' for i in range(X.shape[1])],
'Mutual Info': mi})
print(features.sort_values(by='Mutual Info', ascending=False))--OUTPUT--Feature Mutual Info
1 X1 0.7882
0 X0 0.5445
2 X2 0.0323

Mutual information measures the dependency between two variables, capturing any kind of relationship—not just linear ones. It's a concept from information theory that quantifies how much knowing one variable reduces uncertainty about another, making it a powerful tool for feature selection.

  • The mutual_info_regression function from scikit-learn is ideal for this, especially when you suspect complex interactions in your data.
  • In the example, the target y depends on X0 linearly and X1 non-linearly. The output correctly shows that X1 and X0 share the most information with y, while X2 is unrelated.

Move faster with Replit

Replit is an AI-powered development platform that transforms natural language into working applications. You describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.

The correlation methods in this article are powerful, and Replit Agent can help you turn them into production-ready tools. Instead of just running scripts, you can build full applications that leverage these statistical concepts.

  • Build a financial dashboard that visualizes stock price relationships using a Seaborn heatmap().
  • Create a market research tool that calculates the partial_corr() between ad spend and sales while controlling for seasonal trends.
  • Deploy a feature selection utility that uses mutual_info_regression to identify the most predictive variables in a machine learning dataset.

Take your data analysis projects from concept to completion. Describe your application, and let Replit Agent handle the coding, testing, and deployment.

Common errors and challenges

Even with powerful tools, you might run into a few common roadblocks when calculating correlation in Python, but they're all straightforward to solve.

  • Handling missing values with np.corrcoef(): The function is strict and will return nan if it encounters any missing data. You’ll need to filter out or fill in missing values before passing your arrays to the function to get a valid result.
  • Fixing shape issues with np.corrcoef(): The function also requires that all input arrays have the exact same length. If they don’t, you’ll get a ValueError, so it’s a good practice to check the shape of your arrays first.
  • Correlating categorical data with pandas: The Pandas corr() method only works on numeric columns and will ignore text-based data by default. To include categorical variables in your analysis, you must first encode them into numbers.

Handling missing values with np.corrcoef()

NumPy's corrcoef() function is strict about missing data. If your arrays contain even one np.nan (Not a Number) value, the function can't perform the calculation and will return nan instead of a numeric correlation. See what happens in the code below.

import numpy as np

# Dataset with missing values
x = np.array([1, 2, np.nan, 4, 5])
y = np.array([2, 4, 6, np.nan, 10])

correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation coefficient: {correlation}")

The resulting coefficient is nan because the calculation fails when it encounters the np.nan values. To get a valid number, you first need to address the missing data, as the following code demonstrates.

import numpy as np

# Dataset with missing values
x = np.array([1, 2, np.nan, 4, 5])
y = np.array([2, 4, 6, np.nan, 10])

mask = ~(np.isnan(x) | np.isnan(y))
correlation = np.corrcoef(x[mask], y[mask])[0, 1]
print(f"Correlation coefficient: {correlation:.4f}")

The solution is to filter out the missing data before the calculation. You can create a boolean mask using np.isnan() to find where np.nan values exist in either array. By combining these checks with the | operator and inverting the result with ~, you get a mask that keeps only the complete data pairs. Applying this mask to your arrays ensures corrcoef() receives clean data, giving you a valid correlation coefficient.

Fixing shape issues with np.corrcoef()

The np.corrcoef() function is also strict about array dimensions. It expects inputs to be either all 1D or all 2D with compatible shapes. Mixing them, like passing a 2D and a 1D array, will cause an error. See what happens below.

import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
y = np.array([7, 8, 9]) # 1D array

correlation = np.corrcoef(x, y)[0, 1]
print(correlation)

This code triggers a ValueError because np.corrcoef() tries to treat x and y as rows in one dataset. The operation fails since their dimensions don't match. Check the code below for the correct way to handle this.

import numpy as np

x = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
y = np.array([7, 8, 9]) # 1D array

data = np.vstack((x[0], x[1], y))
correlation_matrix = np.corrcoef(data)
print("Correlation between x[0] and y:", correlation_matrix[0, 2])
print("Correlation between x[1] and y:", correlation_matrix[1, 2])

To fix this, you need to combine your arrays into a single 2D array before calculating the correlation. The np.vstack() function is perfect for this. It stacks the arrays vertically, treating each one as a row in a new dataset. You can then call np.corrcoef() on this unified array and access the specific correlations you need from the resulting matrix, like the correlation between x[0] and y at index [0, 2].

Correlating categorical data with pandas

The Pandas corr() method is designed for numbers and automatically ignores non-numeric columns like text categories. This means you can't directly measure the relationship between categorical data and your other variables. The following code shows what happens when you try.

import pandas as pd

data = {'category': ['A', 'B', 'A', 'C', 'B'],
'values': [10, 15, 12, 18, 20]}
df = pd.DataFrame(data)

correlation = df.corr()
print(correlation)

The resulting correlation matrix only includes the values column, as corr() automatically skips the non-numeric 'category' data. This leaves out a key part of your analysis. See how to include it in the next example.

import pandas as pd

data = {'category': ['A', 'B', 'A', 'C', 'B'],
'values': [10, 15, 12, 18, 20]}
df = pd.DataFrame(data)

df_dummies = pd.get_dummies(df['category'])
df_combined = pd.concat([df_dummies, df['values']], axis=1)

correlation = df_combined.corr()
print(correlation)

To solve this, you must convert text categories into numbers—a process known as one-hot encoding. This is essential whenever your analysis includes non-numeric data. The pd.get_dummies() function handles this transformation for you.

  • It creates new columns for each category, marking rows with a 1 or 0.
  • You then combine these new columns with your original data, allowing corr() to analyze the relationships between the categories and your numeric values.

Real-world applications

Beyond the code and common errors, correlation analysis shines in real-world applications like finance and bioinformatics.

Analyzing stock market correlations with corrcoef()

Understanding how different stocks move in relation to one another is a cornerstone of portfolio management, and NumPy's corrcoef() function offers a straightforward way to quantify this relationship.

import numpy as np
import pandas as pd

# Sample daily returns for two stocks (percentages)
stock_a = np.array([0.5, -0.2, 1.1, -0.9, 0.3, 0.7, -0.5, 1.2])
stock_b = np.array([0.3, -0.1, 0.8, -0.7, 0.2, 0.5, -0.3, 0.9])

# Calculate correlation between daily returns
correlation = np.corrcoef(stock_a, stock_b)[0, 1]
print(f"Correlation between Stock A and Stock B returns: {correlation:.4f}")

This example demonstrates how to measure the relationship between the daily returns of two different stocks. The returns for stock_a and stock_b are stored in NumPy arrays, representing their day-to-day performance changes.

  • The np.corrcoef() function is then applied to these arrays.
  • It calculates a coefficient that quantifies whether the stocks tend to rise and fall together.

A high positive value suggests they move in sync, which is a key insight for financial analysis. The final output is formatted to four decimal places for precision.

Finding correlated genes with pd.DataFrame.corr()

In bioinformatics, the Pandas corr() method helps you sift through complex gene expression data to find genes that might be functionally related by measuring how their activity levels move together.

import pandas as pd
import numpy as np

# Sample gene expression data (rows=samples, columns=genes)
np.random.seed(42)
gene_data = pd.DataFrame(np.random.normal(0, 1, (20, 5)),
columns=['Gene_A', 'Gene_B', 'Gene_C', 'Gene_D', 'Gene_E'])

# Add correlated gene
gene_data['Gene_F'] = gene_data['Gene_A'] * 0.8 + np.random.normal(0, 0.3, 20)

# Find the most correlated gene pair
corr_matrix = gene_data.corr().abs()
np.fill_diagonal(corr_matrix.values, 0)
max_corr = corr_matrix.max().max()
max_corr_genes = np.where(corr_matrix == max_corr)
gene1, gene2 = corr_matrix.columns[max_corr_genes[0][0]], corr_matrix.columns[max_corr_genes[1][0]]

print(f"Most correlated genes: {gene1} and {gene2} (r={max_corr:.4f})")

This code simulates gene expression data to identify the most strongly related pair. It starts by creating a DataFrame of random data, then intentionally adds a new gene, Gene_F, whose values are derived from Gene_A. This setup guarantees a high correlation for the script to find.

  • The corr().abs() method computes the correlation matrix and uses the absolute value to find the strongest relationship, positive or negative.
  • np.fill_diagonal() zeroes out the diagonal to ensure a gene isn’t matched with itself, since self-correlation is always perfect.
  • Finally, it locates the maximum value to pinpoint the two most correlated genes.

Get started with Replit

Turn your knowledge into a real tool. Tell Replit Agent to "build a dashboard that visualizes stock correlations with a Seaborn heatmap" or "create a tool that calculates partial correlation for a CSV upload."

Replit Agent writes the code, tests for errors, and deploys your application. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.