How to find correlation in Python
Learn how to find correlation in Python. Discover different methods, tips, real-world applications, and how to debug common errors.

Correlation analysis in Python uncovers relationships within your data. Python's libraries offer powerful tools to measure how variables move together, a fundamental concept in statistics and data science.
You'll learn several techniques to calculate correlation, with practical tips and real-world applications. You'll also find common debugging advice to help you confidently apply these methods in your projects.
Using NumPy's corrcoef() function
import numpy as np
x = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 6, 8, 10])
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation coefficient: {correlation:.4f}")--OUTPUT--Correlation coefficient: 1.0000
The corrcoef() function is a quick way to get the job done, but it’s important to know it returns a correlation matrix, not a single value. This matrix shows how each input array correlates with every other array passed to the function.
That’s why you need to access the specific coefficient you want using an index like [0, 1]. This selects the correlation between the first array (at index 0) and the second array (at index 1). The diagonal elements of the matrix, such as at [0, 0] or [1, 1], will always be 1 because they represent a variable's perfect correlation with itself.
Common correlation techniques
While corrcoef() is a solid starting point, Python's data science stack offers more robust tools for handling complex datasets and visualizing relationships.
Using Pandas corr() method
import pandas as pd
data = {'x': [1, 2, 3, 4, 5],
'y': [2, 4, 6, 8, 10],
'z': [5, 4, 3, 2, 1]}
df = pd.DataFrame(data)
correlation_matrix = df.corr()
print(correlation_matrix)--OUTPUT--x y z
x 1.000000 1.000000 -1.000000
y 1.000000 1.000000 -1.000000
z -1.000000 -1.000000 1.000000
When your data is in a Pandas DataFrame, using the built-in corr() method is often more convenient. You can call it directly on your DataFrame, and it automatically computes the correlation between all numeric columns, returning a new DataFrame that serves as a correlation matrix.
- This approach is great for quickly exploring relationships in tabular data without needing to isolate individual columns.
- As shown in the output,
xandyhave a perfect positive correlation (1.0), whilexandzhave a perfect negative correlation (-1.0).
Using SciPy for different correlation methods
from scipy import stats
x = [1, 2, 3, 4, 5]
y = [2, 4, 6, 8, 10]
pearson, p_value = stats.pearsonr(x, y)
spearman, _ = stats.spearmanr(x, y)
kendall, _ = stats.kendalltau(x, y)
print(f"Pearson: {pearson:.4f}, Spearman: {spearman:.4f}, Kendall: {kendall:.4f}")--OUTPUT--Pearson: 1.0000, Spearman: 1.0000, Kendall: 1.0000
SciPy’s stats module gives you more statistical firepower by providing functions for different types of correlation. This is useful when you need to go beyond the standard linear correlation and choose a method that better suits your data.
stats.pearsonris for linear relationships and conveniently returns ap-value, which helps determine if the correlation is statistically significant.stats.spearmanrandstats.kendalltaumeasure monotonic relationships. They check if one variable tends to move in the same direction as another, even if the relationship isn't linear.
Visualizing correlation with Seaborn
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
data = pd.DataFrame({
'A': [1, 2, 3, 4, 5],
'B': [2, 4, 5, 4, 5],
'C': [10, 8, 6, 4, 2]
})
sns.heatmap(data.corr(), annot=True, cmap='coolwarm')
plt.title('Correlation Matrix')
plt.show()--OUTPUT--[Heatmap visualization of correlation matrix between variables A, B, and C]
Visualizing data makes complex relationships much easier to grasp. Seaborn's heatmap() function transforms a correlation matrix into a color coded grid, offering an intuitive way to spot patterns instantly.
- The
annot=Trueargument is essential, as it overlays the numerical correlation values directly onto the chart. - A colormap like
'coolwarm'helps you distinguish between positive (warm colors) and negative (cool colors) correlations at a glance.
Advanced correlation analysis
When simple correlations don't tell the whole story, advanced techniques can help you untangle confounding variables and capture complex non-linear relationships.
Computing partial correlation
import pingouin as pg
import pandas as pd
data = pd.DataFrame({
'x': [1, 2, 3, 4, 5, 6, 7, 8],
'y': [3, 5, 7, 9, 8, 7, 5, 4],
'z': [5, 4, 3, 2, 3, 4, 5, 6]
})
partial_corr = pg.partial_corr(data=data, x='x', y='y', covar='z')
print(partial_corr)--OUTPUT--r dof pval CI95% method
0 -0.5 5 0.2532357 [-0.9, 0.4] pearson
Partial correlation measures the relationship between two variables, like x and y, after removing the influence of a third variable, z. It’s useful when you suspect a confounding variable is distorting the true connection you want to analyze.
The pingouin library simplifies this with its partial_corr() function.
- You specify the two variables of interest (
xandy) and the variable to control for using thecovarparameter. - The function returns a detailed statistical summary, giving you a clearer picture of the direct relationship.
Using distance correlation for non-linear relationships
import numpy as np
from dcor import distance_correlation
np.random.seed(0)
x = np.random.uniform(0, 1, 30)
y = np.sin(2 * np.pi * x) + 0.1 * np.random.normal(0, 1, 30)
pearson = np.corrcoef(x, y)[0, 1]
dcor = distance_correlation(x, y)
print(f"Pearson: {pearson:.4f}, Distance correlation: {dcor:.4f}")--OUTPUT--Pearson: 0.0704, Distance correlation: 0.5371
Distance correlation is a powerful tool for detecting both linear and non-linear relationships between variables. Standard Pearson correlation can be misleading when variables follow a complex pattern—like the sine wave created in the code—because it only looks for straight-line trends.
- Notice how the Pearson coefficient is close to zero, incorrectly suggesting no link between
xandy. - In contrast,
distance_correlationfrom thedcorlibrary successfully identifies the connection, yielding a significantly higher value. This makes it essential when you suspect your data's relationship isn't linear.
Measuring correlation with mutual information
from sklearn.feature_selection import mutual_info_regression
import numpy as np
import pandas as pd
np.random.seed(0)
X = np.random.normal(0, 1, (100, 3))
y = X[:, 0] + np.square(X[:, 1]) + np.random.normal(0, 0.5, 100)
mi = mutual_info_regression(X, y)
features = pd.DataFrame({'Feature': [f'X{i}' for i in range(X.shape[1])],
'Mutual Info': mi})
print(features.sort_values(by='Mutual Info', ascending=False))--OUTPUT--Feature Mutual Info
1 X1 0.7882
0 X0 0.5445
2 X2 0.0323
Mutual information measures the dependency between two variables, capturing any kind of relationship—not just linear ones. It's a concept from information theory that quantifies how much knowing one variable reduces uncertainty about another, making it a powerful tool for feature selection.
- The
mutual_info_regressionfunction from scikit-learn is ideal for this, especially when you suspect complex interactions in your data. - In the example, the target
ydepends onX0linearly andX1non-linearly. The output correctly shows thatX1andX0share the most information withy, whileX2is unrelated.
Move faster with Replit
Replit is an AI-powered development platform that transforms natural language into working applications. You describe what you want to build, and Replit Agent creates it—complete with databases, APIs, and deployment.
The correlation methods in this article are powerful, and Replit Agent can help you turn them into production-ready tools. Instead of just running scripts, you can build full applications that leverage these statistical concepts.
- Build a financial dashboard that visualizes stock price relationships using a Seaborn
heatmap(). - Create a market research tool that calculates the
partial_corr()between ad spend and sales while controlling for seasonal trends. - Deploy a feature selection utility that uses
mutual_info_regressionto identify the most predictive variables in a machine learning dataset.
Take your data analysis projects from concept to completion. Describe your application, and let Replit Agent handle the coding, testing, and deployment.
Common errors and challenges
Even with powerful tools, you might run into a few common roadblocks when calculating correlation in Python, but they're all straightforward to solve.
- Handling missing values with
np.corrcoef(): The function is strict and will returnnanif it encounters any missing data. You’ll need to filter out or fill in missing values before passing your arrays to the function to get a valid result. - Fixing shape issues with
np.corrcoef(): The function also requires that all input arrays have the exact same length. If they don’t, you’ll get aValueError, so it’s a good practice to check theshapeof your arrays first. - Correlating categorical data with
pandas: The Pandascorr()method only works on numeric columns and will ignore text-based data by default. To include categorical variables in your analysis, you must first encode them into numbers.
Handling missing values with np.corrcoef()
NumPy's corrcoef() function is strict about missing data. If your arrays contain even one np.nan (Not a Number) value, the function can't perform the calculation and will return nan instead of a numeric correlation. See what happens in the code below.
import numpy as np
# Dataset with missing values
x = np.array([1, 2, np.nan, 4, 5])
y = np.array([2, 4, 6, np.nan, 10])
correlation = np.corrcoef(x, y)[0, 1]
print(f"Correlation coefficient: {correlation}")
The resulting coefficient is nan because the calculation fails when it encounters the np.nan values. To get a valid number, you first need to address the missing data, as the following code demonstrates.
import numpy as np
# Dataset with missing values
x = np.array([1, 2, np.nan, 4, 5])
y = np.array([2, 4, 6, np.nan, 10])
mask = ~(np.isnan(x) | np.isnan(y))
correlation = np.corrcoef(x[mask], y[mask])[0, 1]
print(f"Correlation coefficient: {correlation:.4f}")
The solution is to filter out the missing data before the calculation. You can create a boolean mask using np.isnan() to find where np.nan values exist in either array. By combining these checks with the | operator and inverting the result with ~, you get a mask that keeps only the complete data pairs. Applying this mask to your arrays ensures corrcoef() receives clean data, giving you a valid correlation coefficient.
Fixing shape issues with np.corrcoef()
The np.corrcoef() function is also strict about array dimensions. It expects inputs to be either all 1D or all 2D with compatible shapes. Mixing them, like passing a 2D and a 1D array, will cause an error. See what happens below.
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
y = np.array([7, 8, 9]) # 1D array
correlation = np.corrcoef(x, y)[0, 1]
print(correlation)
This code triggers a ValueError because np.corrcoef() tries to treat x and y as rows in one dataset. The operation fails since their dimensions don't match. Check the code below for the correct way to handle this.
import numpy as np
x = np.array([[1, 2, 3], [4, 5, 6]]) # 2D array
y = np.array([7, 8, 9]) # 1D array
data = np.vstack((x[0], x[1], y))
correlation_matrix = np.corrcoef(data)
print("Correlation between x[0] and y:", correlation_matrix[0, 2])
print("Correlation between x[1] and y:", correlation_matrix[1, 2])
To fix this, you need to combine your arrays into a single 2D array before calculating the correlation. The np.vstack() function is perfect for this. It stacks the arrays vertically, treating each one as a row in a new dataset. You can then call np.corrcoef() on this unified array and access the specific correlations you need from the resulting matrix, like the correlation between x[0] and y at index [0, 2].
Correlating categorical data with pandas
The Pandas corr() method is designed for numbers and automatically ignores non-numeric columns like text categories. This means you can't directly measure the relationship between categorical data and your other variables. The following code shows what happens when you try.
import pandas as pd
data = {'category': ['A', 'B', 'A', 'C', 'B'],
'values': [10, 15, 12, 18, 20]}
df = pd.DataFrame(data)
correlation = df.corr()
print(correlation)
The resulting correlation matrix only includes the values column, as corr() automatically skips the non-numeric 'category' data. This leaves out a key part of your analysis. See how to include it in the next example.
import pandas as pd
data = {'category': ['A', 'B', 'A', 'C', 'B'],
'values': [10, 15, 12, 18, 20]}
df = pd.DataFrame(data)
df_dummies = pd.get_dummies(df['category'])
df_combined = pd.concat([df_dummies, df['values']], axis=1)
correlation = df_combined.corr()
print(correlation)
To solve this, you must convert text categories into numbers—a process known as one-hot encoding. This is essential whenever your analysis includes non-numeric data. The pd.get_dummies() function handles this transformation for you.
- It creates new columns for each category, marking rows with a
1or0. - You then combine these new columns with your original data, allowing
corr()to analyze the relationships between the categories and your numeric values.
Real-world applications
Beyond the code and common errors, correlation analysis shines in real-world applications like finance and bioinformatics.
Analyzing stock market correlations with corrcoef()
Understanding how different stocks move in relation to one another is a cornerstone of portfolio management, and NumPy's corrcoef() function offers a straightforward way to quantify this relationship.
import numpy as np
import pandas as pd
# Sample daily returns for two stocks (percentages)
stock_a = np.array([0.5, -0.2, 1.1, -0.9, 0.3, 0.7, -0.5, 1.2])
stock_b = np.array([0.3, -0.1, 0.8, -0.7, 0.2, 0.5, -0.3, 0.9])
# Calculate correlation between daily returns
correlation = np.corrcoef(stock_a, stock_b)[0, 1]
print(f"Correlation between Stock A and Stock B returns: {correlation:.4f}")
This example demonstrates how to measure the relationship between the daily returns of two different stocks. The returns for stock_a and stock_b are stored in NumPy arrays, representing their day-to-day performance changes.
- The
np.corrcoef()function is then applied to these arrays. - It calculates a coefficient that quantifies whether the stocks tend to rise and fall together.
A high positive value suggests they move in sync, which is a key insight for financial analysis. The final output is formatted to four decimal places for precision.
Finding correlated genes with pd.DataFrame.corr()
In bioinformatics, the Pandas corr() method helps you sift through complex gene expression data to find genes that might be functionally related by measuring how their activity levels move together.
import pandas as pd
import numpy as np
# Sample gene expression data (rows=samples, columns=genes)
np.random.seed(42)
gene_data = pd.DataFrame(np.random.normal(0, 1, (20, 5)),
columns=['Gene_A', 'Gene_B', 'Gene_C', 'Gene_D', 'Gene_E'])
# Add correlated gene
gene_data['Gene_F'] = gene_data['Gene_A'] * 0.8 + np.random.normal(0, 0.3, 20)
# Find the most correlated gene pair
corr_matrix = gene_data.corr().abs()
np.fill_diagonal(corr_matrix.values, 0)
max_corr = corr_matrix.max().max()
max_corr_genes = np.where(corr_matrix == max_corr)
gene1, gene2 = corr_matrix.columns[max_corr_genes[0][0]], corr_matrix.columns[max_corr_genes[1][0]]
print(f"Most correlated genes: {gene1} and {gene2} (r={max_corr:.4f})")
This code simulates gene expression data to identify the most strongly related pair. It starts by creating a DataFrame of random data, then intentionally adds a new gene, Gene_F, whose values are derived from Gene_A. This setup guarantees a high correlation for the script to find.
- The
corr().abs()method computes the correlation matrix and uses the absolute value to find the strongest relationship, positive or negative. np.fill_diagonal()zeroes out the diagonal to ensure a gene isn’t matched with itself, since self-correlation is always perfect.- Finally, it locates the maximum value to pinpoint the two most correlated genes.
Get started with Replit
Turn your knowledge into a real tool. Tell Replit Agent to "build a dashboard that visualizes stock correlations with a Seaborn heatmap" or "create a tool that calculates partial correlation for a CSV upload."
Replit Agent writes the code, tests for errors, and deploys your application. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create & deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.


.png)
.png)