How to plot a linear regression in Python

Plot linear regression in Python. This guide covers various methods, tips and tricks, real-world applications, and debugging common errors.

Published on:

Tue

Mar 3, 2026

Updated on:

Mon

Apr 13, 2026

The Replit Team

ON THIS PAGE

Example H2

A linear regression plot in Python visualizes the relationship between variables. This core data science technique turns complex datasets into clear, predictive models for better decisions.

In this article, we'll explore several techniques to create these plots. You'll find practical tips, see real-world applications, and get advice to debug common issues so you can master this skill.

Basic linear regression plot with NumPy and Matplotlib

import numpy as np import matplotlib.pyplot as plt x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) m, b = np.polyfit(x, y, 1) plt.scatter(x, y) plt.plot(x, m*x + b, color='red') plt.show()--OUTPUT--[A scatter plot with blue dots representing the data points and a red line showing the linear regression fit]

The core of this plot lies in the np.polyfit(x, y, 1) function. It performs a least squares polynomial fit. By setting the degree to 1, you're asking NumPy to find the slope (m) and y-intercept (b) for the straight line that best represents the relationship between your x and y data.

With the slope and intercept calculated, plt.plot(x, m*x + b) draws the regression line. This line visualizes the model's prediction, plotted directly on top of the original data points from plt.scatter(x, y).

Common libraries for regression visualization

While the NumPy and Matplotlib method provides a solid baseline, other libraries offer more direct and powerful ways to visualize linear regression models.

Using `pandas` for linear regression plots

import pandas as pd import matplotlib.pyplot as plt df = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 3.5, 5, 6.2, 7.5]}) plt.scatter(df.x, df.y) plt.plot(df.x, df.x * 1.35 + 0.7, color='green') plt.xlabel('X values') plt.ylabel('Y values') plt.show()--OUTPUT--[A scatter plot with data points and a green regression line, with labeled X and Y axes]

Using pandas lets you organize your data inside a DataFrame. In this code, a df is created, and its columns are plotted with plt.scatter(df.x, df.y). This keeps your data tidy and simple to reference. For more details on creating DataFrames in Python, you can explore various construction methods.

Notice two key differences from the NumPy method:

The regression line is manually defined with the equation df.x * 1.35 + 0.7.
The plot is made more readable by adding axis labels using plt.xlabel() and plt.ylabel().

Creating regression plots with `seaborn`

import seaborn as sns import numpy as np x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) sns.regplot(x=x, y=y, line_kws={"color":"purple"}) plt.title("Linear Regression with Seaborn") plt.show()--OUTPUT--[A scatter plot with data points, a purple regression line, and a shaded confidence interval region]

Seaborn streamlines the process with its sns.regplot() function. This single command handles both plotting the data points and drawing the best-fit line, which simplifies your code significantly. It's a high-level function built on top of Matplotlib, making AI coding with Python more efficient.

Notice the shaded area around the line. This is the confidence interval, which seaborn includes by default to visualize the uncertainty in the fit.
You can easily customize elements, like setting the line color using the line_kws argument.

Using `scikit-learn` for regression visualization

from sklearn.linear_model import LinearRegression import numpy as np import matplotlib.pyplot as plt X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1) y = np.array([2, 3.5, 5, 6.2, 7.5]) model = LinearRegression().fit(X, y) plt.scatter(X, y) plt.plot(X, model.predict(X), color='orange') plt.text(1, 7, f'R² = {model.score(X, y):.3f}') plt.show()--OUTPUT--[A scatter plot with data points, an orange regression line, and an R-squared value displayed]

scikit-learn treats this as a machine learning task. You create a LinearRegression model and train it on your data with the .fit(X, y) method. This process creates a predictive model object that understands the relationship between your variables.

Notice the input data X is reshaped with .reshape(-1, 1). The library requires a 2D array for features, even for a single variable.
The line is plotted using model.predict(X), which asks the trained model to generate predictions for the input values.
The model.score(X, y) function calculates the R-squared value, a handy metric that shows how well the line fits the data.

Advanced regression plotting techniques

With the fundamentals covered, you can now explore more advanced techniques to add dimensionality, quantify uncertainty, and introduce interactivity to your regression plots.

Visualizing multiple regression with 3D plots

import numpy as np import matplotlib.pyplot as plt from mpl_toolkits.mplot3d import Axes3D from sklearn.linear_model import LinearRegression x1 = np.random.rand(100) x2 = np.random.rand(100) y = 2*x1 + 3*x2 + np.random.randn(100)*0.5 X = np.column_stack((x1, x2)) model = LinearRegression().fit(X, y) fig = plt.figure() ax = fig.add_subplot(111, projection='3d') ax.scatter(x1, x2, y) x1_range = np.linspace(0, 1, 10) x2_range = np.linspace(0, 1, 10) X1, X2 = np.meshgrid(x1_range, x2_range) Z = model.predict(np.column_stack((X1.ravel(), X2.ravel()))).reshape(X1.shape) ax.plot_surface(X1, X2, Z, alpha=0.3) plt.show()--OUTPUT--[A 3D scatter plot with data points and a semi-transparent surface representing the multiple regression plane]

When you have more than one input variable, a simple line isn't enough. This example visualizes a multiple regression model where two variables, x1 and x2, predict a single outcome, y. Instead of a best-fit line, you get a best-fit plane that represents the model's predictions in three dimensions, requiring careful memory management and techniques for handling large datasets in Python.

The code uses scikit-learn's LinearRegression to model the relationship between the inputs and the output.
After plotting the original data points with ax.scatter(), it generates the regression plane using ax.plot_surface(), showing the model's overall trend.

Adding confidence intervals to regression plots

import numpy as np import matplotlib.pyplot as plt from scipy import stats x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) slope, intercept, r_value, p_value, std_err = stats.linregress(x, y) y_pred = intercept + slope * x plt.scatter(x, y) plt.plot(x, y_pred, 'r-') plt.fill_between(x, y_pred - std_err*2, y_pred + std_err*2, alpha=0.2) plt.show()--OUTPUT--[A scatter plot with data points, a red regression line, and a light red shaded region representing the confidence interval]

A confidence interval visually represents the uncertainty in your regression model. Instead of just a single best-fit line, you show a probable range for where the true line might be. This example uses SciPy's powerful stats.linregress() function to calculate the necessary statistics.

The function returns the std_err, or standard error, which is key for measuring the fit's precision.
Matplotlib's plt.fill_between() then draws the shaded region. Its width is based on the standard error, giving you a clear visual for the model's confidence.

Creating interactive regression plots with `plotly`

import plotly.express as px import pandas as pd import numpy as np # Create sample data np.random.seed(42) x = np.arange(1, 101) y = 2*x + 10*np.random.randn(100) df = pd.DataFrame({'x': x, 'y': y}) # Create interactive regression plot fig = px.scatter(df, x='x', y='y', trendline='ols', trendline_color_override='red') fig.update_layout(title='Interactive Linear Regression') fig.show()--OUTPUT--[An interactive scatter plot with data points and a red regression line, with hover capabilities showing point values]

Plotly introduces interactivity to your regression plots with minimal effort. The entire visualization is created with a single function, px.scatter, which handles both plotting the data and fitting the model. This approach aligns perfectly with vibe coding principles of rapid development.

The key is the trendline='ols' argument, which automatically calculates and draws an Ordinary Least Squares (OLS) regression line.
This simplifies the process significantly, as you don't need to fit a model separately. The resulting plot allows you to hover over points to see their values, making it ideal for data exploration and saving plots in Python for presentations.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. There's no need to configure environments or manage packages.

This lets you move from piecing together techniques to building complete applications. With Agent 4, you can take an idea to a working product simply by describing it. The Agent handles writing the code, connecting to databases and APIs, and managing deployment.

Instead of just plotting data, you can describe the complete tool you want to build and let the Agent handle the implementation:

A sales forecasting tool that ingests historical data and uses linear regression to visualize future revenue trends.
A real estate price estimator that models property values based on features like square footage and number of rooms, complete with a 3D regression plane.
An interactive analytics dashboard where users can upload a CSV file and instantly generate regression plots with confidence intervals.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Even with powerful libraries, you might hit snags with missing data, array shapes, or skewed plots; here’s how to handle these common issues.

Dealing with `NaN` values in regression data

Missing data, represented as NaN (Not a Number) values, will cause most regression functions to fail. You can't calculate a best-fit line if some of your data points don't exist. You have two primary options to resolve this:

Drop the missing values: If you're using a pandas DataFrame, the df.dropna() method removes any rows containing NaN values. This is the simplest fix, but you lose data.
Fill the missing values: Alternatively, df.fillna() lets you replace NaN with a specific value, such as the mean or median of the column. This preserves your dataset's size but introduces an assumption about the data's true value.

Correcting array shapes for `sklearn` regression models

A frequent error when using scikit-learn is a ValueError related to array dimensions. As seen in the earlier example, scikit-learn expects your input features (X) to be a 2D array, even if you only have one feature. The library is designed to handle multiple predictors, so it requires data in a `[n_samples, n_features]` format.

If your data is in a 1D array, you can fix this easily with X.reshape(-1, 1). This command reshapes your data into a single column, satisfying the library's input requirements without changing the data itself.

Fixing axis limits for proper regression visualization

Sometimes your regression line might extend far beyond your cluster of data points, making the plot difficult to interpret. This happens when the line is plotted across the default range of the axes, which may not align with your data's actual range. Matplotlib gives you direct control over this.

You can use the plt.xlim() and plt.ylim() functions to set the minimum and maximum boundaries for the x-axis and y-axis, respectively. By adjusting the limits to frame your data points snugly, you create a clearer and more focused visualization of the regression fit.

Dealing with `NaN` values in regression data

Missing data, or NaN values, are a common roadblock. Regression functions like np.polyfit() need complete numerical data to work, so they'll throw an error when they encounter gaps. The following code shows exactly what happens when you try this.

import numpy as np import pandas as pd import matplotlib.pyplot as plt # Dataset with missing values data = pd.DataFrame({ 'x': [1, 2, np.nan, 4, 5], 'y': [2, np.nan, 5, 6.2, 7.5] }) # Will fail with missing values plt.scatter(data['x'], data['y']) m, b = np.polyfit(data['x'], data['y'], 1) plt.plot(data['x'], m*data['x'] + b, color='red') plt.show()

The script fails because np.polyfit() is fed data containing np.nan values, which it can't compute. This halts execution before the line is drawn. The next example shows how to properly prepare the data to avoid this error.

import numpy as np import pandas as pd import matplotlib.pyplot as plt # Dataset with missing values data = pd.DataFrame({ 'x': [1, 2, np.nan, 4, 5], 'y': [2, np.nan, 5, 6.2, 7.5] }) # Fix: drop missing values before plotting clean_data = data.dropna() plt.scatter(clean_data['x'], clean_data['y']) m, b = np.polyfit(clean_data['x'], clean_data['y'], 1) plt.plot(clean_data['x'], m*clean_data['x'] + b, color='red') plt.show()

The solution is to first clean the data. By calling data.dropna(), you create a new DataFrame, clean_data, that excludes any rows with NaN values. The subsequent plotting and regression functions run on this sanitized data, which prevents the error from occurring. It's a crucial first step whenever you're working with datasets from external sources like CSV files, which often contain missing entries. For more comprehensive techniques, see our guide on removing NaN values in Python.

Correcting array shapes for `sklearn` regression models

scikit-learn is strict about data shapes, often causing a ValueError. Because it's designed for multiple predictors, its fit() method requires a 2D array for features. The code below shows exactly what happens when you pass a 1D array instead.

import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Incorrect shape for sklearn x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) # This will raise an error model = LinearRegression() model.fit(x, y) # x needs to be 2D plt.scatter(x, y) plt.plot(x, model.predict(x), color='red') plt.show()

The model.fit(x, y) call fails because the input x is a simple list of numbers, not the structured format the model requires. The following example shows how to adjust the data to fix this.

import numpy as np from sklearn.linear_model import LinearRegression import matplotlib.pyplot as plt # Correcting shape for sklearn x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) # Fix: reshape x to be 2D x_2d = x.reshape(-1, 1) model = LinearRegression() model.fit(x_2d, y) plt.scatter(x, y) plt.plot(x, model.predict(x_2d), color='red') plt.show()

The fix is simple: reshape the input data. The model.fit() method requires a 2D array for features, even with just one predictor. By using x.reshape(-1, 1), you convert the 1D array x into a 2D array with one column. This satisfies scikit-learn's input structure without altering your data. Keep this in mind whenever you're feeding a single feature into any scikit-learn model, as it's a standard requirement.

Fixing axis limits for proper regression visualization

Sometimes your regression line might appear cut off, only spanning the range of your data points instead of the full plot area. This happens because Matplotlib's plotting function defaults to the data's range. The code below demonstrates this common visual quirk.

import numpy as np import matplotlib.pyplot as plt # Data points x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) # Linear regression m, b = np.polyfit(x, y, 1) plt.scatter(x, y) plt.plot(x, m*x + b, color='red') # Line only spans the x range of data points plt.show()

The plt.plot() function is only given the original x data, so it draws the line exclusively between the first and last points. This makes the trend line appear abruptly cut off. The following code shows how to fix this.

import numpy as np import matplotlib.pyplot as plt # Data points x = np.array([1, 2, 3, 4, 5]) y = np.array([2, 3.5, 5, 6.2, 7.5]) # Linear regression m, b = np.polyfit(x, y, 1) plt.scatter(x, y) # Fix: extend the line beyond data points x_line = np.array([0, 6]) # Extended range plt.plot(x_line, m*x_line + b, color='red') plt.xlim(0, 6) # Set explicit axis limits plt.show()

The solution extends the regression line for a more complete view of the trend. Instead of plotting against the original x array, a new array, x_line, is created with a wider range. The line is then drawn using this extended array with plt.plot(x_line, m*x_line + b). Finally, plt.xlim() adjusts the axis boundaries to match, ensuring the full line is visible. This is useful when you want to visualize the trend beyond your sample data.

Real-world applications

With the common errors handled, you can apply these plotting techniques to practical scenarios like real estate forecasting and model diagnostics.

Using `numpy` to predict real estate prices

This example uses np.polyfit() to model the relationship between house size and price, creating a simple predictive tool for real estate forecasting.

import numpy as np import matplotlib.pyplot as plt sizes = np.array([750, 850, 950, 1050, 1150, 1250]) prices = np.array([150, 170, 195, 215, 235, 260]) m, b = np.polyfit(sizes, prices, 1) plt.scatter(sizes, prices) plt.plot(sizes, m*sizes + b, 'r-') plt.xlabel('House Size (sq ft)') plt.ylabel('Price ($1000s)') plt.show()

This script demonstrates how to find and plot a linear trend. It starts by defining sizes and prices as NumPy arrays. Then, it uses np.polyfit() to find the optimal slope (m) and intercept (b) that describe the data's relationship. The visualization combines two key elements:

plt.scatter() shows the actual data points.
plt.plot() draws the red trend line using the calculated slope and intercept.

Adding labels with plt.xlabel() and plt.ylabel() makes the final chart clear and readable. In practice, real estate data often comes from reading CSV files containing property records.

Creating residual plots to diagnose model fit

A residual plot helps you diagnose your model's fit by visualizing the errors between its predictions and the actual data, which can reveal hidden patterns that the linear model failed to capture.

import numpy as np import matplotlib.pyplot as plt from sklearn.linear_model import LinearRegression x = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10]) y = np.array([2, 4, 5, 4, 6, 8, 7, 10, 11, 14]) + np.random.randn(10) model = LinearRegression().fit(x.reshape(-1, 1), y) residuals = y - model.predict(x.reshape(-1, 1)) plt.scatter(x, residuals) plt.axhline(y=0, color='red') plt.title('Residual Plot') plt.show()

This script calculates the "leftovers" from a model's predictions. It first fits a LinearRegression model to the data. Then, it finds the residuals by subtracting the model's predictions from the actual y values.

The plot scatters these residuals against the original x values.
The red line at y=0 represents zero error, where the prediction was perfect.

Ideally, the points should be randomly scattered around this line. A clear pattern suggests the linear model isn't fully capturing the data's underlying structure.

Get started with Replit

Turn your knowledge into a tool. Describe what you want to build for Replit Agent, like "a web app that predicts house prices from a CSV" or "a tool to generate residual plots for model diagnostics."

The Agent writes the code, tests for errors, and deploys the app, turning your description into a finished product. Start building with Replit.

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started free

Build your first app today

Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.

Get started for free

Follow @Replit