How to do linear regression in Python
Learn how to perform linear regression in Python. Explore different methods, tips, real-world applications, and how to debug common errors.

Linear regression is a core statistical technique in Python. It allows you to model relationships between variables and predict outcomes, a fundamental skill for data analysis and machine learning projects.
Here, you'll explore several techniques and practical tips for implementation. We'll also cover real-world applications and common debugging advice, so you can confidently build and troubleshoot your own linear regression models.
Using sklearn for basic linear regression
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y)
print(f"Coefficient: {model.coef_[0]:.4f}, Intercept: {model.intercept_:.4f}")--OUTPUT--Coefficient: 0.6000, Intercept: 1.8000
The sklearn library simplifies finding a line of best fit. The main work happens in the fit(X, y) method, which trains the model by analyzing the relationship between your feature data (X) and target values (y).
Once trained, the model holds the two key components of the linear equation:
model.coef_: The coefficient, or slope, which quantifies the relationship between the variables.model.intercept_: The value where the regression line crosses the y-axis.
These attributes define the predictive formula the model has learned from your data.
Alternative implementation methods
While sklearn offers a straightforward path, libraries like numpy, statsmodels, and pandas provide alternative ways to perform linear regression for more control or deeper analysis.
Using numpy for manual calculation
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
X_mean, y_mean = np.mean(X), np.mean(y)
slope = np.sum((X - X_mean) * (y - y_mean)) / np.sum((X - X_mean)**2)
intercept = y_mean - slope * X_mean
print(f"Slope: {slope:.4f}, Intercept: {intercept:.4f}")--OUTPUT--Slope: 0.6000, Intercept: 1.8000
With numpy, you can calculate the regression line manually by applying the core mathematical formulas directly. This gives you a look under the hood of what libraries like sklearn automate for you. The process involves a few key steps:
- Calculating the mean of both your
Xandydatasets withnp.mean(). - Using these means to compute the
slopebased on the standard formula for ordinary least squares. - Determining the
interceptto ensure the regression line correctly passes through the central point of your data.
Using statsmodels for detailed statistics
import statsmodels.api as sm
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
X = sm.add_constant(X)
model = sm.OLS(y, X).fit()
print(model.summary().tables[1])--OUTPUT--#ERROR!
The statsmodels library is perfect when you need more than just the basics. It's built for rigorous statistical analysis and provides a much richer output than other methods.
- A key difference is that you'll need to manually add a constant to your feature data with
sm.add_constant(X)to calculate the intercept. - The model itself is created using
sm.OLS(y, X).fit(), which fits an Ordinary Least Squares regression. - The real power lies in the
model.summary()method, which gives you a comprehensive statistical report.
Using pandas with visualization
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
data = pd.DataFrame({'x': [1, 2, 3, 4, 5], 'y': [2, 4, 5, 4, 5]})
slope = np.polyfit(data['x'], data['y'], 1)
plt.scatter(data['x'], data['y'])
plt.plot(data['x'], np.polyval(slope, data['x']), 'r')
print(f"Polynomial coefficients: {slope}")--OUTPUT--Polynomial coefficients: [0.6 1.8]
When you combine pandas with matplotlib, it’s easy to both calculate and visualize your regression line. The data is first structured in a DataFrame, which works seamlessly with plotting libraries.
- The
np.polyfit(x, y, 1)function does the heavy lifting, fitting a first-degree polynomial (a straight line) to your data and returning the coefficients. - You can then use
plt.scatter()to plot the original data points. - Finally,
plt.plot()draws the calculated regression line, giving you an immediate visual confirmation of the model's fit.
Advanced regression techniques
Moving beyond the simple line, you can build more powerful models that handle multiple features, reduce overfitting, and capture complex, non-linear trends.
Working with multiple features using sklearn
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1, 10], [2, 9], [3, 8], [4, 7], [5, 6]])
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y)
print(f"Coefficients: {model.coef_}, Intercept: {model.intercept_:.4f}")
prediction = model.predict([[6, 5]])
print(f"Prediction for [6, 5]: {prediction[0]:.4f}")--OUTPUT--Coefficients: [0.94285714 0.34285714], Intercept: -1.4286
Prediction for [6, 5]: 6.5714
Multiple linear regression lets you use several input features to predict an outcome. Notice the X variable now contains pairs of values instead of single numbers. The fit method works just the same, but it's now finding the best-fit plane for your multi-dimensional data.
- The
model.coef_attribute returns an array of coefficients—one for each feature—showing how much each one influences the result. - You can make new predictions with
model.predict()by passing it an array with the same feature structure, like[[6, 5]].
Applying regularization with Ridge and Lasso
from sklearn.linear_model import Ridge, Lasso
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
ridge = Ridge(alpha=1.0).fit(X, y)
lasso = Lasso(alpha=0.1).fit(X, y)
print(f"Ridge coefficient: {ridge.coef_[0]:.4f}, intercept: {ridge.intercept_:.4f}")
print(f"Lasso coefficient: {lasso.coef_[0]:.4f}, intercept: {lasso.intercept_:.4f}")--OUTPUT--Ridge coefficient: 0.5455, intercept: 1.9091
Lasso coefficient: 0.4000, intercept: 2.4000
Regularization is a technique to prevent overfitting, where a model learns the training data too well and performs poorly on new data. It works by penalizing large coefficients. Both Ridge and Lasso are regularized models where the strength of this penalty is controlled by the alpha parameter.
Ridgeregression (L2) shrinks coefficients toward zero, making the model more stable.Lassoregression (L1) can shrink coefficients completely to zero, which is useful for feature selection as it can remove irrelevant variables from the model.
Creating polynomial features for non-linear relationships
from sklearn.preprocessing import PolynomialFeatures
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [3], [4], [5]])
y = np.array([1, 4, 9, 16, 25]) # y = x²
poly = PolynomialFeatures(degree=2)
X_poly = poly.fit_transform(X)
model = LinearRegression().fit(X_poly, y)
print(f"Coefficients: {model.coef_}")
print(f"Predicted value for x=6: {model.predict(poly.transform([[6]]))[0]:.1f}")--OUTPUT--Coefficients: [0. 0. 1.]
Predicted value for x=6: 36.0
When your data has a curve, a straight line won't fit. Polynomial regression is the solution. It lets a linear model capture non-linear trends by transforming your input features into polynomials, like x².
- The
PolynomialFeatures(degree=2)class generates new features from your existing ones. Here, it creates an x² term to match the data's pattern. - You then fit a standard
LinearRegressionmodel on this transformed data. This allows the model to find the perfect curve, as seen when it correctly predicts 36 for an input of 6.
Move faster with Replit
Replit is an AI-powered development platform where you can start coding Python instantly. It comes with all Python dependencies pre-installed, so you can skip the tedious setup and get straight to building.
Mastering individual techniques is one thing, but building a complete application is another. Agent 4 helps you bridge that gap, taking your project from a simple description to a finished product. You can build practical tools that use the regression methods covered here, like:
- A property value estimator that predicts housing prices from features like square footage and age.
- A sales forecasting dashboard that visualizes historical data to project future revenue.
- A resource demand calculator that uses past usage patterns to predict future server load.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Even with powerful libraries, you'll likely encounter issues with data shape, missing values, or inconsistent feature scales when building regression models.
Fixing data shape errors with LinearRegression
sklearn's LinearRegression is particular about the shape of your input data. A common mistake is passing a 1D array for a single feature, which triggers a ValueError. The model expects a 2D array, which is a list of lists or a column vector, even if you only have one feature. You can easily fix this by reshaping your data using numpy.reshape(-1, 1). This tells NumPy to figure out the number of rows automatically while creating a single column.
Handling missing values in regression data
Regression models don't work with missing data, often represented as NaN values in your dataset. Trying to fit a model with them will usually result in an error. You have a few options to handle this:
- Imputation: You can fill in the gaps. Common strategies include replacing missing values with the column's mean, median, or a specific constant like zero.
- Deletion: If you have a large dataset, you might simply remove the rows or columns that contain missing values.
The right choice depends on how much data is missing and the context of your analysis. For more comprehensive strategies, see handling missing values in Python.
Improving results with feature scaling
When your model uses multiple features with different units, like house size in square feet and the number of bedrooms, their scales can be vastly different. This can cause features with larger values to unfairly dominate the model's learning process. Feature scaling solves this by putting all features on a common scale. Techniques like standardization (using StandardScaler) or normalization (using MinMaxScaler) ensure that each feature contributes more equally to the final prediction. This is especially important for regularized models like Ridge and Lasso.
Fixing data shape errors with LinearRegression
A frequent stumbling block with LinearRegression is the input data's shape. The model expects a 2D array for your features, so passing a 1D array—even for a single feature—will raise a ValueError. See this common error in action below.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5])
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y) # This will raise an error
print(f"Coefficient: {model.coef_}, Intercept: {model.intercept_}")
The X array is defined as a simple list of numbers, but the model's fit() method expects each feature value to be enclosed in its own list. See how to correct this below.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([1, 2, 3, 4, 5]).reshape(-1, 1)
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y)
print(f"Coefficient: {model.coef_[0]:.4f}, Intercept: {model.intercept_:.4f}")
The fix is simple: use .reshape(-1, 1) on your feature array. This method transforms the 1D array [1, 2, 3, 4, 5] into the 2D format [[1], [2], [3], [4], [5]] that sklearn's fit() method requires. You'll most often run into this when working with a single feature. Reshaping your data into a column vector ensures it meets the model's input structure, preventing the common ValueError.
Handling missing values in regression data
Regression models can't work with incomplete data. If your dataset contains missing values, often represented as NaN (Not a Number), trying to train a model will trigger an error because the mathematical operations can't handle empty inputs.
The code below shows what happens when you attempt to fit a model with data that includes a NaN value.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1], [2], [np.nan], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y) # Will raise an error
print(f"Coefficient: {model.coef_[0]}, Intercept: {model.intercept_}")
The fit() method fails because the np.nan value in the X array isn't a number, which breaks the mathematical operations needed for regression. The code below shows one way to prepare your data to avoid this error.
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.impute import SimpleImputer
X = np.array([[1], [2], [np.nan], [4], [5]])
y = np.array([2, 4, 5, 4, 5])
imputer = SimpleImputer(strategy='mean')
X_imputed = imputer.fit_transform(X)
model = LinearRegression().fit(X_imputed, y)
print(f"Coefficient: {model.coef_[0]:.4f}, Intercept: {model.intercept_:.4f}")
The fix involves using SimpleImputer from sklearn to handle the missing data before training. This class acts as a transformer that fills in NaN values based on a chosen strategy. In this case, strategy='mean' calculates the average of the column and uses it to replace the missing entry. The fit_transform() method then applies this logic, returning a clean dataset ready for your model. This is a crucial preprocessing step whenever your data might be incomplete.
Improving results with feature scaling
When your features have wildly different scales, it can throw off your model's learning process. A feature with large numerical values might be given too much importance, even if it's not the most predictive. See how this imbalance affects the model's coefficients below.
from sklearn.linear_model import LinearRegression
import numpy as np
X = np.array([[1, 1000], [2, 2000], [3, 3000], [4, 4000], [5, 5000]])
y = np.array([2, 4, 5, 4, 5])
model = LinearRegression().fit(X, y)
print(f"Coefficients: {model.coef_}") # Coefficients have very different magnitudes
The resulting model.coef_ values are skewed. The second feature's large scale causes the model to assign it a disproportionately small coefficient, misrepresenting its influence. The corrected approach is shown in the next example.
from sklearn.linear_model import LinearRegression
import numpy as np
from sklearn.preprocessing import StandardScaler
X = np.array([[1, 1000], [2, 2000], [3, 3000], [4, 4000], [5, 5000]])
y = np.array([2, 4, 5, 4, 5])
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)
model = LinearRegression().fit(X_scaled, y)
print(f"Coefficients with scaled features: {model.coef_}")
The fix is to use StandardScaler to put all features on a level playing field. This ensures that no single feature dominates the model just because its numbers are bigger.
- The
scaler.fit_transform(X)method rescales your data. - Fitting the model to this new
X_scaleddata produces more reliable coefficients.
This gives you a truer sense of each feature's impact, which is crucial whenever your inputs have different units, like age and income. For more comprehensive approaches, see scaling data in Python.
Real-world applications
Moving beyond troubleshooting, these regression techniques solve tangible problems, from predicting house prices to detecting anomalies in time series data.
Predicting house prices with sklearn
For example, you can train a LinearRegression model to predict house prices using a single feature like square footage.
from sklearn.linear_model import LinearRegression
import numpy as np
# House sizes in square feet
sizes = np.array([[1400], [1600], [1700], [1875], [1100], [1550], [2350], [2450]])
# House prices in thousands of dollars
prices = np.array([245, 312, 279, 308, 199, 219, 405, 324])
model = LinearRegression().fit(sizes, prices)
predicted_price = model.predict([[2000]])[0]
print(f"Coefficient: {model.coef_[0]:.4f}, Intercept: {model.intercept_:.2f}")
print(f"Predicted price for a 2000 sq ft house: ${predicted_price:.2f}k")
This example demonstrates a practical use of linear regression. You're feeding the model two numpy arrays: house sizes as the input feature and their corresponding prices as the target outcome. For handling larger datasets, you might want to organize this data by creating DataFrames in Python.
- The
fit(sizes, prices)method is where the model learns the connection between the two variables. - After training,
predict([[2000]])applies this learned relationship to estimate the price for a new, unseen house size.
The output gives you a concrete prediction in thousands of dollars, showing how the model generalizes from the training data.
Detecting anomalies in time series data
You can also use linear regression to spot unusual data points in a time series by identifying values that stray significantly from the predicted trend.
import numpy as np
from sklearn.linear_model import LinearRegression
# Time series data (e.g., server response times)
days = np.array(range(1, 31)).reshape(-1, 1)
response_times = np.array([110, 108, 112, 115, 118, 120, 125, 130, 129, 133,
135, 134, 138, 140, 145, 150, 148, 152, 155, 160,
158, 162, 165, 170, 210, 172, 175, 178, 180, 183])
model = LinearRegression().fit(days, response_times)
predictions = model.predict(days)
residuals = response_times - predictions
# Find anomalies (points with large residuals)
threshold = 2 * np.std(residuals)
anomalies = days[abs(residuals) > threshold]
print(f"Anomaly detected on day(s): {anomalies.flatten()}")
This code spots unusual events in a time series dataset. A linear model first learns the expected pattern from the response_times data, establishing a baseline trend. The script then measures how far each actual data point strays from this predicted trend.
- A
thresholdis set to define what counts as a significant deviation. - The code then filters for any days where the response time was far outside this normal range, printing them as anomalies.
This technique is useful for monitoring system health or finding errors in data.
Get started with Replit
Turn your knowledge into a working tool. Describe what you want to build to Replit Agent, like “a tool to predict sales from ad spend” or “a house price estimator based on square footage.”
Replit Agent writes the code, tests for errors, and deploys your app, handling the entire development cycle for you. Start building with Replit.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.
Describe what you want to build, and Replit Agent writes the code, handles the infrastructure, and ships it live. Go from idea to real product, all in your browser.



