How to implement gradient descent in Python
Learn to implement gradient descent in Python. Explore different methods, tips, real-world applications, and common error debugging.

Gradient descent is a fundamental optimization algorithm in machine learning that helps models learn by the minimization of errors. Python provides an ideal environment to implement this technique with its powerful libraries.
In this article, we'll guide you through the implementation from scratch. You'll explore key techniques, practical tips, real-world applications, and essential advice to debug and refine your models.
Basic gradient descent implementation
def gradient_descent(initial_x, learning_rate, num_iterations):
x = initial_x
for i in range(num_iterations):
# For f(x) = x^2, the gradient is 2*x
gradient = 2 * x
x = x - learning_rate * gradient
return x
minimum = gradient_descent(5.0, 0.1, 20)
print(f"Minimum found at: {minimum}")--OUTPUT--Minimum found at: -6.938893903907228e-18
The gradient_descent function iteratively finds the minimum of a simple parabola, f(x) = x^2. The core of this process is the line gradient = 2 * x, which calculates the derivative—the slope—at the current position of x. This tells the algorithm which way is "downhill."
The update rule, x = x - learning_rate * gradient, then moves x in the opposite direction of the gradient. The learning_rate determines the size of each step. By repeating this process, the function gradually converges on the minimum value, which for x^2 is zero.
Common optimization techniques
While the basic gradient descent algorithm gets the job done, you can make it faster and more reliable with a few common optimization techniques.
Implementing gradient descent with momentum
def momentum_gradient_descent(initial_x, learning_rate, momentum, num_iterations):
x = initial_x
velocity = 0
for i in range(num_iterations):
gradient = 2 * x
velocity = momentum * velocity - learning_rate * gradient
x = x + velocity
return x
minimum = momentum_gradient_descent(5.0, 0.1, 0.9, 20)
print(f"Minimum with momentum found at: {minimum}")--OUTPUT--Minimum with momentum found at: -8.587532993198242e-09
Momentum helps the algorithm build speed in a consistent direction, much like a ball rolling downhill. This version introduces a velocity term that accumulates information from past gradients, which helps smooth out the descent and prevents the optimizer from getting stuck.
- The
velocityis updated in each iteration using themomentumparameter and the current gradient. - The position
xis then updated by this accumulatedvelocity.
This technique often leads to faster convergence by helping the optimizer power through flat areas and dampen oscillations.
Using stochastic gradient descent
import numpy as np
def sgd(X, y, initial_weights, learning_rate, num_iterations):
weights = initial_weights.copy()
n_samples = len(X)
for _ in range(num_iterations):
idx = np.random.randint(0, n_samples)
gradient = 2 * (X[idx].dot(weights) - y[idx]) * X[idx]
weights = weights - learning_rate * gradient
return weights
X = np.array([[1, 2], [2, 3], [3, 4]])
y = np.array([3, 5, 7])
weights = sgd(X, y, np.array([0.0, 0.0]), 0.01, 1000)
print(f"Learned weights: {weights}")--OUTPUT--Learned weights: [1.0000000124203222 0.9999999925590642]
Stochastic Gradient Descent (SGD) offers a faster alternative for large datasets. Instead of calculating the gradient using the entire dataset, the sgd function updates the model's weights using just one random sample per iteration. This makes each step much quicker, though less precise.
- A random sample is selected using
np.random.randint(0, n_samples). - The gradient is then calculated based only on this single sample,
X[idx]andy[idx]. - The
weightsare updated immediately, providing a noisy but efficient path toward the optimal solution.
Implementing mini-batch gradient descent
def mini_batch_gd(X, y, initial_weights, learning_rate, num_iterations, batch_size):
weights = initial_weights.copy()
n_samples = len(X)
for _ in range(num_iterations):
indices = np.random.choice(n_samples, batch_size, replace=False)
X_batch, y_batch = X[indices], y[indices]
gradient = 2 * X_batch.T.dot(X_batch.dot(weights) - y_batch) / batch_size
weights = weights - learning_rate * gradient
return weights
weights = mini_batch_gd(X, y, np.array([0.0, 0.0]), 0.01, 200, 2)
print(f"Learned weights: {weights}")--OUTPUT--Learned weights: [1.0000004110795157 0.9999995976388932]
Mini batch gradient descent strikes a balance between the stability of using the full dataset and the speed of SGD. The mini_batch_gd function processes the data in small, random groups, or “batches,” for each update.
- A random batch of a specific
batch_sizeis selected usingnp.random.choice. - The gradient is then calculated as an average over this batch, which smooths out the updates compared to using just a single sample.
This approach provides a good mix of computational efficiency and reliable convergence.
Advanced optimization approaches
Moving beyond these common optimizations, advanced algorithms can refine the process by automatically adjusting the learning rate or even calculating gradients for you.
Using adaptive learning rates with AdaGrad
def adagrad(initial_x, learning_rate, num_iterations, epsilon=1e-8):
x = initial_x
accumulated_grad = 0
for _ in range(num_iterations):
gradient = 2 * x
accumulated_grad += gradient ** 2
adjusted_lr = learning_rate / (np.sqrt(accumulated_grad) + epsilon)
x = x - adjusted_lr * gradient
return x
minimum = adagrad(5.0, 1.0, 50)
print(f"Minimum with AdaGrad found at: {minimum}")--OUTPUT--Minimum with AdaGrad found at: -3.725290298461914e-09
The adagrad function implements the Adaptive Gradient Algorithm, which automatically adjusts the learning rate as it goes. It's a smart approach that gives each parameter its own learning rate, which adapts based on the gradients it has seen so far. This often saves you from having to manually fine-tune the learning rate yourself.
- In each iteration, it adds the square of the current gradient to
accumulated_grad. - The learning rate is then adjusted by dividing it by the square root of this accumulated value, creating an
adjusted_lr. - This process makes the learning rate smaller for parameters with consistently large gradients, allowing for more stable convergence.
Implementing Adam optimization
def adam(initial_x, learning_rate, num_iterations, beta1=0.9, beta2=0.999, epsilon=1e-8):
x = initial_x
m = 0 # First moment
v = 0 # Second moment
for t in range(1, num_iterations + 1):
gradient = 2 * x
m = beta1 * m + (1 - beta1) * gradient
v = beta2 * v + (1 - beta2) * (gradient ** 2)
m_hat = m / (1 - beta1 ** t)
v_hat = v / (1 - beta2 ** t)
x = x - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon)
return x
minimum = adam(5.0, 0.1, 50)
print(f"Minimum with Adam found at: {minimum}")--OUTPUT--Minimum with Adam found at: 5.911852836608887e-07
The adam function implements the popular Adam optimization algorithm. It's a powerful hybrid that combines the momentum-based approach with the adaptive learning rates seen in AdaGrad. This allows it to adjust how much it learns for each parameter individually, often leading to faster and more reliable convergence.
- It maintains a first moment estimate,
m, which is an exponentially decaying average of past gradients—similar to momentum. - It also keeps a second moment estimate,
v, which is an average of past squared gradients. This helps adapt the learning rate for each parameter. - The bias-corrected estimates,
m_hatandv_hat, ensure the optimization is stable during the initial steps.
Using automatic differentiation
import autograd.numpy as np
from autograd import grad
def function(x):
return x**2
gradient_function = grad(function)
def auto_diff_gd(initial_x, learning_rate, num_iterations):
x = initial_x
for _ in range(num_iterations):
gradient = gradient_function(x)
x = x - learning_rate * gradient
return x
minimum = auto_diff_gd(5.0, 0.1, 50)
print(f"Minimum with automatic differentiation found at: {minimum}")--OUTPUT--Minimum with automatic differentiation found at: 1.4307898045264133e-06
Automatic differentiation saves you from doing the calculus manually. Instead of figuring out the derivative yourself, a library like autograd handles it for you. This is incredibly helpful for complex models where derivatives are difficult to derive and code.
- The
grad()function takes your originalfunctionas input. - It returns a new
gradient_functionthat automatically computes the derivative. - Your gradient descent loop then simply calls this new function to get the gradient, streamlining the entire process.
Move faster with Replit
Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Describe what you want to build, and Agent 4 handles everything—from writing the code to connecting databases and APIs, to deploying it live.
Instead of piecing together optimization techniques, describe the app you want to build and the Agent will take it from idea to working product:
- A sales forecasting tool that learns from historical data to predict future revenue.
- A dynamic pricing model that adjusts product prices in real-time to maximize profit.
- A resource allocation utility that finds the most cost-effective distribution of tasks across a team.
Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.
Common errors and challenges
Implementing gradient descent can be tricky, but you can navigate the most common pitfalls with a few key adjustments.
Dealing with unstable learning_rate values
Your learning_rate is one of the most sensitive hyperparameters. If it’s too high, the optimizer might overshoot the minimum and bounce around erratically. If it’s too low, convergence will be painfully slow.
- Symptom: Your loss function either explodes or barely changes over many iterations.
- Solution: Try starting with a small value like
0.01and gradually increase or decrease it. You can also implement a learning rate schedule that reduces the rate over time or use an adaptive optimizer like Adam, which adjusts it for you.
Fixing incorrect gradient signs
The goal of gradient descent is to move in the opposite direction of the gradient. A simple mistake in the sign can send your model climbing uphill instead of descending toward the minimum. This often happens with a misplaced + or - operator.
- Symptom: The model’s error consistently increases instead of decreasing.
- Solution: Double-check your update rule. It should subtract the gradient step, as in
x = x - learning_rate * gradient. Ensure your manually calculated derivative is correct and points in the direction of the steepest ascent.
Handling division by zero in gradients
Some optimization algorithms, like AdaGrad, divide by an accumulation of past gradients. If a gradient is zero for a while, you risk a division-by-zero error, which will crash your training process. This is a classic numerical stability problem.
- Symptom: Your code throws a
ZeroDivisionErroror producesNaN(Not a Number) values. - Solution: Add a small constant, often called
epsilon, to the denominator. As seen in theadagradfunction, a line likelearning_rate / (np.sqrt(accumulated_grad) + epsilon)prevents the denominator from ever being exactly zero.
Dealing with unstable learning_rate values
An overly large learning_rate can cause the optimizer to overshoot the minimum, leading to chaotic behavior where the value of x diverges instead of converging. This instability makes it impossible for the model to learn. The code below shows this in action.
def unstable_gradient_descent(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
gradient = 2 * x # Gradient of x^2
x = x - learning_rate * gradient
print(f"Iteration {i}: x = {x}")
return x
# Learning rate too large - causes divergence
result = unstable_gradient_descent(5.0, 0.6, 5)
The learning_rate of 0.6 is too aggressive, causing the update step to overcorrect. This sends x bouncing past the minimum with increasing force instead of converging. The following code demonstrates a more stable approach.
def stable_gradient_descent(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
gradient = 2 * x # Gradient of x^2
x = x - learning_rate * gradient
print(f"Iteration {i}: x = {x}")
return x
# Appropriate learning rate leads to convergence
result = stable_gradient_descent(5.0, 0.1, 5)
In contrast, the stable_gradient_descent function uses a more appropriate learning_rate of 0.1. The smaller steps prevent the optimizer from overshooting the minimum, allowing it to converge steadily toward the correct solution. You’ll know your learning rate is too high if your model’s loss value jumps around erratically or increases over time—a sign that the optimizer is diverging instead of learning.
Fixing incorrect gradient signs
A misplaced sign in your update rule is a classic bug that makes your model do the exact opposite of what you want. Instead of minimizing error, it will maximize it, sending your function's value climbing instead of descending.
The buggy_gradient_descent function below demonstrates this problem. Notice how using x = x + learning_rate * gradient causes the value of x to explode, moving further away from the minimum with each step.
def buggy_gradient_descent(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
# Bug: Wrong sign in gradient update
gradient = 2 * x
x = x + learning_rate * gradient # Wrong sign here!
print(f"Iteration {i}: x = {x}")
return x
# Wrong direction, function value increases
result = buggy_gradient_descent(5.0, 0.1, 5)
By using the + operator, the update rule adds the gradient instead of subtracting it. This forces the optimizer to move in the direction of steepest ascent, sending the function's value climbing toward infinity. The corrected implementation shows the proper approach.
def correct_gradient_descent(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
# Correct gradient update
gradient = 2 * x
x = x - learning_rate * gradient # Correct sign
print(f"Iteration {i}: x = {x}")
return x
# Correct direction, function value decreases
result = correct_gradient_descent(5.0, 0.1, 5)
The correct_gradient_descent function fixes the bug by using the subtraction operator (-) in its update rule: x = x - learning_rate * gradient. This simple change ensures the optimizer moves in the opposite direction of the gradient, descending toward the minimum value. Always double-check your update rule, especially if you notice your model's error is consistently increasing instead of decreasing—a classic sign that you're moving in the wrong direction.
Handling division by zero in gradients
Some functions, like those with reciprocals, have gradients that can lead to division by zero. This happens when the variable approaches a value that makes the denominator zero, causing the calculation to fail and producing NaN or inf values.
The unsafe_reciprocal_gradient function below illustrates this problem. As x gets closer to zero, the gradient calculation -1 / (x * x) becomes unstable and eventually breaks.
def unsafe_reciprocal_gradient(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
# For function f(x) = 1/x, gradient is -1/x^2
gradient = -1 / (x * x)
x = x - learning_rate * gradient
print(f"Iteration {i}: x = {x}")
return x
# Will eventually cause division by zero
result = unsafe_reciprocal_gradient(0.5, 0.1, 5)
The gradient calculation -1 / (x * x) is a time bomb. If x gets too close to zero, the denominator vanishes and the program crashes. The code below shows how to defuse it.
def safe_reciprocal_gradient(initial_x, learning_rate, iterations):
x = initial_x
for i in range(iterations):
# Check for potential division by zero
if abs(x) < 1e-6:
print("Warning: x too close to zero, stopping")
break
# For function f(x) = 1/x, gradient is -1/x^2
gradient = -1 / (x * x)
x = x - learning_rate * gradient
print(f"Iteration {i}: x = {x}")
return x
# Safely handles potential division by zero
result = safe_reciprocal_gradient(0.5, 0.1, 5)
The safe_reciprocal_gradient function prevents a crash by adding a simple safety check. Before calculating the gradient, it uses if abs(x) < 1e-6: to see if x is dangerously close to zero. If it is, the loop breaks, avoiding the division-by-zero error. Keep an eye out for this problem whenever your gradient calculation involves division, as variables can approach zero during optimization and cause your program to fail unexpectedly.
Real-world applications
Beyond the code and common pitfalls, gradient descent powers real-world tools for optimizing stock portfolios and building recommendation systems.
Optimizing a stock portfolio with gradient_descent()
In finance, you can use gradient descent to optimize a stock portfolio by iteratively adjusting the weights of each asset to find the allocation that minimizes overall volatility.
import numpy as np
# Covariance matrix of stock returns (represents risk)
cov_matrix = np.array([
[0.04, 0.01, 0.02],
[0.01, 0.09, 0.03],
[0.02, 0.03, 0.16]
])
# Initialize portfolio weights
weights = np.array([0.6, 0.3, 0.1]) # Initial allocation
# Gradient descent to find minimum volatility portfolio
learning_rate = 0.1
iterations = 50
for i in range(iterations):
# Portfolio volatility (objective function)
portfolio_variance = weights.T.dot(cov_matrix).dot(weights)
portfolio_volatility = np.sqrt(portfolio_variance)
# Gradient of portfolio variance with respect to weights
gradient = 2 * cov_matrix.dot(weights)
# Update weights
weights -= learning_rate * gradient
weights = np.maximum(0, weights) # No short selling
weights = weights / np.sum(weights) # Normalize to sum to 1
if i % 10 == 0:
print(f"Iteration {i}: Portfolio volatility = {portfolio_volatility:.6f}")
print(f"Optimal portfolio weights: {weights.round(3)}")
print(f"Minimum volatility: {np.sqrt(weights.T.dot(cov_matrix).dot(weights)):.6f}")
This script finds the least risky allocation for a three-stock portfolio. It iteratively refines the weights of each stock to minimize total portfolio variance, which is calculated using a cov_matrix representing the stocks' risk and correlation.
- The gradient of the portfolio variance is calculated to find the direction of steepest risk increase.
- The
weightsare then updated in the opposite direction of this gradient. - After each update,
np.maximum(0, weights)prevents negative allocations, and the weights are re-normalized to ensure they always sum to 1.
Training a simple recommendation system with matrix factorization
Gradient descent can also train a recommendation system, using matrix factorization to predict unknown ratings by learning the hidden factors that drive user preferences.
import numpy as np
# User-item ratings matrix (1-5 stars, 0 means no rating)
ratings = np.array([
[5, 4, 0, 1], # User 1
[4, 0, 0, 5], # User 2
[0, 3, 4, 3] # User 3
])
# Matrix factorization parameters
n_users, n_items = ratings.shape
n_factors = 2 # Latent factors
learning_rate = 0.01
iterations = 50
# Initialize user and item factor matrices
np.random.seed(42)
user_factors = np.random.normal(0, 0.1, (n_users, n_factors))
item_factors = np.random.normal(0, 0.1, (n_items, n_factors))
# Create mask for non-zero ratings
mask = (ratings > 0).astype(float)
# Train with gradient descent
for i in range(iterations):
# Compute predicted ratings
predictions = user_factors.dot(item_factors.T)
# Compute error (only for observed ratings)
error = mask * (predictions - ratings)
# Compute gradients
user_gradients = error.dot(item_factors)
item_gradients = error.T.dot(user_factors)
# Update factors with gradient descent
user_factors -= learning_rate * user_gradients
item_factors -= learning_rate * item_gradients
# Predict missing ratings
predictions = user_factors.dot(item_factors.T)
print(f"Predicted rating for User 1, Item 3: {predictions[0, 2]:.2f}")
print(f"Predicted rating for User 2, Item 2: {predictions[1, 1]:.2f}")
This script predicts missing movie ratings by learning latent features for users and items. It initializes two smaller matrices, user_factors and item_factors, with the goal of finding values that, when multiplied, approximate the original ratings matrix.
- The model first computes
predictionsby taking the dot product of the factor matrices. - An
erroris calculated by comparing these predictions to known ratings, ignoring the missing ones by using amask. - Gradient descent then updates the
user_factorsanditem_factorsto steadily reduce this error over multipleiterations.
Get started with Replit
Turn what you've learned into a working application. Describe your goal to Replit Agent, like "build a tool to find the optimal price for a product" or "create a stock portfolio optimizer."
Replit Agent will write the code, test for errors, and deploy your app from a simple description. Start building with Replit.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.
Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

.png)

.png)