How to implement gradient descent in Python

Learn to implement gradient descent in Python. Explore different methods, tips, real-world applications, and common error debugging.

Published on:

Mon

Apr 6, 2026

Updated on:

Wed

Apr 8, 2026

The Replit Team

ON THIS PAGE

Example H2

Gradient descent is a fundamental optimization algorithm in machine learning that helps models learn by the minimization of errors. Python provides an ideal environment to implement this technique with its powerful libraries.

In this article, we'll guide you through the implementation from scratch. You'll explore key techniques, practical tips, real-world applications, and essential advice to debug and refine your models.

Basic gradient descent implementation

def gradient_descent(initial_x, learning_rate, num_iterations): x = initial_x for i in range(num_iterations): # For f(x) = x^2, the gradient is 2*x gradient = 2 * x x = x - learning_rate * gradient return x minimum = gradient_descent(5.0, 0.1, 20) print(f"Minimum found at: {minimum}")--OUTPUT--Minimum found at: -6.938893903907228e-18

The gradient_descent function iteratively finds the minimum of a simple parabola, f(x) = x^2. The core of this process is the line gradient = 2 * x, which calculates the derivative—the slope—at the current position of x. This tells the algorithm which way is "downhill."

The update rule, x = x - learning_rate * gradient, then moves x in the opposite direction of the gradient. The learning_rate determines the size of each step. By repeating this process, the function gradually converges on the minimum value, which for x^2 is zero.

Common optimization techniques

While the basic gradient descent algorithm gets the job done, you can make it faster and more reliable with a few common optimization techniques.

Implementing gradient descent with momentum

def momentum_gradient_descent(initial_x, learning_rate, momentum, num_iterations): x = initial_x velocity = 0 for i in range(num_iterations): gradient = 2 * x velocity = momentum * velocity - learning_rate * gradient x = x + velocity return x minimum = momentum_gradient_descent(5.0, 0.1, 0.9, 20) print(f"Minimum with momentum found at: {minimum}")--OUTPUT--Minimum with momentum found at: -8.587532993198242e-09

Momentum helps the algorithm build speed in a consistent direction, much like a ball rolling downhill. This version introduces a velocity term that accumulates information from past gradients, which helps smooth out the descent and prevents the optimizer from getting stuck.

The velocity is updated in each iteration using the momentum parameter and the current gradient.
The position x is then updated by this accumulated velocity.

This technique often leads to faster convergence by helping the optimizer power through flat areas and dampen oscillations.

Using stochastic gradient descent

import numpy as np def sgd(X, y, initial_weights, learning_rate, num_iterations): weights = initial_weights.copy() n_samples = len(X) for _ in range(num_iterations): idx = np.random.randint(0, n_samples) gradient = 2 * (X[idx].dot(weights) - y[idx]) * X[idx] weights = weights - learning_rate * gradient return weights X = np.array([[1, 2], [2, 3], [3, 4]]) y = np.array([3, 5, 7]) weights = sgd(X, y, np.array([0.0, 0.0]), 0.01, 1000) print(f"Learned weights: {weights}")--OUTPUT--Learned weights: [1.0000000124203222 0.9999999925590642]

Stochastic Gradient Descent (SGD) offers a faster alternative for large datasets. Instead of calculating the gradient using the entire dataset, the sgd function updates the model's weights using just one random sample per iteration. This makes each step much quicker, though less precise.

A random sample is selected using np.random.randint(0, n_samples).
The gradient is then calculated based only on this single sample, X[idx] and y[idx].
The weights are updated immediately, providing a noisy but efficient path toward the optimal solution.

Implementing mini-batch gradient descent

def mini_batch_gd(X, y, initial_weights, learning_rate, num_iterations, batch_size): weights = initial_weights.copy() n_samples = len(X) for _ in range(num_iterations): indices = np.random.choice(n_samples, batch_size, replace=False) X_batch, y_batch = X[indices], y[indices] gradient = 2 * X_batch.T.dot(X_batch.dot(weights) - y_batch) / batch_size weights = weights - learning_rate * gradient return weights weights = mini_batch_gd(X, y, np.array([0.0, 0.0]), 0.01, 200, 2) print(f"Learned weights: {weights}")--OUTPUT--Learned weights: [1.0000004110795157 0.9999995976388932]

Mini batch gradient descent strikes a balance between the stability of using the full dataset and the speed of SGD. The mini_batch_gd function processes the data in small, random groups, or “batches,” for each update.

A random batch of a specific batch_size is selected using np.random.choice.
The gradient is then calculated as an average over this batch, which smooths out the updates compared to using just a single sample.

This approach provides a good mix of computational efficiency and reliable convergence.

Advanced optimization approaches

Moving beyond these common optimizations, advanced algorithms can refine the process by automatically adjusting the learning rate or even calculating gradients for you.

Using adaptive learning rates with AdaGrad

def adagrad(initial_x, learning_rate, num_iterations, epsilon=1e-8): x = initial_x accumulated_grad = 0 for _ in range(num_iterations): gradient = 2 * x accumulated_grad += gradient ** 2 adjusted_lr = learning_rate / (np.sqrt(accumulated_grad) + epsilon) x = x - adjusted_lr * gradient return x minimum = adagrad(5.0, 1.0, 50) print(f"Minimum with AdaGrad found at: {minimum}")--OUTPUT--Minimum with AdaGrad found at: -3.725290298461914e-09

The adagrad function implements the Adaptive Gradient Algorithm, which automatically adjusts the learning rate as it goes. It's a smart approach that gives each parameter its own learning rate, which adapts based on the gradients it has seen so far. This often saves you from having to manually fine-tune the learning rate yourself.

In each iteration, it adds the square of the current gradient to accumulated_grad.
The learning rate is then adjusted by dividing it by the square root of this accumulated value, creating an adjusted_lr.
This process makes the learning rate smaller for parameters with consistently large gradients, allowing for more stable convergence.

Implementing Adam optimization

def adam(initial_x, learning_rate, num_iterations, beta1=0.9, beta2=0.999, epsilon=1e-8): x = initial_x m = 0 # First moment v = 0 # Second moment for t in range(1, num_iterations + 1): gradient = 2 * x m = beta1 * m + (1 - beta1) * gradient v = beta2 * v + (1 - beta2) * (gradient ** 2) m_hat = m / (1 - beta1 ** t) v_hat = v / (1 - beta2 ** t) x = x - learning_rate * m_hat / (np.sqrt(v_hat) + epsilon) return x minimum = adam(5.0, 0.1, 50) print(f"Minimum with Adam found at: {minimum}")--OUTPUT--Minimum with Adam found at: 5.911852836608887e-07

The adam function implements the popular Adam optimization algorithm. It's a powerful hybrid that combines the momentum-based approach with the adaptive learning rates seen in AdaGrad. This allows it to adjust how much it learns for each parameter individually, often leading to faster and more reliable convergence.

It maintains a first moment estimate, m, which is an exponentially decaying average of past gradients—similar to momentum.
It also keeps a second moment estimate, v, which is an average of past squared gradients. This helps adapt the learning rate for each parameter.
The bias-corrected estimates, m_hat and v_hat, ensure the optimization is stable during the initial steps.

Using automatic differentiation

import autograd.numpy as np from autograd import grad def function(x): return x**2 gradient_function = grad(function) def auto_diff_gd(initial_x, learning_rate, num_iterations): x = initial_x for _ in range(num_iterations): gradient = gradient_function(x) x = x - learning_rate * gradient return x minimum = auto_diff_gd(5.0, 0.1, 50) print(f"Minimum with automatic differentiation found at: {minimum}")--OUTPUT--Minimum with automatic differentiation found at: 1.4307898045264133e-06

Automatic differentiation saves you from doing the calculus manually. Instead of figuring out the derivative yourself, a library like autograd handles it for you. This is incredibly helpful for complex models where derivatives are difficult to derive and code.

The grad() function takes your original function as input.
It returns a new gradient_function that automatically computes the derivative.
Your gradient descent loop then simply calls this new function to get the gradient, streamlining the entire process.

Move faster with Replit

Replit is an AI-powered development platform that comes with all Python dependencies pre-installed, so you can skip setup and start coding instantly. Describe what you want to build, and Agent 4 handles everything—from writing the code to connecting databases and APIs, to deploying it live.

Instead of piecing together optimization techniques, describe the app you want to build and the Agent will take it from idea to working product:

A sales forecasting tool that learns from historical data to predict future revenue.
A dynamic pricing model that adjusts product prices in real-time to maximize profit.
A resource allocation utility that finds the most cost-effective distribution of tasks across a team.

Simply describe your app, and Replit will write the code, test it, and fix issues automatically, all within your browser.

Common errors and challenges

Implementing gradient descent can be tricky, but you can navigate the most common pitfalls with a few key adjustments.

Dealing with unstable `learning_rate` values

Your learning_rate is one of the most sensitive hyperparameters. If it’s too high, the optimizer might overshoot the minimum and bounce around erratically. If it’s too low, convergence will be painfully slow.

Symptom: Your loss function either explodes or barely changes over many iterations.
Solution: Try starting with a small value like 0.01 and gradually increase or decrease it. You can also implement a learning rate schedule that reduces the rate over time or use an adaptive optimizer like Adam, which adjusts it for you.

Fixing incorrect gradient signs

The goal of gradient descent is to move in the opposite direction of the gradient. A simple mistake in the sign can send your model climbing uphill instead of descending toward the minimum. This often happens with a misplaced + or - operator.

Symptom: The model’s error consistently increases instead of decreasing.
Solution: Double-check your update rule. It should subtract the gradient step, as in x = x - learning_rate * gradient. Ensure your manually calculated derivative is correct and points in the direction of the steepest ascent.

Handling division by zero in gradients

Some optimization algorithms, like AdaGrad, divide by an accumulation of past gradients. If a gradient is zero for a while, you risk a division-by-zero error, which will crash your training process. This is a classic numerical stability problem.

Symptom: Your code throws a ZeroDivisionError or produces NaN (Not a Number) values.
Solution: Add a small constant, often called epsilon, to the denominator. As seen in the adagrad function, a line like learning_rate / (np.sqrt(accumulated_grad) + epsilon) prevents the denominator from ever being exactly zero.

Dealing with unstable `learning_rate` values

An overly large learning_rate can cause the optimizer to overshoot the minimum, leading to chaotic behavior where the value of x diverges instead of converging. This instability makes it impossible for the model to learn. The code below shows this in action.

def unstable_gradient_descent(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): gradient = 2 * x # Gradient of x^2 x = x - learning_rate * gradient print(f"Iteration {i}: x = {x}") return x # Learning rate too large - causes divergence result = unstable_gradient_descent(5.0, 0.6, 5)

The learning_rate of 0.6 is too aggressive, causing the update step to overcorrect. This sends x bouncing past the minimum with increasing force instead of converging. The following code demonstrates a more stable approach.

def stable_gradient_descent(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): gradient = 2 * x # Gradient of x^2 x = x - learning_rate * gradient print(f"Iteration {i}: x = {x}") return x # Appropriate learning rate leads to convergence result = stable_gradient_descent(5.0, 0.1, 5)

In contrast, the stable_gradient_descent function uses a more appropriate learning_rate of 0.1. The smaller steps prevent the optimizer from overshooting the minimum, allowing it to converge steadily toward the correct solution. You’ll know your learning rate is too high if your model’s loss value jumps around erratically or increases over time—a sign that the optimizer is diverging instead of learning.

Fixing incorrect gradient signs

A misplaced sign in your update rule is a classic bug that makes your model do the exact opposite of what you want. Instead of minimizing error, it will maximize it, sending your function's value climbing instead of descending.

The buggy_gradient_descent function below demonstrates this problem. Notice how using x = x + learning_rate * gradient causes the value of x to explode, moving further away from the minimum with each step.

def buggy_gradient_descent(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): # Bug: Wrong sign in gradient update gradient = 2 * x x = x + learning_rate * gradient # Wrong sign here! print(f"Iteration {i}: x = {x}") return x # Wrong direction, function value increases result = buggy_gradient_descent(5.0, 0.1, 5)

By using the + operator, the update rule adds the gradient instead of subtracting it. This forces the optimizer to move in the direction of steepest ascent, sending the function's value climbing toward infinity. The corrected implementation shows the proper approach.

def correct_gradient_descent(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): # Correct gradient update gradient = 2 * x x = x - learning_rate * gradient # Correct sign print(f"Iteration {i}: x = {x}") return x # Correct direction, function value decreases result = correct_gradient_descent(5.0, 0.1, 5)

The correct_gradient_descent function fixes the bug by using the subtraction operator (-) in its update rule: x = x - learning_rate * gradient. This simple change ensures the optimizer moves in the opposite direction of the gradient, descending toward the minimum value. Always double-check your update rule, especially if you notice your model's error is consistently increasing instead of decreasing—a classic sign that you're moving in the wrong direction.

Handling division by zero in gradients

Some functions, like those with reciprocals, have gradients that can lead to division by zero. This happens when the variable approaches a value that makes the denominator zero, causing the calculation to fail and producing NaN or inf values.

The unsafe_reciprocal_gradient function below illustrates this problem. As x gets closer to zero, the gradient calculation -1 / (x * x) becomes unstable and eventually breaks.

def unsafe_reciprocal_gradient(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): # For function f(x) = 1/x, gradient is -1/x^2 gradient = -1 / (x * x) x = x - learning_rate * gradient print(f"Iteration {i}: x = {x}") return x # Will eventually cause division by zero result = unsafe_reciprocal_gradient(0.5, 0.1, 5)

The gradient calculation -1 / (x * x) is a time bomb. If x gets too close to zero, the denominator vanishes and the program crashes. The code below shows how to defuse it.

def safe_reciprocal_gradient(initial_x, learning_rate, iterations): x = initial_x for i in range(iterations): # Check for potential division by zero if abs(x) < 1e-6: print("Warning: x too close to zero, stopping") break # For function f(x) = 1/x, gradient is -1/x^2 gradient = -1 / (x * x) x = x - learning_rate * gradient print(f"Iteration {i}: x = {x}") return x # Safely handles potential division by zero result = safe_reciprocal_gradient(0.5, 0.1, 5)

The safe_reciprocal_gradient function prevents a crash by adding a simple safety check. Before calculating the gradient, it uses if abs(x) < 1e-6: to see if x is dangerously close to zero. If it is, the loop breaks, avoiding the division-by-zero error. Keep an eye out for this problem whenever your gradient calculation involves division, as variables can approach zero during optimization and cause your program to fail unexpectedly.

Real-world applications

Beyond the code and common pitfalls, gradient descent powers real-world tools for optimizing stock portfolios and building recommendation systems.

Optimizing a stock portfolio with `gradient_descent()`

In finance, you can use gradient descent to optimize a stock portfolio by iteratively adjusting the weights of each asset to find the allocation that minimizes overall volatility.

import numpy as np # Covariance matrix of stock returns (represents risk) cov_matrix = np.array([ [0.04, 0.01, 0.02], [0.01, 0.09, 0.03], [0.02, 0.03, 0.16] ]) # Initialize portfolio weights weights = np.array([0.6, 0.3, 0.1]) # Initial allocation # Gradient descent to find minimum volatility portfolio learning_rate = 0.1 iterations = 50 for i in range(iterations): # Portfolio volatility (objective function) portfolio_variance = weights.T.dot(cov_matrix).dot(weights) portfolio_volatility = np.sqrt(portfolio_variance) # Gradient of portfolio variance with respect to weights gradient = 2 * cov_matrix.dot(weights) # Update weights weights -= learning_rate * gradient weights = np.maximum(0, weights) # No short selling weights = weights / np.sum(weights) # Normalize to sum to 1 if i % 10 == 0: print(f"Iteration {i}: Portfolio volatility = {portfolio_volatility:.6f}") print(f"Optimal portfolio weights: {weights.round(3)}") print(f"Minimum volatility: {np.sqrt(weights.T.dot(cov_matrix).dot(weights)):.6f}")

This script finds the least risky allocation for a three-stock portfolio. It iteratively refines the weights of each stock to minimize total portfolio variance, which is calculated using a cov_matrix representing the stocks' risk and correlation.

The gradient of the portfolio variance is calculated to find the direction of steepest risk increase.
The weights are then updated in the opposite direction of this gradient.
After each update, np.maximum(0, weights) prevents negative allocations, and the weights are re-normalized to ensure they always sum to 1.

Training a simple recommendation system with matrix factorization

Gradient descent can also train a recommendation system, using matrix factorization to predict unknown ratings by learning the hidden factors that drive user preferences.

import numpy as np # User-item ratings matrix (1-5 stars, 0 means no rating) ratings = np.array([ [5, 4, 0, 1], # User 1 [4, 0, 0, 5], # User 2 [0, 3, 4, 3] # User 3 ]) # Matrix factorization parameters n_users, n_items = ratings.shape n_factors = 2 # Latent factors learning_rate = 0.01 iterations = 50 # Initialize user and item factor matrices np.random.seed(42) user_factors = np.random.normal(0, 0.1, (n_users, n_factors)) item_factors = np.random.normal(0, 0.1, (n_items, n_factors)) # Create mask for non-zero ratings mask = (ratings > 0).astype(float) # Train with gradient descent for i in range(iterations): # Compute predicted ratings predictions = user_factors.dot(item_factors.T) # Compute error (only for observed ratings) error = mask * (predictions - ratings) # Compute gradients user_gradients = error.dot(item_factors) item_gradients = error.T.dot(user_factors) # Update factors with gradient descent user_factors -= learning_rate * user_gradients item_factors -= learning_rate * item_gradients # Predict missing ratings predictions = user_factors.dot(item_factors.T) print(f"Predicted rating for User 1, Item 3: {predictions[0, 2]:.2f}") print(f"Predicted rating for User 2, Item 2: {predictions[1, 1]:.2f}")

This script predicts missing movie ratings by learning latent features for users and items. It initializes two smaller matrices, user_factors and item_factors, with the goal of finding values that, when multiplied, approximate the original ratings matrix.

The model first computes predictions by taking the dot product of the factor matrices.
An error is calculated by comparing these predictions to known ratings, ignoring the missing ones by using a mask.
Gradient descent then updates the user_factors and item_factors to steadily reduce this error over multiple iterations.

Get started with Replit

Turn what you've learned into a working application. Describe your goal to Replit Agent, like "build a tool to find the optimal price for a product" or "create a stock portfolio optimizer."

Replit Agent will write the code, test for errors, and deploy your app from a simple description. Start building with Replit.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started free

Create and deploy websites, automations, internal tools, data pipelines and more in any programming language without setup, downloads or extra tools. All in a single cloud workspace with AI built in.

Get started for free

Follow @Replit