Gradient Descent Visualizer

Watch optimizers race on interactive loss surfaces. Click to set start, adjust learning rate, compare convergence.

Surface

Optimizers

LR: 0.001

Steps: 200

Step 0 / 200

SGDAdamClick to set start point

Loss vs Step (log scale)

Optimizer Comparison

SGD

loss: 6.25e+1 · (-1.50, 1.50)

Running

Adam

loss: 6.25e+1 · (-1.50, 1.50)

Running

Start: (-1.5, 1.5)Min: (1, 1)

Python Code(SGD)

import torch

# Rosenbrock function
def loss_fn(params):
    x, y = params
    return (1 - x)**2 + 100 * (y - x**2)**2

# Initialize at starting point
params = torch.tensor([-1.5, 1.5], dtype=torch.float64, requires_grad=True)
optimizer = torch.optim.SGD([params], lr=0.001)

# Run gradient descent
history = []
for step in range(200):
    optimizer.zero_grad()
    loss = loss_fn(params)
    loss.backward()
    optimizer.step()
    history.append((params[0].item(), params[1].item(), loss.item()))

    if step % 20 == 0:
        print(f"Step {step:>4d}: loss = {loss.item():.6f}, x = {params[0].item():.4f}, y = {params[1].item():.4f}")

print(f"Final: x = {params[0].item():.4f}, y = {params[1].item():.4f}")

How Gradient Descent Works

Gradient descent finds the minimum of a function by repeatedly taking steps in the direction opposite to the gradient (steepest ascent). The learning rate controls step size: too large and it overshoots, too small and it crawls.

SGD (Vanilla)

w = w - lr * gradient

Simple but can oscillate on elongated surfaces.

Momentum

v = 0.9*v + grad; w -= lr*v

Accumulates velocity, dampens oscillations.

AdaGrad

w -= lr*g / sqrt(sum(g^2))

Adapts learning rate per-parameter. Can slow down over time.

RMSProp

cache = 0.9*cache + 0.1*g^2

Fixes AdaGrad's decay with exponential moving average.

Adam (most popular)

Combines momentum + RMSProp with bias correction

Default choice for most deep learning. Adapts per-parameter with momentum.

Frequently Asked Questions

What is gradient descent in machine learning?

Gradient descent is an optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease of the loss function. It is the backbone of training neural networks and many other ML models.

What is the difference between SGD, Adam, and RMSProp?

SGD (Stochastic Gradient Descent) uses a fixed learning rate. Momentum adds velocity to accelerate through flat regions. RMSProp adapts the learning rate per parameter using a moving average of squared gradients. Adam combines momentum and RMSProp with bias correction, making it the most popular default optimizer.

How do I choose the right learning rate for gradient descent?

Start with common defaults like 0.001 for Adam or 0.01 for SGD. If the loss diverges, reduce it. If training is too slow, increase it. Learning rate schedulers can also decay the rate over time for better convergence.

Why does gradient descent sometimes diverge?

Divergence happens when the learning rate is too large, causing the optimizer to overshoot the minimum and bounce to increasingly higher loss values. Reducing the learning rate or using adaptive optimizers like Adam can fix this.

What is a loss surface and why does it matter?

A loss surface maps parameter values to loss values. Its shape determines how easy optimization is. Smooth, convex surfaces have a single minimum, while complex surfaces with saddle points and local minima (like Rosenbrock) are harder to optimize.

Related Tools

ROC & AUC Calculator

ROC curves, AUC, optimal threshold

Confusion Matrix

Precision, recall, F1

Calculus Calculator

Derivatives and integrals

import torch # Rosenbrock function def loss_fn(params): x, y = params return (1 - x)**2 + 100 * (y - x**2)**2 # Initialize at starting point params = torch.tensor([-1.5, 1.5], dtype=torch.float64, requires_grad=True) optimizer = torch.optim.SGD([params], lr=0.001) # Run gradient descent history = [] for step in range(200): optimizer.zero_grad() loss = loss_fn(params) loss.backward() optimizer.step() history.append((params[0].item(), params[1].item(), loss.item())) if step % 20 == 0: print(f"Step {step:>4d}: loss = {loss.item():.6f}, x = {params[0].item():.4f}, y = {params[1].item():.4f}") print(f"Final: x = {params[0].item():.4f}, y = {params[1].item():.4f}")