Loading...
Loading...
Watch optimizers race on interactive loss surfaces. Click to set start, adjust learning rate, compare convergence.
import torch
# Rosenbrock function
def loss_fn(params):
x, y = params
return (1 - x)**2 + 100 * (y - x**2)**2
# Initialize at starting point
params = torch.tensor([-1.5, 1.5], dtype=torch.float64, requires_grad=True)
optimizer = torch.optim.SGD([params], lr=0.001)
# Run gradient descent
history = []
for step in range(200):
optimizer.zero_grad()
loss = loss_fn(params)
loss.backward()
optimizer.step()
history.append((params[0].item(), params[1].item(), loss.item()))
if step % 20 == 0:
print(f"Step {step:>4d}: loss = {loss.item():.6f}, x = {params[0].item():.4f}, y = {params[1].item():.4f}")
print(f"Final: x = {params[0].item():.4f}, y = {params[1].item():.4f}")Gradient descent finds the minimum of a function by repeatedly taking steps in the direction opposite to the gradient (steepest ascent). The learning rate controls step size: too large and it overshoots, too small and it crawls.
w = w - lr * gradientSimple but can oscillate on elongated surfaces.
v = 0.9*v + grad; w -= lr*vAccumulates velocity, dampens oscillations.
w -= lr*g / sqrt(sum(g^2))Adapts learning rate per-parameter. Can slow down over time.
cache = 0.9*cache + 0.1*g^2Fixes AdaGrad's decay with exponential moving average.
Combines momentum + RMSProp with bias correctionDefault choice for most deep learning. Adapts per-parameter with momentum.
Gradient descent is an optimization algorithm that iteratively adjusts model parameters by moving in the direction of steepest decrease of the loss function. It is the backbone of training neural networks and many other ML models.
SGD (Stochastic Gradient Descent) uses a fixed learning rate. Momentum adds velocity to accelerate through flat regions. RMSProp adapts the learning rate per parameter using a moving average of squared gradients. Adam combines momentum and RMSProp with bias correction, making it the most popular default optimizer.
Start with common defaults like 0.001 for Adam or 0.01 for SGD. If the loss diverges, reduce it. If training is too slow, increase it. Learning rate schedulers can also decay the rate over time for better convergence.
Divergence happens when the learning rate is too large, causing the optimizer to overshoot the minimum and bounce to increasingly higher loss values. Reducing the learning rate or using adaptive optimizers like Adam can fix this.
A loss surface maps parameter values to loss values. Its shape determines how easy optimization is. Smooth, convex surfaces have a single minimum, while complex surfaces with saddle points and local minima (like Rosenbrock) are harder to optimize.