4 Optimization for Machine Learning

4.1 Overview

Training a machine learning model is an optimization problem. Our goal is to find the set of model parameters that minimizes a loss function. The loss function measures how poorly the model is performing on the training data. Gradient-based optimization algorithms are the workhorses that allow us to navigate the high-dimensional “loss landscape” to find this minimum.

For econometricians, the key connection is that this is the same basic problem as nonlinear least squares or maximum likelihood estimation: choose parameters to minimize an objective function. What changes in machine learning is the scale of the parameter space, the non-convexity of the objective, and the reliance on iterative gradient-based updates.

This chapter explains how these optimization methods work at a general level. The next chapter then applies this machinery to feed-forward neural networks and backpropagation.

4.2 Roadmap

  1. We begin with the basic intuition behind gradient descent.
  2. We then compare batch, stochastic, and mini-batch updates, since this distinction is central in machine learning practice.
  3. Next, we discuss the main weaknesses of vanilla gradient descent and motivate more advanced optimizers.
  4. We then study Momentum, AdaGrad, RMSprop, and Adam.
  5. Finally, we compare their behavior visually and summarize practical lessons for modern machine-learning estimation.

4.3 The Core Idea for Optimization: Gradient Descent

Imagine you are standing on a foggy mountain and want to get to the lowest valley. The simplest strategy is to look at the ground at your feet and take a step in the steepest downhill direction. You repeat this process until you can no longer go any further down.

This is the core intuition behind Gradient Descent.

  • The “mountain” is the loss landscape.
  • Your “position” is the current set of model parameters, \(\boldsymbol \theta\).
  • The “steepest downhill direction” is the negative of the gradient of the loss function \(l\) with respect to the parameters, \(-\nabla_{\boldsymbol \theta} l(\boldsymbol \theta)\).

The update rule for the parameters is:

\[ \boldsymbol \theta_{\text{new}} = \boldsymbol \theta_{\text{old}} - \eta \nabla_{\boldsymbol \theta} l(\theta_{\text{old}}) \]

where \(\eta\) is the learning rate, a hyperparameter that controls the size of the steps we take.

The main challenge is how to compute the gradient \(\nabla_{\boldsymbol \theta} l(\boldsymbol \theta)\). The way we use our training data to compute it leads to three main variants of gradient descent.

4.4 Notation

To maintain notational consistency across all algorithms, we first establish our mathematical framework:

Setup: Consider a training dataset with \(N\) samples \(y_1, \dots, y_N\) and parameter vectors \(\boldsymbol{\theta} \in \mathbb{R}^m\).

Key definitions:

  • \(l_i(\boldsymbol{\theta}) = l(y_i, \hat{y}_i)\): loss for individual sample \(i\)
  • \(l(\boldsymbol{\theta}) = \sum_{i=1}^N l_i(\boldsymbol{\theta})\): total loss across all samples
  • \(y_i\): observed values, \(\hat{y}_i\): predicted/fitted values (depending on \(\boldsymbol{\theta}\))
  • \(Y = \{y_i | i = 1, \dots, N\}\): the complete dataset
Notation Convention

For brevity, we write \(\hat{y}_i\) instead of \(\hat{y}_i(\boldsymbol{\theta})\), but remember that predictions always depend on the current parameters.

4.5 Batch, Stochastic, and Mini-Batch Gradient Descent

Batch Gradient Descent

In Batch Gradient Descent, we calculate the gradient of the loss function using the entire training dataset in one go. With learning rate \(\eta > 0\) and starting value \(\boldsymbol{\theta}^{(0)}\), the gradient for iteration \(t=1,...,T\) is:

\[ \nabla_{\boldsymbol \theta} l(\boldsymbol \theta) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol \theta} l_i(\boldsymbol{\theta}^{(t-1)}) \]

which leads us to the update rule

\[\boldsymbol{\boldsymbol \theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{N} \nabla_{\boldsymbol{\theta}}\sum_{i=1}^N l_i(\boldsymbol{\theta}^{(t-1)})\]

Note
  • Pros: The gradient is a very accurate estimate of the true gradient, leading to a stable and direct convergence path.
  • Cons: It is computationally very expensive and memory-intensive, especially for large datasets. It is infeasible to load millions of samples into memory at once.

Stochastic Gradient Descent (SGD)

In Stochastic Gradient Descent, we calculate the gradient and update the parameters using just one training sample at a time. The gradient approximation reads as follows:

\[ \nabla_{\boldsymbol \theta} l(\boldsymbol{\theta}^{(t-1)}) \approx \nabla_{\boldsymbol \theta} l_i(\boldsymbol{\theta}^{(t-1)}) \quad \text{for a single random sample } i \]

Of course, using only one data point for updating the parameter instead of all \(i = 1, \dots, N\) is a huge oversimplification.

Note
  • Pros: It is much faster per update and requires very little memory. The noisy updates can also help the model escape shallow local minima.
  • Cons: The path to the minimum is very noisy and erratic. The model never truly “converges” but rather bounces around the minimum.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is the happy medium and the most common approach used in deep learning. We calculate the gradient using a small, random subset of the data called a mini-batch.

More precisely: We first define \(J\) batches \(B_j\) such that \(\bigcup_{j=1}^J B_j = Y\) and \(B_j \cap B_{j'} = \emptyset \quad \text{ for all } j,j' = 1,\ldots,J:\ j \neq j'\). With learning rate \(\eta > 0\) and starting value \(\boldsymbol{\theta}^{(0)}\), we get the following update gradient approximation

\[ \nabla_{\boldsymbol \theta} l(\boldsymbol \theta) \approx \frac{1}{\left|B_j\right|} \sum_{y_i \in B_j} \nabla_{\boldsymbol \theta} l(\boldsymbol \theta; y_i, \hat y_i)\]

and update rule

\[ \boldsymbol{\boldsymbol \theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{\left|B_j\right|} \sum_{y_i \in B_j}\nabla_{\boldsymbol{\theta}}l_i(\boldsymbol{\theta}^{(t-1)}). \]

A typical mini-batch size \(m\) is 32, 64, or 128. The total number of iteration steps for the algorithm \(T\) doesn’t need to coincide with the number of batches \(B\). Typically, we have \(B << T\) such that, after going through all batches for updating the parameters, we draw new batches of the same size to continue updating our parameter.

Note
  • Pros: Mini-Batch Gradient Descent balances the stability of Batch Gradient Descent with the speed of SGD. It also allows for efficient computation using vectorized operations on modern hardware (GPUs).
  • Cons: It introduces an additional hyperparameter (the batch size) to (potentially) tune.

If you take - \(B_j \subseteq {Y}\) with \(\left|B_j\right| = 1\) for all \(j=1,...,J\), then we are back at SGD, and - \(B_j = {Y}\) for each \(j\) results in Batch Gradient Descent.

Question for Reflection

If observations are ordered over time or clustered by firm, what econometric concern is raised by randomly sampled mini-batches, and what kind of batch construction would better preserve the dependence structure?

Random mini-batches can break the local dependence structure and mix observations from different firms or time periods. This may still be useful computationally, but the resulting gradient noise no longer reflects the dependence pattern in the data. For time-series or panel problems, batches that keep time blocks or firm clusters together can be more appropriate when dependence or real-time information sets matter.

Definition: Epoch

An epoch refers to one full pass through the training data. With this notion introduced, we get the following quasi-code for mini-batch optimization in supervised learning:

Define \(n_e, n_{e,b} \in \mathbb{N}\setminus\{0\}\) the number of epochs for training and the number of batches for each epoch, respectively.

Algorithm: Mini-Batch Gradient Descent Training

Input: Training data \(\{(x_i, y_i)\}_{i=1}^N\), learning rate \(\eta\), batch size \(|B|\)

Initialize: \(\boldsymbol{\theta}^{(0)}\), set \(t = 0\)

Repeat for \(e = 1, \ldots, n_e\) (epochs):

  1. Shuffle training data \(\{(x_i, y_i)\}_{i=1}^N\)
  2. For each batch \(B_j\) where \(j = 1, \ldots, n_{e,b}\):
    • Compute fitted values: \(\hat{y}_i = f(x_i; \boldsymbol{\theta}^{(t)})\) for all observations in \(B_j\)
    • Compute batch loss: \(L_j = \frac{1}{|B_j|} \sum_{i \in B_j} \ell(y_i, \hat{y}_i)\)
    • Compute batch gradient: \(\boldsymbol{g}^{(t)} = \frac{1}{|B_j|} \sum_{i \in B_j} \nabla_{\boldsymbol{\theta}} \ell(y_i, f(x_i; \boldsymbol{\theta}^{(t)}))\)
    • Update parameters: \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \boldsymbol{g}^{(t)}\)
    • Increment counter: \(t \leftarrow t + 1\)

Stopping rule: Stop if a convergence criterion is reached or maximum number of iterations \(T\)

Output: Optimized parameters \(\boldsymbol{\theta}^{(T)}\)

4.6 Challenges and Advanced Optimizers

Challenges with Vanilla Gradient Descent
  • Choosing the right learning rate \(\eta\) is difficult
  • The same learning rate applies to all parameters, which may not be ideal
  • It can get stuck in saddle points, which are common in high-dimensional landscapes

To address these issues, researchers have developed adaptive optimization algorithms that adjust the learning rate for each parameter automatically.

In all of the following algorithms, we choose a learning rate \(\eta > 0\) and a starting value \(\boldsymbol{\theta}^{(0)}\). Consider iteration steps \(t = 1,\ldots,T\), and define \(\boldsymbol g_t = \frac{1}{N} \sum_{i = 1}^N \nabla_{\boldsymbol{\theta}}l_i(\boldsymbol{\theta}^{(t)})\) as the gradient at time \(t\). We can combine all methods below with mini-batches, but leave this out for brevity in notation.

Momentum

Momentum helps accelerate SGD in the relevant direction and dampens oscillations. It achieves this by adding a fraction of the previous update vector to the current gradient, creating a “velocity” term that accumulates past gradients.

\[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_t \]

\[ \boldsymbol \theta^{(t+1)} = \boldsymbol \theta^{(t)} - \eta \boldsymbol v_t \]

Here, \(\delta\) is the momentum coefficient, and the velocity \(v_t\) accumulates a weighted average of past gradients. This creates a “ball rolling downhill” effect that helps overcome small local minima and accelerates convergence in consistent directions.

AdaGrad (Adaptive Gradient)

AdaGrad adapts the learning rate for each parameter individually, performing larger updates for infrequent parameters and smaller updates for frequent ones. It achieves this by dividing the learning rate by the square root of the sum of all past squared gradients for that parameter.

For that, let us define \(\boldsymbol G_t = \sum_{\tau = 1}^t \boldsymbol g_\tau \boldsymbol g_\tau^\top + \varepsilon \boldsymbol I_m\) as the accumulated outer product of gradients with a small adjustment \(\varepsilon > 0\), and let \(\boldsymbol I_m\) denote the identity matrix of appropriate dimension. The standard AdaGrad time step is:

\[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \boldsymbol G_t^{-\frac{1}{2}}g_t,\]

or a simplified form

\[\boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \ \mathrm{diag}(\boldsymbol G_t)^{-\frac{1}{2}}\boldsymbol g_t\]

For illustration, we can consider the case of a one-dimensional parameter vector. Then, the simplified AdaGrad reads as follows:

\[ \theta^{(t)} = \theta^{{t - 1}} - \frac{\eta}{\sqrt{G_{t} + \epsilon}} g_{t} \text{ with } G_{t} = \sum_{\tau=1}^{t} g_{\tau}^2.\]

Here it becomes clear why we needed to add the small \(\epsilon\) component: it is a small smoothing constant that prevents division by zero.

Note
  • Pros: Excellent for sparse data (like in NLP), where some features appear infrequently.
  • Cons: The denominator term \(G_t\) grows monotonically. This causes the learning rate to eventually become infinitesimally small, effectively stopping the training process.

RMSprop (Root Mean Square Propagation)

RMSprop modifies AdaGrad to resolve its main issue of a learning rate that aggressively and monotonically decreases. Instead of accumulating all past squared gradients, RMSprop uses an exponentially decaying average, preventing the denominator from growing endlessly. Define \(g_t\) as above and \(g_t^2\) as the elementwise product of \(g_t\) times \(g_t\); that is, \(g_t^2 = g_t \odot g_t\).

The update rules are (with \(v_0 = 0\)):

  1. Compute decaying average of squared gradients: \[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_t^2 \]
  2. Update parameters: \[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{\sqrt{v_t} + \varepsilon}g_t \] Here, \(v_t\) is the moving average of squared gradients, and \(\delta\) is the decay rate (e.g., 0.9). By “forgetting” the distant past, RMSprop keeps the learning rate adaptive and responsive throughout training.

Adam (Adaptive Moment Estimation)

Adam is arguably the most popular and effective optimization algorithm today. It combines the ideas of Momentum (using the first moment, or mean, of the gradients) and RMSprop (using the second moment, or uncentered variance, of the gradients).

The algorithm maintains two moving averages:

  1. First moment estimate (momentum): \[ \boldsymbol m_t = \beta \boldsymbol m_{t-1} + (1 - \beta) \boldsymbol g_{t} \]
  2. Second moment estimate (uncentered variance): \[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_{t}^2 \]

Since \(\boldsymbol m_t\) and \(\boldsymbol v_t\) are initialized as zero vectors, they are biased toward zero, especially during the initial time steps. Adam corrects for this bias:

  1. Bias-corrected estimates:

    \[ \hat{\boldsymbol m}_t = \frac{1}{1 - \beta} \ \boldsymbol m_t \]

    \[ \hat{\boldsymbol v}_t = \frac{1}{1 - \delta} \ \boldsymbol v_t \]

  2. Parameter update:

    \[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{\hat{m}_t}{\sqrt{\hat{v}_t} + \varepsilon}. \]

Adam is generally a good default choice for most deep learning problems due to its robustness and adaptive nature.

4.7 Optimizer Comparison

Algorithm Memory Hyperparameters Key Advantage Key Limitation
Batch Gradient Descent High \(\eta\) Stable, smooth convergence Very slow, memory intensive
SGD Low \(\eta\) Fast per iteration, low memory Noisy, sensitive to \(\eta\)
Mini-Batch Gradient Descent Medium \(\eta\), batch size Balances speed and stability Additional hyperparameter to tune
Momentum Low \(\eta\), \(\delta\) Accelerates convergence, dampens oscillations Can overshoot, needs tuning
AdaGrad Medium \(\eta\), \(\epsilon\) Adapts per-parameter rates Learning rate decays to zero
RMSprop Medium \(\eta\), \(\gamma\), \(\epsilon\) Fixes AdaGrad’s decay problem No momentum component
Adam High \(\eta\), \(\beta\), \(\delta\), \(\epsilon\) Combines momentum + adaptation Can converge to suboptimal points

4.8 Visualizing Optimizers

The following plot shows the paths taken by different optimizers on a contour plot of a loss function \(l(x,y) = 0.5 * x^2 + 2 * y^2\).

All optimizers employ a learning rate of 0.3 and each algorithm stops at maximum 100 iterations. The stopping criterion at the end of each interation is whether \(l(x,y) < 0.1\). More details can be found if you unfold the code via “Show the code”. Note that the SGD optimization is illustrated by adding white noise to the gradient update. In this 2d example, we only have one data point at each gradient update.

Notice how SGD is noisier than Batch Gradient Descent, Momentum overshoots but corrects, AdaGrad stalls, RMSprop corrects the stalling, and Adam looks like a combination of Momentum and RMSprop.

Show the code
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Define loss function (less elongated for clearer visualization)
def f(x, y):
    return 1 + 0.1 * x**2 + 1 * y**2  # Less extreme ratio

def df(x, y):
    return 0.2 * x, 2 * y

# Create contour plot
x = np.linspace(-12, 12, 100)
y = np.linspace(-8, 8, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

def run_batch_gd(x0, y0, lr=0.3, n_steps=100):
    """Pure batch gradient descent - smooth path"""
    path = [(x0, y0)]
    x, y = x0, y0
    for _ in range(n_steps):
        gx, gy = df(x, y)
        x -= lr * gx
        y -= lr * gy
        path.append((x, y))
        if f(x, y) < 1.01:
            break
    return np.array(path)

def run_sgd(x0, y0, lr=0.3, n_steps=100):
    """Realistic SGD - simulate mini-batch variance"""
    path = [(x0, y0)]
    x, y = x0, y0
    
    # Generate synthetic dataset that creates natural variance
    np.random.seed(123)  # Different seed for data generation
    data_variance = 0.3  # Controls how much mini-batches differ
    
    for step in range(n_steps):
        # Each mini-batch has slightly different gradient estimate
        gx_base, gy_base = df(x, y)
        
        # Add realistic mini-batch variance (not just random noise)
        gx_noise = np.random.normal(0, data_variance * abs(gx_base * 1))
        gy_noise = np.random.normal(0, data_variance * abs(gy_base * 1))
        
        gx = gx_base + gx_noise
        gy = gy_base + gy_noise
        
        x -= lr * gx
        y -= lr * gy
        path.append((x, y))
        
        if f(x, y) < 1.01:
            break
    
    return np.array(path)

def run_momentum(x0, y0, lr=0.05, beta=0.9, n_steps=100):
    """Momentum optimizer - shows acceleration and overshooting"""
    path = [(x0, y0)]
    x, y = x0, y0
    vx, vy = 0, 0  # Initialize velocity
    
    for _ in range(n_steps):
        gx, gy = df(x, y)
        
        # Update velocity (momentum)
        vx = beta * vx + gx
        vy = beta * vy + gy
        
        # Update parameters
        x -= lr * vx
        y -= lr * vy
        path.append((x, y))
        
        if f(x, y) < 1.01:
            break
    
    return np.array(path)

def run_adagrad(x0, y0, lr=0.3, eps=1e-8, n_steps=100):
    """AdaGrad optimizer - shows stalling behavior"""
    path = [(x0, y0)]
    x, y = x0, y0
    gx_sum_sq, gy_sum_sq = 0, 0
    
    for step in range(n_steps):
        gx, gy = df(x, y)
        
        # Accumulate squared gradients
        gx_sum_sq += gx**2
        gy_sum_sq += gy**2
        
        # AdaGrad update - learning rate decreases over time
        effective_lr_x = lr / (np.sqrt(gx_sum_sq) + eps)
        effective_lr_y = lr / (np.sqrt(gy_sum_sq) + eps)
        
        x -= effective_lr_x * gx
        y -= effective_lr_y * gy
        path.append((x, y))
        
        # More lenient stopping criterion to see stalling
        if f(x, y) < 1.01 and step > 20:
            break
            
        # Stop if steps become too small (stalling)
        if step > 10 and effective_lr_x < 1e-4 and effective_lr_y < 1e-4:
            break
    
    return np.array(path)

def run_rmsprop(x0, y0, lr=0.3, gamma=0.9, eps=1e-8, n_steps=100):
    """RMSprop optimizer - maintains learning rate"""
    path = [(x0, y0)]
    x, y = x0, y0
    vx, vy = 0, 0
    
    for step in range(n_steps):
        gx, gy = df(x, y)
        
        # Update moving average of squared gradients
        vx = gamma * vx + (1 - gamma) * gx**2
        vy = gamma * vy + (1 - gamma) * gy**2
        
        # RMSprop update - learning rate stays more stable
        effective_lr_x = lr / (np.sqrt(vx) + eps)
        effective_lr_y = lr / (np.sqrt(vy) + eps)
        
        x -= effective_lr_x * gx
        y -= effective_lr_y * gy
        path.append((x, y))
        
        # Same stopping criterion
        if f(x, y) < 1.01 and step > 20:
            break
    
    return np.array(path)

def run_adam(x0, y0, lr=0.3, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=100):
    """Adam optimizer"""
    path = [(x0, y0)]
    x, y = x0, y0
    mx, my = 0, 0
    vx, vy = 0, 0
    
    for t in range(1, n_steps + 1):
        gx, gy = df(x, y)
        
        # Update moments
        mx = beta1 * mx + (1 - beta1) * gx
        my = beta1 * my + (1 - beta1) * gy
        vx = beta2 * vx + (1 - beta2) * gx**2
        vy = beta2 * vy + (1 - beta2) * gy**2
        
        # Bias correction
        mx_hat = mx / (1 - beta1**t)
        my_hat = my / (1 - beta1**t)
        vx_hat = vx / (1 - beta2**t)
        vy_hat = vy / (1 - beta2**t)
        
        # Update parameters
        x -= lr * mx_hat / (np.sqrt(vx_hat) + eps)
        y -= lr * my_hat / (np.sqrt(vy_hat) + eps)
        path.append((x, y))
        
        if f(x, y) < 1.01:
            break
    
    return np.array(path)

# Run all optimizers from the same starting point
start_x, start_y = -10, 5
path_batch = run_batch_gd(start_x, start_y)
path_sgd = run_sgd(start_x, start_y)
path_momentum = run_momentum(start_x, start_y)
path_adagrad = run_adagrad(start_x, start_y)
path_rmsprop = run_rmsprop(start_x, start_y)
path_adam = run_adam(start_x, start_y)

# Plot - using 2x3 grid to accommodate all 6 optimizers
fig, axes = plt.subplots(3, 2, figsize=(10, 10))
axes = axes.flatten()

paths_and_titles = [
    (path_batch, 'Batch Gradient Descent'),
    (path_sgd, 'Stochastic Gradient Descent'),
    (path_momentum, 'Momentum'),
    (path_adagrad, 'AdaGrad'),
    (path_rmsprop, 'RMSprop'),
    (path_adam, 'Adam')
]

for i, (path, title) in enumerate(paths_and_titles):
    ax = axes[i]
    
    # Create contour background
    levels = np.logspace(0, 2, 15)  # Better level spacing
    ax.contour(X, Y, Z, levels=levels, alpha=0.6, colors='gray', linewidths=0.5)
    ax.contourf(X, Y, Z, levels=levels, alpha=0.3, cmap='viridis')
    
    # Plot optimization path
    ax.plot(path[:, 0], path[:, 1], 'r-', linewidth=2, alpha=0.8)
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start', zorder=5)
    ax.plot(path[-1, 0], path[-1, 1], 'ro', markersize=8, label='End', zorder=5)
    ax.plot(0, 0, 'k*', markersize=15, label='Optimum', zorder=5)
    
    # Add arrows to show direction (fewer arrows for clarity)
    if len(path) > 5:
        arrow_indices = np.linspace(0, len(path)-2, min(6, len(path)-1), dtype=int)
        for j in arrow_indices:
            dx = path[j+1, 0] - path[j, 0]
            dy = path[j+1, 1] - path[j, 1]
            # Only draw arrow if movement is significant
            if np.sqrt(dx**2 + dy**2) > 0.1:
                ax.arrow(path[j, 0], path[j, 1], dx, dy, 
                        head_width=0.3, head_length=0.2, fc='red', ec='red', alpha=0.7)
    
    ax.set_aspect('equal', adjustable='box')

    ax.set_title(f'{title}\n({len(path)-1} steps)')
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-12, 3)
    ax.set_ylim(-2, 6)

plt.tight_layout()
plt.show()
Figure 4.1: Comparison of different optimization algorithms on a 2D loss landscape.
Note on Algorithm Performance

While Adam might not appear dramatically superior to other algorithms in this visualization, keep in mind that we’re using a very well-behaved quadratic function for demonstration purposes. In practice, neural network loss landscapes are:

  • Highly non-convex with many local minima and saddle points
  • High-dimensional (often millions of parameters)
  • Ill-conditioned with vastly different scales across parameters
  • Noisy due to mini-batch sampling

In such challenging real-world scenarios, Adam’s adaptive learning rates and momentum typically provide significant advantages over simpler methods. This simple 2D example serves to illustrate the core behavioral differences between optimizers rather than their relative performance on complex problems.

We can make the minimization problem more challenging by having multiple local minima and one global minimum. Of course, depending on the hyperparameters chosen, all algorithms may find the global minimum. The goal is to find an optimizer that is largely robust to different choices of hyperparameters.

Show the code
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Define a more challenging non-convex function with multiple local minima
def f_nonconvex(x, y):
    return 2/3 * ((x**2 + y - 11)**2 + (x + y**2 - 7)**2 + (x - 4)**2 * (y + 2)**2)

def df_nonconvex(x, y):
    # Partial derivative with respect to x
    dx = (2/3) * (4*x*(x**2 + y - 11) + 2*(x + y**2 - 7) + 2*(x - 4)*(y + 2)**2)
    
    # Partial derivative with respect to y
    dy = (2/3) * (2*(x**2 + y - 11) + 4*y*(x + y**2 - 7) + 2*(x - 4)**2*(y + 2))
    
    return dx, dy

# Create contour plot for non-convex function
x_nc = np.linspace(-6, 6, 600)
y_nc = np.linspace(-6, 6, 600)
X_nc, Y_nc = np.meshgrid(x_nc, y_nc)
Z_nc = f_nonconvex(X_nc, Y_nc)

def run_optimizer_nonconvex(optimizer_func, x0, y0, **kwargs):
    """Generic wrapper for running optimizers on non-convex function"""
    path = [(x0, y0)]
    x, y = x0, y0
    
    if 'momentum' in optimizer_func.__name__:
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        beta = kwargs.get('beta', 0.9)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            vx = beta * vx + gx
            vy = beta * vy + gy
            x -= lr * vx
            y -= lr * vy
            path.append((x, y))
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'adagrad' in optimizer_func.__name__:
        gx_sum_sq, gy_sum_sq = 0, 0
        lr = kwargs.get('lr', 0.3)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for step in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            gx_sum_sq += gx**2
            gy_sum_sq += gy**2
            
            effective_lr_x = lr / (np.sqrt(gx_sum_sq) + eps)
            effective_lr_y = lr / (np.sqrt(gy_sum_sq) + eps)
            
            x -= effective_lr_x * gx
            y -= effective_lr_y * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
            if step > 10 and effective_lr_x < 1e-6 and effective_lr_y < 1e-6:
                break
                
    elif 'rmsprop' in optimizer_func.__name__:
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        gamma = kwargs.get('gamma', 0.9)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            vx = gamma * vx + (1 - gamma) * gx**2
            vy = gamma * vy + (1 - gamma) * gy**2
            
            x -= lr * gx / (np.sqrt(vx) + eps)
            y -= lr * gy / (np.sqrt(vy) + eps)
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'adam' in optimizer_func.__name__:
        mx, my = 0, 0
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        beta1 = kwargs.get('beta1', 0.9)
        beta2 = kwargs.get('beta2', 0.99)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for t in range(1, n_steps + 1):
            gx, gy = df_nonconvex(x, y)
            
            mx = beta1 * mx + (1 - beta1) * gx
            my = beta1 * my + (1 - beta1) * gy
            vx = beta2 * vx + (1 - beta2) * gx**2
            vy = beta2 * vy + (1 - beta2) * gy**2
            
            mx_hat = mx / (1 - beta1**t)
            my_hat = my / (1 - beta1**t)
            vx_hat = vx / (1 - beta2**t)
            vy_hat = vy / (1 - beta2**t)
            
            x -= lr * mx_hat / (np.sqrt(vx_hat) + eps)
            y -= lr * my_hat / (np.sqrt(vy_hat) + eps)
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'sgd' in optimizer_func.__name__:
        lr = kwargs.get('lr', 0.3)
        n_steps = kwargs.get('n_steps', 200)
        np.random.seed(456)  # Different seed for SGD noise
        data_variance = 0.2
        
        for _ in range(n_steps):
            gx_base, gy_base = df_nonconvex(x, y)
            
            gx_noise = np.random.normal(0, data_variance * abs(gx_base * 5.3))
            gy_noise = np.random.normal(0, data_variance * abs(gy_base * 5.3))
            
            gx = gx_base + gx_noise
            gy = gy_base + gy_noise
            
            x -= lr * gx
            y -= lr * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    else:  # Batch GD
        lr = kwargs.get('lr', 0.3)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            x -= lr * gx
            y -= lr * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
    
    return np.array(path)

# Define dummy functions for the wrapper
def run_batch_gd_nc(): pass
def run_sgd_nc(): pass
def run_momentum_nc(): pass
def run_adagrad_nc(): pass
def run_rmsprop_nc(): pass
def run_adam_nc(): pass

# Run all optimizers from the same challenging starting point
start_x_nc, start_y_nc = -4, 3  # Start far from any minimum

path_batch_nc = run_optimizer_nonconvex(run_batch_gd_nc, start_x_nc, start_y_nc, lr=0.01)
path_sgd_nc = run_optimizer_nonconvex(run_sgd_nc, start_x_nc, start_y_nc, lr=0.01)
path_momentum_nc = run_optimizer_nonconvex(run_momentum_nc, start_x_nc, start_y_nc, lr=0.001, beta=0.9)
path_adagrad_nc = run_optimizer_nonconvex(run_adagrad_nc, start_x_nc, start_y_nc, lr=0.3)
path_rmsprop_nc = run_optimizer_nonconvex(run_rmsprop_nc, start_x_nc, start_y_nc, lr=0.3)
path_adam_nc = run_optimizer_nonconvex(run_adam_nc, start_x_nc, start_y_nc, lr=0.3)

# Plot - using 2x3 grid for all 6 optimizers
fig, axes = plt.subplots(3, 2, figsize=(10, 10))
axes = axes.flatten()

paths_and_titles_nc = [
    (path_batch_nc, 'Batch Gradient Descent'),
    (path_sgd_nc, 'Stochastic Gradient Descent'),
    (path_momentum_nc, 'Momentum'),
    (path_adagrad_nc, 'AdaGrad'),
    (path_rmsprop_nc, 'RMSprop'),
    (path_adam_nc, 'Adam')
]

# Known global minima of Himmelblau's function
global_minima = [(3.584428, -1.848126)]

for i, (path, title) in enumerate(paths_and_titles_nc):
    ax = axes[i]
    
    # Create contour background - use log scale for better visualization
    levels = np.logspace(0, 3.4, 20)
    ax.contour(X_nc, Y_nc, Z_nc, levels=levels, alpha=0.6, colors='gray', linewidths=0.5)
    ax.contourf(X_nc, Y_nc, Z_nc, levels=levels, alpha=0.3, cmap='viridis')
    
    # Plot the known global minimum once per panel
    gm_x, gm_y = global_minima[0]
    ax.plot(gm_x, gm_y, 'k*', markersize=12, alpha=0.8, label='Global minimum')
    
    # Plot optimization path
    if len(path) > 1:
        ax.plot(path[:, 0], path[:, 1], 'r-', linewidth=2, alpha=0.8)
        ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start', zorder=5)
        ax.plot(path[-1, 0], path[-1, 1], 'ro', markersize=8, label='End', zorder=5)
        
        # Add arrows for direction (fewer for clarity)
        if len(path) > 5:
            arrow_indices = np.linspace(0, len(path)-2, min(5, len(path)-1), dtype=int)
            for j in arrow_indices:
                dx = path[j+1, 0] - path[j, 0]
                dy = path[j+1, 1] - path[j, 1]
                if np.sqrt(dx**2 + dy**2) > 0.05:
                    ax.arrow(path[j, 0], path[j, 1], dx, dy, 
                            head_width=0.1, head_length=0.08, fc='red', ec='red', alpha=0.7)
    
    ax.set_aspect('equal', adjustable='box')

    ax.set_title(f'{title}\n({len(path)-1} steps)')
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-6.1, 6.1)
    ax.set_ylim(-6.1, 6.1)

plt.tight_layout()
plt.show()
Figure 4.2: Comparison of optimization algorithms on a non-convex loss landscape with multiple local minima.

4.9 Summary

Key Takeaways
  1. Gradient descent is the basic optimization idea underlying modern neural-network training: update parameters in the direction of steepest descent.
  2. Mini-batch gradient descent is the practical standard because it balances computational efficiency with relatively stable gradient information.
  3. Learning-rate choice is central: too small leads to painfully slow progress, while too large leads to instability.
  4. Advanced optimizers such as Momentum, AdaGrad, RMSprop, and Adam modify vanilla gradient descent to improve stability, speed, or parameter-specific adaptation.
  5. Adam is often a strong default in practice, but optimizer choice matters more when the loss surface is non-convex, noisy, and high-dimensional.
Common Pitfalls

Learning Rate Issues

  • Too high: Training becomes unstable, loss explodes
  • Too low: Training is painfully slow, may get stuck in poor local minima
  • Solution: Use adaptive optimizers like Adam, or learning rate schedules

Overfitting to Simple Examples

  • Don’t judge optimizers solely on toy 2D functions
  • Real neural networks have millions of parameters and complex loss landscapes
  • What works on simple problems may not scale to deep learning

Prerequisite Boundary

  • This chapter teaches generic optimization ideas
  • Network architectures and backpropagation come in the next chapter
  • Exercises here should be solvable without prior neural-network notation

4.10 Exercises

Exercise 4.1: Learning Rates in an Ill-Conditioned Quadratic Problem

Consider the objective

\[ L(\theta_1,\theta_2)=\frac{1}{2}\left(a\theta_1^2+b\theta_2^2\right), \qquad a>b>0. \]

Suppose we apply gradient descent with constant learning rate \(\eta>0\):

\[ \boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta \nabla L(\boldsymbol{\theta}^{(t)}). \]

  1. Derive the gradient and show that the updates satisfy \[ \theta_1^{(t+1)}=(1-\eta a)\theta_1^{(t)}, \qquad \theta_2^{(t+1)}=(1-\eta b)\theta_2^{(t)}. \]
  2. Show that convergence to \((0,0)\) requires \[ 0<\eta<\frac{2}{a}. \]
  3. Explain why, when \(a\gg b\), a single global learning rate creates a tension between fast progress in the \(\theta_2\) direction and stability in the \(\theta_1\) direction. Relate this to the motivation for adaptive or preconditioned optimization methods.

Exam level. The exercise formalizes why optimization becomes difficult in ill-conditioned problems and why a single learning rate can be inadequate.

Differentiate the quadratic objective coordinate by coordinate.

A scalar recursion \(x_{t+1}=cx_t\) converges to zero if and only if \(|c|<1\).

Compare the learning rate that would be natural for the flat direction, \(1/b\), with the stability restriction coming from the steep direction.

Part 1: Coordinate-Wise Gradient Descent Updates

The gradient is

\[ \nabla L(\theta_1,\theta_2)= \begin{pmatrix} a\theta_1\\ b\theta_2 \end{pmatrix}. \]

Hence gradient descent gives

\[ \theta_1^{(t+1)}=\theta_1^{(t)}-\eta a\theta_1^{(t)} =(1-\eta a)\theta_1^{(t)}, \]

and

\[ \theta_2^{(t+1)}=\theta_2^{(t)}-\eta b\theta_2^{(t)} =(1-\eta b)\theta_2^{(t)}. \]

Part 2: Deriving the Convergence Restriction

The iterates converge to zero if and only if both scalar recursions contract:

\[ |1-\eta a|<1 \qquad\text{and}\qquad |1-\eta b|<1. \]

These are equivalent to

\[ 0<\eta<\frac{2}{a} \qquad\text{and}\qquad 0<\eta<\frac{2}{b}. \]

Since \(a>b\), we have \(\frac{2}{a}<\frac{2}{b}\), so the binding restriction is

\[ 0<\eta<\frac{2}{a}. \]

Part 3: Why Ill-Conditioning Slows Optimization

If \(a\gg b\), then the objective is much steeper in the \(\theta_1\) direction than in the \(\theta_2\) direction. Stability forces

\[ \eta<\frac{2}{a}, \]

which can be very small. But then the update factor in the flat direction is

\[ 1-\eta b, \]

which stays close to 1, so convergence along \(\theta_2\) is very slow.

Thus a single global learning rate cannot be simultaneously well-tuned for both directions: a larger \(\eta\) would help progress in the flat direction but would destabilize the steep direction. This is one of the main motivations for adaptive or preconditioned methods, which scale updates differently across coordinates.

Exercise 4.2: Mini-Batch Gradients as Noisy Gradient Estimators

Let the full-sample objective be

\[ L(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \ell_i(\boldsymbol{\theta}), \]

and define the corresponding full-sample gradient

\[ g(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta}). \]

Now draw a mini-batch \(B\) of size \(m\) uniformly at random from \(\{1,\dots,N\}\) without replacement, and define the mini-batch gradient estimator

\[ \hat g_B(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i\in B}\nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta}). \]

  1. Show that \[ \mathbb{E}\left[\hat g_B(\boldsymbol{\theta})\right] = g(\boldsymbol{\theta}), \] so the mini-batch gradient is an unbiased estimator of the full gradient.
  2. Consider the scalar case where \(\nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta})\) is replaced by \(z_i \in \mathbb{R}\). Show that \[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)} \cdot \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2, \] where \(\bar z = \frac{1}{N}\sum_{i=1}^N z_i\).
  3. Use Part 2 to explain why smaller mini-batches produce noisier updates. Then explain one reason why such noise may nevertheless be useful in optimization.

Exam level. The exercise formalizes why mini-batch updates are noisy but still useful approximations to the full gradient.

Write the mini-batch estimator as an average of terms multiplied by inclusion indicators.

Use the standard variance formula for sampling without replacement, or derive it from inclusion probabilities.

Inspect the factor

\[ \frac{N-m}{m(N-1)}. \]

Part 1: Unbiasedness of the Mini-Batch Gradient

Let

\[ \mathbb{1}\{i \in B\} \]

denote the inclusion indicator for observation \(i\). Then

\[ \hat g_B(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^N \mathbb{1}\{i \in B\}\nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}). \]

Taking expectations and using

\[ \mathbb{E}[\mathbb{1}\{i \in B\}] = \frac{m}{N}, \]

\[ \mathbb{E}[\hat g_B(\boldsymbol{\theta})] = \frac{1}{m}\sum_{i=1}^N \mathbb{E}[\mathbb{1}\{i \in B\}] \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^N \frac{m}{N}\nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = g(\boldsymbol{\theta}). \]

So the mini-batch gradient is an unbiased estimator of the full gradient.

Part 2: Variance Under Sampling Without Replacement

For scalar values \(z_1,\dots,z_N\), the sample mean over a simple random sample without replacement satisfies

\[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)}\,S_N^2, \]

where

\[ S_N^2 = \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2. \]

Hence

\[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)} \cdot \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2. \]

This is exactly the stated formula.

Part 3: Why Gradient Noise Can Help

The variance factor

\[ \frac{N-m}{m(N-1)} \]

falls as \(m\) increases. Therefore smaller mini-batches produce noisier gradient estimates and hence noisier parameter updates.

That noise can nevertheless be useful because it can prevent the algorithm from following a rigid deterministic path. In non-convex problems, moderate noise may help the optimizer move away from poor local geometries such as narrow basins or saddle regions.

Note

Before moving on to neural networks, note the chapter boundary: so far the objects being optimized were generic objective functions. In the next chapter, the same optimization ideas are applied to feed-forward networks, where gradients are computed by backpropagation.

Exercise 4.3: Momentum on a Quadratic Objective

Consider the one-dimensional quadratic objective

\[ L(\theta)=\frac{a}{2}\theta^2, \qquad a>0, \]

and the momentum updates

\[ v_t=\delta v_{t-1}+(1-\delta)a\theta_t, \qquad \theta_{t+1}=\theta_t-\eta v_t, \]

where \(\eta>0\) is the learning rate and \(\delta\in[0,1)\) is the momentum parameter.

  1. Eliminate \(v_t\) and show that \(\theta_t\) satisfies the second-order recursion \[ \theta_{t+1} = \big(1+\delta-\eta(1-\delta)a\big)\theta_t -\delta \theta_{t-1}. \]
  2. Show that when \(\delta=0\) this reduces to the gradient descent recursion from Exercise 4.1. For the general case, use the following fact: a second-order linear recursion \(x_{t+1}=cx_t+dx_{t-1}\) with characteristic polynomial \(p(\lambda)=\lambda^2-c\lambda-d\) converges to zero if and only if \(p(1)>0\), \(p(-1)>0\), and \(|d|<1\). Apply this to show that, for general \(\delta\in[0,1)\), the recursion converges to zero if and only if \[ \eta<\frac{2(1+\delta)}{(1-\delta)\,a}. \]
  3. Compare this stability bound with the vanilla gradient descent bound \(\eta<2/a\). Explain why momentum widens the range of stable learning rates and why, despite this, momentum can still produce oscillatory transient behavior.

Exam level. The exercise turns momentum into a concrete dynamic system, derives its stability region, and compares it with vanilla gradient descent.

Use the update equation at times \(t\) and \(t-1\) to write

\[ v_{t-1}=\frac{\theta_{t-1}-\theta_t}{\eta}. \]

When \(\delta=0\) the lagged term vanishes. For the general case, identify \(c\) and \(d\) in the given stability criterion by matching the recursion to the form \(x_{t+1}=cx_t+dx_{t-1}\). Then evaluate \(p(1)\), \(p(-1)\), and \(|d|\) separately.

Compare the bound at \(\delta=0\) and \(\delta>0\). For oscillation, consider when the discriminant of the characteristic polynomial is negative, so the roots are complex.

Part 1: Deriving the Second-Order Recursion

From the parameter update,

\[ \theta_t=\theta_{t-1}-\eta v_{t-1}, \]

so

\[ v_{t-1}=\frac{\theta_{t-1}-\theta_t}{\eta}. \]

Now substitute this into the momentum update:

\[ v_t=\delta \frac{\theta_{t-1}-\theta_t}{\eta}+(1-\delta)a\theta_t. \]

Using \(\theta_{t+1}=\theta_t-\eta v_t\), we obtain \[\begin{align*} \theta_{t+1} &= \theta_t-\eta\left[\delta \frac{\theta_{t-1}-\theta_t}{\eta}+(1-\delta)a\theta_t\right] \\ &= \theta_t-\delta(\theta_{t-1}-\theta_t)-\eta(1-\delta)a\theta_t \\ &= \big(1+\delta-\eta(1-\delta)a\big)\theta_t-\delta\theta_{t-1}. \end{align*}\]

Part 2: Stability Analysis

When \(\delta=0\), the recursion becomes \(\theta_{t+1}=(1-\eta a)\theta_t\), which is the gradient descent recursion from Exercise 4.1. This converges iff \(|1-\eta a|<1\), i.e., \(0<\eta<2/a\).

For general \(\delta\in[0,1)\), write \(c=1+\delta-\eta(1-\delta)a\). The recursion has characteristic polynomial

\[ p(\lambda)=\lambda^2-c\lambda+\delta. \]

By the stability criterion given in the exercise, convergence requires \(p(1)>0\), \(p(-1)>0\), and \(|\delta|<1\).

  1. \(p(1)=1-c+\delta=\eta(1-\delta)a>0\), which always holds.

  2. \(p(-1)=1+c+\delta=2(1+\delta)-\eta(1-\delta)a>0\), giving

    \[ \eta<\frac{2(1+\delta)}{(1-\delta)\,a}. \]

  3. \(|\delta|<1\) holds by assumption.

Therefore convergence requires \(\eta<\frac{2(1+\delta)}{(1-\delta)\,a}\).

Part 3: Wider Stability but Possible Oscillation

At \(\delta=0\) the bound is \(\eta<2/a\). For \(\delta>0\) the bound becomes \(2(1+\delta)/((1-\delta)a)>2/a\), so momentum widens the range of stable learning rates.

However, the characteristic roots are complex when the discriminant \(c^2-4\delta<0\). Complex roots mean the iterates oscillate even while converging: the algorithm overshoots and corrects, cycling around the optimum before settling down. This explains why momentum can accelerate convergence in directions of persistent gradient but produces oscillatory transient behavior when the effective step size is aggressive.