4 Optimization for Machine Learning

4.1 Overview

Training a machine learning model is an optimization problem. The goal is to find the set of model parameters that minimizes a loss function, which measures how poorly the model fits the training data. Gradient-based methods are the standard tools for navigating the high-dimensional “loss landscape” to find this minimum. Stochastic-gradient methods trace back to stochastic approximation ideas such as Robbins and Monro (1951).

For econometricians, the key connection is that this is the same basic problem as nonlinear least squares or maximum likelihood estimation: choose parameters to minimize an objective function. What changes in machine learning is the scale of the parameter space, the non-convexity of the objective, and the reliance on iterative gradient-based updates.

This chapter explains how these optimization methods work at a general level. The next chapter then applies this machinery to feed-forward neural networks and backpropagation.

4.2 Roadmap

  1. We begin with the basic intuition behind gradient descent.
  2. We then compare batch, stochastic, and mini-batch updates, since this distinction is central in machine learning practice.
  3. Next, we discuss the main weaknesses of vanilla gradient descent and motivate more advanced optimizers.
  4. We then study Momentum, AdaGrad, RMSprop, and Adam.
  5. Finally, we compare their behavior visually and summarize practical lessons for modern machine-learning estimation.

4.3 The Core Idea for Optimization: Gradient Descent

Imagine you are standing on a foggy mountain and want to get to the lowest valley. The simplest strategy is to look at the ground at your feet and take a step in the steepest downhill direction. You repeat this process until further steps no longer reduce the loss.

This is the core intuition behind Gradient Descent.

  • The “mountain” is the loss landscape.
  • Your “position” is the current set of model parameters, \(\boldsymbol \theta\).
  • The “steepest downhill direction” is the negative of the gradient of the loss function \(l\) with respect to the parameters, \(-\nabla_{\boldsymbol \theta} l(\boldsymbol \theta)\).

The update rule for the parameters is:

\[ \boldsymbol \theta_{\text{new}} = \boldsymbol \theta_{\text{old}} - \eta \nabla_{\boldsymbol \theta} L(\boldsymbol\theta_{\text{old}}) \]

where \(\eta\) is the learning rate, a hyperparameter that controls the size of the steps we take.

The main challenge is how to compute the gradient \(\nabla_{\boldsymbol \theta} L(\boldsymbol \theta)\). The way we use our training data to compute it leads to three main variants of gradient descent.

4.4 Notation

The notation below is used throughout the chapter:

Setup: Consider a supervised training dataset \(\mathcal{D}=\{(x_i,y_i)\}_{i=1}^N\) and parameter vectors \(\boldsymbol{\theta} \in \mathbb{R}^m\).

Key definitions:

  • \(\ell_i(\boldsymbol{\theta}) = \ell(y_i, \hat{y}_i)\): loss for individual sample \(i\)
  • \(L(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \ell_i(\boldsymbol{\theta})\): average training loss
  • \(x_i\): predictors, \(y_i\): observed value, \(\hat{y}_i\): predicted/fitted value depending on \(\boldsymbol{\theta}\)
  • \(\mathcal{D} = \{(x_i,y_i): i = 1, \dots, N\}\): the complete training dataset
Notation Convention

For brevity, we write \(\hat{y}_i\) instead of \(\hat{y}_i(\boldsymbol{\theta})\), but remember that predictions always depend on the current parameters.

4.5 Batch, Stochastic, and Mini-Batch Gradient Descent

Batch Gradient Descent

In Batch Gradient Descent, we calculate the gradient of the loss function using the entire training dataset in one go. With learning rate \(\eta > 0\) and starting value \(\boldsymbol{\theta}^{(0)}\), the gradient for iteration \(t=1,\ldots,T\) is:

\[ \nabla_{\boldsymbol \theta} L(\boldsymbol \theta^{(t-1)}) = \frac{1}{N} \sum_{i=1}^{N} \nabla_{\boldsymbol \theta} \ell_i(\boldsymbol{\theta}^{(t-1)}) \]

which leads us to the update rule

\[\boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{N} \sum_{i=1}^N \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}^{(t-1)})\]

Note
  • Pros: The gradient is the exact training-set gradient, giving a stable and direct convergence path.
  • Cons: Computationally expensive and memory-intensive on large datasets; infeasible to load millions of samples into memory at once.

Stochastic Gradient Descent (SGD)

In Stochastic Gradient Descent, we calculate the gradient and update the parameters using just one training sample at a time. The gradient approximation reads as follows:

\[ \nabla_{\boldsymbol \theta} L(\boldsymbol{\theta}^{(t-1)}) \approx \nabla_{\boldsymbol \theta} \ell_i(\boldsymbol{\theta}^{(t-1)}) \quad \text{for a single random sample } i \]

Using only one data point per update instead of all \(i = 1, \dots, N\) is a coarse approximation to the full gradient.

Note
  • Pros: Fast per update and requires little memory; the noisy updates can also help the model escape shallow local minima.
  • Cons: The path to the minimum is noisy and erratic; the model does not converge to a fixed point but bounces around the minimum.

Mini-Batch Gradient Descent

Mini-Batch Gradient Descent is the standard approach in deep learning, balancing batch and SGD. The gradient is computed on a small, random subset of the data called a mini-batch.

More precisely: We first define \(J\) batches \(B_j\) such that \(\bigcup_{j=1}^J B_j = \{1,\ldots,N\}\) and \(B_j \cap B_{j'} = \emptyset \quad \text{ for all } j,j' = 1,\ldots,J:\ j \neq j'\). With learning rate \(\eta > 0\) and starting value \(\boldsymbol{\theta}^{(0)}\), we get the following gradient approximation for the selected batch \(B_j\):

\[ \nabla_{\boldsymbol \theta} L(\boldsymbol \theta^{(t-1)}) \approx \frac{1}{\left|B_j\right|} \sum_{i \in B_j} \nabla_{\boldsymbol \theta} \ell_i(\boldsymbol \theta^{(t-1)})\]

and update rule

\[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{\left|B_j\right|} \sum_{i \in B_j}\nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}^{(t-1)}). \]

A typical mini-batch size \(m\) is 32, 64, or 128. The total number of iteration steps for the algorithm \(T\) doesn’t need to coincide with the number of batches \(J\). Typically, we have \(J \ll T\) such that, after going through all batches for updating the parameters, we draw new batches of the same size to continue updating our parameter.

Note
  • Pros: Mini-Batch Gradient Descent balances the stability of Batch Gradient Descent with the speed of SGD. It also allows for efficient computation using vectorized operations on modern hardware (GPUs).
  • Cons: It introduces an additional hyperparameter (the batch size) to (potentially) tune.

If you take - \(|B_j| = 1\) for each batch used in the update, then we are back at SGD, and - \(J = 1\) with \(B_1 = \{1,\ldots,N\}\) reduces to Batch Gradient Descent.

Question for Reflection

If observations are ordered over time or clustered by firm, what econometric concern is raised by randomly sampled mini-batches, and what kind of batch construction would better preserve the dependence structure?

Random mini-batches can break the local dependence structure and mix observations from different firms or time periods. This may still be useful computationally, but the resulting gradient noise no longer reflects the dependence pattern in the data. For time-series or panel problems, batches that keep time blocks or firm clusters together can be more appropriate when dependence or real-time information sets matter.

Definition: Epoch

An epoch refers to one full pass through the training data. With this notion introduced, we get the following pseudocode for mini-batch optimization in supervised learning:

Define \(n_e, n_{e,b} \in \mathbb{N}\setminus\{0\}\) the number of epochs for training and the number of batches for each epoch, respectively.

Algorithm: Mini-Batch Gradient Descent Training

Input: Training data \(\{(x_i, y_i)\}_{i=1}^N\), learning rate \(\eta\), batch size \(|B_j|\)

Initialize: \(\boldsymbol{\theta}^{(0)}\), set \(t = 0\)

Repeat for \(e = 1, \ldots, n_e\) (epochs):

  1. Shuffle training data \(\{(x_i, y_i)\}_{i=1}^N\)
  2. For each batch \(B_j\) where \(j = 1, \ldots, n_{e,b}\):
    • Compute fitted values: \(\hat{y}_i = f(x_i; \boldsymbol{\theta}^{(t)})\) for all observations in \(B_j\)
    • Compute batch loss: \(L_j = \frac{1}{|B_j|} \sum_{i \in B_j} \ell(y_i, \hat{y}_i)\)
    • Compute batch gradient: \(\boldsymbol{g}^{(t)} = \frac{1}{|B_j|} \sum_{i \in B_j} \nabla_{\boldsymbol{\theta}} \ell(y_i, f(x_i; \boldsymbol{\theta}^{(t)}))\)
    • Update parameters: \(\boldsymbol{\theta}^{(t+1)} = \boldsymbol{\theta}^{(t)} - \eta \boldsymbol{g}^{(t)}\)
    • Increment counter: \(t \leftarrow t + 1\)

Stopping rule: Stop if a convergence criterion is reached or the maximum number of iterations \(T\) is reached

Output: Final parameters \(\boldsymbol{\theta}^{(t^\ast)}\), where \(t^\ast\) is the realized stopping iteration

4.6 Challenges and Advanced Optimizers

Challenges with Vanilla Gradient Descent
  • Choosing the right learning rate \(\eta\) is difficult
  • The same learning rate applies to all parameters, which may not be ideal
  • It can get stuck in saddle points, which are common in high-dimensional landscapes (Dauphin et al. 2014)

To address these issues, researchers have developed adaptive optimization algorithms that adjust the learning rate for each parameter automatically.

In all of the following algorithms, we choose a learning rate \(\eta > 0\) and a starting value \(\boldsymbol{\theta}^{(0)}\). Consider iteration steps \(t = 1,\ldots,T\), and define \(\boldsymbol g_t = \frac{1}{N} \sum_{i = 1}^N \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}^{(t-1)})\) as the gradient used for the \(t\)th update. We can combine all methods below with mini-batches, but leave this out for brevity in notation.

Momentum

The momentum idea traces to Polyak (1964), who introduced the “heavy ball” method to accelerate iterative optimization, and Nesterov (1983), whose accelerated-gradient variant achieves a provably better convergence rate on convex objectives. Momentum accelerates SGD in directions of persistent gradient and dampens oscillations. It does so by adding a fraction of the previous update vector to the current gradient, creating a “velocity” term that accumulates past gradients.

\[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_t \]

\[ \boldsymbol \theta^{(t+1)} = \boldsymbol \theta^{(t)} - \eta \boldsymbol v_t \]

Here, \(\delta\) is the momentum coefficient, and the velocity \(v_t\) accumulates a weighted average of past gradients. This creates a “ball rolling downhill” effect that helps overcome small local minima and accelerates convergence in consistent directions.

AdaGrad (Adaptive Gradient)

AdaGrad adapts the learning rate for each parameter individually, performing larger updates for infrequent parameters and smaller updates for frequent ones (Duchi, Hazan, and Singer 2011). It achieves this by dividing the learning rate by the square root of the sum of all past squared gradients for that parameter.

Two versions appear in the literature. The full-matrix version accumulates the outer products

\[ \boldsymbol{\mathcal{G}}_t = \sum_{\tau=1}^{t} \boldsymbol g_\tau \boldsymbol g_\tau^\top + \varepsilon \boldsymbol I_m \]

and updates

\[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \, \boldsymbol{\mathcal{G}}_t^{-\frac{1}{2}}\boldsymbol g_t. \]

This form is mathematically clean but computationally infeasible in modern networks: storing and inverting an \(m \times m\) matrix with \(m\) in the millions is impractical. The version actually used in software (and called AdaGrad below) replaces the full matrix with its diagonal: define the elementwise gradient accumulator

\[ \boldsymbol G_t = \sum_{\tau=1}^{t} \boldsymbol g_\tau \odot \boldsymbol g_\tau \in \mathbb{R}^m, \]

and update each coordinate by

\[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \, \frac{\boldsymbol g_t}{\sqrt{\boldsymbol G_t + \varepsilon}}, \]

where the division and square root are elementwise. For illustration, the one-dimensional case reads

\[ \theta^{(t)} = \theta^{(t - 1)} - \frac{\eta}{\sqrt{G_{t} + \varepsilon}}\, g_{t}, \qquad G_{t} = \sum_{\tau=1}^{t} g_{\tau}^2.\]

The small constant \(\varepsilon > 0\) is a smoothing term that prevents division by zero. Throughout this book, AdaGrad refers to the diagonal version unless stated otherwise.

Note
  • Pros: Excellent for sparse data (like in NLP), where some features appear infrequently.
  • Cons: The denominator \(\sqrt{\boldsymbol G_t + \varepsilon}\) grows monotonically. This causes the effective learning rate to eventually become infinitesimally small, effectively stopping the training process.

RMSprop (Root Mean Square Propagation)

RMSprop modifies AdaGrad to resolve its main issue of a learning rate that aggressively and monotonically decreases. Instead of accumulating all past squared gradients, RMSprop uses an exponentially decaying average, preventing the denominator from growing endlessly. Define \(g_t\) as above and \(g_t^2\) as the elementwise product of \(g_t\) times \(g_t\); that is, \(g_t^2 = g_t \odot g_t\).

The update rules are (with \(v_0 = 0\)):

  1. Compute decaying average of squared gradients: \[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_t^2 \]
  2. Update parameters: \[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{1}{\sqrt{v_t} + \varepsilon}g_t \] Here, \(v_t\) is the moving average of squared gradients, and \(\delta\) is the decay rate (e.g., 0.9). By “forgetting” the distant past, RMSprop keeps the learning rate adaptive and responsive throughout training.

Adam (Adaptive Moment Estimation)

Adam is a widely used default optimizer for neural networks (Kingma and Ba 2015). It combines the ideas of Momentum (using the first moment, or mean, of the gradients) and RMSprop (using the second moment, or uncentered variance, of the gradients).

The algorithm maintains two moving averages:

  1. First moment estimate (momentum): \[ \boldsymbol m_t = \beta \boldsymbol m_{t-1} + (1 - \beta) \boldsymbol g_{t} \]
  2. Second moment estimate (uncentered variance): \[ \boldsymbol v_t = \delta \boldsymbol v_{t-1} + (1 - \delta) \boldsymbol g_{t}^2 \]

Since \(\boldsymbol m_t\) and \(\boldsymbol v_t\) are initialized as zero vectors, they are biased toward zero, especially during the initial time steps. Unrolling the recursion gives \(\boldsymbol m_t = (1-\beta) \sum_{\tau=1}^{t} \beta^{t-\tau} \boldsymbol g_\tau\), so under a stationary gradient process with \(\mathbb{E}[\boldsymbol g_\tau] = \boldsymbol \mu\) we have \(\mathbb{E}[\boldsymbol m_t] = (1-\beta^t)\boldsymbol \mu\). Adam corrects this bias by rescaling:

  1. Bias-corrected estimates: \[ \hat{\boldsymbol m}_t = \frac{\boldsymbol m_t}{1 - \beta^t} \] \[ \hat{\boldsymbol v}_t = \frac{\boldsymbol v_t}{1 - \delta^t} \]

The correction restores unbiasedness under stationarity; during actual training, gradient means and second moments drift across iterations, so the correction is approximate but still removes most of the initial zero-init bias.

  1. Parameter update: \[ \boldsymbol{\theta}^{(t)} = \boldsymbol{\theta}^{(t-1)} - \eta \frac{\hat{\boldsymbol m}_t}{\sqrt{\hat{\boldsymbol v}_t} + \varepsilon}. \]

Adam is generally a good default choice for most deep learning problems due to its robustness and adaptive nature. A modern variant is AdamW (Loshchilov and Hutter 2019), which decouples weight decay from the adaptive gradient update; this makes the regularization effect easier to tune than in vanilla Adam with an ordinary \(L_2\) penalty.

4.7 Optimizer Comparison

The column Optimizer state reports the number of running state tensors the optimizer stores per parameter (on top of the parameter vector itself). It reflects the memory overhead of the update rule and should not be confused with dataset memory, which is controlled separately by the batch size.

Algorithm Optimizer state Hyperparameters Key Advantage Key Limitation
Batch Gradient Descent 0 \(\eta\) Stable, smooth convergence Very slow; loads the full dataset per step
SGD 0 \(\eta\) Fast per iteration Noisy, sensitive to \(\eta\)
Mini-Batch Gradient Descent 0 \(\eta\), batch size Balances speed and stability Additional hyperparameter to tune
Momentum 1 (velocity) \(\eta\), \(\delta\) Accelerates convergence, dampens oscillations Can overshoot, needs tuning
AdaGrad 1 (squared-gradient sum) \(\eta\), \(\epsilon\) Adapts per-parameter rates Learning rate decays to zero
RMSprop 1 (EMA of squared gradients) \(\eta\), \(\gamma\), \(\epsilon\) Fixes AdaGrad’s decay problem No momentum component
Adam 2 (first and second moments) \(\eta\), \(\beta\), \(\delta\), \(\epsilon\) Combines momentum + adaptation Can converge to suboptimal points

4.8 Visualizing Optimizers

The following plot shows the paths taken by different optimizers on a contour plot of the loss function

\[ l(x,y)=1+0.1x^2+y^2. \]

The algorithms stop after at most 100 iterations, with stopping criterion \(l(x,y)<1.01\) (close to the global minimum value of 1). Most optimizers use a learning rate of 0.3; the momentum panel uses 0.5 because the convex-combination velocity \(v_t = \delta v_{t-1} + (1-\delta)g_t\) already shrinks the effective step by roughly \(1-\delta\). The SGD path is illustrated by adding mini-batch-style noise to the gradient update, since there is no actual dataset behind the contour. Code is foldable via “Show the code”.

Show the code
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Define loss function (less elongated for clearer visualization)
def f(x, y):
    return 1 + 0.1 * x**2 + 1 * y**2  # Less extreme ratio

def df(x, y):
    return 0.2 * x, 2 * y

# Create contour plot
x = np.linspace(-12, 12, 100)
y = np.linspace(-8, 8, 100)
X, Y = np.meshgrid(x, y)
Z = f(X, Y)

def run_batch_gd(x0, y0, lr=0.3, n_steps=100):
    """Pure batch gradient descent - smooth path"""
    path = [(x0, y0)]
    x, y = x0, y0
    for _ in range(n_steps):
        gx, gy = df(x, y)
        x -= lr * gx
        y -= lr * gy
        path.append((x, y))
        if f(x, y) < 1.01:
            break
    return np.array(path)

def run_sgd(x0, y0, lr=0.3, n_steps=100):
    """Realistic SGD - simulate mini-batch variance"""
    path = [(x0, y0)]
    x, y = x0, y0
    
    # Generate synthetic dataset that creates natural variance
    np.random.seed(123)  # Different seed for data generation
    data_variance = 0.3  # Controls how much mini-batches differ
    
    for step in range(n_steps):
        # Each mini-batch has slightly different gradient estimate
        gx_base, gy_base = df(x, y)
        
        # Add realistic mini-batch variance (not just random noise)
        gx_noise = np.random.normal(0, data_variance * abs(gx_base * 1))
        gy_noise = np.random.normal(0, data_variance * abs(gy_base * 1))
        
        gx = gx_base + gx_noise
        gy = gy_base + gy_noise
        
        x -= lr * gx
        y -= lr * gy
        path.append((x, y))
        
        if f(x, y) < 1.01:
            break
    
    return np.array(path)

def run_momentum(x0, y0, lr=0.5, beta=0.9, n_steps=100):
    """Momentum optimizer with convex-combination velocity (matches the text)."""
    path = [(x0, y0)]
    x, y = x0, y0
    vx, vy = 0, 0  # Initialize velocity

    for _ in range(n_steps):
        gx, gy = df(x, y)

        # Velocity as convex combination (matches v_t = beta v_{t-1} + (1 - beta) g_t)
        vx = beta * vx + (1 - beta) * gx
        vy = beta * vy + (1 - beta) * gy

        # Update parameters
        x -= lr * vx
        y -= lr * vy
        path.append((x, y))

        if f(x, y) < 1.01:
            break

    return np.array(path)

def run_adagrad(x0, y0, lr=0.3, eps=1e-8, n_steps=100):
    """AdaGrad optimizer - shows stalling behavior"""
    path = [(x0, y0)]
    x, y = x0, y0
    gx_sum_sq, gy_sum_sq = 0, 0
    
    for step in range(n_steps):
        gx, gy = df(x, y)
        
        # Accumulate squared gradients
        gx_sum_sq += gx**2
        gy_sum_sq += gy**2
        
        # AdaGrad update - learning rate decreases over time
        effective_lr_x = lr / (np.sqrt(gx_sum_sq) + eps)
        effective_lr_y = lr / (np.sqrt(gy_sum_sq) + eps)
        
        x -= effective_lr_x * gx
        y -= effective_lr_y * gy
        path.append((x, y))
        
        # More lenient stopping criterion to see stalling
        if f(x, y) < 1.01 and step > 20:
            break
            
        # Stop if steps become too small (stalling)
        if step > 10 and effective_lr_x < 1e-4 and effective_lr_y < 1e-4:
            break
    
    return np.array(path)

def run_rmsprop(x0, y0, lr=0.3, gamma=0.9, eps=1e-8, n_steps=100):
    """RMSprop optimizer - maintains learning rate"""
    path = [(x0, y0)]
    x, y = x0, y0
    vx, vy = 0, 0
    
    for step in range(n_steps):
        gx, gy = df(x, y)
        
        # Update moving average of squared gradients
        vx = gamma * vx + (1 - gamma) * gx**2
        vy = gamma * vy + (1 - gamma) * gy**2
        
        # RMSprop update - learning rate stays more stable
        effective_lr_x = lr / (np.sqrt(vx) + eps)
        effective_lr_y = lr / (np.sqrt(vy) + eps)
        
        x -= effective_lr_x * gx
        y -= effective_lr_y * gy
        path.append((x, y))
        
        # Same stopping criterion
        if f(x, y) < 1.01 and step > 20:
            break
    
    return np.array(path)

def run_adam(x0, y0, lr=0.3, beta1=0.9, beta2=0.999, eps=1e-8, n_steps=100):
    """Adam optimizer"""
    path = [(x0, y0)]
    x, y = x0, y0
    mx, my = 0, 0
    vx, vy = 0, 0
    
    for t in range(1, n_steps + 1):
        gx, gy = df(x, y)
        
        # Update moments
        mx = beta1 * mx + (1 - beta1) * gx
        my = beta1 * my + (1 - beta1) * gy
        vx = beta2 * vx + (1 - beta2) * gx**2
        vy = beta2 * vy + (1 - beta2) * gy**2
        
        # Bias correction
        mx_hat = mx / (1 - beta1**t)
        my_hat = my / (1 - beta1**t)
        vx_hat = vx / (1 - beta2**t)
        vy_hat = vy / (1 - beta2**t)
        
        # Update parameters
        x -= lr * mx_hat / (np.sqrt(vx_hat) + eps)
        y -= lr * my_hat / (np.sqrt(vy_hat) + eps)
        path.append((x, y))
        
        if f(x, y) < 1.01:
            break
    
    return np.array(path)

# Run all optimizers from the same starting point
start_x, start_y = -10, 5
path_batch = run_batch_gd(start_x, start_y)
path_sgd = run_sgd(start_x, start_y)
path_momentum = run_momentum(start_x, start_y)
path_adagrad = run_adagrad(start_x, start_y)
path_rmsprop = run_rmsprop(start_x, start_y)
path_adam = run_adam(start_x, start_y)

# Plot - using 2x3 grid to accommodate all 6 optimizers
fig, axes = plt.subplots(3, 2, figsize=(10, 10))
axes = axes.flatten()

paths_and_titles = [
    (path_batch, 'Batch Gradient Descent'),
    (path_sgd, 'Stochastic Gradient Descent'),
    (path_momentum, 'Momentum'),
    (path_adagrad, 'AdaGrad'),
    (path_rmsprop, 'RMSprop'),
    (path_adam, 'Adam')
]

for i, (path, title) in enumerate(paths_and_titles):
    ax = axes[i]
    
    # Create contour background
    levels = np.logspace(0, 2, 15)  # Better level spacing
    ax.contour(X, Y, Z, levels=levels, alpha=0.6, colors='gray', linewidths=0.5)
    ax.contourf(X, Y, Z, levels=levels, alpha=0.3, cmap='viridis')
    
    # Plot optimization path
    ax.plot(path[:, 0], path[:, 1], 'r-', linewidth=2, alpha=0.8)
    ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start', zorder=5)
    ax.plot(path[-1, 0], path[-1, 1], 'ro', markersize=8, label='End', zorder=5)
    ax.plot(0, 0, 'k*', markersize=15, label='Optimum', zorder=5)
    
    # Add arrows to show direction (fewer arrows for clarity)
    if len(path) > 5:
        arrow_indices = np.linspace(0, len(path)-2, min(6, len(path)-1), dtype=int)
        for j in arrow_indices:
            dx = path[j+1, 0] - path[j, 0]
            dy = path[j+1, 1] - path[j, 1]
            # Only draw arrow if movement is significant
            if np.sqrt(dx**2 + dy**2) > 0.1:
                ax.arrow(path[j, 0], path[j, 1], dx, dy, 
                        head_width=0.3, head_length=0.2, fc='red', ec='red', alpha=0.7)
    
    ax.set_aspect('equal', adjustable='box')

    ax.set_title(f'{title}\n({len(path)-1} steps)')
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-12, 3)
    ax.set_ylim(-2, 6)

plt.tight_layout()
plt.show()
Figure 4.1: Comparison of different optimization algorithms on a 2D loss landscape.
Note on Algorithm Performance

While Adam might not appear dramatically superior to other algorithms in this visualization, keep in mind that we’re using a very well-behaved quadratic function for demonstration purposes. In practice, neural network loss landscapes are:

  • Highly non-convex with many local minima and saddle points
  • High-dimensional (often millions of parameters)
  • Ill-conditioned with vastly different scales across parameters
  • Noisy due to mini-batch sampling

In such challenging real-world scenarios, Adam’s adaptive learning rates and momentum typically provide significant advantages over simpler methods. This simple 2D example serves to illustrate the core behavioral differences between optimizers rather than their relative performance on complex problems.

We can make the minimization problem more challenging by having multiple local minima and one global minimum. Of course, depending on the hyperparameters chosen, all algorithms may find the global minimum. The goal is to find an optimizer that is largely robust to different choices of hyperparameters.

Show the code
import numpy as np
import matplotlib.pyplot as plt

# Set random seed for reproducibility
np.random.seed(42)

# Define a more challenging non-convex function with multiple local minima
def f_nonconvex(x, y):
    return 2/3 * ((x**2 + y - 11)**2 + (x + y**2 - 7)**2 + (x - 4)**2 * (y + 2)**2)

def df_nonconvex(x, y):
    # Partial derivative with respect to x
    dx = (2/3) * (4*x*(x**2 + y - 11) + 2*(x + y**2 - 7) + 2*(x - 4)*(y + 2)**2)
    
    # Partial derivative with respect to y
    dy = (2/3) * (2*(x**2 + y - 11) + 4*y*(x + y**2 - 7) + 2*(x - 4)**2*(y + 2))
    
    return dx, dy

# Create contour plot for non-convex function
x_nc = np.linspace(-6, 6, 600)
y_nc = np.linspace(-6, 6, 600)
X_nc, Y_nc = np.meshgrid(x_nc, y_nc)
Z_nc = f_nonconvex(X_nc, Y_nc)

def run_optimizer_nonconvex(optimizer_func, x0, y0, **kwargs):
    """Generic wrapper for running optimizers on non-convex function"""
    path = [(x0, y0)]
    x, y = x0, y0
    
    if 'momentum' in optimizer_func.__name__:
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        beta = kwargs.get('beta', 0.9)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            vx = beta * vx + (1 - beta) * gx
            vy = beta * vy + (1 - beta) * gy
            x -= lr * vx
            y -= lr * vy
            path.append((x, y))
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'adagrad' in optimizer_func.__name__:
        gx_sum_sq, gy_sum_sq = 0, 0
        lr = kwargs.get('lr', 0.3)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for step in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            gx_sum_sq += gx**2
            gy_sum_sq += gy**2
            
            effective_lr_x = lr / (np.sqrt(gx_sum_sq) + eps)
            effective_lr_y = lr / (np.sqrt(gy_sum_sq) + eps)
            
            x -= effective_lr_x * gx
            y -= effective_lr_y * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
            if step > 10 and effective_lr_x < 1e-6 and effective_lr_y < 1e-6:
                break
                
    elif 'rmsprop' in optimizer_func.__name__:
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        gamma = kwargs.get('gamma', 0.9)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            vx = gamma * vx + (1 - gamma) * gx**2
            vy = gamma * vy + (1 - gamma) * gy**2
            
            x -= lr * gx / (np.sqrt(vx) + eps)
            y -= lr * gy / (np.sqrt(vy) + eps)
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'adam' in optimizer_func.__name__:
        mx, my = 0, 0
        vx, vy = 0, 0
        lr = kwargs.get('lr', 0.3)
        beta1 = kwargs.get('beta1', 0.9)
        beta2 = kwargs.get('beta2', 0.99)
        eps = kwargs.get('eps', 1e-8)
        n_steps = kwargs.get('n_steps', 200)
        
        for t in range(1, n_steps + 1):
            gx, gy = df_nonconvex(x, y)
            
            mx = beta1 * mx + (1 - beta1) * gx
            my = beta1 * my + (1 - beta1) * gy
            vx = beta2 * vx + (1 - beta2) * gx**2
            vy = beta2 * vy + (1 - beta2) * gy**2
            
            mx_hat = mx / (1 - beta1**t)
            my_hat = my / (1 - beta1**t)
            vx_hat = vx / (1 - beta2**t)
            vy_hat = vy / (1 - beta2**t)
            
            x -= lr * mx_hat / (np.sqrt(vx_hat) + eps)
            y -= lr * my_hat / (np.sqrt(vy_hat) + eps)
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    elif 'sgd' in optimizer_func.__name__:
        lr = kwargs.get('lr', 0.3)
        n_steps = kwargs.get('n_steps', 200)
        np.random.seed(456)  # Different seed for SGD noise
        data_variance = 0.2
        
        for _ in range(n_steps):
            gx_base, gy_base = df_nonconvex(x, y)
            
            gx_noise = np.random.normal(0, data_variance * abs(gx_base * 5.3))
            gy_noise = np.random.normal(0, data_variance * abs(gy_base * 5.3))
            
            gx = gx_base + gx_noise
            gy = gy_base + gy_noise
            
            x -= lr * gx
            y -= lr * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
                
    else:  # Batch GD
        lr = kwargs.get('lr', 0.3)
        n_steps = kwargs.get('n_steps', 200)
        
        for _ in range(n_steps):
            gx, gy = df_nonconvex(x, y)
            x -= lr * gx
            y -= lr * gy
            path.append((x, y))
            
            if f_nonconvex(x, y) < 1 or np.isnan(x) or np.isnan(y):
                break
    
    return np.array(path)

# Define dummy functions for the wrapper
def run_batch_gd_nc(): pass
def run_sgd_nc(): pass
def run_momentum_nc(): pass
def run_adagrad_nc(): pass
def run_rmsprop_nc(): pass
def run_adam_nc(): pass

# Run all optimizers from the same challenging starting point
start_x_nc, start_y_nc = -4, 3  # Start far from any minimum

path_batch_nc = run_optimizer_nonconvex(run_batch_gd_nc, start_x_nc, start_y_nc, lr=0.01)
path_sgd_nc = run_optimizer_nonconvex(run_sgd_nc, start_x_nc, start_y_nc, lr=0.01)
path_momentum_nc = run_optimizer_nonconvex(run_momentum_nc, start_x_nc, start_y_nc, lr=0.001, beta=0.9)
path_adagrad_nc = run_optimizer_nonconvex(run_adagrad_nc, start_x_nc, start_y_nc, lr=0.3)
path_rmsprop_nc = run_optimizer_nonconvex(run_rmsprop_nc, start_x_nc, start_y_nc, lr=0.3)
path_adam_nc = run_optimizer_nonconvex(run_adam_nc, start_x_nc, start_y_nc, lr=0.3)

# Plot - using 2x3 grid for all 6 optimizers
fig, axes = plt.subplots(3, 2, figsize=(10, 10))
axes = axes.flatten()

paths_and_titles_nc = [
    (path_batch_nc, 'Batch Gradient Descent'),
    (path_sgd_nc, 'Stochastic Gradient Descent'),
    (path_momentum_nc, 'Momentum'),
    (path_adagrad_nc, 'AdaGrad'),
    (path_rmsprop_nc, 'RMSprop'),
    (path_adam_nc, 'Adam')
]

# Known global minima of Himmelblau's function
global_minima = [(3.584428, -1.848126)]

for i, (path, title) in enumerate(paths_and_titles_nc):
    ax = axes[i]
    
    # Create contour background - use log scale for better visualization
    levels = np.logspace(0, 3.4, 20)
    ax.contour(X_nc, Y_nc, Z_nc, levels=levels, alpha=0.6, colors='gray', linewidths=0.5)
    ax.contourf(X_nc, Y_nc, Z_nc, levels=levels, alpha=0.3, cmap='viridis')
    
    # Plot the known global minimum once per panel
    gm_x, gm_y = global_minima[0]
    ax.plot(gm_x, gm_y, 'k*', markersize=12, alpha=0.8, label='Global minimum')
    
    # Plot optimization path
    if len(path) > 1:
        ax.plot(path[:, 0], path[:, 1], 'r-', linewidth=2, alpha=0.8)
        ax.plot(path[0, 0], path[0, 1], 'go', markersize=10, label='Start', zorder=5)
        ax.plot(path[-1, 0], path[-1, 1], 'ro', markersize=8, label='End', zorder=5)
        
        # Add arrows for direction (fewer for clarity)
        if len(path) > 5:
            arrow_indices = np.linspace(0, len(path)-2, min(5, len(path)-1), dtype=int)
            for j in arrow_indices:
                dx = path[j+1, 0] - path[j, 0]
                dy = path[j+1, 1] - path[j, 1]
                if np.sqrt(dx**2 + dy**2) > 0.05:
                    ax.arrow(path[j, 0], path[j, 1], dx, dy, 
                            head_width=0.1, head_length=0.08, fc='red', ec='red', alpha=0.7)
    
    ax.set_aspect('equal', adjustable='box')

    ax.set_title(f'{title}\n({len(path)-1} steps)')
    ax.set_xlabel('θ₁')
    ax.set_ylabel('θ₂')
    ax.legend(loc='upper right')
    ax.grid(True, alpha=0.3)
    ax.set_xlim(-6.1, 6.1)
    ax.set_ylim(-6.1, 6.1)

plt.tight_layout()
plt.show()
Figure 4.2: Comparison of optimization algorithms on a non-convex loss landscape with multiple local minima.

4.9 Summary

Key Takeaways
  1. Gradient descent is the basic optimization idea underlying modern neural-network training: update parameters in the direction of steepest descent.
  2. Mini-batch gradient descent is the practical standard because it balances computational efficiency with relatively stable gradient information.
  3. Learning-rate choice is central: too small leads to painfully slow progress, while too large leads to instability.
  4. Advanced optimizers such as Momentum, AdaGrad, RMSprop, and Adam modify vanilla gradient descent to improve stability, speed, or parameter-specific adaptation.
  5. Adam is often a strong default in practice, but optimizer choice matters more when the loss surface is non-convex, noisy, and high-dimensional.
Common Pitfalls

Learning Rate Issues

  • Too high: Training becomes unstable, loss explodes
  • Too low: Training is painfully slow, may get stuck in poor local minima
  • Solution: Use adaptive optimizers like Adam, or learning rate schedules

Overfitting to Simple Examples

  • Don’t judge optimizers solely on toy 2D functions
  • Real neural networks have millions of parameters and complex loss landscapes
  • What works on simple problems may not scale to “deep learning”

4.10 Exercises

Exercise 4.1: Learning Rates in an Ill-Conditioned Quadratic Problem

Consider the objective

\[ L(\theta_1,\theta_2)=\frac{1}{2}\left(a\theta_1^2+b\theta_2^2\right), \qquad a>b>0. \]

Suppose we apply gradient descent with constant learning rate \(\eta>0\):

\[ \boldsymbol{\theta}^{(t+1)}=\boldsymbol{\theta}^{(t)}-\eta \nabla L(\boldsymbol{\theta}^{(t)}). \]

  1. Derive the gradient and show that the updates satisfy \[ \theta_1^{(t+1)}=(1-\eta a)\theta_1^{(t)}, \qquad \theta_2^{(t+1)}=(1-\eta b)\theta_2^{(t)}. \]
  2. Show that convergence to \((0,0)\) requires \[ 0<\eta<\frac{2}{a}. \]
  3. Explain why, when \(a\gg b\), a single global learning rate creates a tension between fast progress in the \(\theta_2\) direction and stability in the \(\theta_1\) direction. Relate this to the motivation for adaptive or preconditioned optimization methods.

Exam level. The exercise formalizes why optimization becomes difficult in ill-conditioned problems and why a single learning rate can be inadequate.

Differentiate the quadratic objective coordinate by coordinate.

A scalar recursion \(x_{t+1}=cx_t\) converges to zero if and only if \(|c|<1\).

Compare the learning rate that would be natural for the flat direction, \(1/b\), with the stability restriction coming from the steep direction.

Part 1: Coordinate-Wise Gradient Descent Updates

The gradient is

\[ \nabla L(\theta_1,\theta_2)= \begin{pmatrix} a\theta_1\\ b\theta_2 \end{pmatrix}. \]

Hence gradient descent gives

\[ \theta_1^{(t+1)}=\theta_1^{(t)}-\eta a\theta_1^{(t)} =(1-\eta a)\theta_1^{(t)}, \]

and

\[ \theta_2^{(t+1)}=\theta_2^{(t)}-\eta b\theta_2^{(t)} =(1-\eta b)\theta_2^{(t)}. \]

Part 2: Deriving the Convergence Restriction

The iterates converge to zero if and only if both scalar recursions contract:

\[ |1-\eta a|<1 \qquad\text{and}\qquad |1-\eta b|<1. \]

These are equivalent to

\[ 0<\eta<\frac{2}{a} \qquad\text{and}\qquad 0<\eta<\frac{2}{b}. \]

Since \(a>b\), we have \(\frac{2}{a}<\frac{2}{b}\), so the binding restriction is

\[ 0<\eta<\frac{2}{a}. \]

Part 3: Why Ill-Conditioning Slows Optimization

If \(a\gg b\), then the objective is much steeper in the \(\theta_1\) direction than in the \(\theta_2\) direction. Stability forces

\[ \eta<\frac{2}{a}, \]

which can be very small. But then the update factor in the flat direction is

\[ 1-\eta b, \]

which stays close to 1, so convergence along \(\theta_2\) is very slow.

Thus a single global learning rate cannot be simultaneously well-tuned for both directions: a larger \(\eta\) would help progress in the flat direction but would destabilize the steep direction. This is one of the main motivations for adaptive or preconditioned methods, which scale updates differently across coordinates.

Exercise 4.2: Mini-Batch Gradients as Noisy Gradient Estimators

Let the full-sample objective be

\[ L(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \ell_i(\boldsymbol{\theta}), \]

and define the corresponding full-sample gradient

\[ g(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta}). \]

Now draw a mini-batch \(B\) of size \(m\) uniformly at random from \(\{1,\dots,N\}\) without replacement, and define the mini-batch gradient estimator

\[ \hat g_B(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i\in B}\nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta}). \]

  1. Show that \[ \mathbb{E}\left[\hat g_B(\boldsymbol{\theta})\right] = g(\boldsymbol{\theta}), \] so the mini-batch gradient is an unbiased estimator of the full gradient.
  2. Consider the scalar case where \(\nabla_{\boldsymbol{\theta}} \ell_i(\boldsymbol{\theta})\) is replaced by \(z_i \in \mathbb{R}\). Show that \[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)} \cdot \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2, \] where \(\bar z = \frac{1}{N}\sum_{i=1}^N z_i\).
  3. Use Part 2 to explain why smaller mini-batches produce noisier updates. Then explain one reason why such noise may nevertheless be useful in optimization.

Exam level. The exercise formalizes why mini-batch updates are noisy but still useful approximations to the full gradient.

Write the mini-batch estimator as an average of terms multiplied by inclusion indicators.

Use the standard variance formula for sampling without replacement, or derive it from inclusion probabilities.

Inspect the factor

\[ \frac{N-m}{m(N-1)}. \]

Part 1: Unbiasedness of the Mini-Batch Gradient

Let

\[ \mathbb{1}\{i \in B\} \]

denote the inclusion indicator for observation \(i\). Then

\[ \hat g_B(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^N \mathbb{1}\{i \in B\}\nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}). \]

Taking expectations and using

\[ \mathbb{E}[\mathbb{1}\{i \in B\}] = \frac{m}{N}, \]

\[ \mathbb{E}[\hat g_B(\boldsymbol{\theta})] = \frac{1}{m}\sum_{i=1}^N \mathbb{E}[\mathbb{1}\{i \in B\}] \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = \frac{1}{m}\sum_{i=1}^N \frac{m}{N}\nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = \frac{1}{N}\sum_{i=1}^N \nabla_{\boldsymbol{\theta}}\ell_i(\boldsymbol{\theta}) = g(\boldsymbol{\theta}). \]

So the mini-batch gradient is an unbiased estimator of the full gradient.

Part 2: Variance Under Sampling Without Replacement

For scalar values \(z_1,\dots,z_N\), the sample mean over a simple random sample without replacement satisfies

\[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)}\,S_N^2, \]

where

\[ S_N^2 = \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2. \]

Hence

\[ \mathrm{Var}\left(\frac{1}{m}\sum_{i\in B} z_i\right) = \frac{N-m}{m(N-1)} \cdot \frac{1}{N}\sum_{i=1}^N (z_i-\bar z)^2. \]

This is exactly the stated formula.

Part 3: Why Gradient Noise Can Help

The variance factor

\[ \frac{N-m}{m(N-1)} \]

falls as \(m\) increases. Therefore smaller mini-batches produce noisier gradient estimates and hence noisier parameter updates.

That noise can nevertheless be useful because it can prevent the algorithm from following a rigid deterministic path. In non-convex problems, moderate noise may help the optimizer move away from poor local geometries such as narrow basins or saddle regions.

Note

Before moving on to neural networks, note the chapter boundary: so far the objects being optimized were generic objective functions. In the next chapter, the same optimization ideas are applied to feed-forward networks, where gradients are computed by backpropagation.

Exercise 4.3: Momentum on a Quadratic Objective

Consider the one-dimensional quadratic objective

\[ L(\theta)=\frac{a}{2}\theta^2, \qquad a>0, \]

and the momentum updates

\[ v_t=\delta v_{t-1}+(1-\delta)a\theta_t, \qquad \theta_{t+1}=\theta_t-\eta v_t, \]

where \(\eta>0\) is the learning rate and \(\delta\in[0,1)\) is the momentum parameter.

  1. Eliminate \(v_t\) and show that \(\theta_t\) satisfies the second-order recursion \[ \theta_{t+1} = \big(1+\delta-\eta(1-\delta)a\big)\theta_t -\delta \theta_{t-1}. \]
  2. Show that when \(\delta=0\) this reduces to the gradient descent recursion from Exercise 4.1. For the general case, use the following Schur-Cohn stability criterion (see, e.g., Hamilton 1994): a second-order linear recursion \(x_{t+1}=cx_t+dx_{t-1}\) with characteristic polynomial \(p(\lambda)=\lambda^2-c\lambda-d\) converges to zero if and only if \(p(1)>0\), \(p(-1)>0\), and \(|d|<1\). Apply this to show that, for general \(\delta\in[0,1)\), the recursion converges to zero if and only if \[ \eta<\frac{2(1+\delta)}{(1-\delta)\,a}. \]
  3. Compare this stability bound with the vanilla gradient descent bound \(\eta<2/a\). Explain why momentum widens the range of stable learning rates and why, despite this, momentum can still produce oscillatory transient behavior.

Exam level. The exercise turns momentum into a concrete dynamic system, derives its stability region, and compares it with vanilla gradient descent.

Use the update equation at times \(t\) and \(t-1\) to write

\[ v_{t-1}=\frac{\theta_{t-1}-\theta_t}{\eta}. \]

When \(\delta=0\) the lagged term vanishes. For the general case, identify \(c\) and \(d\) in the given stability criterion by matching the recursion to the form \(x_{t+1}=cx_t+dx_{t-1}\). Then evaluate \(p(1)\), \(p(-1)\), and \(|d|\) separately.

Compare the bound at \(\delta=0\) and \(\delta>0\). For oscillation, consider when the discriminant of the characteristic polynomial is negative, so the roots are complex.

Part 1: Deriving the Second-Order Recursion

From the parameter update,

\[ \theta_t=\theta_{t-1}-\eta v_{t-1}, \]

so

\[ v_{t-1}=\frac{\theta_{t-1}-\theta_t}{\eta}. \]

Now substitute this into the momentum update:

\[ v_t=\delta \frac{\theta_{t-1}-\theta_t}{\eta}+(1-\delta)a\theta_t. \]

Using \(\theta_{t+1}=\theta_t-\eta v_t\), we obtain \[\begin{align*} \theta_{t+1} &= \theta_t-\eta\left[\delta \frac{\theta_{t-1}-\theta_t}{\eta}+(1-\delta)a\theta_t\right] \\ &= \theta_t-\delta(\theta_{t-1}-\theta_t)-\eta(1-\delta)a\theta_t \\ &= \big(1+\delta-\eta(1-\delta)a\big)\theta_t-\delta\theta_{t-1}. \end{align*}\]

Part 2: Stability Analysis

When \(\delta=0\), the recursion becomes \(\theta_{t+1}=(1-\eta a)\theta_t\), which is the gradient descent recursion from Exercise 4.1. This converges iff \(|1-\eta a|<1\), i.e., \(0<\eta<2/a\).

For general \(\delta\in[0,1)\), write \(c=1+\delta-\eta(1-\delta)a\). The recursion has characteristic polynomial

\[ p(\lambda)=\lambda^2-c\lambda+\delta. \]

By the stability criterion given in the exercise, convergence requires \(p(1)>0\), \(p(-1)>0\), and \(|\delta|<1\).

  1. \(p(1)=1-c+\delta=\eta(1-\delta)a>0\), which always holds.

  2. \(p(-1)=1+c+\delta=2(1+\delta)-\eta(1-\delta)a>0\), giving

    \[ \eta<\frac{2(1+\delta)}{(1-\delta)\,a}. \]

  3. \(|\delta|<1\) holds by assumption.

Therefore convergence requires \(\eta<\frac{2(1+\delta)}{(1-\delta)\,a}\).

Part 3: Wider Stability but Possible Oscillation

At \(\delta=0\) the bound is \(\eta<2/a\). For \(\delta>0\) the bound becomes \(2(1+\delta)/((1-\delta)a)>2/a\), so momentum widens the range of stable learning rates.

However, the characteristic roots are complex when the discriminant \(c^2-4\delta<0\). Complex roots mean the iterates oscillate even while converging: the algorithm overshoots and corrects, cycling around the optimum before settling down. This explains why momentum can accelerate convergence in directions of persistent gradient but produces oscillatory transient behavior when the effective step size is aggressive.

4.11 References

Dauphin, Yann N., Razvan Pascanu, Caglar Gulcehre, Kyunghyun Cho, Surya Ganguli, and Yoshua Bengio. 2014. “Identifying and Attacking the Saddle Point Problem in High-Dimensional Non-Convex Optimization.” In Advances in Neural Information Processing Systems. Vol. 27. Curran Associates.
Duchi, John, Elad Hazan, and Yoram Singer. 2011. “Adaptive Subgradient Methods for Online Learning and Stochastic Optimization.” Journal of Machine Learning Research 12: 2121–59.
Hamilton, James D. 1994. Time Series Analysis. Princeton, NJ: Princeton University Press.
Kingma, Diederik P., and Jimmy Ba. 2015. “Adam: A Method for Stochastic Optimization.” In International Conference on Learning Representations. https://arxiv.org/abs/1412.6980.
Loshchilov, Ilya, and Frank Hutter. 2019. Decoupled Weight Decay Regularization.” In International Conference on Learning Representations. https://openreview.net/forum?id=Bkg6RiCqY7.
Nesterov, Yurii. 1983. “A Method of Solving a Convex Programming Problem with Convergence Rate \(O(1/k^2)\).” Soviet Mathematics Doklady 27 (2): 372–76.
Polyak, Boris T. 1964. “Some Methods of Speeding up the Convergence of Iteration Methods.” USSR Computational Mathematics and Mathematical Physics 4 (5): 1–17. https://doi.org/10.1016/0041-5553(64)90137-5.
Robbins, Herbert, and Sutton Monro. 1951. “A Stochastic Approximation Method.” Annals of Mathematical Statistics 22 (3): 400–407. https://doi.org/10.1214/aoms/1177729586.