9 Autoencoders
9.1 Simple Autoencoders
An autoencoder is a neural network trained to reconstruct its input after passing it through a lower-dimensional representation. If the encoder maps \(\mathbf{x}\in\mathbb{R}^d\) to a latent vector \(\mathbf{h}\in\mathbb{R}^k\) and the decoder maps \(\mathbf{h}\) back to \(\hat{\mathbf{x}}\in\mathbb{R}^d\), the usual objective is a reconstruction loss such as \(\|\mathbf{x}-\hat{\mathbf{x}}\|^2\).
The econometric interpretation is close to dimension reduction: a simple undercomplete autoencoder learns a low-dimensional representation that preserves as much relevant variation as possible for reconstructing the original variables. With linear activations and squared loss, this connects directly to PCA, which is the focus of the first exercise below.
9.2 Variational Autoencoders
Variational autoencoders add an explicit probabilistic model for the latent representation. Instead of mapping each observation to a deterministic latent vector, the encoder produces an approximate posterior distribution \(q_\phi(\mathbf{z}\mid \mathbf{x})\), while the decoder specifies a likelihood \(p_\theta(\mathbf{x}\mid \mathbf{z})\).
This makes VAEs useful for connecting representation learning to likelihood, KL divergence, and regularization. The central training objective is the evidence lower bound (ELBO), which balances reconstruction quality against the KL divergence between the approximate posterior and a prior distribution over latent variables.
9.3 Exercises
Theoretical Question on Autoencoders: Consider a standard autoencoder with encoder \(f_\theta: \mathbb{R}^d \to \mathbb{R}^k\) and decoder \(g_\phi: \mathbb{R}^k \to \mathbb{R}^d\), where \(k < d\) (undercomplete autoencoder).
- Derive the relationship between the autoencoder’s reconstruction loss and Principal Component Analysis (PCA) when using linear activations and squared loss.
- Show that for a linear autoencoder with tied weights (\(g_\phi = f_\theta^T\)), the optimal solution spans the same subspace as the first \(k\) principal components.
- Analyze what happens when we add a sparsity constraint \(\lambda \sum_j |h_j|\) to the loss function, where \(\mathbf{h} = f_\theta(\mathbf{x})\) is the latent representation.
Assume the data is centered: \(\mathbb{E}[\mathbf{x}] = \mathbf{0}\).
- Start by writing the autoencoder reconstruction as \(\hat{\mathbf{x}} = W_d W_e \mathbf{x}\) for linear layers
- Express the reconstruction loss as \(\mathbb{E}[\|\mathbf{x} - \hat{\mathbf{x}}\|^2]\) and expand the quadratic term
- Use the trace operator and properties: \(\mathbf{a}^T \mathbf{b} = \text{tr}(\mathbf{a}^T \mathbf{b}) = \text{tr}(\mathbf{b} \mathbf{a}^T)\)
- Take derivatives with respect to \(W_d\) and \(W_e\) and set them to zero
- Remember that PCA finds the best \(k\)-dimensional subspace minimizing reconstruction error
- With tied weights, \(W_d = W_e^T\), so the reconstruction becomes \(\hat{\mathbf{x}} = W_e^T W_e \mathbf{x}\)
- The optimal \(W_e\) should have orthonormal rows: \(W_e W_e^T = I_k\)
- Think about the relationship between the data covariance matrix \(\Sigma\) and its eigendecomposition
- The solution involves the principal components (eigenvectors) of the covariance matrix
- The \(L_1\) penalty \(\lambda \sum_j |h_j|\) encourages sparsity (many zeros) in the latent representation
- Consider what happens when \(k > d\) (overcomplete case) without sparsity
- Think about the difference between sparse coding and autoencoders
- Consider optimization challenges: \(L_1\) norm is not differentiable at zero
Part 1: Relationship to PCA
For a linear autoencoder with encoder \(f_\theta(\mathbf{x}) = W_e \mathbf{x}\) and decoder \(g_\phi(\mathbf{h}) = W_d \mathbf{h}\), the reconstruction is:
\[\hat{\mathbf{x}} = g_\phi(f_\theta(\mathbf{x})) = W_d W_e \mathbf{x}\]
The reconstruction loss is:
\[L(\theta, \phi) = \mathbb{E}\left[\|\mathbf{x} - W_d W_e \mathbf{x}\|^2\right]\]
Expanding this:
\[\begin{align} L(\theta, \phi) &= \mathbb{E}\left[\mathbf{x}^T \mathbf{x} - 2\mathbf{x}^T W_d W_e \mathbf{x} + \mathbf{x}^T W_e^T W_d^T W_d W_e \mathbf{x}\right] \\ &= \text{tr}(\Sigma) - 2\text{tr}(W_d W_e \Sigma) + \text{tr}(W_e^T W_d^T W_d W_e \Sigma) \end{align}\]
where \(\Sigma = \mathbb{E}[\mathbf{x}\mathbf{x}^T]\) is the data covariance matrix.
Taking derivatives and setting to zero:
\[\frac{\partial L}{\partial W_d} = -2 W_e \Sigma + 2 W_d W_e \Sigma W_e^T = 0\]
\[\frac{\partial L}{\partial W_e} = -2 W_d^T \Sigma + 2 W_d^T W_d W_e \Sigma = 0\]
From the first equation: \(W_d W_e \Sigma = W_e \Sigma\), which implies \(W_d W_e = I_k\) when \(W_e\) has full row rank.
Connection to PCA: The optimal reconstruction minimizes the same objective as PCA: finding the best \(k\)-dimensional linear subspace for data approximation. The columns of \(W_d^T\) span the same subspace as the first \(k\) principal components.
Part 2: Tied Weights Case
With tied weights, \(W_d = W_e^T\), so the reconstruction becomes:
\[\hat{\mathbf{x}} = W_e^T W_e \mathbf{x}\]
The loss function is:
\[L(W_e) = \mathbb{E}\left[\|\mathbf{x} - W_e^T W_e \mathbf{x}\|^2\right] = \text{tr}(\Sigma) - 2\text{tr}(W_e^T W_e \Sigma) + \text{tr}(W_e^T W_e \Sigma W_e^T W_e)\]
Taking the derivative:
\[\frac{\partial L}{\partial W_e} = -4 W_e \Sigma + 4 W_e W_e^T W_e \Sigma = 0\]
This gives us: \(W_e \Sigma = W_e W_e^T W_e \Sigma\)
If \(W_e\) has orthonormal rows (i.e., \(W_e W_e^T = I_k\)), then this reduces to:
\[W_e \Sigma = W_e \Sigma\]
Optimal solution: The rows of \(W_e\) should be the first \(k\) principal components of \(\Sigma\). Specifically, if \(\Sigma = V \Lambda V^T\) is the eigendecomposition, then:
\[W_e = V_k^T\]
where \(V_k\) contains the first \(k\) eigenvectors corresponding to the largest eigenvalues.
Part 3: Sparsity Constraint
With the sparsity penalty, the loss becomes:
\[L_{sparse} = \mathbb{E}\left[\|\mathbf{x} - g_\phi(f_\theta(\mathbf{x}))\|^2\right] + \lambda \mathbb{E}\left[\sum_j |h_j|\right]\]
where \(\mathbf{h} = f_\theta(\mathbf{x})\) and \(h_j\) is the \(j\)-th component of \(\mathbf{h}\).
Effect of sparsity constraint:
Sparse representations: The \(L_1\) penalty encourages most components of \(\mathbf{h}\) to be zero, leading to sparse latent representations.
Feature selection: The autoencoder learns to use only a subset of the \(k\) latent dimensions, effectively performing feature selection in the latent space.
Overcomplete case: When \(k > d\), sparsity prevents the trivial solution and forces the autoencoder to learn meaningful features.
Optimization: The \(L_1\) penalty makes the objective non-smooth. Common approaches include:
- Soft thresholding: Apply thresholding function \(\text{sign}(h_j) \max(0, |h_j| - \lambda/2)\)
- Proximal gradient methods: Alternate between gradient steps on the smooth part and proximal operators on the \(L_1\) term
Comparison to sparse coding: The sparse autoencoder approximates the sparse coding objective:
\[\min_{\mathbf{h}} \|\mathbf{x} - W_d \mathbf{h}\|^2 + \lambda \|\mathbf{h}\|_1\]
but learns the dictionary \(W_d\) jointly with the sparse representations.
Theoretical Question on Variational Autoencoders: Consider a Variational Autoencoder (VAE) with encoder \(q_\phi(\mathbf{z}|\mathbf{x})\) and decoder \(p_\theta(\mathbf{x}|\mathbf{z})\), where the prior is \(p(\mathbf{z}) = \mathcal{N}(\mathbf{0}, I)\).
- Derive the Evidence Lower BOund (ELBO) starting from the log-likelihood \(\log p_\theta(\mathbf{x})\).
- Show how the reparameterization trick enables gradient-based optimization of the ELBO.
- Analyze the two terms in the ELBO: explain their roles and what happens in the limits \(\beta \to 0\) and \(\beta \to \infty\) in the \(\beta\)-VAE formulation.
Assume Gaussian encoder distributions: \(q_\phi(\mathbf{z}|\mathbf{x}) = \mathcal{N}(\boldsymbol{\mu}_\phi(\mathbf{x}), \text{diag}(\boldsymbol{\sigma}_\phi^2(\mathbf{x})))\).
- Start with the log-likelihood \(\log p_\theta(\mathbf{x})\) and introduce any distribution \(q_\phi(\mathbf{z}|\mathbf{x})\)
- Use the “multiply by 1” trick: multiply and divide by \(q_\phi(\mathbf{z}|\mathbf{x})\) inside the integral
- Apply Jensen’s inequality: \(\log \mathbb{E}[X] \geq \mathbb{E}[\log X]\) (log is concave)
- The result will have two terms: a reconstruction term and a KL divergence term
- Remember: \(\mathbb{E}[\log \frac{p(\mathbf{z})}{q(\mathbf{z}|\mathbf{x})}] = -D_{KL}(q(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\)
- The challenge is computing \(\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}[f(\mathbf{z})]\) where the expectation depends on \(\phi\)
- Direct gradient estimation using the log-derivative trick has high variance
- Key insight: express \(\mathbf{z}\) as a deterministic function of \(\phi\) and a noise variable \(\boldsymbol{\epsilon}\)
- For Gaussian distributions: \(\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}\) where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)\)
- Now the expectation is over \(\boldsymbol{\epsilon}\), which doesn’t depend on \(\phi\)
- Identify the two terms: reconstruction loss vs. KL regularization
- Think about what each term encourages the model to do
- For \(\beta\)-VAE: consider what happens when you weight the KL term by \(\beta \neq 1\)
- \(\beta \to 0\): what happens when you ignore the KL penalty?
- \(\beta \to \infty\): what happens when the KL penalty dominates?
- Consider the trade-off between reconstruction quality and latent space structure
Part 1: ELBO Derivation
Starting with the log-likelihood, we use the fact that for any distribution \(q_\phi(\mathbf{z}|\mathbf{x})\):
\[\log p_\theta(\mathbf{x}) = \log \int p_\theta(\mathbf{x}, \mathbf{z}) d\mathbf{z} = \log \int p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z}) d\mathbf{z}\]
Multiplying and dividing by \(q_\phi(\mathbf{z}|\mathbf{x})\):
\[\log p_\theta(\mathbf{x}) = \log \int \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} q_\phi(\mathbf{z}|\mathbf{x}) d\mathbf{z}\]
By Jensen’s inequality (since log is concave):
\[\begin{align} \log p_\theta(\mathbf{x}) &\geq \int q_\phi(\mathbf{z}|\mathbf{x}) \log \frac{p_\theta(\mathbf{x}|\mathbf{z}) p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})} d\mathbf{z} \\ &= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] + \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log \frac{p(\mathbf{z})}{q_\phi(\mathbf{z}|\mathbf{x})}\right] \\ &= \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z})) \end{align}\]
This is the Evidence Lower BOund (ELBO):
\[\mathcal{L}(\theta, \phi; \mathbf{x}) = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\]
Both terms in the ELBO have direct information-theoretic meaning from the Information Theory chapter:
- The reconstruction term \(\mathbb{E}_{q_\phi}[\log p_\theta(\mathbf{x}|\mathbf{z})]\) is the negative cross-entropy between the encoder’s latent distribution and the decoder’s reconstruction — it measures how well the model can reconstruct data from its compressed representation.
- The KL term \(D_{\text{KL}}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\) measures how far the learned posterior deviates from the prior. Recall that KL divergence is non-negative and zero only when the two distributions match exactly. This term acts as a regularizer: it penalizes complex, data-dependent posteriors that deviate far from the simple prior \(\mathcal{N}(\mathbf{0}, I)\).
For Gaussian encoder and prior, the KL term has the closed-form expression derived in Exercise 1.5 of the Information Theory chapter — which is why assuming Gaussian distributions makes VAE training tractable.
Part 2: Reparameterization Trick
The challenge is computing gradients of \(\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right]\) with respect to \(\phi\).
Problem: Direct gradient estimation via:
\[\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z}) \nabla_\phi \log q_\phi(\mathbf{z}|\mathbf{x})\right]\]
has high variance.
Reparameterization trick: Express \(\mathbf{z}\) as a deterministic function of \(\mathbf{x}\), \(\phi\), and a noise variable \(\boldsymbol{\epsilon}\):
\[\mathbf{z} = \boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon}\]
where \(\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)\) and \(\odot\) denotes element-wise multiplication.
Now the expectation becomes:
\[\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] = \mathbb{E}_{\boldsymbol{\epsilon} \sim \mathcal{N}(\mathbf{0}, I)}\left[\log p_\theta(\mathbf{x}|\boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon})\right]\]
Gradient computation:
\[\nabla_\phi \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] = \mathbb{E}_{\boldsymbol{\epsilon}}\left[\nabla_\phi \log p_\theta(\mathbf{x}|\boldsymbol{\mu}_\phi(\mathbf{x}) + \boldsymbol{\sigma}_\phi(\mathbf{x}) \odot \boldsymbol{\epsilon})\right]\]
This can be estimated using Monte Carlo with samples \(\boldsymbol{\epsilon}^{(l)} \sim \mathcal{N}(\mathbf{0}, I)\).
Part 3: Analysis of ELBO Terms
The ELBO has two terms: 1. Reconstruction term: \(\mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right]\) 2. Regularization term: \(-D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\)
Reconstruction Term: - Encourages the decoder to reconstruct the input accurately - Forces the latent representation to contain information necessary for reconstruction - Acts like the reconstruction loss in standard autoencoders
KL Regularization Term: - Forces the encoder distribution to stay close to the prior \(p(\mathbf{z})\) - Enables sampling from the learned latent space - Prevents overfitting and encourages meaningful latent representations
\(\beta\)-VAE Analysis: The \(\beta\)-VAE modifies the ELBO as:
\[\mathcal{L}_\beta = \mathbb{E}_{q_\phi(\mathbf{z}|\mathbf{x})}\left[\log p_\theta(\mathbf{x}|\mathbf{z})\right] - \beta \cdot D_{KL}(q_\phi(\mathbf{z}|\mathbf{x}) \| p(\mathbf{z}))\]
As \(\beta \rightarrow 0\): - KL penalty vanishes - Model focuses purely on reconstruction - Latent space may not follow the prior distribution - Reduces to a standard autoencoder - May lead to “posterior collapse” where \(q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z})\) for all \(\mathbf{x}\)
As \(\beta \rightarrow \infty\): - KL penalty dominates - Encoder forced to output prior distribution: \(q_\phi(\mathbf{z}|\mathbf{x}) \approx p(\mathbf{z})\) - Latent representation becomes independent of input - Reconstruction quality deteriorates - Extreme regularization prevents learning meaningful representations
Optimal \(\beta\): - \(\beta = 1\) corresponds to the true ELBO - \(\beta > 1\) encourages disentangled representations at the cost of reconstruction - \(\beta < 1\) prioritizes reconstruction quality - Choice depends on the desired trade-off between reconstruction fidelity and latent space structure