16 Foundation Models for Economic Text and Expectations

Experimental Chapter

This chapter is under active development. Content, notation, and exercises may change substantially between revisions.

16.1 Overview

A large and growing share of economic information arrives as text. Central-bank statements, monetary-policy minutes, press conferences, earnings disclosures, conference-call transcripts, analyst reports, and regulatory filings all carry signals about policy stance, risk, and expectations that are not fully captured by any scalar time series. Foundation models — large pre-trained language models such as the GPT and LLaMA families — can process this material and return numerical objects: conditional token probabilities, contextual embeddings, classification scores, or generated responses. A growing econometric literature studies exactly how these objects can be used in policy analysis, forecasting, and measurement without importing the methodological laxity of general-purpose machine-learning practice (Haghighi et al. 2025).

This chapter introduces foundation models as econometric tools, not as chatbots. The chapter has three aims. First, we put language models on formal ground as probabilistic sequence models, so that notions such as cross-entropy and perplexity mean exactly what they do in Information Theory. Second, we recast outputs of a foundation model — scores, labels, embeddings — as noisy measurements of latent economic concepts, which connects immediately to classical errors-in-variables reasoning. Third, we insist that any use of text in forecasting respect a real-time information set: retrieval, labeling, and scoring must be \(\mathcal{I}_t\)-measurable, or the forecast evaluation is not credible.

What this chapter does not do: it is not a software tutorial, not a prompt-engineering guide, and not a survey of transformer internals. Pre-training, fine-tuning, and the engineering details of attention architectures are discussed only to the extent necessary to read and evaluate empirical work that uses foundation models.

16.2 Roadmap

  1. We first formalize a language model as a distribution over token sequences and connect its log-likelihood to cross-entropy and perplexity.
  2. We then explain how tokens become representations, introduce self-attention as relevance-weighted averaging, and work through a small numerical example.
  3. Next, we recast prompt-based classification and scoring as measurement and analyze the consequences of the associated measurement error.
  4. After that, we define a real-time information set \(\mathcal{I}_t\) and state the admissibility condition \(D_t \subseteq \mathcal{I}_t\) that retrieval, labeling, and scoring must satisfy.
  5. We then embed text-derived signals into standard predictive regressions and discuss pseudo-out-of-sample evaluation.
  6. We treat synthetic agents — model-generated survey responses — as a disciplined tool for exploring expectations, with explicit limitations.
  7. We close with a summary, including common pitfalls, and three pen-and-paper exercises.

16.3 Language Models as Probabilistic Sequence Models

A language model is a probability distribution over sequences of tokens drawn from a finite vocabulary \(\mathcal{V}\). Tokens are the basic discrete units the model operates on. Depending on the tokenizer, a token can be a word, a subword, a punctuation mark, or a special marker. We will not discuss tokenization in detail. For the econometric analysis below, what matters is that a document is represented as a sequence \(w_{1:T} = (w_1, \ldots, w_T)\) with each \(w_t \in \mathcal{V}\), and that the model assigns a joint probability.

By the chain rule,

\[ p(w_{1:T}) = \prod_{t=1}^T p(w_t \mid w_{1:t-1}), \tag{16.1}\]

with \(p(w_1 \mid w_{1:0}) := p(w_1)\). A language model is thus a family of conditional next-token distributions. Modern foundation models parameterize these conditionals as \(p_\theta(w_t \mid w_{1:t-1})\) using a large neural network with parameters \(\theta\).

Derivation A: from log-likelihood to perplexity. Taking logs in Equation 16.1,

\[ \log p_\theta(w_{1:T}) = \sum_{t=1}^T \log p_\theta(w_t \mid w_{1:t-1}). \]

Dividing by \(T\) gives the average negative log-likelihood per token,

\[ \bar{\ell}(w_{1:T}; \theta) := -\frac{1}{T}\sum_{t=1}^T \log p_\theta(w_t \mid w_{1:t-1}). \tag{16.2}\]

If the test sequence is sampled from a true data-generating distribution \(p^\star\) and we enlarge \(T\), then under mild conditions \(\bar{\ell}(w_{1:T}; \theta)\) converges to the cross-entropy

\[ H(p^\star, p_\theta) = -\mathbb{E}_{p^\star}\!\left[\log p_\theta(W_t \mid W_{1:t-1})\right], \]

exactly the object studied in Information Theory. Perplexity is defined as

\[ \text{PPL}(w_{1:T}; \theta) = \exp\!\bigl(\bar{\ell}(w_{1:T}; \theta)\bigr). \tag{16.3}\]

Perplexity has the interpretation of an effective vocabulary size per token: if the model is uniform over \(K\) tokens, then \(\bar{\ell} = \log K\) and \(\text{PPL} = K\). A lower perplexity means the model assigns, on average, a higher probability to the tokens that actually occur, and is better in a predictive-distribution sense in the terminology of Evaluating Predictive Distributions.

Two remarks matter for econometric work. First, perplexity is computed on a test sequence and is therefore subject to the same pseudo-out-of-sample discipline as any predictive score: the test tokens must not have been seen during training, which for modern foundation models is a nontrivial condition to verify. Second, because perplexity is an exponential of an average, it is dominated by tokens with very small assigned probabilities. A single unlikely token can inflate perplexity substantially; this is relevant when evaluating a language model on crisis-period text whose vocabulary or topic composition departs from the training corpus.

Show the code
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(3)
tokens = ["<s>", "the", "central", "bank", "raised", "rates"]
T = len(tokens) - 1
V = 6

probs = np.zeros((T, V))
observed_idx = np.array([1, 2, 3, 4, 5])
for t in range(T):
    logits = rng.normal(0.0, 1.0, size=V)
    logits[observed_idx[t]] += 1.6
    p = np.exp(logits - logits.max())
    probs[t] = p / p.sum()

fig, ax = plt.subplots(figsize=(8.0, 3.4))
x = np.arange(1, T + 1)
width = 0.12
for j in range(V):
    heights = probs[:, j]
    colors = ["C3" if j == observed_idx[t] else "C0" for t in range(T)]
    for t in range(T):
        ax.bar(x[t] + (j - V/2) * width, heights[t], width=width,
               color=colors[t], alpha=0.85 if colors[t] == "C3" else 0.35)

ax.set_xticks(x)
ax.set_xticklabels([f"t={t}\n{tokens[t]}" for t in range(1, T + 1)])
ax.set_ylabel(r"$p_\theta(\,\cdot\,\mid w_{1:t-1})$")
ax.set_title("Conditional next-token distributions along a sequence")
ax.grid(True, alpha=0.2, axis="y")
plt.tight_layout()
plt.show()
Figure 16.1: Next-token factorization schematic. A document is represented as a token sequence \(w_{1:T}\). The model specifies a conditional distribution \(p_\theta(w_t \mid w_{1:t-1})\) over the vocabulary at each position, shown as a bar at position \(t\). The probability the model assigns to the observed token is the height of the dark bar. Sequence log-likelihood is the sum of the log-heights of the dark bars.

A compact check of Equation 16.2 and Equation 16.3 on this toy sequence:

Show the code
chosen = probs[np.arange(T), observed_idx]
avg_nll = -np.log(chosen).mean()
ppl = np.exp(avg_nll)
print(f"Average NLL: {avg_nll:.4f}")
print(f"Perplexity:  {ppl:.4f}")
Average NLL: 1.6615
Perplexity:  5.2670
Question for Reflection

What breaks in perplexity when the test corpus contains a regime that was not represented in the training corpus, for example monetary-policy statements written after a structural break in the inflation process?

Perplexity is only a meaningful predictive score under some form of exchangeability between training and test tokens. A structural break in the inflation process changes both vocabulary frequencies and conditional patterns, so the cross-entropy \(H(p^\star, p_\theta)\) evaluated on post-break text is computed against a different \(p^\star\) than the one the model was trained under. Perplexity will typically rise, but the more important point is that comparisons across regimes are no longer tests of the same learning problem. This is the same caution that applies to any out-of-sample score under non-stationarity.

16.4 From Words to Representations

Computing the conditionals \(p_\theta(w_t \mid w_{1:t-1})\) requires turning the history \(w_{1:t-1}\) into a numerical summary. Foundation models do this via contextual embeddings: each token position is associated with a vector in \(\mathbb{R}^d\) that summarizes the relevant context. Two points matter for the econometric reader.

Embeddings, fixed and contextual. In a fixed embedding scheme each token \(w \in \mathcal{V}\) has a single vector \(e(w) \in \mathbb{R}^d\) that is independent of context. In a contextual embedding the vector \(e_t\) at position \(t\) depends on the entire sequence \(w_{1:t-1}\) (or, for non-causal models, on both past and future context). Contextual embeddings are what let the word bank be represented differently in “central bank” and “river bank”.

Attention as a weighted average. The dominant mechanism for producing contextual embeddings is self-attention. Starting from fixed token embeddings \(e_1, \ldots, e_n \in \mathbb{R}^d\), three linear maps \(W_Q, W_K, W_V \in \mathbb{R}^{d \times d}\) produce queries, keys, and values,

\[ Q = E W_Q, \quad K = E W_K, \quad V = E W_V, \]

where \(E \in \mathbb{R}^{n \times d}\) stacks the embeddings as rows. The attention output is

\[ \mathrm{Att}(E) = \mathrm{softmax}\!\left(\frac{QK^\top}{\sqrt{d}}\right) V. \tag{16.4}\]

The softmax is applied row-wise, so each row of the result is a convex combination of the rows of \(V\), with weights that depend on the similarity between the corresponding query and every key. The scaling by \(\sqrt{d}\) prevents the softmax from saturating when \(d\) is large. We do not discuss multi-head attention, positional encodings, feed-forward blocks, or layer normalization; these are necessary engineering components but they do not change the econometric content.

From recurrent states to attention.

The previous chapter, LSTM Networks, introduced a fixed-width hidden state \(h_t\) that carries information across time steps. This forces all past relevant information to pass through a bottleneck whose size does not grow with the sequence length. Attention avoids the bottleneck: at position \(t\), the contextual embedding is constructed as an explicit weighted average over representations of all earlier positions. The similarity \(q_t^\top k_s / \sqrt{d}\) acts as a learned relevance score: it measures how much position \(s\) should contribute to the representation of position \(t\). Nothing in Equation 16.4 is recurrent.

16.4.1 Derivation B: A Three-Token Self-Attention Example

Take three tokens in \(\mathbb{R}^2\) with fixed embeddings

\[ e_1 = \begin{pmatrix} 1 \\ 0 \end{pmatrix}, \quad e_2 = \begin{pmatrix} 0 \\ 1 \end{pmatrix}, \quad e_3 = \begin{pmatrix} 1 \\ 1 \end{pmatrix}, \]

and set \(W_Q = W_K = W_V = I_2\) to keep the arithmetic transparent. The query, key, and value matrices are then

\[ Q = K = V = E = \begin{pmatrix} 1 & 0 \\ 0 & 1 \\ 1 & 1 \end{pmatrix}. \]

The similarity matrix before scaling is

\[ Q K^\top = \begin{pmatrix} 1 & 0 & 1 \\ 0 & 1 & 1 \\ 1 & 1 & 2 \end{pmatrix}, \]

and after scaling by \(\sqrt{d} = \sqrt{2} \approx 1.414\),

\[ \tfrac{1}{\sqrt{2}} Q K^\top \approx \begin{pmatrix} 0.707 & 0 & 0.707 \\ 0 & 0.707 & 0.707 \\ 0.707 & 0.707 & 1.414 \end{pmatrix}. \]

Applying the softmax row-wise gives the attention weights

\[ A \approx \begin{pmatrix} 0.397 & 0.197 & 0.405 \\ 0.197 & 0.397 & 0.405 \\ 0.259 & 0.259 & 0.483 \end{pmatrix}, \]

which are nonnegative and each row sums to one. The contextual output is

\[ \mathrm{Att}(E) = A V \approx \begin{pmatrix} 0.802 & 0.602 \\ 0.602 & 0.802 \\ 0.742 & 0.742 \end{pmatrix}. \]

Three observations. First, row three attends most strongly to itself because \(e_3\) has the largest norm and therefore the largest self-similarity \(e_3^\top e_3 = 2\). Second, rows one and two remain distinct but each picks up a contribution from token three, which is the mechanism by which a repeated concept can contaminate or support neighboring positions. Third, the output vectors are not arbitrary: they lie in the column space of \(V\), which in turn is the column space of \(E\).

Show the code
import numpy as np
import matplotlib.pyplot as plt

E = np.array([[1.0, 0.0],
              [0.0, 1.0],
              [1.0, 1.0]])
d = E.shape[1]
scores = E @ E.T / np.sqrt(d)
A = np.exp(scores - scores.max(axis=1, keepdims=True))
A = A / A.sum(axis=1, keepdims=True)

fig, ax = plt.subplots(figsize=(4.6, 3.8))
im = ax.imshow(A, cmap="Blues", vmin=0.0, vmax=A.max())
for i in range(3):
    for j in range(3):
        ax.text(j, i, f"{A[i, j]:.3f}", ha="center", va="center",
                color="black" if A[i, j] < 0.35 else "white")
ax.set_xticks([0, 1, 2])
ax.set_yticks([0, 1, 2])
ax.set_xticklabels(["$e_1$", "$e_2$", "$e_3$"])
ax.set_yticklabels(["$e_1$", "$e_2$", "$e_3$"])
ax.set_xlabel("Attended position $s$ (key)")
ax.set_ylabel("Query position $t$")
ax.set_title("Self-attention weights")
fig.colorbar(im, ax=ax, fraction=0.046, pad=0.04)
plt.tight_layout()
plt.show()
Figure 16.2: Attention weights \(A\) from Derivation B. Cell \((t, s)\) shows the fraction of position \(t\)’s output representation that is contributed by position \(s\). Rows sum to one. The diagonal dominance of row three reflects the larger self-similarity of the third token.

Every foundation model we discuss below can be seen as iterating a version of this construction many times, with learned \(W_Q, W_K, W_V\) and additional transformations, and ultimately producing, at each position \(t\), a contextual embedding vector \(\phi_t \in \mathbb{R}^d\). For the econometric analysis in the next sections, \(\phi_t\) is the object that downstream labeling, scoring, and retrieval procedures will consume.

16.5 Foundation Models as Measurement Devices

A common use of a foundation model in macro and finance is to turn a text into a scalar or low-dimensional score. Leading recent examples in the econometric literature are Bertsch et al. (2025), who construct stance measures from Federal Reserve speeches to map central-bank mandates into policy-stance time series, and Siano (2025), who shows that extracting context-sensitive content from earnings-announcement disclosures using LLM methods outperforms the dictionary-based and bag-of-words measures that have dominated the earlier finance literature. Both studies are worth reading as full-length applications of exactly the measurement framing introduced below. Examples of typical latent targets include:

  • hawkishness of a central-bank statement,
  • recession concern in monetary-policy minutes,
  • financial-stability emphasis in a speech,
  • tone or sentiment of an earnings press release,
  • disclosure novelty of a conference-call transcript.

In every case, the procedure maps a document \(d_t\) to a number \(z_t = g(d_t)\) using either a prompt (“classify this statement as hawkish, neutral, or dovish”), a probability extracted from the model’s output distribution, or a fine-tuned classification head on top of a contextual embedding.

The measurement view. It is rarely reasonable to treat \(z_t\) as a direct observation of the underlying economic concept. A better model is

\[ z_t = s_t + u_t, \tag{16.5}\]

where \(s_t\) is the latent concept of interest (the true hawkishness of the statement at time \(t\)) and \(u_t\) is measurement error with properties determined by the labeling procedure. Three sources of error deserve attention.

Prompt sensitivity. Two prompts designed to elicit the same underlying concept can produce meaningfully different scores, because the conditional distribution the model samples from depends on the full prompt.

Model-version instability. Foundation models are updated. A score constructed with one model version and re-computed with a later version need not agree. If the econometrician uses one version early in the sample and another later, \(u_t\) has a time-varying distribution.

Threshold effects in classification. Asking a model to return a discrete class and then encoding that class as a number (\(-1, 0, +1\)) throws away calibration information near decision boundaries. Errors near the boundary are systematically related to the underlying \(s_t\), violating classical measurement-error assumptions.

The implications are those familiar from classical errors-in-variables. Let

\[ y_{t+1} = \alpha + \gamma\, s_t + \varepsilon_{t+1}, \quad \varepsilon_{t+1} \perp s_t, \]

and suppose the econometrician estimates

\[ y_{t+1} = \alpha + \delta\, z_t + \eta_{t+1} \]

by OLS with \(z_t = s_t + u_t\), \(u_t \perp s_t\), \(u_t \perp \varepsilon_{t+1}\), \(\mathrm{Var}(s_t) = \sigma_s^2\), \(\mathrm{Var}(u_t) = \sigma_u^2\). Then

\[ \mathrm{plim}\,\hat{\delta}_{\text{OLS}} = \gamma \cdot \frac{\sigma_s^2}{\sigma_s^2 + \sigma_u^2} = \gamma \cdot \lambda, \]

with \(\lambda \in (0, 1]\) the reliability ratio. The coefficient is attenuated, and the attenuation is worse when the language model is noisier. Exercise 2 asks you to derive this and extend it.

This has a nontrivial consequence for econometric practice. For parameter estimation, attenuation is a problem and \(\delta\) is not informative about \(\gamma\) without a reliability correction. For forecasting, attenuation is less damaging: a consistent non-zero \(\hat\delta\) still implies that \(z_t\) carries predictive content for \(y_{t+1}\), even though \(\delta\) is not the coefficient on the latent concept. This distinction between prediction and inference, familiar from econometrics, is what makes foundation-model-derived signals potentially useful in forecasting while being suspect in causal exercises.

Show the code
import numpy as np
import matplotlib.pyplot as plt

rng = np.random.default_rng(7)
n = 250
s = rng.normal(0.0, 1.0, size=n)
eps = rng.normal(0.0, 0.6, size=n)
y = 1.0 * s + eps

sigmas_u = [0.0, 0.6, 1.4]
fig, axes = plt.subplots(1, 3, figsize=(11.6, 3.8), sharey=True)
for ax, sigma_u in zip(axes, sigmas_u):
    u = rng.normal(0.0, sigma_u, size=n)
    z = s + u
    slope = np.cov(y, z, ddof=0)[0, 1] / np.var(z, ddof=0)
    ax.scatter(z, y, s=11, alpha=0.45, color="C0")
    xs = np.linspace(z.min(), z.max(), 50)
    ax.plot(xs, slope * xs, color="C3", linewidth=2)
    ax.set_title(fr"$\sigma_u = {sigma_u:.1f}$,  $\hat\delta \approx {slope:.2f}$")
    ax.set_xlabel("Measured signal $z_t$")
    ax.grid(True, alpha=0.2)
axes[0].set_ylabel("Outcome $y_{t+1}$")
fig.suptitle("Attenuation as measurement noise grows", y=1.03, fontsize=13)
plt.tight_layout()
plt.show()
Figure 16.3: Attenuation in a predictive regression with a measured regressor. The left panel shows the latent concept \(s_t\) against the outcome \(y_{t+1}\). The right panels show the observed signal \(z_t = s_t + u_t\) under increasing noise \(\sigma_u\). OLS slopes are reported inside each panel; the true slope \(\gamma = 1\) is attenuated by the reliability ratio \(\lambda = \sigma_s^2 / (\sigma_s^2 + \sigma_u^2)\), which approaches zero as \(\sigma_u\) grows.
Example: Hawkish vs. dovish classification of central-bank text

A common construction in monetary-policy applications passes each Federal Open Market Committee (FOMC) statement, or a broader collection of Federal Reserve speeches as in Bertsch et al. (2025), through a foundation model with a prompt that asks for a label in \(\{-1, 0, +1\}\) or for a continuous stance score aligned with specific mandate dimensions. The resulting series \(z_t\) is then used as a predictor of future interest-rate changes or bond yields. Under the measurement view, \(z_t\) is a noisy estimate of the committee’s latent stance \(s_t\). The reliability ratio \(\lambda\) depends on the prompt, the model version, and the degree of ambiguity of each individual statement. Empirical work based on \(z_t\) should therefore (i) report sensitivity across prompt variants, (ii) hold the model version fixed across the evaluation sample, and (iii) treat regression slopes as attenuated relative to coefficients on \(s_t\).

A schematic of the full measurement pipeline is shown in Figure 16.4.

Figure 16.4: From text to a scalar signal. A document \(d_t\) is tokenized, embedded, and processed by a foundation model to produce contextual embeddings \(\phi_t\) or a conditional distribution \(p_\theta(\cdot \mid d_t)\). A labeling function \(g\) then maps these to a scalar or low-dimensional score \(z_t\). The econometric content of \(z_t\) is its relationship to the latent concept \(s_t\) and to the outcome \(y_{t+1}\); the rest of the pipeline is instrumentation.
Question for Reflection

Suppose two foundation-model versions disagree on 12 % of the hawkish/dovish labels over a 2015–2020 sample. You use version 1 through 2017 and version 2 from 2018 onward in a single predictive regression of bond yields on \(z_t\). Describe what happens to the interpretation of \(\hat\delta\) and what a minimal fix would look like.

The measurement error \(u_t\) has a discontinuous distribution at the version-switch date: its variance, and potentially its mean, differ across the two subsamples. If the break is ignored, \(\hat\delta\) is a weighted average of two regimes with different reliability ratios, and the attenuation is ambient and unknown. A minimal fix is to re-label the entire sample with a single frozen model version (ideally the earlier one, and the later one, separately) and report both. A more careful fix would include a regime dummy and an interaction term, and report the estimated reliability ratio in each regime.

16.6 Retrieval and Real-Time Information Sets

Applied use of foundation models in forecasting almost always involves retrieval: selecting a subset of documents on which to run the model, or from which to construct the prompt. This section argues that retrieval is the most important place where text-based forecasting goes wrong, and that the corrective discipline is straightforward to state.

Information sets. Fix a forecast origin \(t\). Let \(\mathcal{I}_t\) denote the \(\sigma\)-algebra of information genuinely available to a forecaster at time \(t\) in real time. A document \(d\) is \(\mathcal{I}_t\)-measurable if and only if it had been published, in its current form, on or before \(t\), using the publication timestamps a real-time forecaster would have had. Let \(D_t\) denote the set of documents the text-based procedure actually uses at forecast origin \(t\), and let \(x_t^{\text{text}} = g(D_t)\) be the resulting text-derived feature.

The admissibility condition. For \(x_t^{\text{text}}\) to be valid as a predictor in a pseudo-out-of-sample forecasting exercise, the document set must satisfy

\[ D_t \subseteq \mathcal{I}_t, \tag{16.6}\]

and, crucially, the procedure that selects \(D_t\) must itself be \(\mathcal{I}_t\)-measurable.

16.6.1 Derivation C: Why Retrieval Must Be \(\mathcal{I}_t\)-Measurable

Suppose the forecaster estimates a model of the form \(\mathbb{E}[Y_{t+h} \mid \mathcal{I}_t]\) using features \(x_t\) that include \(x_t^{\text{text}} = g(D_t)\). The pseudo-out-of-sample forecast at origin \(t\) is

\[ \hat{y}_{t+h \mid t} = \hat{m}(x_t; \hat\theta_t), \]

where \(\hat{m}\) is the fitted model and \(\hat\theta_t\) are parameters estimated from data available at time \(t\).

For \(\hat{y}_{t+h \mid t}\) to be a valid estimate of a real-time forecast, every object on the right-hand side must be \(\mathcal{I}_t\)-measurable. If \(D_t\) is not, then \(x_t^{\text{text}}\) is not, and the forecast is computed from information a real-time forecaster could not have used. The pseudo-out-of-sample loss obtained under this violation is a biased estimate of real-time loss, typically downward biased: future-information contamination tends to improve apparent forecast accuracy.

A second subtlety concerns the retrieval function itself. Suppose \(D_t\) is chosen by taking the top-\(k\) documents nearest to a query under a similarity metric based on an embedding model \(E_\psi\) with parameters \(\psi\). If \(\psi\) was estimated on a corpus that includes documents published after \(t\), then even a set \(D_t\) consisting only of documents published before \(t\) is selected by a function that depends on post-\(t\) information. The admissibility condition must therefore be strengthened to: both \(D_t\) and the retrieval parameters \(\psi_t\) that produce it must be \(\mathcal{I}_t\)-measurable. This is a real constraint in practice, because most off-the-shelf embedding models are trained once on a fixed corpus and then reused across forecast origins, which can only satisfy the constraint if the training corpus is restricted to pre-sample text.

Concrete sources of leakage. Table 16.1 lists the recurring sources and the conditions they violate. The list is not exhaustive, but each item has appeared in published empirical work.

Table 16.1: Sources of leakage in text-based forecasting, the admissibility condition each violates, and a suggested remediation.
Source of leakage Condition violated Suggested remediation
Revised transcripts replace originals Document identity in \(D_t\) depends on post-\(t\) revisions Use original release; store timestamp of first publication
Curated summaries written after the event \(D_t \not\subseteq \mathcal{I}_t\) directly Exclude summaries; use raw statements only
Benchmark labels constructed from future outcomes Labels are \(\mathcal{I}_{t+h}\)-, not \(\mathcal{I}_t\)-measurable Re-label using only pre-\(t\) information
Retriever \(\psi\) trained on full corpus Retrieval function not \(\mathcal{I}_t\)-measurable Use retriever trained on pre-sample text only
Foundation model pre-trained on text including post-\(t\) material \(p_\theta\) itself is not \(\mathcal{I}_t\)-measurable Disclose cutoff; restrict evaluation window to pre-cutoff
Metadata (topic tags, emotion scores) generated post-hoc Metadata is \(\mathcal{I}_{t+\Delta}\)-measurable Regenerate metadata from \(\mathcal{I}_t\) only
Show the code
import numpy as np
import matplotlib.pyplot as plt
from matplotlib.patches import Rectangle

fig, ax = plt.subplots(figsize=(9.6, 3.0))
ax.set_xlim(0, 10)
ax.set_ylim(0, 4)
ax.axis("off")

ax.add_patch(Rectangle((5, 0), 5, 4, facecolor="#f2d7d5", alpha=0.55, edgecolor="none"))
ax.axvline(5, color="black", linewidth=1.0)
ax.text(5.0, 3.7, "forecast origin $t$", ha="center", fontsize=10)
ax.text(7.5, 3.7, "forbidden future", ha="center", fontsize=10, style="italic")

ax.annotate("", xy=(9.8, 0.3), xytext=(0.2, 0.3),
            arrowprops=dict(arrowstyle="->", linewidth=1.1))
ax.text(0.2, 0.05, "time", fontsize=9)

def mark(x, y, label, ok=True):
    ax.scatter([x], [y], s=55, color=("C2" if ok else "C3"), zorder=3)
    ax.text(x, y + 0.22, label, ha="center", fontsize=9)

mark(2.2, 1.2, "$d_1$ (original)", ok=True)
mark(6.2, 1.2, "$d_2$ (revision of $d_1$)", ok=False)
mark(7.3, 2.1, "$d_3$ (curated summary)", ok=False)
mark(4.1, 2.6, "$d_4$ (news summary)", ok=True)

plt.tight_layout()
plt.show()
Figure 16.5: Real-time timeline at forecast origin \(t\). The shaded region to the right of \(t\) is the forbidden future: no object produced or revised there may enter \(D_t\). Document \(d_1\) is admissible. Document \(d_2\) is a revision of \(d_1\) published after \(t\) and is not admissible, even though it concerns an event that occurred before \(t\). Document \(d_3\) is a curated summary published after \(t\) and is not admissible. Document \(d_4\) is a news summary published before \(t\) but cites material produced after \(t\); it is admissible as a physical object but contaminates any retriever that weights by similarity to post-\(t\) content.

This is the single most important section of the chapter. Text-based forecasting is attractive because language is expressive, but every failure mode we have seen in macro-forecast evaluation has its textual counterpart, and several new ones — post-hoc retrieval, revised transcripts, pre-training contamination — do not have close macro analogues and therefore require explicit discipline. Ahrens et al. (2025) provide a useful operational example: they study high-frequency market responses to central-bank speeches, and the credibility of their identification hinges on precisely the timestamp discipline formalized above, because the window over which a response is measured must begin strictly after the speech is public and end before any other confounding information arrives.

16.7 Forecast Augmentation with Text-Derived Signals

With measurement and admissibility in hand, we can embed text-derived signals in a standard predictive regression. The workhorse is

\[ y_{t+h} = \alpha + \beta^\top x_t + \gamma^\top z_t + \varepsilon_{t+h}, \tag{16.7}\]

where \(y_{t+h}\) is the target (inflation \(h\) quarters ahead, next-day return around an earnings announcement, etc.), \(x_t\) is a vector of standard predictors (lags, factors, macro indicators), and \(z_t\) is a vector of text-derived scores constructed from \(D_t\) subject to the admissibility condition.

Design choices. Three choices deserve explicit statements, because each can change whether any forecast gain is genuine.

  1. Window. Rolling versus expanding estimation windows have different biases under regime change. The choice should be made up front, not after inspecting out-of-sample performance.
  2. Loss. For point forecasts, use RMSE or MAE; for predictive distributions, use the log score or CRPS as in Evaluating Predictive Distributions. A gain in RMSE that becomes a loss in log score usually reflects miscalibration of uncertainty.
  3. Baseline. The relevant comparison is not model-with-text versus AR(\(p\)). It is model-with-foundation-model-text versus (i) model-with-dictionary-text and (ii) model-with-static-embedding-text. Otherwise any gain might be attributable to text per se rather than to the foundation model.

A minimal evaluation loop. The following code illustrates the structure of a rolling pseudo-out-of-sample evaluation. No API calls are involved; \(z_t\) is treated as given.

Show the code
import numpy as np

def rolling_forecast(y, X, z, window, horizon=1):
    n = len(y)
    preds_base = np.full(n, np.nan)
    preds_text = np.full(n, np.nan)
    for t in range(window, n - horizon):
        Xt = X[t - window:t]
        yt = y[t - window + horizon:t + horizon]
        Xt_full = np.column_stack([np.ones(window), Xt])
        beta_base = np.linalg.lstsq(Xt_full, yt, rcond=None)[0]
        preds_base[t + horizon] = np.r_[1.0, X[t]] @ beta_base

        zt = z[t - window:t]
        Xt_aug = np.column_stack([Xt_full, zt])
        beta_aug = np.linalg.lstsq(Xt_aug, yt, rcond=None)[0]
        preds_text[t + horizon] = np.r_[1.0, X[t], z[t]] @ beta_aug
    return preds_base, preds_text

rng = np.random.default_rng(11)
n = 400
s = rng.normal(size=n)
x = 0.4 * s + rng.normal(size=n)
u = rng.normal(scale=0.7, size=n)
z = s + u
eps = rng.normal(scale=0.9, size=n)
y = 0.3 * x + 0.6 * s + eps

pb, pt = rolling_forecast(y, x, z, window=120, horizon=1)
mask = ~np.isnan(pb) & ~np.isnan(pt)
rmse_base = np.sqrt(np.mean((y[mask] - pb[mask]) ** 2))
rmse_text = np.sqrt(np.mean((y[mask] - pt[mask]) ** 2))
print(f"RMSE baseline:          {rmse_base:.3f}")
print(f"RMSE baseline + text:   {rmse_text:.3f}")
RMSE baseline:          1.162
RMSE baseline + text:   1.171

The text-augmented model does better here by construction, because \(z_t\) is a noisy but real proxy for the predictive state \(s_t\). In empirical work the gain will typically be smaller and noisier; what matters is that the loop is constructed so that only information from \(\mathcal{I}_t\) enters any object on the right-hand side of Equation 16.7.

Conformal wrappers. If we want prediction intervals rather than point forecasts, the conformal machinery of Conformal Prediction applies, but with a warning. The coverage guarantee requires exchangeability of the non-conformity scores, which is not automatic when the text-derived regressor is itself produced by a model whose behavior may drift over time. In time-series settings, the specialized variants (CV+, adaptive scores) are more appropriate; the standard exchangeability argument is fragile.

Example: Earnings-announcement returns with text

Short-window equity returns around earnings announcements are a canonical testbed; Siano (2025) is the cleanest recent finance application. Construct \(z_t\) as a tone score for the earnings press release and conference-call transcript, passed through a foundation model with a fixed prompt and a fixed model version. Augment a baseline regression on prior earnings surprises, market factors, and industry dummies with \(z_t\). A genuine gain from the foundation model (over a dictionary baseline, over a static-embedding baseline) is evidence that contextual representations capture content the simpler baselines miss; Siano (2025) finds exactly this pattern. A failure to beat the dictionary baseline is also informative: it tells you that, for this sample and target, the additional modeling capacity of the foundation model does not produce additional forecast value. A closely parallel macro exercise is Ahrens et al. (2025), who augment forecasts of short-window market reactions with text-derived features of central-bank speeches; the same design choices (window, loss, baselines) apply.

Question for Reflection

You find that adding \(z_t\) improves in-sample \(R^2\) by 0.08 but does not improve rolling pseudo-out-of-sample RMSE. Give two distinct explanations, one statistical and one data-pipeline, and describe how you would distinguish them.

Statistical explanation: \(z_t\) is collinear with \(x_t\) in-sample but adds no conditional information once \(x_t\) is known at the relevant forecast horizon. This is diagnosed by comparing \(R^2\) of a regression of \(z_t\) on \(x_t\) in-sample to the same object in rolling windows; high \(R^2\) implies the in-sample improvement was due to over-fitting additional noise. Data-pipeline explanation: the in-sample construction of \(z_t\) used information not available in real time (revised transcripts, retriever trained on the full corpus), so the in-sample gain is leakage that disappears once admissibility is enforced. The two can be distinguished by re-running the in-sample regression with \(z_t\) constructed under strict admissibility; if the \(R^2\) gain collapses, the problem is pipeline-contamination.

16.8 Synthetic Agents and Expectation Formation

Foundation models can be prompted to produce synthetic survey responses: the econometrician describes a persona — a household with given demographics, a firm with given sector and size, a forecaster with given information set — and asks the model to answer a survey question as that persona. The set of resulting answers \(\{\tilde r_i\}_{i=1}^N\) is sometimes called a set of synthetic agents. Their use in economics is still experimental, and the econometric question is when, if ever, moments of \(\{\tilde r_i\}\) are informative about moments of a real survey \(\{r_i\}\). The closest work to the treatment below is Zarifhonarvar (2026), who generates household inflation expectations with large language models and compares the induced moments with those of the Michigan Survey; the patterns he documents (approximate mean agreement under careful prompting, systematic under-dispersion, underrepresentation of tail respondents) are a natural empirical counterpart to the bias decomposition given below.

A minimal formal setup. Let \((\theta_i, r_i)\) index real survey respondents, with \(\theta_i\) observable persona attributes and \(r_i\) the answer. Let \(F\) denote the true joint distribution. The econometrician chooses a distribution \(G\) over personas from which to draw synthetic respondents \(\tilde\theta_i\), and for each synthetic persona queries the model to produce \(\tilde r_i\). Write the implicit model as a conditional distribution \(q_\theta(\tilde r \mid \tilde \theta)\) induced by the foundation model and the prompt.

A bias decomposition. The synthetic mean is

\[ \mathbb{E}_{G, q_\theta}[\tilde r] = \int \mathbb{E}_{q_\theta}[\tilde r \mid \theta] \, dG(\theta), \]

while the real survey mean is

\[ \mathbb{E}_F[r] = \int \mathbb{E}_F[r \mid \theta] \, dF_\theta(\theta), \]

where \(F_\theta\) is the persona marginal of the real survey. The gap decomposes as

\[ \begin{aligned} \mathbb{E}_{G, q_\theta}[\tilde r] - \mathbb{E}_F[r] &= \underbrace{\int \mathbb{E}_{q_\theta}[\tilde r \mid \theta] \bigl(dG(\theta) - dF_\theta(\theta)\bigr)}_{\text{persona-mismatch bias}} \\ &\quad + \underbrace{\int \bigl(\mathbb{E}_{q_\theta}[\tilde r \mid \theta] - \mathbb{E}_F[r \mid \theta]\bigr) dF_\theta(\theta)}_{\text{conditional bias}}. \end{aligned} \tag{16.8}\]

The first term is a familiar representativeness problem: if the distribution of personas used to prompt the model does not match the distribution in the real survey population, the synthetic mean is biased even if the model answers each persona correctly. The second term is the more fundamental problem: conditional on a persona, the model’s answer distribution is not the real conditional answer distribution. Both terms can be nonzero even for a good model and a carefully designed prompt.

Consequences for empirical practice. Three implications follow.

  1. Synthetic agents should be used primarily for exploring heterogeneity within a population whose marginal distribution is known, not for replacing a real survey sample.
  2. Even within a known population, validating means is not enough. Variances, quantiles, and conditional patterns should be compared to real survey moments where possible; persistent underestimation of variance is a typical failure mode.
  3. Generated answers can be contaminated by training-data artifacts: the model may echo public-opinion distributions that prevailed during training rather than anything about the prompted persona. This is a form of training-cutoff leakage closely related to the discussion of retrieval in Section 16.6.
Example: Synthetic inflation-expectations respondents

The Michigan Survey of Consumer Expectations reports cross-sectional distributions of household inflation expectations. Prompting a foundation model with personas drawn from the Michigan demographic marginals produces a set of synthetic one-year-ahead inflation expectations. Comparing the mean of the synthetic distribution to the Michigan mean tests the persona-mismatch bias term in Equation 16.8; comparing conditional means by age, income, or education bracket tests the conditional-bias term. Zarifhonarvar (2026) reports exactly these comparisons: means can be approximately matched with careful prompting, variances are systematically understated, and tail respondents (very high or very low expectations) are under-represented. This is useful information about the model’s calibration, not a statement that synthetic respondents can substitute for the actual survey.

16.9 Summary

Key Takeaways
  1. A language model is a probability distribution over token sequences. Its log-likelihood is a sum of conditional token log-probabilities, its average negative log-likelihood is cross-entropy, and perplexity is the exponential of that average. Perplexity inherits all the interpretational caveats of any predictive score: it requires a well-defined test set and is sensitive to regime change.
  2. Self-attention constructs contextual embeddings as relevance-weighted averages of value vectors. The three-token example in Derivation B shows the mechanism explicitly; everything that follows in a foundation model is a more elaborate version of that construction.
  3. Foundation-model outputs used as regressors are best modeled as noisy measurements of a latent economic concept: \(z_t = s_t + u_t\). The OLS slope in a forecasting regression on \(z_t\) is attenuated toward zero by the reliability ratio \(\lambda = \sigma_s^2 / (\sigma_s^2 + \sigma_u^2)\).
  4. Attenuation matters for parameter estimation but not for the qualitative question of whether \(z_t\) has predictive content. A consistent non-zero slope on \(z_t\) is evidence of predictive content for \(y_{t+h}\), even though the slope is not an estimate of the coefficient on the latent concept.
  5. Retrieval and labeling must be \(\mathcal{I}_t\)-measurable — not only the set of retrieved documents \(D_t\), but also the retriever parameters \(\psi_t\) that produce it and the foundation-model parameters \(\theta\) that consume it. Pre-training contamination is a genuine concern and should be treated as a first-class evaluation constraint.
  6. Synthetic survey responses suffer from both a persona-mismatch bias and a conditional bias; their best use is for within-population heterogeneity exploration, not for replacing a real survey sample.
Common Pitfalls
  • Treating a classification label as a direct observation of the underlying economic concept. The measurement-error view is almost always more accurate and changes how the resulting regressions should be read.
  • Using a retriever or embedding model trained on a corpus that overlaps with the evaluation window. Even if retrieved documents themselves predate the forecast origin, the retriever that selects them must also be \(\mathcal{I}_t\)-measurable.
  • Silently switching foundation-model versions in the middle of a forecast sample. The measurement error \(u_t\) then has a time-varying distribution, and regression coefficients mix regimes.
  • Evaluating a text-augmented forecast against an AR(\(p\)) baseline only. A dictionary baseline and a static-embedding baseline are necessary to attribute any gain to the foundation model rather than to text in general.
  • Treating perplexity as a universal quality metric. Perplexity evaluated on a test set whose distribution differs from the training distribution is not comparable across models and can be dominated by a small number of rare tokens.
  • Reading coefficients on \(z_t\) as causal or structural. Foundation models are noisy measurement devices, not identification strategies.

16.10 Exercises

Exercise 16.1: Sequence probabilities, cross-entropy, and perplexity

Consider a vocabulary \(\mathcal{V} = \{a, b, c, d, e\}\) of size \(K = 5\). A language model \(p_\theta\) processes the sequence \(w_{1:8} = (a, b, a, c, d, a, b, e)\) and returns the following one-step-ahead conditional probabilities assigned to the observed token at each position:

\(t\) Observed \(w_t\) \(p_\theta(w_t \mid w_{1:t-1})\)
1 \(a\) 0.40
2 \(b\) 0.25
3 \(a\) 0.50
4 \(c\) 0.20
5 \(d\) 0.15
6 \(a\) 0.50
7 \(b\) 0.30
8 \(e\) 0.0001

Part 1. Compute \(\log p_\theta(w_{1:8})\) and the average negative log-likelihood \(\bar{\ell}(w_{1:8}; \theta)\). Use natural logarithms.

Part 2. Compute the perplexity \(\text{PPL}(w_{1:8}; \theta)\) and verify numerically that \(\text{PPL} = \exp(\bar{\ell})\).

Part 3. The token at position 8 is assigned probability \(10^{-4}\). Quantify its contribution to \(\bar{\ell}\) and to \(\text{PPL}\). Discuss whether perplexity is a robust evaluation criterion on text that contains rare events, for example monetary-policy statements issued during a crisis whose vocabulary departs sharply from the training corpus. Your answer should connect to the definition of cross-entropy and to the limit argument that motivates perplexity as an effective vocabulary size.

Exam-level. The numerical computation in Parts 1 and 2 is standard; Part 3 should be the diagnostic.

Compute \(-\log(0.0001) \approx 9.21\) and compare to the average of the other seven terms. Then ask what happens to \(\text{PPL}\) if you replace the last token’s probability with \(10^{-2}\) instead; the change in \(\text{PPL}\) is the diagnostic quantity.

Part 1. The log-probabilities are

\[ \log(0.40), \log(0.25), \log(0.50), \log(0.20), \log(0.15), \log(0.50), \log(0.30), \log(0.0001), \]

which numerically are approximately

\[ -0.9163, -1.3863, -0.6931, -1.6094, -1.8971, -0.6931, -1.2040, -9.2103. \]

The sum is \(\log p_\theta(w_{1:8}) \approx -17.6096\), and

\[ \bar{\ell}(w_{1:8}; \theta) = -\frac{-17.6096}{8} \approx 2.2012. \]

Part 2. Perplexity is

\[ \text{PPL} = \exp(2.2012) \approx 9.038. \]

Direct verification: \(\text{PPL} = (0.40 \cdot 0.25 \cdot 0.50 \cdot 0.20 \cdot 0.15 \cdot 0.50 \cdot 0.30 \cdot 0.0001)^{-1/8}\), which evaluates to the same value.

Part 3. The last token contributes \(-\log(0.0001) \approx 9.2103\) to the sum, i.e., \(9.2103 / 8 \approx 1.151\) to \(\bar{\ell}\). If instead the last-token probability had been \(10^{-2}\), the contribution would have been \(\log(100)/8 \approx 0.576\). The difference in \(\bar{\ell}\) is about \(0.576\), translating into a change in \(\text{PPL}\) from about \(9.04\) to about \(\exp(2.2012 - 0.576) \approx 5.08\).

The diagnostic point is that a single rare token shifts perplexity by nearly a factor of two. Perplexity is a geometric mean of reciprocal probabilities, so it is dominated by the smallest assigned probabilities. On crisis-period text, the vocabulary includes tokens that occur rarely or not at all in training; these will typically receive small probabilities under the trained model. Perplexity computed on such text is a valid summary of predictive loss only if the assumption of comparable training and test token distributions holds. Under regime change, cross-entropy \(H(p^\star, p_\theta)\) is computed against a different \(p^\star\) than during training, and the resulting perplexity is not on the same scale as in-distribution perplexity. It is therefore not directly comparable across regimes and should be supplemented with token-level diagnostics (e.g., the fraction of mass placed on tokens with probability below some threshold).


Exercise 16.2: Measurement error in a text-derived signal

Consider an econometric model with latent regressor

\[ y_{t+1} = \alpha + \gamma\, s_t + \varepsilon_{t+1}, \]

and a foundation-model-derived measurement

\[ z_t = s_t + u_t. \]

Assume \(s_t, u_t, \varepsilon_{t+1}\) are mean-zero and mutually uncorrelated, with \(\mathrm{Var}(s_t) = \sigma_s^2\), \(\mathrm{Var}(u_t) = \sigma_u^2\), and \(\mathrm{Var}(\varepsilon_{t+1}) = \sigma_\varepsilon^2\). The econometrician estimates

\[ y_{t+1} = \alpha + \delta\, z_t + \eta_{t+1} \]

by ordinary least squares.

Part 1. Derive

\[ \mathrm{plim}\,\hat{\delta}_{\text{OLS}} = \gamma \cdot \frac{\sigma_s^2}{\sigma_s^2 + \sigma_u^2}. \]

Show all steps starting from the OLS normal equations.

Part 2. Now drop the assumption \(\mathrm{Cov}(s_t, u_t) = 0\). Suppose instead \(\mathrm{Cov}(s_t, u_t) = \rho \sigma_s \sigma_u\) with \(\rho \in (-1, 1)\). Re-derive \(\mathrm{plim}\,\hat\delta\) and determine, given \(\gamma > 0\), the sign of the bias relative to Part 1 as a function of \(\rho\). Interpret the case \(\rho > 0\) in terms of a foundation model that is systematically more confident (smaller \(|u_t|\)) on high-\(s_t\) documents.

Part 3. A researcher reports a highly significant positive \(\hat\delta\) in Part 1’s setting and concludes that “hawkish monetary-policy statements predict higher future bond yields”. Another researcher argues that because \(\hat\delta\) is attenuated, the conclusion is unwarranted. Evaluate the two positions. Your answer should distinguish clearly between the objective of estimating \(\gamma\) (inference about the latent concept) and the objective of assessing whether \(z_t\) has predictive content for \(y_{t+1}\) (a forecasting statement). Conclude which of the two objectives each researcher is pursuing and which objective the data can address.

Exam-level. Part 1 is a standard derivation; Part 2 requires care with covariance; Part 3 is the econometric-diagnostic part.

Start from \(\hat\delta_{\text{OLS}} = \mathrm{Cov}(y_{t+1}, z_t) / \mathrm{Var}(z_t)\) and compute each object using the model equations and the independence assumptions.

Predictive content is about the joint distribution of \((y_{t+1}, z_t)\): it is enough that \(\mathbb{E}[y_{t+1} \mid z_t] \neq \mathbb{E}[y_{t+1}]\). Inference about \(\gamma\) is about the coefficient on \(s_t\), a different object. A non-zero \(\hat\delta\) is consistent with the first and uninformative about the second.

Part 1. Compute the covariance:

\[ \mathrm{Cov}(y_{t+1}, z_t) = \mathrm{Cov}(\alpha + \gamma s_t + \varepsilon_{t+1},\, s_t + u_t) = \gamma\, \mathrm{Var}(s_t) = \gamma \sigma_s^2, \]

using \(\mathrm{Cov}(s_t, u_t) = 0\) and \(\mathrm{Cov}(\varepsilon_{t+1}, z_t) = 0\). The variance is

\[ \mathrm{Var}(z_t) = \mathrm{Var}(s_t) + \mathrm{Var}(u_t) = \sigma_s^2 + \sigma_u^2. \]

Hence

\[ \mathrm{plim}\,\hat{\delta}_{\text{OLS}} = \frac{\gamma \sigma_s^2}{\sigma_s^2 + \sigma_u^2} = \gamma \lambda, \]

with reliability ratio \(\lambda \in (0, 1]\).

Part 2. Now

\[ \mathrm{Cov}(y_{t+1}, z_t) = \gamma\, \mathrm{Cov}(s_t, s_t + u_t) = \gamma(\sigma_s^2 + \rho \sigma_s \sigma_u), \]

and

\[ \mathrm{Var}(z_t) = \sigma_s^2 + 2\rho \sigma_s \sigma_u + \sigma_u^2. \]

Therefore

\[ \mathrm{plim}\,\hat\delta = \gamma \cdot \frac{\sigma_s^2 + \rho \sigma_s \sigma_u}{\sigma_s^2 + 2\rho \sigma_s \sigma_u + \sigma_u^2}. \]

Comparing to Part 1’s \(\gamma \lambda\), the ratio is

\[ \frac{\sigma_s^2 + \rho \sigma_s \sigma_u}{\sigma_s^2 + 2\rho \sigma_s \sigma_u + \sigma_u^2} - \frac{\sigma_s^2}{\sigma_s^2 + \sigma_u^2}. \]

For \(\gamma > 0\) and \(\rho > 0\), the numerator gains a term \(\rho \sigma_s \sigma_u > 0\) while the denominator gains \(2\rho \sigma_s \sigma_u\). The net effect on \(\mathrm{plim}\,\hat\delta\) depends on which dominates. Differentiating with respect to \(\rho\) at \(\rho = 0\) gives

\[ \left.\frac{\partial}{\partial \rho}\right|_{\rho=0} = \gamma \cdot \frac{\sigma_s \sigma_u (\sigma_s^2 + \sigma_u^2) - \sigma_s^2 \cdot 2\sigma_s \sigma_u}{(\sigma_s^2 + \sigma_u^2)^2} = \gamma \cdot \frac{\sigma_s \sigma_u (\sigma_u^2 - \sigma_s^2)}{(\sigma_s^2 + \sigma_u^2)^2}. \]

So positive \(\rho\) attenuates further (moves \(\hat\delta\) toward zero) when \(\sigma_s^2 > \sigma_u^2\), and reduces the attenuation when \(\sigma_s^2 < \sigma_u^2\). Interpretation: if the foundation model is more confident on high-\(s_t\) documents, the measurement error is smaller in magnitude there, which in a low-noise regime makes the slope look even flatter, and in a high-noise regime partially offsets the baseline attenuation.

Part 3. The two researchers are pursuing different objectives. The first is making a forecasting statement: \(z_t\) carries information about \(y_{t+1}\), in the sense that \(\mathbb{E}[y_{t+1} \mid z_t] \neq \mathbb{E}[y_{t+1}]\). A consistent non-zero \(\hat\delta\) in Part 1 is evidence for this, because \(\hat\delta\) converges to \(\gamma \lambda\), which is nonzero whenever \(\gamma \neq 0\) and \(\lambda > 0\). The attenuation does not undermine this conclusion; it only says \(\hat\delta \neq \gamma\).

The second is making an inference statement about the latent concept: the coefficient of \(y_{t+1}\) on \(s_t\) is \(\gamma\), and \(\hat\delta\) does not estimate it consistently. This is correct. Without a reliability correction (or an instrument for \(s_t\), or a validation subsample where \(s_t\) is observed), \(\hat\delta\) cannot be translated into an estimate of \(\gamma\). The second researcher is right to resist the causal-style interpretation.

The data therefore support the forecasting statement but not the inference statement. This is a general feature of foundation-model-derived signals in empirical work: they are usable as predictors, and usable as evidence of predictive content, but are not in themselves estimates of structural coefficients on the latent economic concept.


Exercise 16.3: Real-time information sets and the retriever

Fix the forecast origin $t = $ end of 2019-Q4 (i.e., 2019-12-31). Consider the following documents and artifacts:

  • \(d_1\): ECB monetary-policy statement, first published 2019-10-24.
  • \(d_2\): officially revised transcript of the Q&A following \(d_1\), re-published 2020-01-15.
  • \(d_3\): news summary of \(d_1\), published 2020-02-03, discussing the ensuing market reaction.
  • \(d_4\): academic working paper released 2019-11-30 that cites \(d_1\).
  • \(E_\psi\): a sentence-embedding model trained on a general web corpus crawled through 2021-06.
  • \(p_\theta\): a foundation model whose training cutoff is 2021-09.

The forecaster constructs \(z_t = g(D_t)\) by (i) retrieving from a document archive the top-\(k\) documents most similar to a query “ECB policy stance” under \(E_\psi\), (ii) prompting \(p_\theta\) with those documents to produce a hawkish/dovish score.

Part 1. Determine, for each of \(d_1, d_2, d_3, d_4\), whether the physical document is \(\mathcal{I}_t\)-measurable. State the timestamp that justifies each decision.

Part 2. Suppose the admissibility condition is interpreted strictly: both the document set \(D_t\) and the retriever parameters \(\psi\) that produced it must be \(\mathcal{I}_t\)-measurable. Is the retrieval step admissible as specified? Write down the admissibility condition on the retriever and explain which of its inputs (training corpus, query, similarity metric) violate the condition here.

Part 3. Is the foundation model \(p_\theta\) admissible? Construct a pseudo-out-of-sample protocol that restores validity. Be explicit about (i) which objects must be re-estimated at each forecast origin, (ii) which objects may be held fixed across origins, and (iii) how you would document the resulting protocol so that another researcher could verify admissibility.

Exam-level. Parts 1 and 2 are mostly mechanical once the admissibility condition is taken seriously; Part 3 requires a structured answer.

Think of the retriever as a function whose behavior depends on its parameters \(\psi\). The admissibility condition must apply to the function, not only to its outputs. If \(\psi\) depends on post-\(t\) information, two different forecast origins using the same \(\psi\) are both running a retriever that was not available in real time.

Part 1.

  • \(d_1\): published 2019-10-24, before $t = $ 2019-12-31. Admissible in its 2019-10-24 form.
  • \(d_2\): revision re-published 2020-01-15, after \(t\). Not admissible in its revised form, even though it concerns an event that occurred before \(t\). If the original (pre-revision) transcript was published before \(t\), that original is admissible; the revision is not.
  • \(d_3\): published 2020-02-03, after \(t\). Not admissible. It also contains information about market reactions that occurred after \(t\), which is a second, independent violation.
  • \(d_4\): published 2019-11-30, before \(t\). The physical working paper is admissible. The fact that it cites \(d_1\) is fine; citations to earlier documents do not violate the admissibility condition.

Part 2. The retriever as specified is not admissible. The admissibility condition is: for each forecast origin \(t\), the retriever’s parameters \(\psi_t\) must be \(\mathcal{I}_t\)-measurable, i.e., they must have been estimable from information available on or before \(t\).

The violations: (i) \(E_\psi\) was trained on a web corpus crawled through 2021-06, which post-dates 2019-12-31. The similarity metric used for selection at \(t\) therefore depends on future corpora. (ii) The query string “ECB policy stance” is itself admissible if written before \(t\), but the way it is mapped to a vector depends on \(\psi\) and inherits the same violation. (iii) The similarity metric itself (cosine similarity in the \(E_\psi\) embedding space) depends on \(\psi\).

To restore admissibility: replace \(E_\psi\) with a retriever \(E_{\psi_t}\) trained on a corpus restricted to pre-\(t\) text, or use a retriever whose parameters are themselves \(\mathcal{I}_t\)-measurable (e.g., TF-IDF computed on the pre-\(t\) archive).

Part 3. The foundation model \(p_\theta\) is not admissible for the 2019-Q4 forecast origin because its training cutoff is 2021-09. Any knowledge about events between \(t\) and 2021-09 that is encoded in \(\theta\) contaminates \(p_\theta\).

A valid protocol:

(i) Objects re-estimated at each forecast origin \(t\):

  • the retriever parameters \(\psi_t\), trained on a corpus of documents published on or before \(t\);
  • the predictive-regression parameters \(\hat\theta_t\) in Equation 16.7;
  • labeling thresholds if \(z_t\) is a discretized score.

(ii) Objects held fixed across origins (once chosen):

  • the prompt template used to elicit \(z_t\);
  • the foundation-model architecture, although the model parameters \(\theta\) must be chosen so that the training cutoff of \(\theta\) precedes every origin \(t\) in the evaluation window;
  • the preprocessing pipeline (tokenizer, normalization).

(iii) Documentation requirements:

  • publication timestamps for every document in the archive, together with revision flags;
  • the training cutoff of \(\theta\), demonstrably before the earliest forecast origin;
  • the training corpus of \(\psi_t\) at each origin;
  • the exact prompt template.

If the evaluation window extends back further than the earliest foundation model with a defensible cutoff, the honest conclusion is that a pseudo-out-of-sample evaluation on this window cannot be conducted with current models, and the researcher should either restrict the window or state the limitation explicitly. It is not admissible to use a model whose training cutoff post-dates the forecast origin and hope that the contamination is small.

References

Ahrens, Maximilian, Deniz Erdemlioglu, Michael McMahon, Christopher J. Neely, and Xiye Yang. 2025. “Mind Your Language: Market Responses to Central Bank Speeches.” Journal of Econometrics 249: 105921. https://doi.org/10.1016/j.jeconom.2024.105921.
Bertsch, Christoph, Isaiah Hull, Robin L. Lumsdaine, and Xin Zhang. 2025. “Central Bank Mandates and Monetary Policy Stances: Through the Lens of Federal Reserve Speeches.” Journal of Econometrics 249: 105948. https://doi.org/10.1016/j.jeconom.2025.105948.
Haghighi, Maryam, Andreas Joseph, George Kapetanios, Christopher Kurz, Michele Lenza, and Juri Marcucci. 2025. “Machine Learning for Economic Policy.” Journal of Econometrics 249: 105970. https://doi.org/10.1016/j.jeconom.2025.105970.
Siano, Federico. 2025. “The News in Earnings Announcement Disclosures: Capturing Word Context Using LLM Methods.” Management Science 71 (11): 9831–55. https://doi.org/10.1287/mnsc.2024.05417.
Zarifhonarvar, Ali. 2026. “Generating Inflation Expectations with Large Language Models.” Journal of Monetary Economics 157: 103859. https://doi.org/10.1016/j.jmoneco.2025.103859.