Data Sets Used in This Book

17 Data Sets Used in This Book

This chapter is a reference appendix for the file-backed datasets shipped with the book repository. Earlier chapters link here instead of re-describing the source each time. Synthetic datasets generated inside code chunks are not listed here; they are documented where they appear.

17.1 FRED-QD

FRED-QD is the quarterly U.S. macroeconomic database developed by McCracken and Ng (2020). A repository copy of the release used in these notes is stored at data/fred_qd_current.csv.

  • Unit of observation: calendar quarter
  • Coverage: U.S. macro-financial series, 1959Q1 onward
  • Frequency: quarterly
  • File layout: the CSV contains two metadata rows before the actual data. The first row stores factor-group identifiers and the second row stores the authors’ suggested transformation codes. The sasdate column holds the end-of-quarter date.

The chapters in the book use the subset of series listed below. Full variable documentation is available in the FRED-QD release notes.

Mnemonic Description
GDPC1 Real gross domestic product, chained dollars
CPIAUCSL Consumer price index, all urban consumers
UNRATE Civilian unemployment rate
FEDFUNDS Effective federal funds rate
GS10 10-year Treasury constant-maturity yield
TB3MS 3-month Treasury bill secondary-market rate
VIXCLSx CBOE volatility index (VIX), quarterly closing value

Figure 17.1 shows the series used in the book on a common standardized scale so that their timing can be compared across the sample. Magnitudes are not directly comparable on this scale.

Show the code
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

raw = pd.read_csv("data/fred_qd_current.csv")
fred_qd = raw.iloc[2:].copy()
fred_qd["sasdate"] = pd.to_datetime(fred_qd["sasdate"])

for column in fred_qd.columns:
    if column != "sasdate":
        fred_qd[column] = pd.to_numeric(fred_qd[column], errors="coerce")

gdp_growth = 400 * np.log(fred_qd["GDPC1"]).diff()
inflation = 400 * np.log(fred_qd["CPIAUCSL"]).diff()
term_spread = fred_qd["GS10"] - fred_qd["TB3MS"]
log_vix = np.log(fred_qd["VIXCLSx"])

plot_df = pd.DataFrame(
    {
        "date": fred_qd["sasdate"],
        "GDP growth": gdp_growth,
        "Inflation": inflation,
        "Unemployment": fred_qd["UNRATE"],
        "Fed funds": fred_qd["FEDFUNDS"],
        "Term spread": term_spread,
        "Log VIX": log_vix,
    }
).dropna()

standardized = plot_df.copy()
for column in standardized.columns[1:]:
    standardized[column] = (
        standardized[column] - standardized[column].mean()
    ) / standardized[column].std()

fig, ax = plt.subplots(figsize=(11.4, 4.8))
for column in standardized.columns[1:]:
    ax.plot(standardized["date"], standardized[column], linewidth=1.5, label=column)

ax.set_xlabel("Quarter")
ax.set_ylabel("Standardized value")
ax.grid(True, alpha=0.2)
ax.legend(frameon=False, ncol=3, loc="upper left")
plt.tight_layout()
plt.show()
Figure 17.1: Selected FRED-QD series used in the book, standardized to mean zero and variance one. Real GDP growth is the annualized log difference of GDPC1; inflation is the annualized log difference of CPIAUCSL; the remaining series are shown as levels, with the term spread constructed as GS10 - TB3MS and log VIX as log(VIXCLSx).

17.2 Loading FRED-QD

The snippet below is the canonical way to load FRED-QD in this book. It skips the two metadata rows, parses sasdate as a date, and coerces the remaining columns to numeric.

Show the code
import pandas as pd

raw = pd.read_csv("data/fred_qd_current.csv")
fred_qd = raw.iloc[2:].copy()
fred_qd["sasdate"] = pd.to_datetime(fred_qd["sasdate"])

for column in fred_qd.columns:
    if column != "sasdate":
        fred_qd[column] = pd.to_numeric(fred_qd[column], errors="coerce")

fred_qd = fred_qd.sort_values("sasdate").reset_index(drop=True)

The authors’ transformation codes in the second metadata row are not applied here. Individual chapters construct the transformations they need directly from the raw levels so that the forecast origin and information set are explicit.

17.3 SPY TAQ 5-Minute Realized Measures

The repository also contains scripts for constructing realized-volatility measures from WRDS TAQ millisecond trade data for the SPDR S&P 500 ETF (SPY). The raw TAQ data and derived realized-measure files are not shipped with the public book because TAQ access is licensed through WRDS. The expected local paths are data/taq_spy/SPY_daily_measures.csv for the daily CSV used in the neural-network illustration and data/taq_spy/SPY_5min_daily_rv.parquet for the Parquet output produced by the 5-minute compilation script. The whole data/taq_spy/ directory is excluded from git.

  • Unit of observation: trading day
  • Underlying data: WRDS TAQ millisecond trades, queried from taqmsec.ctm_YYYYMMDD
  • Instrument: SPY
  • Frequency used for realized measures: 5-minute intraday prices
  • Coverage: 2015-01-02 through 2024-12-31
  • Construction scripts: scripts/download_spy_taq_wrds.R downloads and cleans TAQ trades; scripts/compile_spy_5min_rv_parquet.R compiles daily realized measures from the local 5-minute bar files.

The two local files have different column sets.

Daily CSV (SPY_daily_measures.csv) — used by the neural-network illustration chapter:

Variable Description
date Trading day
rv Realized variance, sum of squared 5-minute log returns in percent

Parquet file (SPY_5min_daily_rv.parquet) — output of the 5-minute compilation script:

Variable Description
date Trading day
rv_5min Realized variance, sum of squared 5-minute log returns in percent
bv_5min Bipower variation, computed from adjacent absolute 5-minute returns
rq_5min Realized quarticity proxy based on fourth powers of 5-minute returns
intraday_return Open-to-close log return in percent

The Parquet file can be loaded in R as follows:

library(arrow)

spy_rv <- read_parquet("data/taq_spy/SPY_5min_daily_rv.parquet")

17.4 References

McCracken, Michael W., and Serena Ng. 2020. “FRED-QD: A Quarterly Database for Macroeconomic Research.” Working Paper 26872. National Bureau of Economic Research.