MLmetrics: Machine Learning for Econometricians

This open textbook introduces modern machine-learning methods for graduate students in econometrics. The aim is not to replace econometric reasoning with black-box prediction, but to show how flexible ML tools can be used carefully in the settings econometricians actually face: forecasting, financial risk, macroeconomic data, firm-level panels, and distributional uncertainty.

The book assumes a strong background in econometrics and statistics, but no prior exposure to machine learning. Each method is introduced from first principles and connected to familiar econometric ideas such as likelihood, loss functions, forecast evaluation, time-series dependence, and model selection.

Why Does This Book Exist?

This book grew out of two related motivations. The first was my own curiosity about how modern machine-learning methods should be adapted to the econometric problems that arise in research on forecasting, macro-finance, and predictive distributions. The second was teaching the MSc course Machine Learning at the Econometrics Institute at Erasmus University Rotterdam.

In that course, I kept running into the same problem: standard ML texts are usually written for a different audience. They often assume little prior statistical training, work mostly with independent and identically distributed data, and motivate ideas with image or text classification. The econometric issues that arise naturally in economic and financial applications - serial dependence, distributional forecasting, real-time information sets, and leakage from invalid validation designs - usually receive much less attention. At the same time, econometrics texts that discuss prediction or regularization rarely connect those topics to the broader ML toolkit, and seldom explain how methods such as neural networks or gradient boosting should be adapted when the data have the structure of a macroeconomic or financial time series.

The aim of these notes is to bridge that gap. They are not a substitute for the many strong ML resources that already exist. Instead, they frame modern machine learning in a language that econometricians can use immediately, with the guiding question: how can we use flexible methods while respecting the information sets, dependence structures, and evaluation problems that arise in econometric work?

Who This Book Is For

This book is written for MSc students, research master’s students, and early PhD students in econometrics, economics, finance, and related quantitative fields.

Readers should be comfortable with probability, mathematical statistics, regression, likelihood-based estimation, and basic time-series methods. No prior knowledge of neural networks, random forests, boosting, or conformal prediction is assumed. Readers without a background in mathematical statistics or econometrics will find the pace challenging and may want to start with a general introduction such as James et al. James et al. (2013).

What You Will Learn

Machine-learning examples are often presented in clean i.i.d. settings. Econometric applications rarely look like that. Across the book, the emphasis is on the parts of machine learning that matter when that assumption breaks down:

  • evaluating predictions when dependence, nonstationarity, and information sets matter
  • using flexible models without contaminating inference through leakage or invalid validation schemes
  • thinking about predictive distributions, scoring rules, tail risk, and uncertainty quantification
  • the behavior of neural networks and tree-based methods in economic and financial applications
  • tuning, comparing, and diagnosing models in a way that remains statistically defensible

Book Roadmap

The chapters are organized into four parts.

Part Chapters Focus
Background Information Theory, Cross Validation, Evaluating Predictive Distributions, Optimization Core tools for measuring uncertainty, selecting models, evaluating forecasts, and training flexible models.
Neural Networks Feed-Forward Networks, Recurrent Networks, LSTM Networks, Empirical Time-Series Exercise, Distribution Modeling Neural-network models for nonlinear prediction, sequential data, and distributional forecasting.
Tree-Based Methods Decision Trees, Random Forests, Gradient Boosting, Advanced Tree-Based Methods Interpretable trees, variance reduction by forests, stagewise boosting, and distributional tree methods.
Further Topics Hyperparameter Optimization, Conformal Prediction, Foundation Models for Economic Text, Data Sets Used in This Book Modern tools for tuning, uncertainty quantification, text as an econometric input, and the data sources that anchor the empirical examples.

How To Use This Book

Each chapter combines formal notation, econometric interpretation, and Python examples. The code chunks are there to make the ideas concrete, but the mathematical arguments are written so the central logic can be followed without running code.

The exercises are designed with handwritten exam preparation in mind. They emphasize derivations, forecast-evaluation logic, validation pitfalls, and conceptual distinctions that matter when machine-learning methods are applied to economic data.

Why Python?

Many econometricians work primarily in R, and R has excellent support for statistical modeling and time-series analysis. This book uses Python because the modern ML ecosystem - PyTorch, scikit-learn, Optuna, Hugging Face - is implemented there, and students who go on to work with deep learning or foundation models will encounter Python as the default. The focus throughout remains on model choice, loss functions, validation design, and forecast evaluation rather than on software for its own sake.

What This Book Does Not Cover

This book is not a general introduction to econometrics, and it is not a full treatment of time-series econometrics. It does not replace a dedicated treatment of ARIMA, GARCH, state-space models, causal inference, or asymptotic theory. Those topics are assumed as background when needed.

It is also not a software-engineering manual. The code examples are there to make the statistical ideas concrete, not to provide production-ready ML pipelines.

How This Book Relates to Other Resources

Several good resources cover adjacent territory, and it helps to say briefly how this book fits in.

Coqueret and Guida Coqueret and Guida (2020) provide a practitioner-oriented guide to ML in asset pricing, with a focus on return prediction and portfolio formation. This book covers less on portfolio construction but more on time-series validation, distributional scoring rules, conformal prediction under dependence, and foundation models for economic text.

Standard ML textbooks such as James et al. James et al. (2013) or Hands-On Machine Learning are excellent starting points. This book picks up where they leave off: it assumes readers already understand regression and likelihood, and focuses on adapting ML methods to non-i.i.d. econometric settings.

Feedback and Updates

This book is an ongoing project. Its public home is mlmetrics.org. If you spot a typo, find an unclear explanation, or have suggestions for additional material, please open an issue on GitHub: github.com/onnokleen/mlmetrics/issues. Instructors who use the book in their courses are especially welcome to get in touch.

Acknowledgements

I am grateful to my colleagues for numerous discussions that helped shape the ideas and direction of this book. I would also like to thank the team behind Tidy Finance for providing an example of what an open, carefully structured, and practically useful textbook project can look like; their work gave me the inspiration to push this project forward. Finally, I thank my students for the numerous comments and suggestions they provided on earlier versions of these notes.

How To Cite This Book

If you use this book in research, teaching materials, or course syllabi, please cite it as:

Kleen, Onno. 2026. MLmetrics: Machine Learning for Econometricians. Open textbook. https://mlmetrics.org.

BibTeX:

@book{Kleen2026MLmetrics,
  author = {Kleen, Onno},
  title = {MLmetrics: Machine Learning for Econometricians},
  year = {2026},
  url = {https://mlmetrics.org},
  note = {Open textbook}
}

Author and License

Onno Kleen is an Assistant Professor of Econometrics at the Department of Econometrics at Erasmus University Rotterdam and a fellow at the Tinbergen Institute. His research focuses on time-series econometrics, distribution forecasting, and volatility modeling - the same problems that motivate much of this book.

This book is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.

References

Coqueret, Guillaume, and Tony Guida. 2020. Machine Learning for Factor Investing: R Version. Chapman; Hall/CRC. https://www.mlfactor.com/.
James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. https://doi.org/10.1007/978-1-4614-7138-7.