Under active development. A stable version is expected in November 2026. Feedback is welcome by email or via GitHub issues.

MLmetrics: Machine Learning for Econometricians

This open textbook introduces modern machine-learning methods for graduate students in econometrics. The aim is not to replace econometric reasoning with black-box prediction, but to show how flexible ML tools can be used carefully in the settings econometricians face in practice: forecasting, financial risk, macroeconomic data, firm-level panels, and distributional uncertainty.

The book assumes a strong background in econometrics and statistics, but no prior exposure to machine learning. Each method is introduced from first principles and connected to familiar econometric ideas such as likelihood, loss functions, forecast evaluation, time-series dependence, and model selection.

Why Does This Book Exist?

This book grew out of two related motivations. The first was curiosity about how modern machine-learning methods should be adapted to the econometric problems that arise in research on forecasting, macro-finance, and predictive distributions. The second was teaching the MSc course Machine Learning at the Econometrics Institute at Erasmus University Rotterdam.

In that course, I kept running into the same problem: standard ML texts are usually written for a different audience. They often assume little prior statistical training, work mostly with independent and identically distributed data, and motivate ideas with image or text classification. The econometric issues that arise naturally in economic and financial applications — serial dependence, distributional forecasting, real-time information sets, and leakage from invalid validation designs — usually receive much less attention. At the same time, econometrics texts that discuss prediction or regularization rarely connect those topics to the broader ML toolkit, and seldom explain how methods such as neural networks or gradient boosting should be adapted when the data have the structure of a macroeconomic or financial time series.

The aim of these notes is to fill that gap directly. They are not a substitute for the many strong ML resources that already exist. Instead, they frame modern machine learning in a language that econometricians recognize, with the guiding question: how can we use flexible methods while respecting the information sets, dependence structures, and evaluation problems that arise in econometric work?

Who This Book Is For

This book is written for students in econometrics, economics, finance, and related quantitative fields.

Readers should be comfortable with probability, mathematical statistics, regression, likelihood-based estimation, and basic time-series methods. No prior knowledge of neural networks, random forests, boosting, or conformal prediction is assumed.

What You Will Learn

Machine-learning examples are often presented in clean i.i.d. settings. Econometric applications rarely look like that. Across the book, the emphasis is on the parts of machine learning that matter when that assumption breaks down:

evaluating predictions when dependence, nonstationarity, and information sets matter
using flexible models without contaminating inference through leakage or invalid validation schemes
thinking about predictive distributions, scoring rules, tail risk, and uncertainty quantification
the behavior of neural networks and tree-based methods in economic and financial applications
tuning, comparing, and diagnosing models in a way that remains statistically defensible

Book Roadmap

The chapters are organized into four parts.

Part	Chapters	Focus
Background	Information Theory, Cross Validation, Evaluating Predictive Distributions, Optimization	Core tools for measuring uncertainty, selecting models, evaluating forecasts, and training flexible models.
Neural Networks	Feed-Forward Networks, Recurrent Networks, LSTM Networks, Empirical Time-Series Exercise, Distribution Modeling	Neural-network models for nonlinear prediction, sequential data, and distributional forecasting.
Tree-Based Methods	Decision Trees, Random Forests, Gradient Boosting, Advanced Tree-Based Methods	Interpretable trees, variance reduction by forests, stagewise boosting, and distributional tree methods.
Further Topics	Hyperparameter Optimization, Conformal Prediction, Foundation Models for Economic Text, Data Sets Used in This Book	Modern tools for tuning, uncertainty quantification, text as an econometric input, and the data sources that anchor the empirical examples.

How To Use This Book

Each chapter combines formal notation, econometric interpretation, and Python examples. The code chunks are there to make the ideas concrete, but the mathematical arguments are written so the central logic can be followed without running code.

The exercises are designed with handwritten exam preparation in mind. They emphasize derivations, forecast-evaluation logic, validation pitfalls, and conceptual distinctions that matter when machine-learning methods are applied to economic data.

Why Python?

Many econometricians work primarily in R, and R has excellent support for statistical modeling and time-series analysis. This book uses Python because the modern ML ecosystem — PyTorch, scikit-learn, Optuna — is implemented there, and students who go on to work with deep learning or foundation models will encounter Python as the default. The focus throughout remains on model choice, loss functions, validation design, and forecast evaluation rather than on software for its own sake.

What This Book Does Not Cover

This book is not a general introduction to econometrics, and it is not a full treatment of time-series econometrics. It does not replace a dedicated treatment of ARIMA, GARCH, state-space models, causal inference, or asymptotic theory. Those topics are assumed as background when needed.

It is also not a software-engineering manual. The code examples are there to make the statistical ideas concrete, not to provide production-ready ML pipelines.

How This Book Relates to Other Resources

Several good resources cover adjacent territory, and it helps to say briefly how this book fits in.

Coqueret and Guida Coqueret and Guida (2020) provide a practitioner-oriented guide to ML in asset pricing, with a focus on return prediction and portfolio formation. This book covers less on portfolio construction but more on time-series validation, distributional scoring rules, conformal prediction under dependence, and foundation models for economic text.

Standard ML textbooks such as James et al. James et al. (2013) or Hands-On Machine Learning are excellent starting points. This book picks up where they leave off: it assumes readers already understand regression and likelihood, and focuses on adapting ML methods to non-i.i.d. econometric settings.

Feedback and Updates

This book is an ongoing project. Its public home is mlmetrics.org. If you spot a typo, find an unclear explanation, or have suggestions for additional material, please open an issue on GitHub: github.com/onnokleen/mlmetrics/issues, or email me at kleen@ese.eur.nl. Instructors who use the book in their courses are especially welcome to get in touch.

Acknowledgements

I am grateful to my colleagues for discussions that shaped the ideas and direction of this book. I would also like to thank the team behind Tidy Finance for providing an example of what an open, carefully structured textbook project can look like; their work was the inspiration to start this one. Finally, I thank my students for the comments and suggestions they provided on earlier versions of these notes.

How To Cite This Book

If you use this book in research, teaching materials, or course syllabi, please cite it as:

Kleen, Onno. 2026. MLmetrics: Machine Learning for Econometricians. Open textbook. https://mlmetrics.org.

BibTeX:

@book{Kleen2026MLmetrics,
  author = {Kleen, Onno},
  title = {MLmetrics: Machine Learning for Econometricians},
  year = {2026},
  url = {https://mlmetrics.org},
  note = {Open textbook}
}

Author and License

Onno Kleen is an Assistant Professor of Econometrics at the Department of Econometrics at Erasmus University Rotterdam and a fellow at the Tinbergen Institute. His research focuses on time-series econometrics, distribution forecasting, and volatility modeling — the same problems that motivate much of this book.

This book is licensed under Creative Commons Attribution-NonCommercial-ShareAlike 4.0 International CC BY-NC-SA 4.0.

References

Coqueret, Guillaume, and Tony Guida. 2020. Machine Learning for Factor Investing: R Version. Chapman; Hall/CRC. https://www.mlfactor.com/.

James, Gareth, Daniela Witten, Trevor Hastie, and Robert Tibshirani. 2013. An Introduction to Statistical Learning. https://doi.org/10.1007/978-1-4614-7138-7.