1 Referee Report on the mlmetrics.org Lecture Notes

1.1 Executive Summary

This report evaluates the public version of the lecture notes hosted at mlmetrics.org as of April 22, 2026. The homepage describes the project as an open textbook for MSc, research master’s, and early PhD students in econometrics, economics, and finance; it also states that the notes are still under active development, with a stable version expected in November 2026. The current structure contains 17 chapters organized into four parts: Background, Neural Networks, Tree-Based Methods, and Further Topics. The public homepage identifies the author as entity[“people”,“Onno Kleen”,“econometrician”]. citeturn11search0

My overall judgment is positive but clearly pre-publication: the notes are already pedagogically promising, unusually well-targeted to econometric forecasting problems, and at their best when they explain why generic i.i.d.-style ML workflows fail in time-dependent economic data. The greatest strengths are the persistent emphasis on information sets, serial dependence, leakage, forecast-origin discipline, and predictive-distribution evaluation; the most successful chapter in the current public version is the conformal prediction chapter, which gives a clean exchangeability-based theorem, a rank-based proof skeleton, an explicit discussion of marginal versus conditional coverage, and a sensible warning about time-series adaptations. citeturn11search0turn2view1turn2view2turn7search2

The biggest weaknesses are not conceptual but editorial and production-related. The public site still shows multiple signs of an unreleased draft: duplicated top-level headings on some pages, public exposure of local file paths, a leaked Python warning in the random-forests chapter, uneven citation practice, and a few chapters whose content is still too thin to function as graduate lecture notes. The “Empirical Exercise” chapter is especially underdeveloped: in the current live version it is effectively a short pointer to a notebook plus one reflection question, not a self-contained chapter. The datasets appendix also exposes local paths and contains a duplicated heading. citeturn15search0turn18view0turn19view0turn20view0

From a referee perspective, I would classify the project as strong draft notes, not yet publication-ready textbook notes. The revisions needed to reach a high standard are concrete and tractable: regularize the references; remove rendering and reproducibility defects; complete the thin chapters; and tighten a handful of mathematical statements whose current wording is slightly too informal or too broad for a graduate audience. Relative to standard texts such as entity[“book”,“Probabilistic Machine Learning: An Introduction”,“murphy 2022”], entity[“book”,“The Elements of Statistical Learning”,“hastie tibshirani friedman 2009”], entity[“book”,“An Introduction to Statistical Learning”,“james witten hastie tibshirani 2023”], entity[“book”,“Deep Learning”,“goodfellow bengio courville 2016”], and entity[“book”,“Elements of Information Theory”,“cover thomas 2006”], the distinctive contribution of mlmetrics.org is its econometric orientation. Relative to those same texts, however, it is not yet comparable in polish, citation completeness, or consistency of mathematical exposition. citeturn5search0turn5search1turn5search2turn5search3turn9search0

1.2 Scope and Overall Assessment

The public homepage makes the intended audience much more specific than a generic graduate ML audience: these notes are written for quantitatively trained econometrics and economics students who already know probability, mathematical statistics, regression, likelihood-based estimation, and basic time-series methods, but who may have little prior ML exposure. That framing is reflected throughout the book, and it is one of the strongest design decisions in the project. The notes consistently ask the right question for that audience: not “how do I use the latest ML package,” but “which ML workflows remain statistically defensible once dependence, publication lags, revised data, and real-time information sets matter?” citeturn11search0

Pedagogically, the project is strongest when it leans into that comparative advantage. The random-forests, boosting, conformal, and forecasting-oriented sections repeatedly connect modern methods back to familiar econometric ideas such as likelihood, loss functions, out-of-sample design, and predictive distributions. The result is often more useful for this readership than generic ML texts, which typically assume i.i.d. data and motivate methods with image or text benchmarks rather than macro-financial prediction. citeturn11search0turn15search0turn20view0turn2view1

The same specialization also creates a trade-off. Readers outside econometrics will sometimes find the notes narrower than standard ML texts in breadth, and some chapters implicitly assume background knowledge in forecasting, time-series dependence, and distributional evaluation. For the stated audience, that is acceptable and often beneficial. For a broader graduate ML/statistics audience, however, the notes would need slightly more signposting, more notation reminders, and more canonical-machine-learning references to avoid feeling idiosyncratic. citeturn11search0turn5search1turn5search2turn5search3

The organization is coherent and generally well chosen:

flowchart LR
    A[Background] --> B[Neural Networks]
    A --> C[Tree-Based Methods]
    B --> D[Further Topics]
    C --> D
    D --> E[HPO, Conformal, Foundation Models, Data]

The ordering mostly works. Information theory, validation, distributional evaluation, and optimization supply the right prerequisites for later chapters; the neural-network and tree-based sequences are logically arranged; and the move from core prediction tools to hyperparameter optimization, conformal prediction, foundation models, and datasets is sensible. Where the organization currently falters is not in macro-ordering but in chapter completeness: the book outline promises more uniformity than the current draft actually delivers. citeturn11search0turn19view0turn18view0

1.3 Detailed Chapter-by-Chapter Critique

The table below synthesizes a close reading of the live site and public source files. It is intentionally selective: it highlights the most important strengths and the most pressing referee concerns for each chapter or section of the project, rather than summarizing every subsection exhaustively. The underlying evidence comes from the live site, the public source repository, and the canonical literature named later in this report. citeturn11search0turn5view0turn15search0turn18view0turn19view0turn20view0turn2view1turn16view0

Chapter Strengths Main referee concerns
About / front matter Clear statement of audience and purpose; excellent motivation for an econometric-ML bridge. citeturn11search0 Should state versioning and change policy more explicitly; the “under active development” message should ideally be paired with release tags or dated snapshots. citeturn11search0
Information Theory Good choice of preliminaries for later chapters on log scores, KL, and probabilistic forecasting. Several continuous/discrete distinctions need tightening. In particular, statements about cross-entropy being negative and “MLE as KL minimization” should be phrased more carefully for graduate notes; the chapter also needs a more canonical information-theory reference base, especially beyond Murphy. Compare against entity[“book”,“Probabilistic Machine Learning: An Introduction”,“murphy 2022”] and entity[“book”,“Elements of Information Theory”,“cover thomas 2006”]. citeturn5search0turn9search0
Cross Validation Strong econometric instinct: the notes correctly foreground leakage, forecast-origin discipline, and time-aware validation. Citation practice is too thin for such a central methodological chapter. A graduate chapter of this sort should explicitly anchor itself in nested-CV bias and dependence-aware CV references such as Varma–Simon, Cawley–Talbot, and Roberts et al. citeturn7search4turn7search5turn7search19
Evaluating Predictive Distributions This is an important and unusually valuable topic for econometrics students; the book’s overall emphasis on distributional evaluation is one of its most original features. The chapter would benefit from a few more explicit bridges back to subsequent conformal and distributional-network chapters, plus a fuller “why proper scoring rules matter under dependence” discussion. Compare against Gneiting–Raftery. citeturn10view2turn7search2turn5search0
Optimization Clear intuition and useful exam-style exercises. The chapter is under-referenced relative to its scope. For material covering SGD, momentum, adaptive methods, and modern optimizer design, one citation is not enough. It also needs sharper distinction between theorem-like convergence statements and heuristic practitioner advice.
Feed-Forward Neural Networks The econometric reinterpretation of FNNs as learned basis expansions is clear and effective; parameter-count explanations are pedagogically useful. citeturn16view0 The universal approximation discussion should distinguish more clearly between bounded continuous activations in the Cybenko/Hornik style and the chapter’s later ReLU-centered practical emphasis. In its current form, the theorem statement is mathematically fine but pedagogically liable to blur which theorem covers which activation family. citeturn16view0turn5search3
Recurrent Neural Networks The state-space/GARCH analogy is apt for the intended readership, and the discussion of forecast-origin discipline is strong. Source support is too sparse. The chapter appears to rely almost entirely on the GARCH analogy and would benefit from explicit foundational references on vanilla RNNs and backpropagation through time. One empirical statement about typical sequence length limits is too broad unless qualified more carefully.
LSTM Networks Sensible bridge from vanishing gradients to gated architectures. As with the RNN chapter, references are too thin relative to the conceptual load; more canonical sequencing from vanilla RNN to LSTM to attention-era architectures would help.
Empirical Exercise for Networks In the live site this is currently not a real chapter. It links out to a notebook and gives one reflection prompt. citeturn19view0 This section is far too slight for a graduate textbook chapter. Either expand it substantially into a real worked case study or fold it into neighboring chapters as an appendix. citeturn19view0
Distributional Neural Networks The topic choice is excellent and fits the broader distributional-evaluation strategy of the book. The chapter would be stronger with more explicit comparison among parametric likelihood-based distributional nets, quantile objectives, and later conformal wrappers; currently the coverage feels narrower than the chapter title suggests.
Decision Trees Clear derivation of leaf means, impurity intuition, and pruning concepts; good econometric interpretation of threshold rules. The chapter needs canonical references to CART and related tree literature. Without them, the exposition reads more like polished lecture notes than a citable graduate chapter. Compare against entity[“book”,“The Elements of Statistical Learning”,“hastie tibshirani friedman 2009”] and the original CART monograph. citeturn5search1turn9search1
Random Forests Pedagogically strong treatment of OOB logic, weights, and the “prediction not causality” warning. The emphasis on invalid OOB use under dependence is especially good. The live page currently leaks a local Python warning path into the public rendering, which materially reduces polish. The chapter also needs the canonical original citation to the random-forest paper by entity[“people”,“Leo Breiman”,“random forests author”]. citeturn15search0turn4search12
Gradient Boosting The residual and pseudo-residual derivations are useful; the econometric warning about time-ordered early stopping is well judged. The chapter discusses gradient boosting without citing the foundational Friedman paper, while the currently present citation is to AdaBoost. That is a conspicuous bibliographic gap for a graduate text. citeturn4search9turn15search0
Advanced Tree-Based Methods Useful idea to compare quantile/distributional forests with NGBoost; the QRF/NGBoost contrast is pedagogically helpful. The live page currently has a duplicated top-level heading (“NGBoost: Parametric Distributional Boosting” plus the chapter title), and the chapter is under-referenced relative to the topics it covers. A chapter on QRF and NGBoost should cite the canonical original papers directly. citeturn20view0turn21view0
Advanced Hyperparameter Optimization This is one of the more mature chapters in the draft because it already seems better tied to the nested-CV and Bayesian-optimization literature. Even here, the notes would benefit from one compact formal statement of what the outer validation loop is estimating and why inner-loop tuning invalidates naive reuse of the same folds. Compare against Varma–Simon and Cawley–Talbot. citeturn7search4turn7search5
Conformal Prediction The strongest chapter in the current public version. The theorem statement, the exchangeability definition, the rank argument, the marginal-versus-conditional distinction, and the warnings for time-series applications are all carefully done. citeturn2view1turn2view2turn1view2turn1view3 The chapter would still benefit from a slightly more explicit bridge between textbook split conformal and modern dependent-data variants, but this is refinement rather than repair. The inclusion of EnbPI is appropriate. citeturn7search2
Foundation Models for Economic Text Ambitious and genuinely interesting econometric framing: sequence probabilities, embeddings as measurements, real-time retrieval discipline. This chapter is not yet citation-complete relative to standard foundation-model literature. A public graduate chapter covering attention, transformer-like representations, and LM measurement should cite the core lineage from entity[“people”,“Ashish Vaswani”,“transformer author”] onward, including the transformer paper, BERT, and GPT-3–style scaling. citeturn4search2turn8search0turn8search1
Datasets appendix Useful in principle to centralize data provenance. The live page currently shows a duplicated heading (“Loading FRED-QD” above the actual chapter title), exposes local file-oriented phrasing, and is too informal for data provenance in a public textbook. It needs cleaner provenance, reproducibility language, and explicit statements on licensing/access constraints. citeturn18view0

1.4 Mathematical Correctness, Definitions, Proofs, and Notation

On mathematical correctness, the notes are generally reliable at the level of graduate lecture notes, especially in chapters built around standard constructions rather than highly abstract theorems. I did not find evidence of a systemic mathematical problem such as false theorem statements, broken derivations, or repeated misuse of probability notation. The most rigorous publicly visible chapter is conformal prediction, where the definition of exchangeability, the split-conformal theorem, and the rank-based coverage argument are all aligned with the standard literature. The theorem statement and the surrounding caveats are notably careful for draft notes. citeturn2view1turn2view2turn7search2

The main mathematical issue is different: precision varies by chapter. Some chapters are close to textbook-grade. Others use mathematically suggestive language without quite specifying assumptions sharply enough. The feed-forward chapter is a good example. The universal approximation theorem is attributed to the classic bounded-activation results, but the surrounding practical discussion emphasizes ReLU-heavy modern networks. A graduate reader should be told more explicitly that the classical theorem quoted there is not the same statement typically used for ReLU architectures. citeturn16view0turn5search3

A second recurring issue is that some claims are pedagogically correct but not quite referee-tight. In information-theoretic and optimization contexts, the notes sometimes move quickly from heuristic interpretation to theorem-like wording. That style is effective for teaching, but it benefits from one extra sentence separating exact results from informal intuition. Standard references such as Murphy and Cover–Thomas make that distinction with more consistency, and the notes would improve by borrowing that discipline. citeturn5search0turn9search0

Notation is mostly consistent within chapters. The book is better than average on introducing symbols before using them, and many chapters begin with a roadmap or notation block. Cross-chapter consistency, however, could be improved by adding a short global notation appendix or an early “notation conventions” page. At present, notation is chapter-local rather than book-global. That is workable, but a textbook of this scope benefits from a stable core vocabulary for loss functions, empirical risk, information sets, fitted distributions, and predictive targets. The same applies to terminology: the notes are usually careful about saying “predictive, not causal,” but the difference between distributional forecasting, calibrated prediction, density estimation, and uncertainty quantification could be cross-indexed more explicitly. citeturn11search0turn2view1turn20view0

1.5 Comparison with Standard References

The comparison table below benchmarks the notes against several standard texts that are widely used in graduate ML/statistics teaching. The point is not that mlmetrics.org must replicate those books. It should not. Its advantage is precisely that it is narrower and more econometric. The question is where that specialization clearly outperforms the standard texts, and where it currently falls short. The qualitative comparisons below synthesize the public descriptions of those books and the current public state of mlmetrics.org. citeturn11search0turn5search0turn5search1turn5search2turn5search3turn9search0

Reference Relative strength of mlmetrics.org Relative weakness of mlmetrics.org
entity[“book”,“Probabilistic Machine Learning: An Introduction”,“murphy 2022”] Better focused on forecasting discipline, leakage, and econometric information sets. Much less broad, less polished mathematically, and less comprehensive in references.
entity[“book”,“The Elements of Statistical Learning”,“hastie tibshirani friedman 2009”] Better tailored to macro-finance and dependent-data evaluation. Weaker on canonical coverage of tree methods, model assessment theory, and mature exposition.
entity[“book”,“An Introduction to Statistical Learning”,“james witten hastie tibshirani 2023”] More advanced and more explicitly econometric in its validation cautions. Less uniform pedagogically; some chapters are draft-like where ISL is clean and complete.
entity[“book”,“Deep Learning”,“goodfellow bengio courville 2016”] Better at translating ML concepts into forecasting and econometric language. Much weaker on depth and completeness for neural-network theory, architectures, and literature lineage.
entity[“book”,“Elements of Information Theory”,“cover thomas 2006”] Better at connecting information-theoretic notions directly to forecasting and predictive scores. Not nearly as rigorous or complete on the underlying information theory.

The most favorable comparison for mlmetrics.org is against generic introductory ML texts on the axis of dependence-aware evaluation. The least favorable is on the axes of bibliographic completeness, editorial polish, and uniformity of chapter development. That is exactly what one would expect from a strong draft note set that has not yet been normalized into a release-quality book. citeturn11search0turn5search0turn5search1turn5search2turn5search3turn9search0

1.6 Prioritized Corrections and Suggestions

The list below is ordered by expected impact on the quality of the public notes, not merely by ease of implementation.

Priority Recommendation Why this matters Effort
High Regularize the references across the entire book. Add missing canonical citations in tree, neural-net, information-theory, and foundation-model chapters. At minimum: CART, random forests, gradient boosting, transformer/BERT/GPT-3 lineage, and a canonical information-theory text. citeturn4search3turn4search12turn4search9turn4search2turn8search0turn8search1turn9search0 This is the single fastest way to make the notes feel scholarly rather than merely pedagogical. Right now, several strong chapters are under-anchored relative to the standard literature. Low to medium
High Fix public rendering defects and source reproducibility problems. Remove leaked warnings and local-path artifacts; eliminate duplicated headings; make source files self-contained rather than dependent on local absolute bibliography/data paths. citeturn15search0turn18view0turn20view0 These are highly visible and avoidable. They weaken reader trust more than small mathematical imprecisions do. Medium
High Expand or restructure the thin chapters. The empirical neural-network example should become a real worked case study or be folded into adjacent chapters; the datasets appendix needs fuller provenance and less local-machine language. citeturn19view0turn18view0 The current outline promises chapter-level completeness that the live book does not yet consistently deliver. Medium to high
High Preserve and foreground the econometric comparative advantage. The strongest parts of the notes are the repeated warnings about information sets, leakage, and dependent data; these should remain central and perhaps be summarized in a recurring “econometric checklist” box at the end of each applied chapter. citeturn11search0turn15search0turn2view1 This is what differentiates the project from standard ML texts. Low
Medium Tighten theorem-adjacent wording in a few places. Clarify where statements are exact, where they are heuristics, and which assumptions matter. The feed-forward chapter’s universal-approximation discussion is the clearest case. citeturn16view0turn5search3 This would noticeably improve referee confidence without changing the exposition style. Low to medium
Medium Add one global notation guide and one glossary of predictive objects. The current chapter-local notation is serviceable, but a shared notation layer would reduce re-learning costs across chapters. Medium
Medium Standardize end-of-chapter elements. Every chapter should have: roadmap, key takeaways, common pitfalls, exercises, and references/further reading. The strongest chapters already do this; the reader notices when weaker chapters do not. Medium
Medium Strengthen chapter cross-references. Explicitly link information theory ↔︎ predictive scores ↔︎ conformal ↔︎ distributional networks ↔︎ advanced tree methods. The conceptual architecture is present, but its internal weave could be made more visible. Low
Lower Broaden exercise types slightly. The current exercises are very exam-friendly and mostly good; adding a few structured computational or “replicate this figure responsibly” exercises would help bridge theory and practice. This is useful, but less urgent than citations and production cleanup. Medium

A few specific edits would have especially high payoff. In the random-forests chapter, remove the rendered warning block from the public page and add the original random-forest citation by Breiman. In the gradient-boosting chapter, add Friedman’s 2001 paper and distinguish more clearly between AdaBoost as a special case and gradient boosting as the main object of the chapter. In the feed-forward chapter, add a one-paragraph note explaining that the classical universal-approximation theorem quoted there is not the only theorem relevant to modern ReLU networks. In the foundation-models chapter, add the minimal canonical lineage: transformer, BERT, and GPT-3 or equivalent scaling-era reference points. citeturn15search0turn4search12turn4search9turn16view0turn4search2turn8search0turn8search1

1.7 Open Questions and Review Limitations

This report evaluates the public site and public source state as visible on April 22, 2026. The homepage itself warns that the book is still under active development, so some issues identified here may already be on the author’s revision list. I reviewed the live rendered pages and source files, but I did not re-run every notebook or code chunk end-to-end in a clean environment, so this report should be read primarily as a referee assessment of mathematical exposition, scholarly framing, pedagogical structure, and public-site quality, not as a full reproducibility audit of all computations. citeturn11search0turn5view0

The most important uncertainty is therefore temporal, not substantive: because the project is explicitly a moving draft, some chapter-level defects may disappear quickly. The high-confidence findings are the ones most visible in the current public version: the strength of the econometric framing, the quality of the conformal chapter, the unevenness of chapter completion, the incompleteness of several citation trails, and the need to clean up public rendering and source portability before the notes can be treated as publication-ready graduate lecture notes. citeturn11search0turn2view1turn15search0turn18view0turn19view0