Walk-forward, OOS, Deflated Sharpe, and PBO: overfitting controls that matter

The controls that did the work here are walk-forward gates with in-window selection, bootstrap confidence intervals, dominance checks, and parameter-stability evidence. Deflated Sharpe and PBO are specified as acceptance criteria but deliberately deferred — and saying so plainly is part of the methodology.

An honest inventory

Deflated Sharpe Ratio and PBO appear in this archive’s specs as required_if_feasible acceptance criteria. As of this writing they are specified but not yet implemented — the controls that have actually run, and actually caught things, are simpler:

Walk-forward with in-window selection. Candidate selection must happen inside each walk-forward window; ranking on the full sample first is forbidden except as labeled exploration. Gates require ≥60% of windows positive and median OOS expectancy above zero.
Bootstrap confidence intervals. 1,000-iteration trade-level bootstrap (plus a month-block variant to respect clustering); the acceptance gate is on the CI’s lower bound, not the mean.
Dominance checks. Maximum single-trade contribution ≤30% of total PnL, and the result must survive removing the top 1% of trades. Strategies whose PnL is two lucky days do not pass.
Parameter-stability evidence. Stability of re-estimated parameters across windows is treated as affirmative evidence against overfitting, not just absence of failure.

Two cases where the controls did their job

A failure caught. The daily stat-arb pipeline ran a 3-year-train / 1-year-validation / 1-year-test walk-forward, stepped every 6 months — 10 windows. Verdict: FAIL. OOS Sharpe −1.67 against a 1.2 gate, OOS CAGR −5.6%, max drawdown −30.5%, and 0 of 10 windows positive. A single full-sample backtest of the same configuration looked viable; the walk-forward structure is what exposed it.

A pass that earned trust. The macro recession overlay re-calibrated thresholds each year over 21 expanding-window OOS years (2006–2026) and the chosen parameters barely moved: the defensive threshold was 0.6 in 14 of 21 years and 0.5 in the rest; the aggressive threshold was identical in all 21. In-sample-to-OOS Sharpe decay was −0.006 (0.593 → 0.586). Just as important: the search space was 20 combinations, not 20,000. A small, pre-committed grid is itself an overfitting control — there is less to overfit with.

Why the famous statistics are deferred

DSR and PBO correct for the breadth of the search — which matters most when the search is huge and the selection automated. The projects’ current designs attack the same risk further upstream: keep the grid small, select inside windows, demand bootstrap-CI and dominance survival, and treat parameter drift as a red flag. The plan is still to add DSR/PBO as the final acceptance layer; claiming them before they run would be exactly the kind of overstatement the rest of the methodology exists to prevent.

The lesson

Overfitting controls are a budget, not a checklist. Spend first on the structural ones — out-of-sample discipline and honest search spaces — because no statistic computed after an unconstrained search can fully un-ring that bell.