'Defensive' is a claim that needs a significance test
A point-estimate ranking of "defensive" assets is not evidence of a defensive edge. Attach n, t, and p before believing it — here the highest mean was insignificant, and the reliable hedge had the smaller mean.
The seductive ranking
Sort assets by average return during S&P 500 down months and gold tops the list: +0.9% across 106 down months. It is tempting to crown gold the best down-month hedge. The mean is real; the edge is not established. The test statistics are t = 1.71, p = 0.09 — not significant at the 5% level. The ranking is a hypothesis, not a finding.
What ranking-by-mean rewards
Sorting on the mean quietly rewards high-variance assets: a few large up-months pull the average up even when the asset is down about as often as it is up. Gold’s hit rate in down months is 53% — barely better than a coin flip. A high average with a near-50% hit rate is exactly the signature of noise, not reliable defensiveness.
The asset that actually held up
Treasuries (IEF) averaged a smaller +0.5% in the same down months, but with a 56% hit rate and t = 2.10, p = 0.04 — significant. Cash is trivially positive in essentially every down month. The reliable hedge had the smaller headline mean; the mean-ranking would have pointed you at the wrong asset.
The discipline
For any conditional “defensive” claim — performance in down months, in crises, in a regime — report n, t, and p next to the mean, and flag the 5% threshold explicitly. A point-estimate ranking tells you where to look; only the inference tells you whether there is anything there.