Ockham’s Razor is about what to believe when we have no evidence, not how to pick between theories supported by equal amounts of evidence.
In slighly longer form
I’m in the middle of The Science of Conjecture and I just realized that I’ve been misinterpreting Ockham’s Razor for the last several years. Ockham’s Razor says:
Entities are not to be multiplied without necessity.
For a long time, I’d taken this to mean:
The best explanation is the simplest explanation that takes all the variables into account.
In statistical terms we might phrase this as “bet on sparsity”. Up until today I thought that this was a maxim we might appeal to in the model selection phase of modeling: given fits \(F_1\) and \(F_2\) with equal levels of support (say, WAIC), we should make inferences from the fit with fewer effective degrees of freedom.
But if \(F_1\) and \(F_2\) only differ in complexity, there’s no reason (statistical or philosophical) to prefer inferences made from one of the fits1. To borrow John Myle White’s language: both fits represent pseudo-truths under different models.
How should we perform inference given that we have two fits equally supported by the evidence at hand? In some cases, \(F_1\) and \(F_2\) come from model families that are “compatible” in some sense, and we might be able to use stacking (i.e. as recently proposed by Yao et al or as in van der Laan’s TMLE).
Perhaps more interesting is the case when \(F_1\) and \(F_2\) each allow for types of inference that are fundamentally incompatible with each other. People are thinking about this kind of thing – take for example this excerpt from Jim Savage’s Zen of Modeling:
- You never have enough observations to distinguish one possible data generating process from another process that has different implications. You should model both, giving both models weight in decision-making.
For the most part I’ve seen comments like this coming from people in industry who need to make business decisions under uncertainty. I’d be curious if there are any formal frameworks for synthesizing inference from “incompatible” models, or work to define what “compatibility” is and how it might allow inference from collection of fits2.
Anyway, circling back to Ockham’s Razor, I’m convinced that the point is not that we should prefer fits from simple models to fits from complex ones. Rather, I prefer to the read the Razor as a statement about burden of proof: the more novel structure in a hypothesis (entities in the original statement), the greater the burden of proof. In cases when two theories have equal amounts of proof (i.e. similar AIC, etc), the Razor is silent. When we have no evidence, simple explanations are better, but that’s it.
As a concrete example of inference from a collection of “compatible” model fits, consider the path generated by a sequence of LASSO estimates for the linear model. In this case we might treat the first \(k\) variables to enter the model as being the most important.↩