### Significant Moral Hazard

What follows is a guest post by my comrade Dan Malinsky. After the recent publication of the paper `Redefine statistical significance' Malinsky and I attended a talk by one of the paper's authors. I found Malinsky's comments after the talk interesting and thought-provoking that I asked him to write up a post so I could share it with all yinz. Enjoy!

--------------------------------------------------------------------------------------------------------------------------

Benjamin et al. present an interesting and thought-provoking set of claims. There are, of course, many complexities to the P-value debate but I’ll just focus on one issue here.

Benjamin et al. propose to move the conventional statistical significance threshold in null hypothesis significance testing (NHST) from P < 0.05 to P < 0.005. Their primary motivation for making this recommendation is to reduce the rate of false positives in published research. I want to draw attention to the possibility that moving threshold to P < 0.005 may not have it’s intended effect: despite the fact that “all else being equal” such a policy should theoretically reduce false positive rates, in practice this move may leave the false positive rates unchanged, or even make them worse. In particular, the “all else being equal” clause will fail to hold, because the policy may incentivize researchers to make more errors of model specification, which will contribute to a high false positive rate. It is at least an open question which causal factors will dominate, and what the resultant false positive rate will really look like.

An important contribution to the high false positive rates in some areas of empirical research is model misspecification, broadly-understood. By model-misspecification I mean anything which might make the likelihood wrong: confounding, misspecification of the relevant parametric distributions, incorrect functional forms, sampling bias of various sorts, sometimes non-i.i.d.ness, etc. In fact, these factors are

Note that the authors Benjamin et al. agree on the first claim. Benjamin et al. mention some of these problems, agree that they are problems, and frankly admit that their proposal does nothing to address these or many other statistical issues. Model misspecification, in their view, ought to be tackled separately and independently of the decision rule convention. The authors also admit that these and related issues are “arguably bigger problems” than the choice of P-value. I think these are bigger problems in the sense specified above: model misspecification will afflict any choice of decision rule. This is important because the proposed policy shift may actually lead to more model misspecification. So, the issues interact and it is not so straightforward to tackle them separately.

P < 0.005 requires larger sample sizes (as the authors discuss), which are expensive and difficult to come by in many fields. In an effort to recruit more study participants, researchers may end up with samples that exhibit more bias -- less representative of the target population, not identically distributed, not homogenous in the right ways, etc. Researchers may also be incentivized, given finite time and resources, to perform less model-checking and diagnostics to make sure the likelihood is empirically adequate. Furthermore, the P-value critically depends on the tails of the relevant probability distribution. (That’s because the P-value is calculated based on the “extreme values” of the distribution of the test statistic under the null model.) The tails of the distribution are rarely exactly right at finite sample sizes, but they need to be “right enough.” With a low P-value threshold like 0.005, getting the tails of the distribution “right enough” to achieve the advertised false positive rate becomes more unlikely because with 0.005 one considers outcomes further out into the tails. Finally, other problems which inflate false positive rates like p-hacking, failure to correct for multiple testing, and so on may be exacerbated by the lower threshold. The mechanisms are not all obvious -- perhaps, for example, making it more difficult to publish “positive” findings will incentivize researchers to probe a wider space of (mostly false) hypotheses in search of a “significant” one, thereby worsening the p-hacking problem -- but it is at least worth taking seriously that these factors may offset the envisaged benefits of P < 0.005. (I think there are some interesting things which may be said about why these considerations are less worrisome in particle physics, where the famous 5-sigma criterion plays a role in announcements. I’ll leave that aside for now.)

I’m not disputing any mathematical claim made by the authors. Indeed, for two decision rules like P < 0.05 and P < 0.005 applied to the same hypotheses, likelihood, and data, the more stringent rule will lead to fewer expected false positives. My point is just that implementing the new policy will change the likelihoods and data under consideration, since researchers will face the same pressure to publish significant results but publishing will be made more difficult in a kind of crude way.

This worry will be relevant for any decision threshold convention, and so it speaks against any strict uniform standard. However, Benjamin et al. raise the important point that “it is helpful for consumers of research to have a consistent benchmark.” My friend and colleague Liam Kofi Bright reinforces this point in his blog post: there are all sorts of communal benefits to having some mechanism which distinguishes “significant” results from “insignificant.” I’d like to propose a different kind of mechanism.

Sometimes statisticians casually entertain the idea of requiring “staff statistician reviewers” to review (the data analysis portions of) empirical articles submitted for publication. I think we can plausibly institutionalize a version of this practice, and it can function as a benchmarking procedure. Every journal will pay some number of professional statisticians (who should be otherwise employed at universities, research centers, etc.) to act as statistical reviewers, and specifically to interrogate issues of model specification, sample selection, decision procedures, robustness, and so on. Only when a paper receives a stamp of approval from two or more statistical reviewers should it count as having “passed the benchmark.” The institutionalization of this proposal would have some corollary benefits: there are a lot of statistician professionals who are employed with “soft money,” i.e., they have to raise parts of their salaries by applying for grants. This mechanism could partially replace that grant-cycle: journals would apply regularly every few years for funding from the NIH, NSF, and other funding agencies to compensate statistical reviewers (an amount dependent on the journal’s submission volume); the statisticians get to supplement their incomes with this funding rather than spend time applying for grants; and the public gets some comfort in knowing that the latest published results are not fraught with data analysis problems. I can image a host of other benefits too: e.g., statisticians will be inspired and motivated to direct their own research towards addressing live concerns shared by practicing empirical scientists, and the empirical scientists will be alerted to more sophisticated or state-of-the-art analytic methods. Statistician’s review may also reduce the prevalence of NHST, in favor of some of the alternative analytical tools mentioned in Benjamin et al. The details of this proposed institutional practice need to be elaborated, but I conjecture it would be more effective at reducing false positives (and perhaps cheaper) than imposing P < 0.005 and requiring larger sample sizes across the board.

[I should acknowledge that, depending how my career goes, I could be the kind of person who is employed in this capacity. So: conflict of interest alert! Acknowledgements to Liam Kofi Bright, Jacqueline Mauro, Maria Cuellar, and Luis Pericchi.]

--------------------------------------------------------------------------------------------------------------------------

Benjamin et al. present an interesting and thought-provoking set of claims. There are, of course, many complexities to the P-value debate but I’ll just focus on one issue here.

Benjamin et al. propose to move the conventional statistical significance threshold in null hypothesis significance testing (NHST) from P < 0.05 to P < 0.005. Their primary motivation for making this recommendation is to reduce the rate of false positives in published research. I want to draw attention to the possibility that moving threshold to P < 0.005 may not have it’s intended effect: despite the fact that “all else being equal” such a policy should theoretically reduce false positive rates, in practice this move may leave the false positive rates unchanged, or even make them worse. In particular, the “all else being equal” clause will fail to hold, because the policy may incentivize researchers to make more errors of model specification, which will contribute to a high false positive rate. It is at least an open question which causal factors will dominate, and what the resultant false positive rate will really look like.

An important contribution to the high false positive rates in some areas of empirical research is model misspecification, broadly-understood. By model-misspecification I mean anything which might make the likelihood wrong: confounding, misspecification of the relevant parametric distributions, incorrect functional forms, sampling bias of various sorts, sometimes non-i.i.d.ness, etc. In fact, these factors are

*more important*contributions to the false positive rate than the choice of P-value convention or decision threshold, in the sense that any plausible decision rule no matter how stringent (whether it is based on P-values, Bayes factors, or posterior probabilities) will lead to unacceptably high false positive rates if model misspecification is widespread in the field.Note that the authors Benjamin et al. agree on the first claim. Benjamin et al. mention some of these problems, agree that they are problems, and frankly admit that their proposal does nothing to address these or many other statistical issues. Model misspecification, in their view, ought to be tackled separately and independently of the decision rule convention. The authors also admit that these and related issues are “arguably bigger problems” than the choice of P-value. I think these are bigger problems in the sense specified above: model misspecification will afflict any choice of decision rule. This is important because the proposed policy shift may actually lead to more model misspecification. So, the issues interact and it is not so straightforward to tackle them separately.

P < 0.005 requires larger sample sizes (as the authors discuss), which are expensive and difficult to come by in many fields. In an effort to recruit more study participants, researchers may end up with samples that exhibit more bias -- less representative of the target population, not identically distributed, not homogenous in the right ways, etc. Researchers may also be incentivized, given finite time and resources, to perform less model-checking and diagnostics to make sure the likelihood is empirically adequate. Furthermore, the P-value critically depends on the tails of the relevant probability distribution. (That’s because the P-value is calculated based on the “extreme values” of the distribution of the test statistic under the null model.) The tails of the distribution are rarely exactly right at finite sample sizes, but they need to be “right enough.” With a low P-value threshold like 0.005, getting the tails of the distribution “right enough” to achieve the advertised false positive rate becomes more unlikely because with 0.005 one considers outcomes further out into the tails. Finally, other problems which inflate false positive rates like p-hacking, failure to correct for multiple testing, and so on may be exacerbated by the lower threshold. The mechanisms are not all obvious -- perhaps, for example, making it more difficult to publish “positive” findings will incentivize researchers to probe a wider space of (mostly false) hypotheses in search of a “significant” one, thereby worsening the p-hacking problem -- but it is at least worth taking seriously that these factors may offset the envisaged benefits of P < 0.005. (I think there are some interesting things which may be said about why these considerations are less worrisome in particle physics, where the famous 5-sigma criterion plays a role in announcements. I’ll leave that aside for now.)

I’m not disputing any mathematical claim made by the authors. Indeed, for two decision rules like P < 0.05 and P < 0.005 applied to the same hypotheses, likelihood, and data, the more stringent rule will lead to fewer expected false positives. My point is just that implementing the new policy will change the likelihoods and data under consideration, since researchers will face the same pressure to publish significant results but publishing will be made more difficult in a kind of crude way.

This worry will be relevant for any decision threshold convention, and so it speaks against any strict uniform standard. However, Benjamin et al. raise the important point that “it is helpful for consumers of research to have a consistent benchmark.” My friend and colleague Liam Kofi Bright reinforces this point in his blog post: there are all sorts of communal benefits to having some mechanism which distinguishes “significant” results from “insignificant.” I’d like to propose a different kind of mechanism.

Sometimes statisticians casually entertain the idea of requiring “staff statistician reviewers” to review (the data analysis portions of) empirical articles submitted for publication. I think we can plausibly institutionalize a version of this practice, and it can function as a benchmarking procedure. Every journal will pay some number of professional statisticians (who should be otherwise employed at universities, research centers, etc.) to act as statistical reviewers, and specifically to interrogate issues of model specification, sample selection, decision procedures, robustness, and so on. Only when a paper receives a stamp of approval from two or more statistical reviewers should it count as having “passed the benchmark.” The institutionalization of this proposal would have some corollary benefits: there are a lot of statistician professionals who are employed with “soft money,” i.e., they have to raise parts of their salaries by applying for grants. This mechanism could partially replace that grant-cycle: journals would apply regularly every few years for funding from the NIH, NSF, and other funding agencies to compensate statistical reviewers (an amount dependent on the journal’s submission volume); the statisticians get to supplement their incomes with this funding rather than spend time applying for grants; and the public gets some comfort in knowing that the latest published results are not fraught with data analysis problems. I can image a host of other benefits too: e.g., statisticians will be inspired and motivated to direct their own research towards addressing live concerns shared by practicing empirical scientists, and the empirical scientists will be alerted to more sophisticated or state-of-the-art analytic methods. Statistician’s review may also reduce the prevalence of NHST, in favor of some of the alternative analytical tools mentioned in Benjamin et al. The details of this proposed institutional practice need to be elaborated, but I conjecture it would be more effective at reducing false positives (and perhaps cheaper) than imposing P < 0.005 and requiring larger sample sizes across the board.

[I should acknowledge that, depending how my career goes, I could be the kind of person who is employed in this capacity. So: conflict of interest alert! Acknowledgements to Liam Kofi Bright, Jacqueline Mauro, Maria Cuellar, and Luis Pericchi.]

## Comments

## Post a Comment