Supporting the Redefinition of Statistical Significance
Recently an article entitled `Redefining Statistical Significance' (RSS) has been made available. In this piece a diverse bunch of authors (including four philosophers of science - represent) put forward an argument with the thesis: ``[f]or fields where the threshold for defining statistical significance for new discoveries is P<0.05, we propose a change to P<0.005.'' In this very brief note I just want to state my support for the broad principle behind this proposal and make explicit an aspect of their reasoning that is hinted at in RSS but which I think is especially worth holding clear in our minds.
RSS argues that, basically, rejecting the null at P<0.05 represents (by Bayesian standards) very weak evidence against the null and in favour of the hypothesis under test, and further than its communal acceptance as the standard significance level for discovery predictably and actually leads to unacceptably many false-positive discoveries. P<0.005 taken as the norm would go some way towards solving both these problems, and the authors emphasise most especially that it would bring false positive levels down to within what they deem to be more acceptable levels. RSS doesn't claim originality for these points, and is a short and very readable paper; I recommend checking it out.
The authors then have a section replying to objections. They note that they do not think that changing the significance level communally required for discovery claims is a cure-all, and deploy a number of brief but very interesting arguments against the counter-claim that the losses in terms of false-negatives would outweigh the gains in avoiding false positives. This is all interesting stuff, but the point at which I wish to state my broad agreement comes when they consider the objection that ``The appropriate threshold for statistical significance should be different for different research communities.'' Here their response is to say that they agree in principle that different communities facing different sorts of puzzles ought use different norms for discovery claims, but note that many communities have settled on the idea that given the sort of claims they are considering and tests they can do P<0.05 is an appropriate standard for discovery claims. They are addressing those communities in particular with their proposal, so are addressing communities which have already come to agree that they should share a standard for discovery claims.
My one small contribution here, then, is in following up on this point. They briefly note in their reply to this objection that -- `it is helpful for consumers of research to have a consistent benchmark.' I think this point deserves elaboration and emphasis, and it is why I feel that, although I do not feel sufficiently expert to comment on the specific proposal they made, the broad contours of their argument are right. Why, after all, do we actually have to agree on a communal standard for what counts as an appropriate significance level for `claims of discovery of new effects' at all? Couldn't we leave that to the discretion of individual researchers? Or maybe foster for some time a diversity of standards across journals and let a kind of Millian intellectual marketplace do its work? To put it philosophically, why have something rather than nothing here?
I take it that a lot of what the communal standard is doing is providing a bench mark for those not able to make expert or highly-informed personal assessment of the claims and evidence to know that the hypothesis in question is confirmed to the standards of those who are able to make expert or highly informed assessments. These consumers of the research are those for whom the consistent benchmark helps. Especially for the kind of social scientific fields which have in fact adopted this benchmark, a pressing methodological consideration has to be that non-scientists or folk not able to assess statistical claims, and more pointedly people with policy or culturally influential positions, will consume the research, and take actions based on what they believe to be reliable, or at least take action on the grounds of what convinces them. The trade off between Type 1 and Type 2 errors, then, must be made with it in mind that there is an audience of non-experts to the claims made in this field, and an audience who will shape actions and lives and self-perceptions (in part) upon the results these fields put out. As a scientific community we must therefore decide what we think of our own work can be vouchsafed to these observers, or validated to the standard this cultural responsibility entails.
In theory, of course, we could still leave this up to individuals or allow for a diversity of standards among journals. But I think awareness of the scientific community's public role tends to speak against that. Such diversity, I'd wager, would either result in a cacophonic public discourse on science in which the media and commentators constantly reported results, then their failure to replicate, and then their replication once more (as well as contrary results, their failure to replicate...). This because the diversity of standards led to non-experts picking who to believe randomly among folk with different standards, or according to who they judged to have the flashiest smile, or whichever university PR department reached out to them last, or factionally choosing their favourite sources. Or, it would result in silence, as gradually scientific results come to be seen as too unreliable, too divided among themselves, to be worth paying much attention to at all. If you think that scientifically acquired information can make a positive difference to public discourse, either of these seem like bad outcomes. (The somewhat self-promoting Du Bois scholar nerd in me can't resist pointing out that Du Bois brought similar considerations to bear in responding to widespread failures of social scientific research in his day.) In fact, I think this epistemic environment makes a conservative attitude sensible, and speak in favour of adopting a very low tolerance for false-positives. This because is much harder to correct misinformation once it is out there than it is to defer announcing until we are more confident, and the very act of correction may induce the same loss of trust worry mentioned before. This means that in addition to elaborating upon RSS' reply to an objection, and without feeling competent to quite judge whether P<0.005 in particular is the right standard, I also think the overall direction of change advocated by RSS is the right one, relative to where we are now.
RSS argues that, basically, rejecting the null at P<0.05 represents (by Bayesian standards) very weak evidence against the null and in favour of the hypothesis under test, and further than its communal acceptance as the standard significance level for discovery predictably and actually leads to unacceptably many false-positive discoveries. P<0.005 taken as the norm would go some way towards solving both these problems, and the authors emphasise most especially that it would bring false positive levels down to within what they deem to be more acceptable levels. RSS doesn't claim originality for these points, and is a short and very readable paper; I recommend checking it out.
The authors then have a section replying to objections. They note that they do not think that changing the significance level communally required for discovery claims is a cure-all, and deploy a number of brief but very interesting arguments against the counter-claim that the losses in terms of false-negatives would outweigh the gains in avoiding false positives. This is all interesting stuff, but the point at which I wish to state my broad agreement comes when they consider the objection that ``The appropriate threshold for statistical significance should be different for different research communities.'' Here their response is to say that they agree in principle that different communities facing different sorts of puzzles ought use different norms for discovery claims, but note that many communities have settled on the idea that given the sort of claims they are considering and tests they can do P<0.05 is an appropriate standard for discovery claims. They are addressing those communities in particular with their proposal, so are addressing communities which have already come to agree that they should share a standard for discovery claims.
My one small contribution here, then, is in following up on this point. They briefly note in their reply to this objection that -- `it is helpful for consumers of research to have a consistent benchmark.' I think this point deserves elaboration and emphasis, and it is why I feel that, although I do not feel sufficiently expert to comment on the specific proposal they made, the broad contours of their argument are right. Why, after all, do we actually have to agree on a communal standard for what counts as an appropriate significance level for `claims of discovery of new effects' at all? Couldn't we leave that to the discretion of individual researchers? Or maybe foster for some time a diversity of standards across journals and let a kind of Millian intellectual marketplace do its work? To put it philosophically, why have something rather than nothing here?
I take it that a lot of what the communal standard is doing is providing a bench mark for those not able to make expert or highly-informed personal assessment of the claims and evidence to know that the hypothesis in question is confirmed to the standards of those who are able to make expert or highly informed assessments. These consumers of the research are those for whom the consistent benchmark helps. Especially for the kind of social scientific fields which have in fact adopted this benchmark, a pressing methodological consideration has to be that non-scientists or folk not able to assess statistical claims, and more pointedly people with policy or culturally influential positions, will consume the research, and take actions based on what they believe to be reliable, or at least take action on the grounds of what convinces them. The trade off between Type 1 and Type 2 errors, then, must be made with it in mind that there is an audience of non-experts to the claims made in this field, and an audience who will shape actions and lives and self-perceptions (in part) upon the results these fields put out. As a scientific community we must therefore decide what we think of our own work can be vouchsafed to these observers, or validated to the standard this cultural responsibility entails.
In theory, of course, we could still leave this up to individuals or allow for a diversity of standards among journals. But I think awareness of the scientific community's public role tends to speak against that. Such diversity, I'd wager, would either result in a cacophonic public discourse on science in which the media and commentators constantly reported results, then their failure to replicate, and then their replication once more (as well as contrary results, their failure to replicate...). This because the diversity of standards led to non-experts picking who to believe randomly among folk with different standards, or according to who they judged to have the flashiest smile, or whichever university PR department reached out to them last, or factionally choosing their favourite sources. Or, it would result in silence, as gradually scientific results come to be seen as too unreliable, too divided among themselves, to be worth paying much attention to at all. If you think that scientifically acquired information can make a positive difference to public discourse, either of these seem like bad outcomes. (The somewhat self-promoting Du Bois scholar nerd in me can't resist pointing out that Du Bois brought similar considerations to bear in responding to widespread failures of social scientific research in his day.) In fact, I think this epistemic environment makes a conservative attitude sensible, and speak in favour of adopting a very low tolerance for false-positives. This because is much harder to correct misinformation once it is out there than it is to defer announcing until we are more confident, and the very act of correction may induce the same loss of trust worry mentioned before. This means that in addition to elaborating upon RSS' reply to an objection, and without feeling competent to quite judge whether P<0.005 in particular is the right standard, I also think the overall direction of change advocated by RSS is the right one, relative to where we are now.
Comments
Post a Comment