Time to abandon statistical significance

27 Sep, 2017 at 10:55 | Posted in Statistics & Econometrics | 6 Comments

worship-p-300x214We recommend dropping the NHST [null hypothesis significance testing] paradigm — and the p-value thresholds associated with it — as the default statistical paradigm for research, publication, and discovery in the biomedical and social sciences. Specifically, rather than allowing statistical signicance as determined by p < 0.05 (or some other statistical threshold) to serve as a lexicographic decision rule in scientic publication and statistical decision making more broadly as per the status quo, we propose that the p-value be demoted from its threshold screening role and instead, treated continuously, be considered along with the neglected factors [such factors as prior and related evidence, plausibility of mechanism, study design and data quality, real world costs and benefits, novelty of finding, and other factors that vary by research domain] as just one among many pieces of evidence.

We make this recommendation for three broad reasons. First, in the biomedical and social sciences, the sharp point null hypothesis of zero effect and zero systematic error used in the overwhelming majority of applications is generally not of interest because it is generally implausible. Second, the standard use of NHST — to take the rejection of this straw man sharp point null hypothesis as positive or even definitive evidence in favor of some preferredalternative hypothesis — is a logical fallacy that routinely results in erroneous scientic reasoning even by experienced scientists and statisticians. Third, p-value and other statistical thresholds encourage researchers to study and report single comparisons rather than focusing on the totality of their data and results.

Andrew Gelman et al.

ad11As shown over and over again when significance tests are applied, people have a tendency to read ‘not disconfirmed’ as ‘probably confirmed.’ Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more ‘reasonable’ to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.

We should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-values mean nothing if the model is wrong. And most importantly — statistical significance tests DO NOT validate models!

statistical-models-sdl609573791-1-42fd0In journal articles a typical regression equation will have an intercept and several explanatory variables. The regression output will usually include an F-test, with p – 1 degrees of freedom in the numerator and n – p in the denominator. The null hypothesis will not be stated. The missing null hypothesis is that all the coefficients vanish, except the intercept.

If F is significant, that is often thought to validate the model. Mistake. The F-test takes the model as given. Significance only means this: if the model is right and the coefficients are 0, it is very unlikely to get such a big F-statistic. Logically, there are three possibilities on the table:
i) An unlikely event occurred.
ii) Or the model is right and some of the coefficients differ from 0.
iii) Or the model is wrong.
So?

6 Comments

  1. Disagree strongly.
    .
    Lars does not put forth a viable alternative. Should we ignore data and instead rely on bullshit?
    .
    All studies are flawed in some way, but that does not mean they have no value.
    .
    For example, there are flaws and uncertainties in many climate change studies, but that does not mean that we should dismiss those studies. Instead, we should ask questions and do more studies to answer those questions. In the meantime, we should act on the best available data, flawed though it may be.
    .
    My work and my hobby involves constantly conducting experiments, collecting data, and trying to make sense of the data, to arrive at an understanding of how certain systems actually work. I am well aware of the limitations of data analysis, but the problem is, there is nothing better out there.

    • If you don’t like my argumentation, I strongly recommend you to at least read Gelman’s article 🙂

      • I had read Gelman’s article (brought to my attention by Tom Hickey at MNE) prior to reading your article (also brought to my attention by Tom).
        .
        Suggesting that I have not read the article is an ad hominem attack. Stick to the issues, please.
        .
        As I commented on MNE, what you and Gelman seem to be hinting at, but don’t actually phrase it that way, is that rigid dogmatic thinking about data analysis is bad. If you would phrase it that way, then I would agree with you. Instead the message comes across as “statistical testing is terrible. End statistical testing.” I cannot agree with that.
        .
        Data analysis is not the enemy; neither is it the only thing that should be considered. The Gelman paper makes vague suggestions about other things that might be considered, but its focus is on dissing statistical testing, as is your essay.
        .
        It’s hard to discuss this subject without concrete examples. In most of the problems that I work with, the data is flawed and incomplete, and as you phrased it the experimenter’s model may be wrong. The experimenter may be focusing on the wrong things while ignoring the things that are actually important. I try to point out those issues in my work, yet at the same time I take the data seriously because it’s often all I have at any given time other than speculation and armchair theory.
        .
        One single experiment rarely settles the question no matter what the statistics say. In general, the solution is not to diss data analysis, but to collect more data and have more eyes looking at the problem. The litmus test is having other people be able to conduct similar experiments and get repeatable results. Evolution and anthropogenic climate change are examples of models that have been confirmed by numerous studies, even though there may be flaws and omissions in any one study.
        .
        Pharmaceutical trials in the U.S. are examples of questionable testing practices — a few experiments may seem to confirm that a particular drug is safe and effective, while other experiments may show the opposite (and are sometimes not included in the final analysis). The solution is not to diss drug testing but to get more and better tests until there is high confidence in the collective results.

      • And i would also recommend a read from Columbia State University to read or rearead. ,http://www.stat.columbia.edu/~gelman/research/published/badbayesmain.pdf

  2. “Is our expectation of rain, when we start out for a walk, always more likely than not, or less likely than not, or as likely as not? I am prepared to argue that on some occasions none of these alternatives hold, and that it will be an arbitrary matter to decide for or against the umbrella. If the barometer is high, but the clouds are black, it is not always rational that one should prevail over the other in our minds, or even that we should balance them, though it will be rational to allow caprice to determine us and to waste no time on the debate.” John Maynard Keynes, A Treatise on Probability, :https://web.archive.org/web/20120919051334/http://books.google.com/books?id=YmCvAAAAIAAJ&dq=keynes+A+Treatise+on+Probability&printsec=frontcover&source=bn&hl=es&ei=stqOSp6sLojLlAeNruW1DA&sa=X&oi=book_result&ct=result&resnum=5#v=onepage&q&f=false


Sorry, the comment form is closed at this time.

Blog at WordPress.com.
Entries and Comments feeds.