Deborah Mayo vs. Andrew Gelman on the (non-)significance of significance tests17 October, 2015 at 10:34 | Posted in Statistics & Econometrics | Leave a comment
Andrew: I believed, and still believe, in checking the fit of a model by comparing data to hypothetical replications. This is not the same as significance testing in which a p-value is used to decide whether to reject a model or whether to believe that a finding is true.
Mayo: I don’t know that significance tests are used to decide that a finding is true, and I’m surprised to see you endorsing/spreading the hackneyed and much lampooned view of significance tests, p-values, etc. despite so many of us trying to correct the record. And statistical hypothesis testing denies uncertainty? Where in the world do you get this? (I know it’s not because they don’t use posterior probabilities…) But never mind, let me ask: when you check the fit of a model using p-value assessments, are you not inferring the adequacy/inadequacy of the model? Tell me what you are doing if not. I don’t particularly like calling it a decision, neither do many people, and I like viewing the output as “whether to believe” even less. But I don’t know what your output is supposed to be.
Andrew: I don’t think hypothesis testing inherently denies uncertainty. But I do think that it is used by many researchers as a way of avoiding uncertainty: it’s all too common for “significant” to be interpreted as “true” and “non-significant” to be interpreted as “zero.” Consider, for example, all the trash science we’ve been discussing on this blog recently, studies that may have some scientific content but which get ruined by their authors’ deterministic interpretations. When I check the fit of a model, I’m assessing its adequacy for some purpose. This is not the same as looking for p< .05 or p<.01 in order to go around saying that some theory is now true.
Mayo: I fail to see how a deterministic interpretation could go hand in hand with error probabilities; and I never hear even the worst test abusers declare a theory is not true, give me A break… So when you assess adequacy for a purpose, what does this mean? Adequate vs inadequate for a purpose is pretty dichotomous. Do you assess how adequate? I’m unclear as to where the uncertainty enters for you, because as I understand it is not in terms of a posterior probability.
Andrew: Here’s a quote from a researcher, I posted it on the blog a few days ago: “Our results demonstrate that physically weak males are more reluctant than physically strong males to assert their self-interest…” Here’s another quote: “Ovulation led single women to become more liberal, less religious, and more likely to vote for Barack Obama. In contrast, ovulation led married women to become more conservative, more religious, and more likely to vote for Mitt Romney.” These are deterministic statements based on nothing more than p-values that happen to be statistically significant. Researchers make these sorts of statements all the time. It’s not your fault, I’m not saying you would do this, but it’s a serious problem. Along similar lines, we’ll see claims that a treatment has an effect on men and not on women, when really what is happening is that p< .05 for the men in the study and p>.05 for the women. In addition to brushing away uncertainty, people also seem to want to brush away uncertainty, thus talking about “the effect” as if it is a constant across all groups and all people. A recent example featured on this blog was a study primarily of male college students which was referred repeatedly (by its authors, not just by reporters and public relations people) as a study of “men” with no qualifications. P.S. Bayesians do this too, indeed there’s a whole industry (which I hate) of Bayesian methods for getting the posterior probability that a null hypothesis is true. Bayesians use different methods but often have the misguided goal of other statisticians, to deny uncertainty and variation.
Mayo: These moves from observed associations, and even correlations, to causal claims are poorly warranted, but these are classic fallacies that go beyond tests to reading all manner of “explanations” into the data. I find it very odd to view this as a denial of uncertainty by significance tests. Even if they got their statistics right, the link from stat to substantive causal claim would exist. I just find it odd to regard the statistical vs substantive and correlation vs cause fallacies, which every child knows, some kind of shortcoming with significance tests. Any method or no method can commit these fallacies, especially from observational studies. But when you berate the tests as somehow responsible, you misleadingly suggest that other methods are better, rather than worse. At least error statistical methods can identify the flaws at 3 levels (data, statistical inference, stat-> substantive causal claim) in a systematic way. We can spot the flaws a mile off… I still don’t know where you want the uncertainty to show up; I’ve indicated how I do.
Andrew: You write, “I still don’t know where you want the uncertainty to show up;” I want the uncertainty to show up in a posterior distribution for continuous parameters, as described in my books.
Mayo: You write, “I want the uncertainty to show up in a posterior distribution for continuous parameters”. Let’s see if I have this right. You would report the posterior probabilities that a model was adequate for a goal. Yes? Now you have also said you are a falsificationist. So is your falsification rule to move from a low enough posterior probability in the adequacy of a model, to the falsity of a claim that the model of is adequate (for the goal). And would high enough posterior in the adequacy of a model translate into something like, not being able to falsify its adequacy or perhaps, accepting it as adequate (the latter would not be falsificationist, but might be more sensible than the former). Or are you no longer falsificationist-leaning.
Andrew: No, I would not “report the posterior probabilities that a model was adequate for a goal.” That makes no sense to me. I would report the posterior distribution of parameters and make probabilistic predictions within a model.
Mayo: Well if you’re going to falsify as a result, you need a rule from these posteriors to infer the predictions are met satisfactorily or not. Else there is no warrant for rejecting/improving the model. That’s the kind of thing significance tests can do. But specifically, with respect to the misleading interpretations of data that you were just listing, it isn’t obvious how they are avoided by you. The data may fit these hypotheses swimmingly. Anyhow, this is not the place to discuss this further. In signing off, I just want to record my objection to (mis)portraying statistical tests and other error statistical methods as flawed because of some blatant, age-old misuses or misleading language, like “demonstrate” (flaws that are at least detectable and self-correctable by these same methods, whereas they might remain hidden by other methods now in use). [Those examples should not even be regarded as seeking evidence but at best colorful and often pseudoscientific interpretations.] When the Higgs particle physicists found their 2 and 3 standard deviation effects were disappearing with new data—just to mention a recent example from my blog—they did not say the flaw was with the p-values! They tightened up their analyses and made them more demanding. They didn’t report posterior distributions for the properties of the Higgs, but they were able to make inferences about their values, and identify gaps for further analysis.
For my own take on significance tests see here.