## On significance and model validation

22 January, 2013 at 12:53 | Posted in Statistics & Econometrics | 9 Comments

Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20.

Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions, with a t-value of 1.25 [(100-75)/20] the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means  – using the ordinary 5% significance-level, we would not reject the null hypothesis aalthough the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false.

In its standard form, a significance test is not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypothesis by making it hard to disconfirm the null hypothesis.

And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.

And, most importantly, of course we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-value of 0.11 means next to nothing if the model is wrong. As David Freedman writes in Statistical Models and Causal Inference:

I believe model validation to be a central issue. Of course, many of my colleagues will be found to disagree. For them, fitting models to data, computing standard errors, and performing significance tests is “informative,” even though the basic statistical assumptions (linearity, independence of errors, etc.) cannot be validated. This position seems indefensible, nor are the consequences trivial. Perhaps it is time to reconsider.

1. I can’t imagine wanting to calculate the odds a null is false by (1-p)/p. Very strange, and certainly not warranted by frequentists or Bayesians so far as I know.

Putting that entirely to one side, severity requires considering magnitudes indicated. Here, you might look at the lower bound: mu > 75- 2SD or the like. But, like Fisher, we would never consider a single result, and like Freedman, we would always test assumptions.

• I appreciate you commenting on my article, Deborah, so let me just make some short remarks:
1) It’s pretty clear from the setting of my example that we’re in a policy situation, and so people do have to make decisions and act on them [“forever undecided” or waiting for “the long run” or “hypothetical parallel universe” won’t do (I do think “Student” had a point here …)], in that context you may very well have to act on your odds ratios.
2) When you write “WE” I have no problem as long as everyone are aware of it including people like you and Aris and other “error statistical philosophy”-statisticians. Fine with me, because compared to how statistical significance testing is practised out there in the social sciences you have a much more ambitious “program” and tougher demands on “severity” etc. But, nota bene, that is not what the other “WE” in the social sciensces do. Not yet at least …
3) I DO think Deirdre and Stephen (not to talk of the dozen of earlier critics) have a couple of really good points, but I also think (like e.g. Thomas Mayer) that they overreach [and too often write in a style – “rhetorical”? – I don’t particularly appreciate in an academic context (blogposts, goes without saying, are something completely different … )]. My own position is leaning more to the less exciting but “sober” view of Olle Häggström (whom I cite in the article). It’s a qustion of balance, but I think we still see too much of one-eyed focus on traditional simple-minded significance testing in social sciences.

• Lars: you haven’t even computed the p-value correctly, if I’m understanding you. If the null is 100 and the observed value being used is 75, the numerator would be 75 – 100 (observed – expected), S o the p-value is over .5.

• Deborah, I don’t follow you, to be honest. Since I did flag that normality conditions are assumed to apply, the one-tailed p-values of (100-75)/20 and (75-100)/20 are the same, and equal to 0.10565.

• No, if this is a one tailed test looking for positive discrepancies from mu = 100, then the p-value is .89.

• Corresponding to the observed value of the test statistic, the p-value is the LOWEST level of significance at which the null hypothesis can be rejected.
So, please Deborah, what’s the problem?

2. Lars; you write that “the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false”. Writing about the odds of a hypothesis being false is explicitly a Bayesian concept, and you haven’t given a prior… so something is not right here. (I appreciate that it may just be a semantic slip.)

If you want to (loosely) connect p-values to quantities that appear in Bayesian measures of support for the null hypothesis, check out J Berger and T Sellke 1987, or G Casella and R Berger 1987. There are also other Bayesian interpretations of p-values, but they have little to do with measuring support for the null.

Finally, I think you may be overstating the reliance on assumptions here. Independence of observations is important – this is typically guaranteed by the study design. And the test is just comparing means in two populations, so “linearity” is trivially true. In fact, provided we allow for non-constant variance (i.e. the unequal-variance t-test) and have a moderate sample size then standard analyses are very robust. Normal sampling distributions are really not required; the Central Limit Theorem is what really does the work.