On significance and model validation22 January, 2013 at 12:53 | Posted in Statistics & Econometrics | 9 Comments
Let us suppose that we as educational reformers have a hypothesis that implementing a voucher system would raise the mean test results with 100 points (null hypothesis). Instead, when sampling, it turns out it only raises it with 75 points and having a standard error (telling us how much the mean varies from one sample to another) of 20.
Does this imply that the data do not disconfirm the hypothesis? Given the usual normality assumptions on sampling distributions, with a t-value of 1.25 [(100-75)/20] the one-tailed p-value is approximately 0.11. Thus, approximately 11% of the time we would expect a score this low or lower if we were sampling from this voucher system population. That means – using the ordinary 5% significance-level, we would not reject the null hypothesis aalthough the test has shown that it is likely – the odds are 0.89/0.11 or 8-to-1 – that the hypothesis is false.
In its standard form, a significance test is not the kind of “severe test” that we are looking for in our search for being able to confirm or disconfirm empirical scientific hypothesis. This is problematic for many reasons, one being that there is a strong tendency to accept the null hypothesis since they can’t be rejected at the standard 5% significance level. In their standard form, significance tests bias against new hypothesis by making it hard to disconfirm the null hypothesis.
And as shown over and over again when it is applied, people have a tendency to read “not disconfirmed” as “probably confirmed.” But looking at our example, standard scientific methodology tells us that since there is only 11% probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more “reasonable” to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.
And, most importantly, of course we should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-value of 0.11 means next to nothing if the model is wrong. As David Freedman writes in Statistical Models and Causal Inference:
I believe model validation to be a central issue. Of course, many of my colleagues will be found to disagree. For them, fitting models to data, computing standard errors, and performing significance tests is “informative,” even though the basic statistical assumptions (linearity, independence of errors, etc.) cannot be validated. This position seems indefensible, nor are the consequences trivial. Perhaps it is time to reconsider.