Resisting the ‘statistical significance testing’ temptation

18 Sep, 2023 at 10:22 | Posted in Statistics & Econometrics | 3 Comments

Imagine a dictator “game” in which a mixed-sex group of experimental subjects are used as first players who can decide which share of their initial endowment they give to a second player (one person acts as second player for the whole group). Additionally, assume that the experimental subjects are a convenience sample but not a random sample of a well-defined broader population. What kind of statistical inferences are possible? Neither one of the two chance mechanisms – neither random sampling nor randomization –  applies. Consequently, there is no role for the p-value …

Statistical significance for CRO | BrainlabsDue to engrained disciplinary habits, researchers might be tempted to implement “statistical significance testing” routines in our dictator game example even though there is no chance model upon which to base statistical inference.  While there is no random process, implementing a two-sample t-test might be the spontaneous reflex to find out whether there is a “statistically significant” difference between the two sexes. One should recognize, however, that doing so would require that some notion of a random mechanism is accepted. In our case, this would require imagining a randomization distribution that would result if money amounts were repeatedly assigned to sexes (“treatments”) at random. Our question would be whether the money amounts transferred to the second player differed more between the sexes than what would be expected in the case of such a random assignment. We must realize, however, that there was no random assignment of subjects to treatments, i.e. the sexes might not be independent of covariates. Therefore, the p-value based on a two-sample t-test for a difference in mean does not address the question of whether the difference in the average transferred money is caused by the subjects’ being male or female. That could be the case, but the difference could also be due to other reasons such as female subjects being less or more wealthy than male subjects. As stated above, it would therefore make sense to control for known confounders in a regression analysis ex post – again, without reference to a p-value as long as the experimental subjects have not been recruited through random sampling.

Norbert Hirschauer et al.

ad11As shown over and over again when significance tests are applied, people have a tendency to read ‘not disconfirmed’ as ‘probably confirmed.’ Standard scientific methodology tells us that when there is only say a 10 % probability that pure sampling error could account for the observed difference between the data and the null hypothesis, it would be more ‘reasonable’ to conclude that we have a case of disconfirmation. Especially if we perform many independent tests of our hypothesis and they all give about the same 10 % result as our reported one, I guess most researchers would count the hypothesis as even more disconfirmed.

We should never forget that the underlying parameters we use when performing significance tests are model constructions. Our p-values mean nothing if the model is wrong. And most importantly — statistical significance tests DO NOT validate models!

In journal articles, a typical regression equation will have an intercept and several explanatory variables. The regression output will usually include an F-test, with p – 1 degrees of freedom in the numerator and n – p in the denominator. The null hypothesis will not be stated. The missing null hypothesis is that all the coefficients vanish, except the intercept.

statistical-models-sdl609573791-1-42fd0If F is significant, that is often thought to validate the model. Mistake. The F-test takes the model as given. Significance only means this: if the model is right and the coefficients are 0, it is very unlikely to get such a big F-statistic. Logically, there are three possibilities on the table:
i) An unlikely event occurred.
ii) Or the model is right and some of the coefficients differ from 0.
iii) Or the model is wrong.

So?

3 Comments

  1. As you know, I am an opponent of “statistical significance”: I regard it as an antiquated concept long overdue for replacement with more accurate terms, concepts, and interpretations for neutral statistical outputs such as P-values. One of the problems facing this reform program is that tradition has misidentified P-values with “statistical significance”, thanks in no small measure to the pernicious British tradition of calling P-values “significance levels” (a poor choice of terms which under Fisher’s influence became common in the 1920s at the same time the alternate term “P-value” began to appear – although “value of P” was already being used by Karl Pearson in 1900 and was sometimes used by Fisher as well).

    If one carefully separates the concepts one can find that it is wrong or at least confusing to claim “Our P-values mean nothing if the model is wrong”: On the contrary, the smaller the P-value the greater the indication or evidence that the model from which it was computed (the target model of the P-value) is wrong in some way. Thus, given that a central modeling task is to identify the ways in which our model is wrong, a large P-value means only that the P-value contains little information about the model. Conversely, we get the most information from a P-value precisely when it is small, for then it is indicating that the actual data generator deviates from the target model in a direction to which the P-value is sensitive (whether that deviation matters in practical terms depends on much more detailed statistical and contextual information).

    Even Freedman was confusing on this issue, as seen in the above quote in which he fails to clearly distinguish the model with no coefficient constraints from the target model with zero for all non-intercept coefficients (from which the P-value at issue was computed). This sort of confusion is aggravated in “significance testing” by the degradation of P-values to binary indicators for hypotheses embedded in highly restrictive models and evaluated only by repeated-sampling criteria. I see that tradition as one source of the deficiency of Freedman’s comment: he fails to note that a small P-value tells us the data are far from what its target model would have us expect – which was at the very core of Pearson’s rationale for considering “the value of P” computed from his chi-squared statistic.

    I have written many articles on these points, most recently a discussion paper in the Scandinavian Journal of Statistics, “Divergence vs. decision P-values: A distinction worth making in theory and keeping in practice” in issue 1 of 2023, with rejoinder on issue 3, “Connecting simple and precise P-values to complex and ambiguous realities”,
    which is followed by the discussant comments. On request I am happy to supply e-copies of the entire set, along with an errata sheet; my e-mail is lesdomes at ucla dot edu.

    • As always a pleasure to read your sharp and erudite comments, Sander. I have downloaded your discussion paper and look forward to read it. David Freedman is probably the one statistician that has influenced my own thinking on (how to apply) statistics the most. If it is as you say, that he is “confusing” here, you sure have found something I have never beeen able to detect in his writings. Interesting!

      • Thanks Lars!

        Here are two small corrections to the print version of the main SJoS paper:
        p. 70, first line Section 4.4, “Continuing the example, the Hodges and Lehmann (1954) UMPU decision P-value pHL for H…” should be
        “Continuing the example, for μ^ exterior to the open interval (mL,mU) the Hodges and Lehmann (1954) UMPU decision P-value pHL for H…”
        p. 71, line 12: “α ≥ .16” should be “α < .16”.

        As did Fisher and many others whose perspective and contributions has been for us of immense value, Freedman had his blind spots and confusions – he was after all only human, and was very much a product of his time. The confusion displayed above I attribute to his locking into a pure frequentist framework when interpreting statistics, as was common among American statisticians in the mid-20th century (but with profound exceptions such as Mosteller and Tukey, who took a more visual approach to statistics).

        Another common problem he shared was confusing studies and their data with observer interpretations of studies and data. As I like to say, "data say nothing at all – if you hear the data speaking, seek psychiatric care immediately". An example: Freedman repeated the near-catechism that observational studies of health effects of diet and nutrients were contradicted by large trials. That's nonsense sold as "science": The contradictions were between the excessively certain and overly broad conclusions of those reporting on the observational studies and those reporting on the trials – thanks largely to both sides overlooking the profound limitations and red flags in all the relevant studies, including the animal experiments. Their blindnesses synergized with some very naive use of "statistical significance" or lack thereof to make sweeping conclusions – a practice vociferously defended to this day by those who reify statistical models and prioritize deductions from them over facts about the actual data-generating mechanisms.


Sorry, the comment form is closed at this time.

Blog at WordPress.com.
Entries and Comments feeds.