## Monte Carlo simulation and p values (wonkish)

28 January, 2013 at 11:45 | Posted in Statistics & Econometrics | 9 Comments

In many social sciences p values and null hypothesis significance testing (NHST) are often used to draw far-reaching scientific conclusions – despite the fact that they are as a rule poorly understood and that there exist altenatives that are easier to understand and more informative.

Not the least using confidence intervals (CIs) and effect sizes are to be preferred to the Neyman-Pearson-Fisher mishmash approach that is so often practised by applied researchers.

Running a Monte Carlo simulation with 100 replications of a fictitious sample having N = 20, confidence itervals of 95%, a normally distributed population with a mean = 10 and a standard deviation of 20, taking two-tailed p values on a zero null hypothesis, we get varying CIs (since they are based on varying sample standard deviations), but with a minimum of 3.2 and a maximum of 26.1 we still get a clear picture of what would happen in an infinite limit sequence. On the other hand p values (even though from a purely mathematical statistical sense more or less equivalent to CIs) vary strongly from sample to sample, and jumping around between a minimum of 0.007 and a maximum of 0.999 don’t give you a clue of what will happen in an infinite limit sequence! So, I can’t but agree with Geoff Cummings:

The problems are so severe we need to shift as much as possible from NHST … The first shift should be to estimation: report and interpret effect sizes and CIs … I suggest p should be given only a marginal role, its problem explained, and it should be interpreted primarily as an indicator of where the 95% CI falls in relation to a null hypothesised value.

Geoff Cumming

[In case you want to do your own Monte Carlo simulation, here’s an example using Gretl:
nulldata 20
loop 100 –progressive
series y = normal(10,15)
scalar zs = (10-mean(y))/sd(y)
scalar df = \$nobs-1
scalar ybar=mean(y)
scalar ysd= sd(y)
scalar ybarsd=ysd/sqrt(\$nobs)
scalar tstat = (ybar-10)/ybarsd
pvalue t df tstat
scalar lowb = mean(y) – critical(t,df,0.025)*ybarsd
scalar uppb = mean(y) + critical(t,df,0.025)*ybarsd
scalar pval = pvalue(t,df,tstat)
store E:\pvalcoeff.gdt lowb uppb pval
endloop]

1. “p values … vary strongly from sample to sample, and jumping around between a minimum of 0.007 and a maximum of 0.999 don’t give you a clue of what will happen in an infinite limit sequence!”

Well, we actually do know. We have an uniform distribution from 0 to 1. Don’t we?

• Yeah, and that’s extremely helpful when testing null hypothesis …

• Well, I don’t see why looking at a collection of pvalues getting closer and closer to an uniform distribution (0,1) is a less clear picture of what is going to happen “in an infinite limit sequence” than a dancing collection of CIs.

I don’t see why this is a problem, and let alone a problem that leads to “… shift as much as possible from NHST”.

CIs and effect sizes give us very useful information, but pvalues beautifully summarize experiments (Try to imagine using CI’s in an experiment with hundreds of variables involved)

2. And it’s even worse when you test hypothesis which is actually true. In that case, p-values will be distributed uniformly between 0 and 1 – all over the place, indeed!

More seriously: p-values are functions of data, and thus random. When testing hypothesis that is actually false, their distribution will shift towards zero (i.e. towards rejection region) in a way that depends on the power of the test. Simulating your particular example, I find that p-value is less than 0.05 about 55% of the time. and less than 0.1 around 70% of the time. While such power may seem low (no surprise, with just 20 observations), it’s hardly the same as “not having a clue”.

• Compairing with the transparant CIs I actually
think my formulation is justifiable.

3. Something must definitely be wrong when the p value is > 5%, and the CI does not contain the zero.

You will need to recompute this, Lars.

• Really, Pontus? The CIs jump around the population mean and – with an alpha = 0.05 – in the long run 95%
would include μ = 10 and 5% would miss it (this jumping around over replication is well-known and not really the issue in my article).

• If you have a symmetric distribution of errors a p-value larger than 0.05 also means that the confidence bounds will include the H0. Thus in as much as your p-value jumps above 0.05, your CIs must include the zero. A minimum of 3.2 must therefore be impossible and the result of an error in your program.

• Pontus, since I obviously can’t convince you, have a look at Geoff’s article
(http://www.stat.auckland.ac.nz/~iase/publications/icots8/ICOTS8_8J4_CUMMING.pdf)!