On several occasions I have been approached by students and researchers to “help them get a p-value for their results”. They believed that the purpose of an analysis was to dig this value from the data otherwise the analysis would be incomplete.

Sir Ronald Fisher, a British statistician and geneticist, introduced the p-value in 1925. This was around the time when he was developing computational algorithms for analyzing data from his balanced experimental designs. He wrapped up his work in his first book, Statistical Methods for Research Workers. The book went through many editions and translation over time, and later became the standard reference work for scientists in many disciplines.

At the time he assumed 0.05 as a reference point for rejecting a null hypothesis but not as a sharp cut off. Fisher’s philosophy of significance testing interpreted the p-value as a measure of evidence from a single experiment. As a measure of evidence, the p-value was meant to be combined with other sources of information. Thus, there was no set threshold for “significance” (Fisher, 1973).

P-values have since been widely misunderstood in a lot of circles where it is reported. Goodman’s article on the misinterpretation of p-values lists some misconceptions. In brief the article puts across that a p-value of say 0.05 does not mean that: there is only a 5% chance that the null hypothesis is true, there is a 5% chance of a Type I error (i.e. false positive), there is a 95% chance that the results would replicate if the study were repeated, there is no difference between groups or that you have proved your experimental hypothesis.

A p-value should be interpreted as: The probability of getting the results you have observed or more extreme results given that the null hypothesis is true. This might still not be clear, so let’s have usual coin toss examples frequent in introduction to probability lessons.

Suppose we toss a fair coin 20 times and observe the number of heads that come up, we would expect to obtain 10 heads in our experiment. This is because, for a fair coin, the probability of turning heads is 0.5 and so the expected number of heads will be 20*0.5=10.

Now let’s experiment with a coin with an unknown probability of turning heads. Our aim in the experiment is to quantify the evidence against our null hypothesis that the coin is fair. In our experiment the coin lands heads on 16 out of 20 tosses.

How do we interpret this result? Is it unusual given that we were expecting about 10 heads? Let’s calculate a p-value. Remember that a p-value was the probability of getting the observed results (16 heads) or more extreme results (17, 18, 19, or 20 heads) if our null hypothesis is true- the coin is fair. Considering each toss as a Bernoulli experiment we can easily obtain the probability that in 20 trials we get x (x=16,.., 20) number of heads using the binomial function:

P\left(X>=16\right)=\sum_{x=16}^{20}{20 \choose x}{p^{x}}{\left(1-p\right)^{20-x}}


Where p is the probability of success in each trial.

The p-value obtained is 0.0059. This could mean that an unlikely event occurred i.e. a fair coin landing heads 16 times or that the coin is not fair! However, the p-value does not tell us which is which. Many people conclude that such an unlikely event suggests that the coin is not fair; rejecting the null hypothesis that the coin is fair, but do not recognize that there is a second possibility in the circumstance. So, I have heard statements like ‘the p-value was <0.05 which proves that the null hypothesis is true’ or ‘the p-value was <0.05, therefore, we accept the null hypothesis’. This is where the misinterpretation comes.

What are your views or questions on the p-value?