NOTE TO READERS: The following has been excerpted from Principles and Practice of Structural Equation Modeling (Third Edition) by Rex B. Kline. My contribution was limited to some small editing solely for this post. – Kevin Gray
There is ample evidence that many of us do not know the correct interpretation of outcomes of statistical tests, or p values. For example, at the end of a standard statistics course, most students know how to calculate statistical tests, but they do not typically understand what the results mean (Haller & Krauss, 2002). About 80% of psychology professors endorse at least one incorrect interpretation of statistical tests (Oakes, 1986). It is easy to find similar misinterpretations in books and articles (Cohen, 1994), so it seems that psychology students get their false beliefs from teachers and also from what students read. However, the situation is no better in other behavioral science disciplines (e.g., Hubbard & Armstrong, 2006).
Most misunderstandings about statistical tests involve overinterpretation, or the tendency to see too much meaning in statistical significance. Specifically, we tend to believe that statistical tests tell us what we want to know, but this is wishful thinking. Elsewhere I described statistical tests as a kind of collective Rorschach inkblot test for the behavioral sciences in that what we see in them has more to do with fantasy than with what is really there (Kline, 2004). Such wishful thinking is so pervasive that one could argue that much of our practice of hypothesis testing based on statistical tests is myth.
In order to better understand misinterpretations of p values, let us first deal with their correct meaning. Here it helps to adopt a frequentist perspective where probability is seen as the likelihood of an outcome over repeatable events under constant conditions except for chance (sampling error). From this view, a probability does not apply directly to a single, discrete event. Instead, probability is based on the expected relative frequency over a large number of trials, or in the long run. Also, there is no probability associated with whether or not a particular guess is correct in a frequentist perspective. The following mental exercises illustrate this point:
1. A die is thrown, and the outcome is a 2. What is the probability that this particular result is due to chance? The correct answer is not p = 1/6, or .17. This is because the probability .17 applies only in the long run to repeated throws of the die. In this case, we expect that .17 of the outcomes will be a 2. The probability that any particular outcome of the roll of a die is the result of chance is actually p = 1.00.
2. One person thinks of a number from 1 to 10. A second person guesses that number by saying, 6. What is the probability that the second person guessed right? The correct answer is not p = 1/10, or .10. This is because the particular guess of 6 is either correct or incorrect, so no probability (other than 0 for “wrong” or 1.00 for “right”) is associated with it. The probability .10 applies only in the long run after many repetitions of this game. That is, the second person should be correct about 10% of the time over all trials.
Let us now review the correct interpretation of statistical significance. You should know that the abbreviation p actually stands for the conditional relative-frequency probability, the likelihood of a sample result or one even more extreme (a range of results) assuming that the null hypothesis is true, the sampling method is random sampling, and all other assumptions for the corresponding test statistic, such as the normality requirement of the t-test, are tenable. Two correct interpretations for the specific case p < .05 are given next. Other correct definitions are probably just variations of the ones that follow:
1. Assuming that H0 is true (i.e., every result happens by chance) and the study is repeated many times by drawing random samples from the same population, less than 5% of these results will be even more inconsistent with H0 than the particular result observed in the researcher’s sample.
2. Less than 5% of test statistics from random samples are further away from the mean of the sampling distribution under H0 than the one for the observed result. That is, the odds are less than 1 to 19 of getting a result from a random sample even more extreme than the observed one.
Described next are what I refer to as the “Big Five” false beliefs about p values. Three of the beliefs concern misinterpretation of p, but two concern misinterpretations of their complements, or 1 – p. Approximate base rates for some of these beliefs, reported by Oakes (1986) and Haller and Krauss (2002) in samples of psychology students and professors, are reported beginning in the next paragraph. What I believe is the biggest of the Big Five is the odds-against-chance fallacy, or the false belief that p indicates the probability that a result happened by chance (e.g., if p < .05, then the likelihood that the result is due to chance is < 5%).
Remember that p is estimated for a range of results, not for any particular result. Also, p is calculated assuming that H0 is true, so the probability that chance explains any individual result is already taken to be 1.0. Thus, it is illogical to view p as somehow measuring the probability of chance. I am not aware of an estimate of the base rate of the odds-against-chance fallacy, but I think that it is nearly universal in the behavioral sciences. It would be terrific if some statistical technique could estimate the probability that a particular result is due to chance, but there is no such thing.
The local type I error fallacy for the case p < .05 is expressed as follows: I just rejected H0 at the .05 level. Therefore, the likelihood that this particular (local) decision is wrong (a Type I error) is < 5% (70% approximate base rate among psychology students and professors). This belief is false because any particular decision to reject H0 is either correct or incorrect, so no probability (other than 0 or 1.00; i.e., right or wrong) is associated with it. It is only with sufficient replication that we could determine whether or not the decision to reject H0 in a particular study was correct.
The inverse probability fallacy goes like this: Given p < .05; therefore, the likelihood that the null hypothesis is true is < 5% (30% approximate base rate). This error stems from forgetting that p values are probabilities of data under H0, not the other way around. It would be nice to know the probability that either the null hypothesis or alternative hypothesis were true, but there is no statistical technique that can do so based on a single result.
Two of the Big Five concern 1 – p. One is the replicability fallacy, which for the case of p < .05 says that the probability of finding the same result in a replication sample exceeds .95 (40% approximate base rate). If this fallacy were true, knowing the probability of replication would be useful. Unfortunately, a p value is just the probability of the data in a particular sample under a specific null hypothesis. In general, replication is a matter of experimental design and whether some effect actually exists in the population. It is thus an empirical question and one that cannot be directly addressed by statistical tests in a particular study.
The last of the Big Five, the validity fallacy, refers to the false belief that the probability that H1 is true is greater than .95, given p < .05 (50% approximate base rate). The complement of p, or 1 – p, is also a probability, but it is just the probability of getting a result even less extreme under H0 than the one actually found. Again, p refers to the probability of the data, not to that of any particular hypothesis, H0 or H1. See Kline (2004, chap. 3) or Kline (2009, chap. 5) for descriptions of additional false beliefs about statistical significance.
It is pertinent to consider one last myth about statistical tests, and it is the view that the .05 and .01 levels of statistical significance, or α, are somehow universal or objective “golden rules” that apply across all studies and research areas. It is true that these levels of α are the conventional standards used today. They are generally attributed to R.A. Fisher, but he did not advocate that these values be applied across all studies (e.g., Fisher, 1956). There are ways in decision theory to empirically determine the optimal level of α given estimate of the costs of various types of decision errors (Type I vs. Type II error), but these methods are almost never used in the behavioral sciences. Instead, most of us automatically use α = .05 or α = .01 without acknowledging that these particular levels are arbitrary.
Even worse, some of us may embrace the sanctification fallacy, which refers to dichotomous thinking about p values that are actually continuous. If α = .05, for example, then a result where p = .049 versus one where p = .051 is practically identical in terms of statistical outcomes. However, we usually make a big deal about the first (it’s significant!) but ignore the second. (Or worse, we interpret it as a “trend” as though it was really “trying” to be significant, but fell just short.) This type of black-and-white thinking is out of proportion to continuous changes in p values.
There are other areas in SEM where we commit the sanctification fallacy. This thought from the astronomer Carl Sagan (1996) is apropos: “When we are self-indulgent and uncritical, when we confuse hopes and facts, we slide into pseudoscience and superstition” (p. 27). Let there be no superstition concerning statistical significance going forward from this point.