Statistical Power

Statistical power is about quantifying the probability of making a "false negative" error, when a specific alternative is in fact true. For example, and as we simulate and animate below, failing to reject the hypothesis that a coin is "fair" (p_heads = 0.5), when it is in fact "biased" (p_heads = 0.65) is a false negative. false negatives are extremely important in fields like epidemiology and immunological testing (think COVID-19), since erroneously releasing a person who is positive, but tests negative, for the virus leads them to continue spreading it to others in the community. In some ways, the concept of statistical power serves to unify the topics covered in the previous two Issues, as it utilizes both the Law of Large Numbers (LLN) and the Central Limit Theorem (CLT) towards a single application.

By CLT, we know that when the number of coins is large, the sampling distribution will be approximately Normal. And by the Law of Large Numbers, we know that if we toss a coin "enough times," the variation of the sample average goes to zero, and the sample average will collapse to the true expectation. Then we will be able to tell a fair coin (p_heads = 0.5) from a biased coin (p_heads = 0.65). So the question remains: Exactly how many tosses are 'enough' for us to conclude that we are not making a false negative error due to random chance alone?

Statistical Errors

When deciding the truth of something, there are four possibilities, depicted by the table below.

Predicted Reality
Positive Negative

Positive

(Reject)

True Positive

(Power)

(1 - β)

False Positive

(Type I Error)

(α)

Negative

(Fail to Reject)

False Negative

(Type II Error)

(β)

True Negative

We see there are two types of errors we can make: false positives (we conclude something is significant when in fact it is not) and false negatives (we fail to conclude something is significant when in fact it is). The arbitrary standard across the field of statistics is to set our false positive (α) tolerance to be 0.05. That is, regardless of the data or the context, the 'most distal' 5% of our data will be false positives, by construction: their variation away from the center is due to random chance alone, but they will be incorrectly deemed statistically significant. When it comes to false positives, we resign ourselves to being satisfied with this level of accuracy.

Conversely, when it comes to false negatives (β), there are more factors and discretion involved. First, we must choose a specific alternative hypothesis to test against. For example, probability of making a false negative error will be different depending on if we compare a fair coin (p_heads = 0.5) against a biased coin (p_heads = 0.65) vs. if we compare a fair coin (p_heads = 0.5) against a very biased coin (p_heads = 0.9) (the second will have far fewer false negatives). All else being equal, the 'further away' the alternative, the lower probability of making a false negative error. Second, what becomes a false negative is determined by the prior choice of α. Data sourced from the alternative that falls in the α = 0.05 false positive zone, will get rejected (correctly). However, data sourced from the alternative that encroaches beyond the false positive zone, will incorrectly fail to be rejected — a false negative. Finally, the probability of making a false negative error (β) depends on the number of trials. The more we toss the coins, by the Law of Large Numbers we know the variation of the sampling distribution of the mean will decrease, thus fewer data will encroach beyond the false positive zone. The statistical power is defined as 1-β — or, the probability that we correctly determine something to be statistically significant (True Positive). High power is good! It means we are rarely making false negative errors.

Power at Play: Titrating Tosses

Now, with the terms and intuition established, we are ready to begin answering our "how many is enough tosses?" question. We begin by simulating coin tosses in the same way we have done before when exploring the LLN in Issue 1 and CLT in Issue 2, to construct the sampling distribution of the proportion of heads.

First note that we are constructing the sampling distribution using a large number of coins (10K), so asymptotic (approximate) Normality comes as no surprise, by CLT as in Issue 2. Second, we see that as the number of tosses of each coin increases, both sampling distributions collapse to their expectations (0.5 for the fair coin in red on the left, and 0.65 for the biased coin in black on the right), by LLN as in Issue 1.

The new concepts of false positive/negative errors and statistical power are found in the shaded regions underneath the curves. The respective errors are shaded in the same colors as they appear in the two-way table above. Let's break it down, element by element.

The vertical dashed line is the α = 0.05 false positive "critical value" (threshold) for the fair coin. The area shaded in red to the left of the critical value is the 5% false positives of fair coins which will be falsely rejected as 'biased', even though their larger proportion of heads is merely due to chance alone. The area shaded in blue-gray to the right of the critical value is the β% false negatives -- the probability of failing to reject biased coins actually coming from the distribution in black centered about 0.65; failing to reject the hypothesis that they came from the distribution in red centered about 0.5. The color-changing area 1-β to the left of the critical value is the power. It is the probability that we correctly determine a biased coin with p_heads = 0.65 as having a greater proportion of heads than a fair coin, and correctly reject the hypothesis that the biased coin came from the distribution in red.

We observe that as sample size increases and the spread of both distributions diminishes, the amount of overlap between the two distributions shrinks. The α = 0.05 red shaded false positive error remains constant, but the β blue-grey shaded false negative error drops to zero, and the power goes to 100%. We see that we reach a reasonable power-level of 80% very quickly, after only about 62 tosses, but to get that remaining 20% requires roughly 310 tosses. Power does not increase linearly.


Code for plots Seeing Statistics aims to animate the predictable structures that emerge from repeated randomness.