Suppose one person in every 100 is actually an extraterrestrial. And further imagine that a brilliant scientist develops a test which can identify real aliens 100% of the time. But it also incorrectly flags 5% of normal people as aliens too.

Question: If someone tests positive, what’s the probability that they’re really an alien and not just a misclassified human?

95% maybe?

Well, let’s work it out. In a random sample of 100 people, we’d expect the test to identify one bonafide alien (1%) along with roughly five false positives (5%). That means that five out of the six positive tests are wrong. But this also means that the probability of getting it right is only 1 in 6, or a mere 16%.

Next, suppose they discover an extraterrestrial enclave (Area 51?) where 40% of the population are known to be aliens in disguise.

Before blindly diving into more frequency counting, we should point out an important nuance. The group with false positives is restricted to those who really aren’t aliens, and so we need to adjust our expected false positive count by multiplying by this probability. Since the non-aliens now account for only 60% of the sample, we should multiply by 0.6. This was true in the previous example too but we glossed over it since it was 99%.

Similarly, the expected number of true positives needs to be weighted by the probability of the test detecting a positive. For this test, that probability is 100% and so we don’t need to do anything further. More on this important detail later when we introduce Bayes’ Theorem.

With that in mind, select a random sample of 100 “people” from this population. Since 60% are likely normal, we expect our test to now record only three false positives: 60%∗5%∗100=3. We also expect 40 true positives. The likelihood of nabbing an alien has suddenly increased to 40/(40+3) or ~93%. In other words, the same test now produces much better results.

If any of this surprises you, then welcome to the base rate fallacy (a.k.a., base rate bias, neglect or the false positive paradox). Or, why screening matters.

These two examples demonstrate that if the probability of a true match is outweighed by the likelihood of a false positive then an otherwise decent test isn’t going to produce very good results.

Before continuing, it’s useful to identify the jargon that arise in examples like this. Then a bunch of math. Then we’ll do a few more examples.

#### Terminology

This field is packed with an ocean of abbreviations and similar-sounding terms. Worse yet, there are often multiple names for the same concept depending on the discipline (as sardonically demonstrated in this blog post).

Fortunately the terms listed below are all based on ratios of the form

A/(A +B)

where the value of A and B are combinations of true/false positive/negative counts (the acronyms TP, FP, TN, FN). Recognizing this pattern allows you to reconstruct a definition instead of blindly memorizing it.

So, here we go!

Sensitivity = TP/(TP + FN)

Sensitivity measures how well a test detects positive cases. Anything less than 1 (or 100%) indicates the possibility of false negatives.

Specificity = TN/(TN + FP)

Specificity measures how well a test detects negative cases. Anything less than 1 (or 100%) indicates the possibility of false positives.

In our first two examples, the sensitivity was 100% and the specificity was 95%: the test always identified real aliens but it sometimes misclassified regular folks too.

Here’s another example. Suppose a firefighter can always identify houses that are burning but he may also a confuse a house with a smoky fireplace as being on fire too. Because he perfectly identifies burning houses means that this firefighter has a 100% sensitivity to fires. But because he occasionally mistakes chimney smoke for house fires means that his specificity is less than 100%.

Note that one can always construct a test with 100% sensitivity by simply providing a positive result for all test subjects. But the specificity of that test would be zero. For example, a test that blindly identifies everyone as an alien will never miss a real one (100% sensitivity) but unfortunately it generates false positives for everyone else (0% specificity since there would be no true negatives, just false positives).

An inversion of these concepts is the positive predictive value (PPV), which is the proportion of positive tests that were actually correct. It measures how well the test does at identifying positive results.

PPV = TP /(TP + FP)

Note that the only difference between the PPV and sensitivity is the second term in the denominator, the count of false positives instead of false negatives. Thus it sums over all positive test results instead of all positive cases. As such, when a test’s PPV is low, it isn’t terribly predictive.

The PPV is a key metric in diagnostic testing. It conveys the trustworthiness of a medical test to a doctor. It tells a data analyst how well their classifications are performing. The PPV is also the value we calculated in those frequency counting examples above: the ratio of true positives to all positives results.

The complement of PPV is the negative predictive value (NPV), which is the proportion of negative results among all negative cases. It measures how well the test does at identifying negative results.

NPV = TN /(TN + FN)

Related to the NPV is the false omission rate (FOR), which is the the ratio of false negatives to all negative results. If the FOR is high, then a negative result should be suspect. Note that NPV + FOR = 1 and so the FOR is often written as 1 -NPV.

FOR = FN /(FN + TN) = 1 - NPV

There’s also a summary metric called accuracy, which measures the proportion of correct results among all results.

Accuracy = (TN + TP)/(TP + TN + FP + FN)

Those are the main ratio-based definitions. This last one is a good segue to the next section on probability.

The belief that an event will occur before collecting any data is called the prior. In disease screening, the prevalence is often referred to as the prevalence, pre-test probability, clinic probability, or base rate. In the first alien example, the prevalence was 1%. In the second example it rose to 40%. And with that, the PPV rose too. This is not a coincidence as will be shown.

As an example of how these terms go by multiple names, in the parlance of machine learning sensitivity is recall, PPV is precision and prevalence is prior.

#### PROBABILISTIC TAKE

Instead of frequency counting, PPV/NPV/FOR computations are normally presented as probabilities (or rates as they are sometimes called in practice). This lets a practitioner assess a diagnostic result using only prevalence and the known parameters of the test itself.

Recall that a conditional probability is the probability of an event occurring given that another event occurred previously. For example, the probability that Netflix stock rises given that one of its movies wins Best Picture. Specifically, the PPV is the conditional probability of having a disease given a positive test result. This is expressed as P(D+|T+). Sensitivity is the reverse scenario: it’s the probability of a positive test result given the patient having the disease, denoted P(T+|D+).

Bayes’ Theorem is an essential tool when analyzing conditional probabilities.

$P(A|B) = \frac{P(A) P(B|A)}{P(B)}$

Revised terminology and abbreviations

D+ = has disease
D- = does not have disease
T+ = tests positive
T− = tests negative

P(D+) = prevalence, probability of disease prior to testing for it

P(D+|T+) = PPV, probability of disease given a positive result

P(D-|T-) = NPV, probability of no disease given a negative result

P(T+|D+) = sensitivity, the probability of a positive test result given that the subject has the condition ,a.k.a., the true positive rate (TPR)

P(T-|D-) = specificity, the probability of a negative result given that the subject does not have the condition, a.k.a., the true negative rate (TNR)

P(T+|D-) = false positive rate (FPR), the probability that a test will produce a false positive

P(T-|D+) = false negative rate (FNR), the probability that a test will produce a false negative

The goal now is to translate the PPV into an expression involving only the prevalence and the test parameters— sensitivity, specificity. We do that by first applying Bayes’ theorem.

Note that the term in the denominator, P(T+), is the probability of getting a positive test result. But a positive test result falls into two non-overlapping categories: a true positive and a false positive. Because they are non-overlapping events, we can apply the law of total probability and express this is a weighed sum of both.

It’s worth noting that P(T+|D-), the false positive rate (FPR), is related to the specificity. Since someone without the disease will either receive a positive or a negative test result, then P(T+|D-) + P(T-|D-) = 1. The second term is just the specificity. So, FPR + specificity = 1. The same is true of the false negative rate (FNR) and sensitivity: FNR + sensitivity = 1.

We can now substitute this expanded P(T+) into our PPV expression:

This revised form of the PPV may look hideous but it’s still a ratio. The difference now is that the terms are based on known diagnostic parameters instead of frequency counts. In words, the Bayesian PPV reads

PPV = sensitivity*prior /
(sensitivity*prior + (1 - prior)*(1 - specificity))


Here’s an important detail. Notice what happens when the specificity is 1 or 100%. In this case, the second term in the denominator is zero which causes the remaining terms to cancel out. That is, the PPV is simply 1 or 100%. This means that if the test is perfectly specific, there will be no false positives regardless of prevalence. And these types of tests exist. Several manufactures of Covid-19 tests make claims of 100% specificity, for example, though sensitivities are typically much lower. That means that for these tests positive results are trustworthy but negatives could be wrong.

We can now revisit our first alien example and compute its exact PPV. In that example that the sensitivity was 100%, the specificity 95% and the prior 1%. Plugging this in we get

(0.01 * 1.0)/(0.01*1.0+.99∗0.05) ≈ 0.168

We ball-parked it before as 1/6=0.166… The discrepancy is now explained by the increased rigor we get with the Bayes’ solution (recall that we rounded the 99% term to 100% before).

In the second example we were more precise with false positives and so both values are identical:

(0.4 * 1.0)/(0.4*1.0+.6∗0.05) ≈ 0.93

What about the negative predictive value (NPV)? Mimicking the steps we followed for the PPV gives

Or in words:

NPV = specificity*(1 - prior) /
(specificity*(1 - prior) + prior*(1 - sensitivity))

It follows that as the prior increases, the NPV shrinks since the probability of not having the disease falls with it. This in turn causes the FOR (i.e., 1- NPV) to increase with prevalence. However, the effects of prevalence on both will drop off as the the sensitivity approaches 100%. This is analogous to the effects of sensitivity on the PPV.

The three relations PPV, NPV and FOR plotted in Matlab for a given the sensitivity and specificity. Note how quickly the PPV rises with prevalence.

#### DISEASE EXAMPLES

To experiment with these scenarios check out this test calculator.

A doctor does a blood test for celiac (1% prevalence, se=98%, sp=95%) and it comes back positive, yet the patient has no celiac symptoms despite having a gluten-heavy diet. Because the PPV for this test is around 17%, the doctor suggests they dismiss the result unless symptoms arise.

Using the PPV expression derived above we can verify this result:

(0.02∗0.75)/(0.02∗0.75 + 0.98∗0.05) ≈ 0.165

Next, suppose there’s an open screening for a low prevalence disease of 2% using a test with a 75% sensitivity and 95% specificity. Here the PPV is only 23% — i.e., only roughly 1 in 5 positive test results will be true positives. The FOR, on the other hand, is less than 1% meaning that negative results are much more reliable.

Verify:
PPV =(0.02∗0.75)/(0.02∗0.75 + 0.98∗0.05) = 0.234…
FOR = 1 - (0.98∗.95) / (0.98∗.95+.02∗0.25) ≈ 0

Now if you start pre-screening (fever, breathing issues, etc.) and raise the prevalence to 70%, then the PPV rises to 97.2% for the same test. Conversely, the FOR rises to 38%. That means roughly 2 in 5 sick people will now test negative since there are more sick people on which the imperfect test can fail.

Verify:
PPV = (0.7∗0.75)/(0.7∗0.75 + 0.3∗0.05) ≈ 0.972
1-NPV = 1 — (0.3∗.95)/ (0.3∗.95+.70∗0.25) ≈ 0.38

Let’s look again at the effect specificity can have on the PPV. Suppose that our brilliant scientist revises his alien test to be perfectly sensitive and then applies it to the general population. Because his test will never record a false positive or negative regardless of prevalence, both its PPV and its NPV will always be 1 or 100%. To reiterate:

The only time we can ignore prevalence in a PPV is when there’s no chance of the test producing a false positive.

Most tests are not perfect though, which puts the burden back on the prevalence. Consider this: at a specificity of 99.7% and a sensitivity of 100%, the PPV for a 1% prevalence is only 77%. Thus even at near-perfect sensitivity, there’s still a non-trivial margin of error when the prevalence is low.

To underscore this point, here’s a Matlab plot of the PPV for varying prevalence and a fixed sensitivity. Note the vertical line at 1% prevalence.

Zooming in a bit

#### TAKEAWAYS

Prevalence (prior) is the key aspect of the base rate fallacy. For the same test, an increase in the prevalence (by screening, for example), can increase the PPV and thus the quality of the results. This was demonstrated using both frequency analysis and Bayes’ Theorem. But as noted, if a test has perfect sensitivity or specificity then the effects of prevalence can be ignored on the corresponding measure, NPV and PPV, respectively.

If you enjoyed this post please click the like button, share it, or leave a comment. It would mean a lot considering how much effort went into writing it. Thanks!