Laboratory testing is principally done for two reasons: (1) to obtain information that cannot be determined clinically, but which is often important in forming hypotheses and (2) to test hypotheses.
Tests in the first category are those commonly described as “routine” testing and include serum electrolytes, blood urea nitrogen, creatinine, complete blood counts, urinalysis, and, less commonly, transaminases and erythrocyte sedimentation rate or C-reactive protein. Some, or all, of these tests are performed in patients with significant illness in order to help the clinician identify significant abnormalities in major organ function or laboratory signs of inflammation or infection.
Tests in the second category are innumerable, and more are being developed as you read. They are used to identify specific abnormalities and diseases. The diagnostic performance of these tests is highly dependent upon the patient population tested. Tests in this category are most useful when the diagnosis in question is in the mid-range of probability on the basis of your clinical assessment—that is, the probability for the disease being present is roughly between 20% and 80%.
To understand why this is so, it is necessary to understand the measures of test performance and how interpretation is dependent on both the diagnostic criteria for a disease or condition and the pretest probability that the disease is present.
Principles of Testing for Disease
Disease present or absent
The first question is, how do we determine who has the disease and who does not? This is done with an independent test or set of criteria accepted as establishing “the diagnosis.” The assumption is that a disease is either present or absent. Although this may seem obvious for diseases such as cancer or an infection where a tissue biopsy or culture are the diagnostic standards, most biologic measurements are continuous variables, not either/or determinations; it is often difficult to say whether rheumatoid arthritis is present or not, or which level of creatinine determines renal failure.
Most diseases have variable clinical severity; hence the diagnostic standard used to establish the disease can be either very inclusive (sensitive) or more exclusive (specific). A good example is, the American Rheumatologic Association criteria for the diagnosis of rheumatic syndromes. These criteria were developed because no laboratory tests of sufficient accuracy are available to identify these patients. The goal was to identify persons eligible for inclusion in research studies of the specific diseases. Hence, the criteria for diagnosis of these syndromes is set to be quite specific; that is to say, patients meeting the criteria are very likely to have the syndrome. However, it cannot be concluded that patients not meeting the criteria, who have many of the features and no other explanation, do not have the syndrome.
Test positive or negative
The definition of normal for continuous variables is a statistical determination (see the discussion of cholesterol in Chapter 18 for an exception). At what point “abnormal” becomes an illness or disease is a judgment based upon the desire to identify those with disease (true positives), but not include a significant portion of patients without the disease (false positives). Furthermore, most tests are not positive in all the patients with a given disease, so there will be patients with the disease who are missed by the test (false negatives). Finally, we want to be sure that a very high proportion of patients who do not have the disease, have a negative test (true negatives).
The decision of what cut point constitutes an abnormal test is determined by comparing the distribution of the test results in patients with the disease (determined as above) and in those without the disease (Fig. 17-1). When the two populations overlap in part of their range, a cut point is chosen to minimize the misallocation of patients (false positives and false negatives).
Interpretation of Test Results
A population of unaffected patients is compared with a population of diseased patients. The cut point for determining normal-abnormal is the value with the best compromise between sensitivity and specificity. LR can be calculated for test values above and below the usual cut point.
Note, however, that much of the information is lost in looking at tests of continuous variables as positive or negative: very abnormal tests are more likely to be associated with disease than mildly abnormal tests. As we discuss below, likelihood ratios (LR) are a good way to capture this information for making diagnostic decisions.
Probability is a ratio or proportion of one part of a population to the population as a whole. A racehorse that wins 1 race in 20 has a 5% probability of winning (1/20 = 0.05).
Odds are the ratio of two probabilities. Because most events are uncommon (otherwise we would not need to make all these calculations), odds are customarily expressed as the odds against an event. For our horse, the probability of losing is 0.95; whereas the probability of winning is 0.05, that is, the odds are 19:1; that it will lose.
Odds and probabilities can be derived from one another:
Pretest probability and prior odds
We have generated a set of hypotheses at the bedside and have formulated a differential diagnosis (see Chapter 1). An essential part of the differential diagnosis process is to make a conscious estimate of the probability for the patient to have each disease in the differential diagnosis. If we were to see 100 patients exactly like the patient before us—same age, sex, comorbidities, presenting symptoms, and physical signs—how many would have each condition? This estimate is the pretest probability. Expressed as odds (the probability of having the disease over the probability of not having the disease); this known as the prior odds.
Tests can be systematically evaluated in a 2-cell × 2-cell table whose parameters are whether the disease is present or absent and whether the test is positive or negative (Fig. 17-2) by predetermined criteria. The four cells, conventionally labeled as a, b, c, and d, represent the true positive tests (disease present and test positive), the false-positives tests (disease absent, but test positive), the false-negative tests (disease present, but test negative), and the true negative tests (disease absent and test negative), respectively.
The 2 × 2 Table: Sensitivity, Specificity, and PPVs and NPVs
What additional information is in the table? We can see in the first column all the patients who have the disease (a + c) and in the second column all the patients who are free of disease (b + d). The ratio of the first column to the sum of the two columns is the prevalence of disease in the group of patients who generated the data in the table:
To know how to interpret this prevalence, we need to know how the patients were selected for inclusion in each column. If this was a randomly selected, population-based sample, then the prevalence is that of the disease in the population, often a useful number. On the other hand, the investigators may have selected a group of patients with the disease and another group known not to have the disease in a predetermined ratio or by some nonrandom method. If this is the case, the “prevalence” is essentially meaningless for understanding disease prevalence in any useful clinical sense.
Aids in the Selection and Interpretation of Tests
Sensitivity is the number of patients with the disease who have a positive test, divided by the total number with the disease: sensitivity = a/(a + c), a probability. With highly sensitive tests, the vast majority of patients with the disease have a positive test (very few false negatives). Tests with high sensitivity (>0.95) are most useful when negative, thereby making the diagnosis less likely. Note that the sensitivity of a test, because it is calculated only in those with the disease, is independent of the prevalence of the disease. Sensitivity can be increased by changing the cutoff for defining a positive test to a less abnormal value (see Fig. 17-1).
Because sensitivity is independent of prevalence, it is susceptible to over interpretation when disease prevalence is very low (see Example 1 below). In this case, the false-positive tests (b) may significantly outnumber the true positives (a).
Sensitive tests are used when you do not wish to miss a serious disease because of the consequences of a delayed diagnosis. A negative result makes the disease unlikely and helps to reassure the patient and clinician, and serves to narrow the diagnostic possibilities. A positive test needs to be confirmed with more specific tests before a diagnosis can be established.
Specificity is the proportion of patients without the disease who have a negative test: specificity = d /(b + d), a probability. With highly specific tests, the vast majority of patients without the disease have negative tests (very few false positives). However, the test may also be negative in those with the disease. Note that patients with the disease do not enter into the determination of specificity; it, like sensitivity, is independent of disease prevalence. Specificity can also be improved by changing the cut point for defining abnormal to a more abnormal value (see Fig. 17-1).
Because specificity is independent of prevalence, it is susceptible to over interpretation when disease prevalence (pretest probability) is high (see Example 4 below). In this case, the false-negative tests (c) may significantly out-number the true negatives (d).
Highly specific tests are used to confirm a diagnosis. This is especially important when the consequences of the diagnosis are serious for the patient, either for prognosis or therapy.
Setting your positive/negative cut point
For most diagnostic tests, the clinical laboratory supplies a reference range (see Chapter 18). This range is determined by testing hundreds of samples of unselected patients, patients with the disease, and patients known to not have the disease. From this data, graphs such as Figure 17-1 can be generated. The data are analyzed to determine the statistical best fit for distinguishing the diseased from the nondiseased populations.
For many clinical tests, such as treadmill exercise tests, interpretation of imaging studies, and application of diagnostic tests, the clinician must decide, based upon the clinical scenario and the type of diagnostic question being asked (screening, case finding, hypothesis testing) what cut point will best serve to answer the question. Consultation with specialists in laboratory medicine and with experts in the diseases in your differential diagnosis can assist you in determining what should be regarded as a positive or negative test in each specific clinical situation.
When we do a test, we are not really interested in the test (sensitivity and specificity), but in how it can help us in understanding our patient’s problem: does the presence of a positive test predict that the patient has the disease (positive predictive value [PPV]) and does a negative test predict the absence of the disease (negative predictive value [NPV]). Predictive values are calculated from 2 × 2 tables (see Fig. 17-2). As we shall see, the predictive values for a test are dependent upon the population which was used to generate the data in the 2 × 2 table; different populations have different disease prevalences. To generate meaningful predictive values, the patients generating the data must be chosen randomly from a clinical population that is relevant to your question and patient.
Positive predictive value
The PPV is calculated from our 2 × 2 table. It is the proportion of patients with a positive test who have the disease: PPV = a/(a + b), a probability. Tests with a high PPV have few false-positive tests, therefore a positive test supports the diagnosis. Note, however, that if the disease is rare in the population (therefore (b + d) >> (a + c), the test will have to be extremely specific (low false positives, b) for the true positives to be greater than the false positives (see Examples). Therefore, when the pretest probability of disease is low (low prevalence), even seemingly good tests (sensitivity, specificity) may perform badly for predicting the presence of disease.
Negative predictive value
The NPV is the proportion of patients with a negative test who do not have the disease: NPV = d/(c + d), a probability. Tests with a high NPV have few false negatives, therefore a negative test argues against the disease. When the condition is prevalent in the population to begin with, a negative test may not be very helpful; that is, the NPV may be low and the disease may be present despite a negative test.
Consequently, to use the PPV and NPV, the clinician must know, or have a good estimate of, the prevalence of the condition being tested for, in the population which the patient represents. Most clinicians do not have this data readily available. What we do have is our clinical estimate of the probability of disease that we have generated from our history and physical examination in generating our differential diagnosis.
Another way of expressing the usefulness of a test is in LR. A positive likelihood ratio (LR+) is the ratio of the probability of a positive test in people with the disease (the sensitivity) to the probability of a positive test in people without the disease: LR+ = [a/(a + c)] ÷ [b/(b + d)]. A negative likelihood ratio (LR−) is the probability of a negative test in patients with the disease divided by the probability of a negative test in people without the disease (the specificity): LR− = [c/(a + c)] ÷ [d/(b + d)] (Fig. 17-3). LR, the ratio of two probabilities, is an odds.
Positive and Negative LR
LR shows how well a result more abnormal (LR+) or less abnormal (LR−) than a given value for the test (the cut point for “test positive” in the 2 × 2 table) discriminates between those with and without the disease. They are a function of the defined parameters of the test and are independent of the prevalence of the disease (see the Examples). LR contains all the sensitivity and specificity information and expresses the relationship between sensitivity and specificity for positive and negative results.
A big advantage of LR is that they can be calculated for a range of test values, rather than the single normal/abnormal cut point used for sensitivity and specificity. Thus, LR allows us to use all the information, rather than the limited information in a single normal/abnormal cut point.
As the LR+ becomes larger, the likelihood of the disease increases; as the LR− approaches zero, the disease becomes much less likely. Generally speaking, LR between 0.5 and 2.0 is not useful and those <0.5 but >0.2 or >2.0 but <5.0 are suggestive but not conclusive. Values of LR >5 argue for the disease whereas LR <0.2 argue against the disease.
Post-test probability and posterior odds
LR includes information from each cell of the 2 × 2 table; they are not susceptible to the errors that occur in the application of predictive values to conditions of low and high prevalence, respectively, as discussed above. This makes them much more useful diagnostically.
Because LR is a ratio of probabilities, they are an expression of odds. We can use them to derive a new probability for the disease based upon the test result. Because this new probability is determined after the test is done, it is the post-test probability (PP). To calculate the post-test probability, convert the pretest probability to pretest odds and then multiply by the LR to get the post-test odds (posterior odds). Then, convert the post-test odds back to the post-test probability (see Example 1). The post-test probability can be calculated for both a positive test and a negative test.
Clinicians should learn to think in terms of the LR for the parameter ranges of the tests they use. This is the implicit reasoning that experienced and efficient clinicians use in selecting and interpreting their laboratory tests. It is useful to make this process explicit. This allows us to actually do the calculations in the occasional situation where it will be useful, but also helps us to understand and dissect our decision-making processes and to avoid misinterpretation of the significance of either normal or abnormal laboratory results [Barry HC, Ebell MH. Test characteristics and decision rules. Endocrinol Metab Clin North Am. 1997;26:45–65].
Four examples of clinical testing scenarios are given, each with different estimated disease prevalence. For each example, the test is assumed to have 95% sensitivity and 95% specificity. These examples should help you to understand the concepts discussed above.
Example 1: Disease prevalence 1%
Of 10,000 patients, only 100 have the disease (99:1 odds against) (Fig. 17-4). False positives are five times more likely to be found than true positives. The calculations will only be shown for this example.
Example 1: Disease Prevalence 1%
The test has a sensitivity of 0.95 and a specificity of 0.95.
Calculation of positive and NPVs:
Calculation of positive and Negative LR
The PPV (0.16) is better than the baseline risk (0.01), but is still quite low; so a positive test does not even make the diagnosis very probable. The NPV is 0.999 (1 in 10,000), which sounds good, but is actually not much better than the already low baseline risk of 0.01 (1 in 100).
Has this highly sensitive and specific test helped you in this situation? Not much. Although the likelihood of disease is much higher with a positive test (from 99:1 to 16:1), still most positive tests are false positives and further evaluation is necessary.
This example is typical of a screening situation for an uncommon disease in an asymptomatic population. The test has to be very sensitive and very specific to be useful. An example of such a test is HIV testing in pregnant women, but most clinical tests have neither the sensitivity nor specificity required to be effective when disease prevalence is low.
Example 2: Disease prevalence 10%
Of 1000 patients, 100 have the disease (9:1 odds against) (Fig. 17-5).
Example 2: Disease Prevalence 10%
The test has a sensitivity of 0.95 and a specificity of 0.95.
In this scenario, 65% of the patients with a positive test have the disease; a definite improvement over the 10% at baseline. A positive test is twice as likely to be a true positive as a false positive. The NPV is quite low, so a negative test is helpful in reducing the likelihood of disease.
Are either the positive or negative results likely to be diagnostically sufficient? A negative test is useful to reduce the post-test probability of disease below any reasonable clinical threshold. A positive test will need to be followed with more specific testing to confirm the diagnosis (raise the probability of disease above the level needed for clinical certainty). This is especially true for any disease with an adverse prognosis or for which therapies are potentially toxic.
This scenario is representative of a case finding strategy in an at risk population. A test with 95% sensitivity and specificity could be used to separate the population into a low-risk pool and a high-risk pool, the latter to undergo further testing.
Example 3: Disease prevalence 50%
You have worked up a patient and your clinical impression is that the patient has a 50% chance (1:1 odds) of having the disease (disease prevalence of 0.5) (Fig. 17-6). You construct a 2 × 2 table with what you know.
Example 3: Disease Prevalence 50%
For these examples, the test has a sensitivity of 0.95 and a specificity of 0.95.
The PPV and NPV are both significant improvements more than the baseline risk of 0.5. There are relatively few false positives or false negatives.
Does the test help you in this clinical situation? The test is clinically useful regardless of the result. Both a positive and a negative test make substantial changes in the disease probability, and both probably exceed the level of certainty required in most clinical situations.
This example is representative of the situation in which laboratory testing is most useful, that is, true uncertainty, with even odds for and against the disease. Selecting tests with good LR in this setting will have a profound impact on your diagnostic process.
Example 4: Disease prevalence 90%
You have worked up a patient and your clinical impression is that the patient has a 90% chance (9:1 odds in favor) of having the disease (disease prevalence of 0.9) (Fig. 17-7). You construct a 2 × 2 table with what you know.
Example 4: Disease Prevalence 90%
The test has a sensitivity of 0.95 and a specificity of 0.95.
It is quite likely the patient has the disease based upon your clinical assessment. A positive test (PPV) only minimally improves your accuracy. A negative test (NPV) reduces the probability, but it is still the most likely diagnosis, and one-third of those with a negative test will be misclassified (false negatives).
Has the test helped you in reaching your predetermined levels of certainty required to either diagnose the disease or exclude it from further consideration? No, a positive result adds nothing and a negative result is likely to be an error.
This scenario is representative of a situation when too high a level of certainty is expected for the clinical situation. Neither a positive or negative test is helpful.
Note that our test has excellent LR, and the LR is the same regardless of the prevalence of disease. Like sensitivity and specificity, LR is a function of the value of the test chosen as the cut point. This confirms that the LR tells how well a positive and negative test discriminate the population into higher and lower risk groups. However, the interpretation and usefulness of the test still depends upon the baseline probability of disease (pretest probability): a 20 times improvement in very long odds is still long odds (999:1 to 49:1), and a 20 times decrease in very short odds is still an almost even proposition (1:24 to 20:24).
The reader is encouraged to construct their own examples and vary the prevalence of disease and the sensitivity and specificity of the test in order to familiarize themselves with these concepts. The formal calculations are rarely done in clinical practice, but the principles and concepts are used every day by skilled clinicians in deciding how to evaluate their differential diagnoses.
As demonstrated in Example 3, diagnostic testing is most useful when true uncertainty exists with nearly even odds for and against the condition. The purpose of forming a probabilistic differential diagnosis is to identify the conditions which are truly uncertain (approximately even odds) where testing can improve your probability assessment. Most physical findings do not have positive LR of sufficient magnitude to be used as a clinical test of the hypotheses; they do not establish a diagnosis [McGee S. Evidence-Based Physical Diagnosis. 2nd ed. Philadelphia, PA: WB Saunders; 2007]. The history and physical examination are for hypothesis generation and estimation of pretest probabilities (prior odds). Many laboratory tests have LR that allows accurate diagnostic discrimination: the laboratory is the best place to test your specific hypotheses. When clinical probability estimates are either very high or very low, further testing is not useful and is often misleading.
2 × 2 Tables Revisited: Caveat Emptor
If you plan to use the sensitivity, specificity or LR generated from a 2 × 2 table, it is necessary to understand the methods used for selection of the test sample that produced the data. Each of these parameters is dependent upon the inclusion criteria for the categories disease-present and disease-absent and the method for identifying the population(s) that were tested.
Most diseases have a broad spectrum of severity that is generally reflected in the amount of aberration in the tests characteristic of the disease: more-severe disease, more-abnormal tests; less-severe disease, less-abnormal or even normal test results. As can be seen from Figure 17-1 and the preceding discussion, if the investigators choose to define the presence of disease as those with more-severe disease (cut point moved to the right), the test will appear more specific and less sensitive, and the positive LR will improve, whereas the negative LR becomes less useful. If they choose an inclusive definition to reflect the broad range of those with the disease (cut point moved to the left), the test will be more sensitive and less specific, and the negative LR will improve, whereas the positive LR becomes less useful. In addition, if the 2 × 2 table was constructed using patients with unusually severe disease (as may be seen in an academic referral practice), the sensitivity and specificity calculated may be inappropriately high if applied to a more representative population of patients.
Broadly speaking there are three methods of identifying patients to generate data for a 2 × 2 table.
By far the easiest method is to take patients from a known diseased group (e.g., patients with known systemic lupus erythematosus, attending a rheumatology clinic) and another group of patients from a nondiseased population (e.g., blood donors) and apply the test (e.g., an antinuclear antibody test) to both groups. This will generate a 2 × 2 table weighted to more severe disease because the patients are already diagnosed and attending a clinic (see Severity of disease above). The apparent prevalence is not a real population prevalence; it is an a priori choice of the investigators as to how many people they want in each group. Tests evaluated by this method often appear very good when you calculate the sensitivity, specificity, and LR. Because the population of the table is really two independent populations, the PPV and NPV have no meaning. The clinical usefulness of information generated by this sampling method is marginal at best when the clinician attempts to apply the test parameters to an unselected population.
The second, far more difficult, method is to select a population of patients that represent the community at large (e.g., a random sample of adults), perform the test on all of them, and also evaluate all of them by the gold-standard criteria for the disease. When diseases have a low prevalence in the population (e.g., systemic lupus erythematosus), huge numbers of patients would need very thorough evaluations at tremendous expense to identify enough cases to produce any meaningful data. Hence, this method is only applied to the evaluation of screening tests proposed for large populations (e.g., fecal occult blood testing for colon cancer). This method also does not generate clinically useful data for the clinician outside of the screening paradigm.
The most clinically useful information is generated by selecting patients from a population that presents with the challenge faced by the physician: patients who might have the disease based upon history and physical examination (an intermediate pretest probability, near even odds). A consecutive series of such patients is identified and the test and diagnostic gold standard are applied to all. The data generated in this way is far more useful to the clinician when faced with a diagnostic challenge. The test parameters (LR, sensitivity, and specificity), the prevalence of disease, and the PPV and NPV are much more likely to be applicable to clinical decision making. The clinician still must assess whether the gender, ethnic mix, ages, and comorbidities of the test population are representative of their patient population.
The phrases “rule in” and “rule out” are commonplace in the clinical vernacular but are discouraged.
Some diagnoses can be confirmed by specific pathologic tests (e.g., neoplasms, vasculitis), by laboratory tests (e.g., HIV infection, myocardial injury, sickle cell disease), and by microbiologic tests (e.g., cultures and polymerase chain reaction [PCR] identification of specific organisms).
It is impossible, short of necropsy, to “rule out” a diagnosis. When tests with highly negative LR are negative in patients with intermediate or low pretest probability for the disease, the probability of the disease becomes very small, but never zero. In each clinical case, we empirically set a clinical level of certainty required to confirm a diagnosis, as we discussed in Chapter 1. We also determine a level of clinical certainty needed to effectively exclude a diagnosis from further consideration. This will depend upon the patient, the clinical scenario, and the risk associated with drawing a false-negative conclusion. When we have assured ourselves that the diagnosis is less probable than our threshold, we can say it is excluded clinically, but it is never “ruled out.”
Furthermore, many clinical conditions, especially syndromes (e.g., rheumatoid arthritis), have no gold-standard diagnostic test or exclusion criteria.
The skilled clinician uses a patient’s history and physical examination to generate pathophysiologic and diagnostic hypotheses. A differential diagnosis includes those diseases with the highest estimated probability of being present in this patient, and less likely diseases associated with severe morbidity if not promptly diagnosed. An explicit estimation of the probability for each is made. Tests are selected for which the result (positive, negative, or a specific value or finding) will generate a post-test probability (applying the LR) of a clinically significant high or low level. On the basis of the first round of test results and repeat examination of the patient, a refined differential diagnosis is generated and a second round of tests may be ordered. This process is repeated until the post-test probability for the diagnosis exceeds the threshold required by the clinical situation. At that point, a working diagnosis is established.