Table 1–2 lists the general characteristics of useful diagnostic tests. Most of the principles detailed below can be applied not only to laboratory and radiologic tests but also to elements of the history and physical examination. An understanding of these characteristics is very helpful to the clinician when ordering and interpreting diagnostic tests.
The accuracy of a laboratory test is its correspondence with the true value. A test is deemed inaccurate when the result differs from the true value even though the results may be reproducible (Figure 1–1A), this represents systematic error (or bias). For example, serum creatinine is commonly measured by a kinetic Jaffe method, which has a systematic error as large as 0.23 mg/dL when compared with the gold standard gas chromatography-isotope dilution mass spectrometry method. In the clinical laboratory, accuracy of tests is maximized by calibrating laboratory equipment with reference material and by participation in external proficiency testing programs.
Relationship between accuracy and precision in diagnostic tests. The center of the target represents the true value of the substance being tested. (A) A diagnostic test that is precise but inaccurate; repeated measurements yield very similar results, but all results are far from the true value. (B) A test that is imprecise and inaccurate; repeated measurements yield widely different results, and the results are far from the true value. (C) An ideal test that is both precise and accurate.
Test precision is a measure of a test's reproducibility when repeated on the same sample. If the same specimen is analyzed many times, some variation in results (random error) is expected; this variability is expressed as a coefficient of variation (CV: the standard deviation divided by the mean, often expressed as a percentage). For example, when the laboratory reports a CV of 5% for serum creatinine and accepts results within ± 2 standard deviations, it denotes that, for a sample with serum creatinine of 1.0 mg/dL, the laboratory may report the result as anywhere from 0.90 to 1.10 mg/dL on repeated measurements from the same sample.
An imprecise test is one that yields widely varying results on repeated measurements (Figure 1–1B). The precision of diagnostic tests, which is monitored in clinical laboratories by using control material, must be good enough to distinguish clinically relevant changes in a patient's status from the analytic variability (imprecision) of the test. For instance, the manual peripheral white blood cell differential count may not be precise enough to detect important changes in the distribution of cell types, because it is calculated by subjective evaluation of a small sample (eg, 100 cells). Repeated measurements by different technicians on the same sample result in widely differing results. Automated differential counts are more precise because they are obtained from machines that use objective physical characteristics to classify a much larger sample (eg, 10,000 cells).
An ideal test is both precise and accurate (Figure 1–1C).
Some diagnostic tests are reported as positive or negative, but many are reported quantitatively. Use of reference intervals is a technique for interpreting quantitative results. Reference intervals are often method- and laboratory-specific. In practice, they often represent test results found in 95% of a small population presumed to be healthy; by definition, then, 5% of healthy patients will have an abnormal test result (Figure 1–2). Slightly abnormal results should be interpreted critically—they may be either truly abnormal or falsely abnormal. Statistically, the probability that a healthy person will have 2 separate test results within the reference interval is (0.95 × 0.95)%, ie, 90.25%; for 5 separate tests, it is 77.4%; for 10 tests, 59.9%; and for 20 tests, 35.8%. The larger the number of tests ordered, the greater the probability that one or more of the test results will fall outside the reference intervals (Table 1–3). Conversely, values within the reference interval may not rule out the actual presence of disease, since the reference interval does not establish the distribution of results in patients with disease.
The reference interval is usually defined as within 2 SD of the mean test result (shown as –2 and 2) in a small population of healthy volunteers. Note that in this example, test results are normally distributed; however, many biologic substances have distributions that are skewed.
Table 1–3. Relationship between Number of Tests and Probability of One or More Abnormal Results in a Healthy Person. |Favorite Table|Download (.pdf)
Table 1–3. Relationship between Number of Tests and Probability of One or More Abnormal Results in a Healthy Person.
|Number of Tests||Probability of One or More Abnormal Results (%)|
It is important to consider also whether published reference intervals are appropriate for the particular patient being evaluated, since some intervals depend on age, sex, weight, diet, time of day, activity status, posture, or even season. Biologic variability occurs among individuals as well as within the same individual. For instance, serum estrogen levels in women vary from day to day, depending on the menstrual cycle; serum cortisol shows diurnal variation, being highest in the morning and decreasing later in the day; and vitamin D shows seasonal variation with lower values in winter.
The results of diagnostic tests can be altered by external factors, such as ingestion of drugs; and internal factors, such as abnormal physiologic states. These factors contribute to the biologic variability and must be considered in the interpretation of test results.
External interferences can affect test results in vivo or in vitro. In vivo, alcohol increases γ-glutamyl transpeptidase, and diuretics can affect sodium and potassium concentrations. Cigarette smoking can induce hepatic enzymes and thus reduce levels of substances such as theophylline that are metabolized by the liver. In vitro, cephalosporins may produce spurious serum creatinine levels due to interference with a common laboratory method of analysis.
Internal interferences result from abnormal physiologic states interfering with the test measurement. For example, patients with gross lipemia may have spuriously low serum sodium levels if the test methodology includes a step in which serum is diluted before sodium is measured, and patients with endogenous antibodies (eg, human anti-mouse antibodies) may have falsely high or low results in automated immunoassays. Because of the potential for test interference, clinicians should be wary of unexpected test results and should investigate reasons other than disease that may explain abnormal results, including pre-analytical and analytical laboratory error.
Sensitivity and Specificity
Clinicians should use measures of test performance such as sensitivity and specificity to judge the quality of a diagnostic test for a particular disease.
Test sensitivity is the ability of a test to detect disease and is expressed as the percentage of patients with disease in whom the test is positive. Thus, a test that is 90% sensitive gives positive results in 90% of diseased patients and negative results in 10% of diseased patients (false negatives). Generally, a test with high sensitivity is useful to exclude a diagnosis because a highly sensitive test renders fewer results that are falsely negative. To exclude infection with the virus that causes AIDS, for instance, a clinician might choose a highly sensitive test, such as the HIV antibody test or antigen/ antibody combination test.
A test's specificity is the ability to detect absence of disease and is expressed as the percentage of patients without disease in whom the test is negative. Thus, a test that is 90% specific gives negative results in 90% of patients without disease and positive results in 10% of patients without disease (false positives). A test with high specificity is useful to confirm a diagnosis, because a highly specific test has fewer results that are falsely positive. For instance, to make the diagnosis of gouty arthritis, a clinician might choose a highly specific test, such as the presence of negatively birefringent needle-shaped crystals within leukocytes on microscopic evaluation of joint fluid.
To determine test sensitivity and specificity for a particular disease, the test must be compared against an independent "gold standard" test or established standard diagnostic criteria that define the true disease state of the patient. For instance, the sensitivity and specificity of rapid antigen detection testing in diagnosing group A β-hemolytic streptococcal pharyngitis are obtained by comparing the results of rapid antigen testing with the gold standard test, throat swab culture. Application of the gold standard test to patients with positive rapid antigen tests establishes specificity. Failure to apply the gold standard test to patients with negative rapid antigen tests will result in an overestimation of sensitivity, since false negatives will not be identified. However, for many disease states (eg, pancreatitis), an independent gold standard test either does not exist or is very difficult or expensive to apply—and in such cases reliable estimates of test sensitivity and specificity are sometimes difficult to obtain.
Sensitivity and specificity can also be affected by the population from which these values are derived. For instance, many diagnostic tests are evaluated first using patients who have severe disease and control groups who are young and well. Compared with the general population, these study groups will have more results that are truly positive (because patients have more advanced disease) and more results that are truly negative (because the control group is healthy). Thus, test sensitivity and specificity will be higher than would be expected in the general population, where more of a spectrum of health and disease is found. Clinicians should be aware of this spectrum bias when generalizing published test results to their own practice. To minimize spectrum bias, the control group should include individuals who have diseases related to the disease in question, but who lack this principal disease. For example, to establish the sensitivity and specificity of the anti-cyclic citrullinated peptide test for rheumatoid arthritis, the control group should include patients with rheumatic diseases other than rheumatoid arthritis. Other biases, including spectrum composition, population recruitment, absent or inappropriate reference standard, and verification bias, are discussed in the references.
It is important to remember that the reported sensitivity and specificity of a test depend on the analyte level (threshold) used to distinguish a normal from an abnormal test result. If the threshold is lowered, sensitivity is increased at the expense of decreased specificity. If the threshold is raised, sensitivity is decreased while specificity is increased (Figure 1–3).
Hypothetical distribution of test results for healthy and diseased individuals. The position of the "cutoff point" between "normal" and "abnormal" (or "negative" and "positive") test results determines the test's sensitivity and specificity. If point A is the cutoff point, the test would have 100% sensitivity but low specificity. If point C is the cutoff point, the test would have 100% specificity but low sensitivity. For many tests, the cutoff point (B) is set at the value of the mean plus 2 SD of test results for healthy individuals. In some situations, the cutoff is altered to enhance either sensitivity or specificity.
Figure 1–4 shows how test sensitivity and specificity can be calculated using test results from patients previously classified by the gold standard test as diseased or nondiseased.
Calculation of sensitivity, specificity, and probability of disease after a positive test (posttest probability). TP, true positive; FP, false positive; FN, false negative; TN, true negative.
The performance of two different tests can be compared by plotting the receiver operator characteristic (ROC) curves at various reference interval cutoff values. The resulting curves, obtained by plotting sensitivity against (1 − specificity) at different cut-off values for each test, often show which test is better; a clearly superior test will have an ROC curve that always lies above and to the left of the inferior test curve, and, in general, the better test will have a larger area under the ROC curve. For instance, Figure 1–5 shows the ROC curves for PSA and prostatic acid phosphatase in the diagnosis of prostate cancer. PSA is a superior test because it has higher sensitivity and specificity for all cutoff values.
Receiver operator characteristic (ROC) curves for prostate-specific antigen (PSA) and prostatic acid phosphatase (PAP) in the diagnosis of prostate cancer. For all cutoff values, PSA has higher sensitivity and specificity; therefore, it is a better test based on these performance characteristics.(Data from Nicoll CD et al. Routine acid phosphatase testing for screening and monitoring prostate cancer no longer justified
. Clin Chem 1993;39:2540.)
Note that, for a given test, the ROC curve also allows one to identify the cutoff value that minimizes both false-positive and false-negative results. This is located at the point closest to the upper-left corner of the curve. The optimal clinical cutoff value, however, depends on the condition being detected and the relative importance of false-positive versus false-negative results.