Table e3–2 lists the general
characteristics of useful diagnostic tests. Most of the principles
detailed below can be applied not only to laboratory and radiologic
tests but also to elements of the history and physical examination.
An understanding of these characteristics is very helpful to the clinician
when ordering and interpreting diagnostic tests.
The accuracy of a laboratory test is its correspondence with the true value. A test is deemed inaccurate when the result differs from the true value even though the results may be reproducible (Figure e3–1A), also called systematic error (or bias). For example, serum creatinine is commonly measured by a kinetic Jaffe method, which has a systematic error as large as 0.23 mg/dL (20.33 mcmol/L) when compared with the gold standard gas chromatography-isotope dilution mass spectrometry method. In the clinical laboratory, accuracy of tests is maximized by calibrating laboratory equipment with standard reference material and by participation in external proficiency testing programs.
Relationship between accuracy and precision in diagnostic
tests. The center of the target represents the true value of the
substance being tested. A: A diagnostic test that is precise but inaccurate; repeated measurements yield very similar results, but all results are far from the true value. B: A test that is imprecise and inaccurate; repeated measurements yield widely different results, and the results are far from the true value. C: An ideal
test that is both precise and accurate. (Reproduced, with permission, from Nicoll D et al. Pocket Guide to Diagnostic Tests, 6th ed. McGraw-Hill, 2012.)
Test precision is a measure of a test’s reproducibility when repeated on the same sample. If the same specimen is analyzed many times, some variation in results (random error) is expected; this variability is expressed as a coefficient of variation (CV: the standard deviation divided by the mean, often expressed as a percentage). For example, when the laboratory reports a CV of 5% for serum creatinine and accepts results within ± 2 standard deviations, it denotes that, for a sample with serum creatinine of 1.0 mg/dL (88.4 mcmol/L), the laboratory may report the result as anywhere from 0.90 (79.56 mcmol/L) to 1.10 mg/dL (97.24 mcmol/L) on repeated measurements from the same sample.
An imprecise test is one that yields widely varying results on
repeated measurements (Figure e3–1B). The precision of diagnostic tests, which is monitored in clinical laboratories
by using quality control material, must be good enough to distinguish clinically
relevant changes in a patient’s status from the analytic
variability (imprecision) of the test. For instance, the manual
peripheral blood white blood cell differential count may not be
precise enough to detect important changes in the distribution of
cell types, because it is calculated by subjective evaluation of
a small sample (eg, 100 cells). Repeated measurements by different technicians
on the same sample result in widely differing results. Automated
differential counts are more precise because they are obtained from
machines that use objective physical characteristics to classify
a much larger sample (eg, 10,000 cells).
Some diagnostic tests are reported as positive or negative, but
many are reported quantitatively. Use of reference intervals is
a technique for interpreting quantitative results. Reference intervals
are often method- and laboratory-specific. In practice, they often represent
test results found in 95% of a small population presumed
to be healthy; by definition, then, 5% of healthy patients
will have an abnormal test result (Figure e3–2). Slightly abnormal results should be interpreted critically—they
may be either truly abnormal or falsely abnormal. Statistically, the probability that a healthy person will have 2 separate test results within the reference interval is 0.95 × 0.95 = 0.9025, or 90.25%; for 5 separate tests, it is 77.4%; for 10 tests, 59.9%; and for 20 tests, 35.8%.
The larger the number of tests ordered, the greater the probability
that one or more of the test results will fall outside the reference
interval (Table e3–3). Conversely, values
within the reference interval may not rule out the actual presence
of disease since the reference interval does not establish the distribution
of results in patients with disease. As such, reference intervals
must be used within the context of medical knowledge about the disorder
The reference interval is usually defined as within 2 SD
of the mean test result (shown as -2 and 2) in a small population
of healthy volunteers. Note that in this example, test results are
normally distributed; however, many biologic substances have
distributions that are skewed. (Reproduced, with permission, from Nicoll D et al. Pocket Guide to Diagnostic Tests, 6th ed. McGraw-Hill, 2012.)
Table e3–3. Relationship
between the number of tests and the probability that a healthy person
will have one or more abnormal results.
| Favorite Table
Table e3–3. Relationship
between the number of tests and the probability that a healthy person
will have one or more abnormal results.
|Number of Tests||Probability That One or More Results Will Be Abnormal|
It is important to consider also whether published reference
intervals are appropriate for the particular patient being evaluated,
since some intervals depend on age, sex, weight, diet, time of day,
activity status, posture, or even season. Biologic variability occurs among
individuals as well as within the same individual. For instance,
serum estrogen levels in women vary from day to day depending on
the menstrual cycle; serum cortisol shows diurnal variation, being
highest in the morning and decreasing later in the day; and vitamin
D shows seasonal variation with lower values in winter.
Table 3 of the Appendix contains the reference intervals for
commonly used chemistry and hematology tests. Test performance characteristics
such as sensitivity and specificity are needed to interpret results
and are discussed below.
The results of diagnostic tests can be altered by external factors,
such as ingestion of drugs; and internal factors, such as abnormal
physiologic states. These factors contribute to the biologic variability
and must be considered in the interpretation of test results.
External interferences can affect test results in vivo or in vitro. In vivo, alcohol increases gamma-glutamyl transpeptidase, and diuretics can affect sodium and potassium concentrations. Cigarette smoking can induce hepatic enzymes and thus reduce levels of substances such as theophylline that are metabolized by the liver. In vitro, cephalosporins may produce spurious serum creatinine levels due to interference with a common laboratory method of analysis.
Internal interferences result from abnormal physiologic states
interfering with the test measurement. For example, patients with
gross lipemia may have spuriously low serum sodium levels if the
test methodology includes a step in which serum is diluted
before sodium is measured, and patients with endogenous antibodies (eg, human anti-mouse antibodies) may have falsely high or low results in automated immunoassays. Because of the potential for test interference,
clinicians should be wary of unexpected test results and should investigate
reasons other than disease that may explain abnormal results, including
pre-analytical and analytical laboratory error.
Ismail AA. Interference from endogenous antibodies in automated immunoassays: what laboratorians need to know. J Clin Pathol. 2009 Aug;62(8):673–8.
Sturgeon CM et al. Analytical error and interference in immunoassay: minimizing risk. Ann Clin Biochem. 2011 Sep;48(Pt 5):418–32.
Sensitivity & Specificity
Clinicians should use measures of test performance such as sensitivity
and specificity to judge the quality of a diagnostic test for a
Test sensitivity is the ability of a test to detect disease and is expressed as the percentage of patients with disease in whom the test is positive. Thus, a test that is 90% sensitive gives positive results in 90% of diseased patients and negative results in 10% of diseased patients (false negatives). Generally, a test with high sensitivity is useful to exclude a diagnosis because a highly sensitive test renders few results that are falsely negative. To exclude infection with the virus that causes AIDS, for instance, a clinician might choose a highly sensitive test, such as the HIV p24 antigen and HIV antibody combination test.
A test’s specificity is the ability to detect absence of disease and is expressed as the percentage of patients without disease in whom the test is negative. Thus, a test that is 90% specific gives negative results in 90% of patients without disease and positive results in 10% of patients without disease (false positives). A test with high specificity is useful to confirm a diagnosis, because a highly specific test has few results that are falsely positive. For instance, to make the diagnosis of gouty arthritis, a clinician might choose a highly specific test, such as the presence of negatively birefringent needle-shaped urate crystals on microscopic evaluation of joint fluid.
To determine test sensitivity and specificity for a particular disease, the test must be compared against an independent "gold standard" test or established standard diagnostic criteria that define the true disease state of the patient. For instance, the sensitivity and specificity of rapid antigen detection testing in diagnosing group A beta-hemolytic streptococcal pharyngitis are obtained by comparing the results of rapid antigen testing with the gold standard test, throat swab culture. Application of the gold standard test to patients with positive rapid antigen testing establishes specificity. Failure to apply the gold standard test following negative rapid test may result in an overestimation of sensitivity, since false negatives will not be identified. However, for many disease states (eg, pancreatitis), an independent gold standard test either does not exist or is very difficult or expensive to apply—and in such cases reliable estimates of test sensitivity and specificity are sometimes difficult to obtain.
Sensitivity and specificity can also be affected by the population
from which these values are derived. For instance, many diagnostic
tests are evaluated first using patients who have severe disease
and control groups who are young and well. Compared with the general
population, this study group will have more results that are truly
positive (because patients have more advanced disease) and more
results that are truly negative (because the control group is healthy).
Thus, test sensitivity and specificity will be higher than would
be expected in the general population, where more of a spectrum
of health and disease is found. Clinicians should be aware of this spectrum bias when generalizing published
test results to their own practice. To minimize spectrum bias, the control group should include individuals who have diseases related to the disease in question, but who lack this principal disease. For example, to establish the sensitivity and specificity of the anti-cyclic citrullinated peptide test for rheumatoid arthritis, the control group should include patients with rheumatic diseases other than rheumatoid arthritis. Other biases, including spectrum composition, population recruitment, absent or inappropriate reference standard, and verification bias, should also be considered in certain situations, where critical appraisal of published articles may be necessary.
It is important to remember that the reported sensitivity and
specificity of a test depend on the analyte level (threshold) used
to distinguish a normal from an abnormal test result. If the threshold
is lowered, sensitivity is increased at the expense of decreased
specificity. If the threshold is raised, sensitivity is decreased
while specificity is increased (Figure e3–3).
Hypothetical distribution of test results for healthy
and diseased individuals. The position of the “cutoff point” between “normal” and “abnormal” (or “negative” and “positive”)
test results determines the test’s sensitivity and specificity.
If point A is the cutoff point, the
test would have 100% sensitivity but low specificity. If point C is the cutoff point, the test would have
100% specificity but low sensitivity. For many tests, the
cutoff point is determined by the reference interval, ie, the range of
test results that is within 2 SD of the mean of test results for
healthy individuals (point B). In
some situations, the cutoff is altered to enhance either sensitivity
or specificity. (Reproduced, with permission, from Nicoll D et al. Pocket Guide to Diagnostic Tests, 6th ed. McGraw-Hill, 2012.)
Figure e3–4 shows how test sensitivity
and specificity can be calculated using test results from patients
previously classified by the gold standard test as diseased or nondiseased.
Calculation of sensitivity, specificity, and probability
of disease after a positive test (posttest probability). TP, true
positive; FP, false positive; FN, false negative;
TN, true negative. (Reproduced, with permission, from Nicoll D et al. Pocket Guide to Diagnostic Tests, 6th ed. McGraw-Hill, 2012.)
The performance of two different tests can be compared by plotting the receiver operator characteristic (ROC) curves at various reference interval cutoff values. The resulting curve for each test, obtained by plotting the sensitivity against (1-specificity), often shows which test is better; a clearly superior test will have an ROC curve that always lies above and to the left of the inferior test curve, and, in general, the better test will have a larger area under the ROC curve. For instance, Figure e3–5 shows the ROC curves for PSA and prostatic acid phosphatase (PAP) in the diagnosis of prostate cancer. PSA is a superior test because it has higher sensitivity and specificity for all cutoff values.
Receiver operator characteristic (ROC) curves for prostate-specific
antigen (PSA) and prostatic acid phosphatase (PAP) in the diagnosis
of prostate cancer. For all cutoff values, PSA has higher sensitivity
and specificity; therefore, it is a better test based on these performance
characteristics. (Modified and reproduced, with permission, from
Nicoll D et al. Routine acid phosphatase testing for screening and
monitoring prostate cancer no longer justified. Clin Chem. 1993 Dec;39(12):2540–1.)
Note that, for a given test, the ROC curve also allows one to identify the cut-off value that minimizes both false-positive and false-negative results, which is located at the point closest to the upper-left corner of the curve. The optimal clinical cut-off value, however, depends on the condition being detected and the relative importance of false-positive versus false-negative results.
Bossuyt X. Clinical performance characteristics of a laboratory test. A practical approach in the autoimmune laboratory. Autoimmun Rev. 2009 Jun;8(7):543–8.
Irwin RJ et al. A principled approach to setting optimal diagnostic thresholds: where ROC and indifference curves meet. Eur J Intern Med. 2011 Jun;22(3):230–4.
Soreide K et al. Diagnostic accuracy and receiver-operating characteristics curve analysis in surgical research and decision making. Ann Surg. 2011 Jan;253(1):27–34.