Clinicians should use measures of test performance such as sensitivity and specificity to judge the quality of a diagnostic test for a particular disease.
Test sensitivity is the ability of a test to detect disease and is expressed as the percentage of patients with disease in whom the test is positive. Thus, a test that is 90% sensitive gives positive results in 90% of diseased patients and negative results in 10% of diseased patients (false negatives). Generally, a test with high sensitivity is useful to exclude a diagnosis because a highly sensitive test renders few results that are falsely negative. To exclude infection with the virus that causes AIDS, for instance, a clinician might choose a highly sensitive test, such as the HIV p24 antigen and HIV-1/2 antigen/antibody combination fourth-generation rapid test (HIV-1/2 Ag/Ab Combo rapid test).
A test’s specificity is the ability to detect absence of disease and is expressed as the percentage of patients without disease in whom the test is negative. Thus, a test that is 90% specific gives negative results in 90% of patients without disease and positive results in 10% of patients without disease (false positives). A test with high specificity is useful to confirm a diagnosis, because a highly specific test has few results that are falsely positive. For instance, to make the diagnosis of gouty arthritis, a clinician might choose a highly specific test, such as the presence of negatively birefringent needle-shaped urate crystals on microscopic evaluation of joint fluid.
To determine test sensitivity and specificity for a particular disease, the test must be compared against an independent “gold standard” test or established standard diagnostic criteria that define the true disease state of the patient. For instance, the sensitivity and specificity of rapid antigen detection testing in diagnosing group A beta-hemolytic streptococcal pharyngitis are obtained by comparing the results of rapid antigen testing with the gold standard test, throat swab culture. Application of the gold standard test to patients with positive rapid antigen testing establishes specificity, whereas application of gold standard test to patients with negative rapid antigen testing establishes sensitivity. Failure to apply the gold standard test following negative rapid tests may result in an overestimation of sensitivity, since false negatives will not be identified. However, for many disease states (eg, pancreatitis), an independent gold standard test either does not exist or is very difficult or expensive to apply—and in such cases, reliable estimates of test sensitivity and specificity are sometimes difficult to obtain.
Sensitivity and specificity can also be affected by the population from which these values are derived. For instance, many diagnostic tests are evaluated first using patients who have severe disease and control groups who are young and well. Compared with the general population, this study group will have more results that are truly positive (because patients have more advanced disease) and more results that are truly negative (because the control group is healthy). Thus, test sensitivity and specificity will be higher than would be expected in the general population, where more of a spectrum of health and disease is found. Clinicians should be aware of this spectrum bias when generalizing published test results to their own practice. To minimize spectrum bias, the control group should include individuals who have diseases related to the disease in question, but who lack this principal disease. For example, to establish the sensitivity and specificity of the anti-cyclic citrullinated peptide test for rheumatoid arthritis, the control group should include patients with rheumatic diseases other than rheumatoid arthritis. Other biases, including spectrum composition, population recruitment, absent or inappropriate reference standard, and verification bias, should also be considered in certain situations, where critical appraisal of published articles may be necessary.
It is important to remember that the reported sensitivity and specificity of a test depend on the analyte level (threshold) used to distinguish a normal from an abnormal test result. If the threshold is lowered, sensitivity is increased at the expense of decreased specificity. If the threshold is raised, sensitivity is decreased while specificity is increased (Figure e2–3).
Hypothetical distribution of test results for healthy and diseased individuals. The position of the “cutoff point” between “normal” and “abnormal” (or “negative” and “positive”) test results determines the test’s sensitivity and specificity. If point A is the cutoff point, the test would have 100% sensitivity but low specificity. If point C is the cutoff point, the test would have 100% specificity but low sensitivity. For many tests, the cutoff point is determined by the reference interval, ie, the range of test results that is within 2 SD of the mean of test results for healthy individuals (point B). In some situations, the cutoff is altered to enhance either sensitivity or specificity. (Reproduced, with permission, from Nicoll D et al. Guide to Diagnostic Tests, 7th ed. McGraw-Hill, 2017.)
Figure e2–4 shows how test sensitivity and specificity can be calculated using test results from patients previously classified by the gold standard test as diseased or nondiseased.
Calculation of sensitivity, specificity, and probability of disease after a positive test (posttest probability). TP, true positive; FP, false positive; FN, false negative; TN, true negative. (Reproduced, with permission, from Nicoll D et al. Guide to Diagnostic Tests, 7th ed. McGraw-Hill, 2017.)
Receiver Operator Characteristic
The performance of two different tests can be compared by plotting the receiver operator characteristic (ROC) curves at various reference interval cutoff values. Each point on the ROC curve represents a sensitivity/specificity pair corresponding to a particular decision threshold. The resulting curve for each test, obtained by plotting the sensitivity against (1–specificity), often shows which test is more accurate; a clearly superior test will have an ROC curve that always lies above and to the left of the inferior test curve, and, in general, the better test will have a larger area under the ROC curve. For instance, Figure e2–5 shows the ROC curves for PSA and prostatic acid phosphatase in the diagnosis of prostate cancer. PSA is a superior test because it has significantly larger area under the curve and higher sensitivity and specificity for all cutoff values.
Receiver operator characteristic (ROC) curves for prostate-specific antigen (PSA) and prostatic acid phosphatase (PAP) in the diagnosis of prostate cancer. For all cutoff values, PSA has higher sensitivity and specificity; therefore, it is a better test based on these performance characteristics. (Modified and reproduced, with permission, from Nicoll D et al. Routine acid phosphatase testing for screening and monitoring prostate cancer no longer justified. Clin Chem. 1993 Dec;39(12):2540–1.)
Note that, for a given test, the ROC curve also allows one to identify the cut-off value that minimizes both false-positive and false-negative results, which is located at the point closest to the upper-left corner of the curve. The optimal clinical cut-off value, however, depends on the condition being detected and the relative importance of false-positive versus false-negative results.
et al. Which fecal immunochemical test should I choose? J Prim Care Community Health. 2017 Oct;8(4):264–77.
A. Bayes' theorem, the ROC diagram and reference values: definition and use in clinical diagnosis. Biochem Med (Zagreb). 2018 Feb 15;28(1):010101.