In secondary studies, new data are not presented. Rather, existing data from other studies are collected and perhaps reanalyzed to synthesize the body of evidence on a particular topic. All such studies need to consider the possible role of what is known as publication bias, which stems from the belief that studies that find an effect (“positive studies”) are more likely to be published than studies that do not (“negative studies”). If this belief is warranted, a synthesis of only published studies will be biased toward finding an effect.
There are 3 main types of secondary studies: review articles, systematic reviews, and meta-analyses. In a review article, the authors identify as many studies as can be found related to a particular research question (and perhaps related topics). They then present in an organized way the key findings of those studies, and draw conclusions based on critical evaluation of the full body of evidence on the question.
Systematic reviews have largely taken the place of the basic review article. In a systematic review, the investigators must spell out the systematic approach used in identifying the research to be included in the review. Typically, a very wide net is cast and a very large number of studies need to be vetted for inclusion or exclusion from the review.
In a meta-analysis, investigators gather all the studies relevant to a particular question in a similar manner as one would do for a systematic review. They then do a new analysis of the combined set of data from those studies, in order to present an overall estimate of the effect of interest that is more precise due to the much larger number of subjects from all studies combined.
Most clinical studies involve one or more statistical hypothesis tests to address the research question(s) at hand. There are many types of such hypothesis tests. For complex questions or designs, a biostatistician should be consulted, but some of the basic tests used, and issues related to their proper use, are presented here. The types of tests can be grouped as those related to continuous outcomes, categorical or discrete outcomes, and time-to-event (survival) data.
Continuous outcomes—Many of the outcomes we measure are quantitative in nature and lend themselves to inference based on continuous distributions. Many of the tests of this type involve comparing the mean (or other measure of central tendency) of 2 or more populations. When only 2 groups are to be compared, a t-test for 2 independent samples is often employed. If a comparison group internal to the study is not available, sometimes the mean of a single group is compared to a known population value using a one-sample t-test. If the 2 samples of data to be compared are not from 2 independently selected groups, but are either from the same subjects (paired data) or a comparison group especially selected to match the first group on important characteristics (matched data), then a paired t-test is often employed. When more than 2 groups are to be compared to test for equality of means, a method called analysis of variance (ANOVA) can be used.
All of these types of t-tests and the ANOVA method make an assumption that the data within each population are (approximately) normally distributed. Minor deviations from this assumption can be tolerated, but when this assumption is not realistic, as is often the case, the analyst has 2 options: (1) the data can be transformed in a way that results in a more symmetric distribution (eg, positively skewed data may benefit from a logarithmic transformation) or (2) a nonparametric test may be employed instead of the t-test or ANOVA. These nonparametric tests, as their name implies, do not make distributional assumptions about the data. They are insensitive to extreme outliers as they make use of the rank ordering of the observations, not their actual values. So, for example, if the 5 oldest patients in a sample were 102, 64, 63, 59, and 56 years of age, in a nonparametric analysis these values would be converted to ranks of 1, 2, 3, 4, and 5 thus diminishing the influence of the extreme value of 102. Examples of nonparametric tests (not an exhaustive list) include the Mann-Whitney U test or the Wilcoxon rank-sum test in place of the 2-sample t-test, the Wilcoxon signed rank test in place of the paired t-test, and Friedman’s test in place of ANOVA.
Categorical outcomes—When the outcome of interest is a qualitative factor, we say it is a categorical variable, meaning that it can only take on a discrete set of possible values. Categorical variables may further be classified as either nominal (named categories, such as blood type) or ordinal (ordered categories, such as severity of pain if measured as none, mild, moderate, or severe).
Analysis of categorical outcomes often involves using one of a set of tests known collectively as chi-square tests. In the Pearson chi-square test, 2 or more groups are compared with respect to the proportion exhibiting the outcome. The data are often summarized in a contingency table or cross-tabulation, in which subjects are cross-classified by the combination of their exposure group and whether or not they have the outcome. The cross-tabulation presents counts and percentages for each possible combination of the 2 factors, with one factor listed on the rows and the other on the columns. The Pearson chi-square uses a continuous distribution to approximate the behavior of the categorical data; when the number of subjects in 1 or more of the groups (defined by either exposure or outcome) is small, however, the validity of that approximation is questionable, in that case a Fisher’s exact test should be used instead. Most statistical software is programmed to warn users when use of a Fisher’s exact test is preferred. Just as is the case with continuous outcomes, if the groups to be compared are not independently selected but are instead paired or matched, then the method of analysis needs to account for the interdependence within the pairs. In this case, one could use a McNemar’s test for paired data. The contingency table is set up differently, however, as the unit of observation becomes the matched pairs instead of the individual subjects. Each pair is cross-classified with respect to the outcome for each of the 2 members of the pair.
Time-to-event outcomes (survival analysis)—When the outcome is an event that occurs over time, one is often interested in both whether subjects experience the event and how soon they experience the event. If the event is one for which all subjects can be observed completely with respect to either the fact or the timing of the event, then the methods described earlier can be employed. Many studies, however, follow subjects for the occurrence of an event where, at the end of the study, some subjects have not experienced the event but have been at risk for that event for a period of time. The time to event for such subjects is said to be censored, in that we only partially observe it (ie, we know the time is at least as long as the amount of time they were at risk, but do not know how much longer). This arises often in studies where the event of interest is mortality, as such studies will often not be continued until all subjects have died. As such, the analytic methods for dealing with censored time-to-event data are often referred to as survival analysis. The survival (or event-free time) experience of 1 or more groups can be estimated using a variety of methods, including the Kaplan–Meier or product-limit method. This can be illustrated graphically with a Kaplan–Meier curve (actually a step function) that plots study time on the horizontal axis against survival (event-free) probability on the vertical access. The survival experience of 2 or more groups can be compared with various methods, including the log-rank test.
P-values—Regardless of the type of statistical test used, a hypothesis test is performed by comparing the observed data to what would be expected to be observed under what is termed the null hypothesis which, generally speaking, represents the absence of a difference between groups, or an effect, association, etc. If what was observed would be unlikely under the null hypothesis, then that null hypothesis is rejected in favor of the alternative (eg, that there is a difference or an effect). Specifically, a test statistic is calculated (it may be a t statistic, a chi-square statistic, or some other) whose probability distribution under the null hypothesis is known. When “unlikely” values of the test statistic occur, that is, when we reject the null hypothesis. How we define unlikely is determined by the significance level of the test, denoted α, which is typically set by convention at 0.05. So, with α = 0.05, if the test statistic falls in the range of values that are the 5% least likely to occur, that is, when we reject the null hypothesis. But, in addition to this reject or do not reject dichotomy, our statistical tests produce a measure of the likelihood of our test statistic known as the P value. Specifically, the P value is the chance of observing a test statistic as large (in absolute value) as the one observed if the null hypothesis were true. Because the test statistic is a function of the observed data, one can think of the P value as the chances of observing as great a degree of difference as was observed if, in fact, the populations represented by the groups do not differ at all. One thing that the P value is not, though it is often misinterpreted as such, is the probability that a conclusion to reject or not reject is due to chance alone.
Measures of Disease Occurrence and Association
The aims of a study often include a desire to quantify the frequency with which, or the rate at which, a disease (or other outcome) occurs and/or the association of 1 or more risk factors with the occurrence of disease. For this, we need to have metrics to describe both disease occurrence and the associations between risk factors and outcomes.
The proportion of a particular population that develops a particular disease is called the risk of disease. The risk is estimated simply as the number with the disease over the total number at risk of the disease. The proportion developing a disease per unit time is called the rate of disease. The numerator for the rate is still the same, but the denominator is no longer a count of the at-risk population, but the person-time of observation. To calculate person-time, say in years, the number of years that each subject in the at-risk group was followed is summed to get the total person-years for the rate calculation. The line between risks and rates may sometimes be fine, as a risk is often for a fixed period of time, for example, the 5-year risk of cancer recurrence. We do not often think of the odds of disease, but we sometimes need to calculate odds for measures of association (described later). If the probability of an outcome is P, then the odds of that outcome is P/(1 – P), that is, the chance of it happening divided by the chance of it not happening. Note that if an outcome is rare (P close to zero), then the chance of it not happening is close to one, and the odds and the risk are almost the same.
Ratio Measures of Association
If 2 groups are being compared with respect to the occurrence of an outcome, they can be compared with either ratio measures or difference measures. We first consider ratio measures. The relative risk is simply the risk in the group of interest divided by the risk in the comparison group. If rates rather than risks are being compared, one may instead compute a rate ratio. In the context of survival analysis, the rate of the event in one group divided by the rate in the other is called the hazard (rate) ratio. For the case-control study design, risk in each comparison group cannot be observed directly, and we instead calculate an odds ratio, defined as the odds of exposure in the cases over the odds of exposure in the controls. If the outcome is rare, one can exploit the fact that odds and risk are almost the same and employ the so-called rare-disease assumption to interpret an odds ratio as an estimate of relative risk. When outcomes are not rare, however, the odds ratio tends to overestimate the relative risk (ie, be further away from a null ratio of one).
Difference Measures of Association
Sometimes the absolute difference in risk or rate between 2 groups is of greater interest than the relative difference. In such cases, a difference measure would be preferred over a ratio measure. The risk difference is simply the risk in one group minus the risk in the other. In clinical studies, where the goal is often to test an experimental treatment to see if it reduces risk of an adverse outcome relative to a control treatment, these measures sometimes go by more specific names. The control event rate and the experimental event rate are simply the rates (or risks) of the outcome in the control group and experimental group, respectively. The difference between these is known as the absolute risk reduction (ARR), and is a measure of how much of the event (disease, mortality, etc) could be prevented (or cured) by replacement of the control treatment with the experimental treatment. Another measure of association, related to the ARR and often important in clinical studies, is the number needed to treat (NNT). It is defined as NNT = 1/ARR. It is an estimate of how many patients would need to be treated with the experimental treatment in order to prevent 1 adverse outcome (eg, death). So, for example, if a new treatment reduced mortality in a certain patient population from 10% to 4%, the relative risk (of death) would be 0.40 (4%/10%), the ARR would be 6% (10% – 4%), and NNT = 1/0.06 = 16.67, suggesting that 17 patients would need to be treated to prevent 1 death. The number needed to harm is an analogous measure in studies where the groups are being compared with respect to a risk factor or exposure that increases, rather than decreases, the risk of the outcome. It is still calculated as 1 divided by the absolute difference in risk, but has an opposite interpretation because the group with the factor of interest is harmed rather than helped.
Any of the measures mentioned earlier can be estimated from a given set of data. Such an estimate is called a point estimate. Because it is based on limited sample data, the point estimate will differ from the actual value it is intended to estimate due to sampling variability. To get a better idea of what the true value of a population parameter may be, one may calculate and present a confidence interval (CI) around the point estimate. A CI has a confidence level associated with it, denoted (1 – α) and usually expressed as a percent, corresponding to a significance level α for a related hypothesis test. Typically, α = 0.05, corresponding to 95% CIs. Even though a CI gives us a better idea of what the true population value is than does the point estimate, any given CI from a single study may or may not contain that true value. But the formulas for calculating 95% CIs are constructed in such a way that, over repeated studies, they will contain the correct value 95% of the time. It is from this that CIs get their name, as we claim 95% confidence that the interval will contain the true value.
Diagnostic and Screening Test Performance Measures
Diagnostic and screening tests are used to classify patients with respect to the presence or absence of a disease, syndrome, or other condition. These tests are not always accurate, and their performance is described by various metrics that relate to the correct classification of those with and without disease. To define these measures, consider a population of N patients represented in the following “truth table.” The columns of the table represent the patients’ true presence or absence of disease while the rows represent their test result. In practice, when evaluating new tests, we may not know with absolute certainty the true disease status of a set of patients, but instead compare a new test to a “gold standard” that is assumed to represent true disease status (Table 116–1).
Table 116–1Truth table. ||Download (.pdf) Table 116–1 Truth table.
| || ||True Disease Status |
|Disease ||No Disease ||Total |
|Test Result ||Positive ||a ||b ||a + b |
|Negative ||c ||d ||c + d |
|Total ||a + c ||b + d ||N |
The sensitivity of a test refers to how well it correctly classifies those with disease, and is equal to [a/(a + c)], the proportion of those with disease who test positive. The specificity of a test refers to how well it correctly classifies those without disease, and is equal to [d/(b + d)], the proportion of those without disease who test negative. The positive predictive value (PPV) of a test refers to how likely it is that a positive test result indicates the presence of disease, and is equal to [a/(a + b)], the proportion of all those with a positive test result who actually have disease. The negative predictive value (NPV) of a test refers to how likely it is that a negative test result indicates the absence of disease, and is equal to [d/(c + d)], the proportion of all those with a negative test result who actually do not have disease.
The sensitivity and specificity of a test are characteristics of the test itself and thus do not change unless something about the test itself is changed (eg, by altering a cutoff for what is considered a positive test result). The PPV and NPV, however, depend not only on how accurate the test is but also on how prevalent the disease is in the population to whom the test is applied. The prevalence of a disease is the proportion of the population that has the disease. In our truth table discussed earlier, the prevalence is [(a + c)/N]. In particular, if the prevalence is low in the population being tested, a test with fairly high sensitivity and specificity can still have very poor PPV. In such instances, more specific confirmatory tests may be needed in those who initially test positive.