Clinical research often uses techniques of statistical inference to determine the ‘statistical significance’ of the results, that is, how likely is it that the results obtained by the research methods may be ascribed to chance, rather than some effect of an intervention. Similar methods are also used in epidemiologic studies of risk factors to identify associations between exposure to the risk factor(s) and a disease or other health-related outcome. Even in descriptive studies, statistical inference may be used to identify differences between subgroups in the population that is being described. Inferences are often drawn, and statistical significance established by using a method known as hypothesis testing. A research hypothesis is established-usually stating that there is some difference between or among the groups studied-and the observed data are analyzed in order to decide whether to accept or reject the corresponding null hypothesis of no difference. Hypothesis testing has become de rigueur in clinical research, but its value as a primary means of analysis has been questioned.1-4 The purpose of this paper is to describe confidence intervals (CIs) as a statistical tool in clinical research and explain their utility as an alternative to hypothesis testing.
CONVENTIONAL USES OF HYPOTHESIS TESTING
Hypothesis testing results in a yes-no decision regarding the authenticity of the research findings. The decision to accept or reject the null hypothesis is based on a statistical test that yields a probability (the p-value) that the observed results are attributable to chance, in other words, random variation. The calculated p-value is compared to a probability value, alpha, which is traditionally set at .05. If the calculated p-value is less than alpha, the null hypothesis is rejected in favor of the alternative hypothesis. The value of alpha represents the probability of a type I error, which occurs when the null hypothesis is rejected when it is true. A ‘nonsignificant’ or ‘negative’ finding is interpreted to mean that any difference or association that may have been observed is not a true difference because it can be attributed to random variation of the measure in the population.
Several approaches to reporting the results of hypothesis testing have been used in the literature. The most basic and least informative approach is to merely report that the results are significant or nonsignificant based on the predetermined alpha level. A more informative approach is to report that the results are significant at or below some alpha value, for example p Limitations of Hypothesis Testing
Hypothesis testing as a method of determining significance in research has its origins in the agricultural work of R. A. Fisher in the 1920s.4,5 Fisher originated the methods of randomized assignment and of using the p-value to establish significance, arbitrarily selecting the probability of .05 as a threshold because he felt that it was ‘convenient.’4 Since that time, Fisher’s method of hypothesis testing using an alpha value of .05 has become entrenched in scientific literature, despite the feeling of many authorities, including Fisher himself, that this inflexible practice is not warranted.4
The limits of hypothesis testing as it has been used throughout much of the clinical research literature, become evident when issues like sample size, statistical power, and effect size are considered. Power is related to the value of beta (power = 1 – beta), which is the probability of a type II error (defined as acceptance of a null hypothesis which is false). For example, a study with a .10 (or 10%) probability of type II error has a power of .90 (or 90%). Effect size is related to the difference in a measure that is deemed to be clinically significant.6 Each of these concepts is crucial to the understanding of research literature because they too, are represented as terms in the equations that are used to calculate the numbers that allow us to make judgments about research findings. Using hypothesis testing to determine a study’s significance relative to alpha conceals the fact that the value of alpha is balanced by power, sample size, and effect size in any given investigation. These issues typically are considered prior to beginning a study, where an investigator must specify alpha, power, and some minimum clinically significant difference in a measure in order to calculate an adequate sample size.
Often, clinical researchers do not have the luxury of obtaining large sample sizes to insure adequate power to detect small but clinically important differences. Studies with small sample sizes (small-n studies) may result in ‘negative’ findings based on hypothesis testing, yet these findings may have clinical significance.7 The negative findings of small-n studies may be misleading when hypothesis testing methods are rigidly applied, causing us to ignore potentially useful interventions.2 Clearly, there is a need for an alternative to hypothesis testing that permits a broader and more flexible interpretation of research findings, and where the nuances of a study’s findings are not obscured by a binary decision regarding significance.
An alternative approach is available in the use of confidence intervals (CIs). A CI is a range of values, calculated from the observed data, which is likely to contain the true value at a specified probability. The probability is chosen by the investigator(s), and it is equal to 1 – alpha. Thus for an investigation that uses an alpha of .05, the corresponding interval will be a 95% CI. Confidence intervals provide information that may be used to test hypotheses, as well as additional information related to precision, power, sample size, and effect size.
Methods for calculating CIs vary according to the type of measure (mean, difference between rates, odds ratio, etc.) around which the CIs are constructed. It is beyond the scope of this article to specify formulas for calculating CIs. Interested readers may find these methods elsewhere.1,3,6,8 In general however, the interval is computed by adding and subtracting some quantity from the point estimate, which is the value of the target measure that is calculated from the data. Calculation of this quantity requires at a minimum the standard error, or a related measure, and a value related to alpha, such as a t- or Z-statistic.1,3,6
A CI may be constructed around a point estimate of a continuous variable such as a mean. For example, Berry and colleagues measured six-minute walk distance in a randomized clinical trial of long- term versus short-term exercise in participants with chronic obstructive pulmonary disease. At the end of the trial, participants who were involved in long-term (18 months) exercise walked a mean distance of 1,815 feet, with a 95% CI of 1,750 – 1,880 feet, and participants in the short-term (3 months) program walked a mean distance of 1,711 feet, with a 95% CI of 1,640 – 1,782 feet.9 The interpretation of the CIs is that the data are consistent with a 95% probability that the true mean falls between 1,750 and 1,880 feet for the long-term exercise group while in the short-term exercise group, the true mean falls between 1,640 and 1,782 feet.
Confidence intervals also may be constructed around a point estimate representing a categorical variable, such as the proportion of individuals who respond favorably to an intervention, and around epidemiologic measures of effect such as a relative risk or odds ratio. For example, Pereira and associates10 studied health outcomes in women 10 years after an exercise (walking) intervention. They calculated a relative risk of 0.18 (95% CI = 0.04-0.80) for heart disease in women who participated in the intervention, indicating a strong protective association.10 In this case, the Cl indicates that there is a 95% probability that the true relative risk is somewhere between 0.04 and 0.80.
CONFIDENCE INTERVALS AND HYPOTHESIS TESTING
Although CIs may be used for hypothesis testing of group differences in continuously measured variables, in practice this is rarely done. More commonly, CIs are used to test hypotheses involving proportions and ratio measures of effect. When considering a 95% CI around a relative risk, an investigator notes whether the CI includes the null value of the ratio, which for a relative risk is one. A CI that includes the null value is equivalent to a p- value that exceeds the specified value of alpha. For example, in the investigation of long-term outcomes following the walking intervention cited above, the relative risk for high blood pressure was 0.90 (95% CI = 0.47-1.74).10 Because this CI includes the null value of 1, a hypothesis test would accept the null hypothesis of no difference in high blood pressure risk between intervention and control groups. In contrast, the association between the walking intervention and heart disease (relative risk = 0.18, 9\5% CI = 0.04- 0.80) is interpreted as statistically significant because it does not contain the null value for relative risk. For this association, the null hypothesis of no difference in risk would be rejected in favor of the alternative hypothesis that the intervention protects against heart disease.
Confidence Intervals: Beyond Hypothesis Testing
To construe CIs as merely a different way to test hypotheses, however, would ignore other important information conveyed. A CI informs the investigator and the reader about the power of the study and whether or not the data are compatible with a clinically significant treatment effect. The width of the CI is an indication of the precision of the point estimate – a narrower CI indicates a more precise estimate, while a wide CI indicates a less precise estimate. Precision is related to sample size and power such that the larger the sample size, and the greater the power, the more precise will be the estimate of the measure.8,11 Assessing the width of the CI is particularly useful in studies with small sample sizes. In small-n studies with ‘negative’ findings, where hypothesis testing fails to find statistically significant treatment effects or associations, point estimates with wide CIs that include the null value may be consistent with clinically significant findings.8 This is because hypothesis testing alone fails to account for statistical power and sample size. Since power is equal to 1 – beta, it follows that studies with small sample sizes, and low statistical power, have a higher probability of failing to identify true treatment effects or associations (a type II error). As mentioned previously, type II errors can have adverse consequences in clinical research, particularly where large sample sizes are simply not feasible.2
The lower limit of a CI, which is the limit closest to the null value, is typically used for hypothesis testing. The higher limit, the limit furthest from the null value, can be used to indicate whether or not a treatment effect or association is compatible with the data.11 In any investigation, the true value of the variable under study is unknown, but it is estimated by the data. A confidence interval around the point estimate indicates a range of credible values of the variable that is consistent with the observed data. If the interval contains the value of a variable that corresponds to a clinically significant treatment effect or association, the study has not ruled out that such an effect exists, even if the finding ‘failed’ a hypothesis test. The evidence for a treatment effect or association may not be conclusive, but the finding need not be rejected unequivocally as the logic of hypothesis testing demands. Further study using larger sample sizes or meta-analysis may reveal a positive effect. This approach to analysis is more accommodating to thoughtful, yet judicious interpretation, allowing authors and readers to reflect on the nuances of the data as they consider the meaning of a study.
For example, White et al12 examined outcomes in patients with severe chronic obstructive pulmonary disease who participated in pulmonary rehabilitation or received advice and recommendations about exercise. They found no statistically significant differences in quality of life outcomes between the 2 groups, but the confidence intervals around these measures allowed them to suggest that some of their findings approached clinically significant differences. These authors acknowledge that recruitment difficulties lowered the sample sizes they were able to obtain, hence lowering the power of their study to detect statistically significant differences.12 This study illustrates some of the difficulties in conducting and interpreting research in populations with rare or severe conditions, and how the use of confidence intervals can assist in the interpretation of otherwise negative findings.
Confidence intervals also provide a more appropriate means of analysis for studies that seek to describe or explain, rather than to make decisions about treatment efficacy. The logic of hypothesis testing uses a decision-making mode of thinking which is more suitable to randomized, controlled trials (RCTs) of health care interventions. Indeed, hypothesis testing to determine statistical significance was initially intended to be used only in randomized experiments5 such as RCTs which are typically not feasible in clinical research involving identification of risk factors, etiology, clinical diagnosis, or prognosis.13 Use of CIs permits hypothesis testing, if warranted, but it also allows a more flexible approach to analysis that accounts for the objectives of each investigation in its proper context.
SOME CAVEATS IN THE INTERPRETATION OF RESULTS
As we consider the relative utility of hypothesis testing and CIs in the interpretation of research studies, it is important to appreciate the limits inherent in any statistical analysis of data. A fundamental assumption in analysis is that measurements are without bias, ie, measurement error or misclassification that is systematic or nonrandom. Neither hypothesis testing nor the use of CIs can correct for bias, which may lead to erroneous conclusions based on the observed data. Readers are also reminded that determination of statistical significance does not imply that results are clinically significant. Because of the interrelationship of alpha, power, effect size, and sample size, studies with large sample sizes may produce statistically significant results, even if a difference between groups (effect size) or an association is small.8 Determination of clinical significance requires additional interpretation based on clinical experience and prior literature.
Confidence intervals permit a more flexible and nuanced approach to analysis of research data. Not only do CIs enable investigators to test hypotheses about their data, they are also more informative about such important features as sample size and the precision of point estimates of group differences and associations. Confidence intervals also are useful in the interpretation of studies with small sample sizes, allowing researchers and consumers of scientific literature to draw more meaningful conclusions about the clinical significance of such studies. Increased use of CIs by researchers and journal editors along with improved understanding of CIs on the part of clinicians will help us avoid unnecessarily rigid interpretation of clinical research as we move toward evidence- based practice.
1. Sim J, Reid N. Statistical inference by confidence intervals: Issues of interpretation and utilization. Phys Ther. 1999;79:186- 195.
2. Ottenbacher KJ, Barrett KA. Statistical conclusion validity of rehabilitation research. Am J Phys Med Rehabil. 1990;69:102-107.
3. Simon R. Confidence intervals for reporting results of clinical trials. Ann Intern Med. 1986;105:429-435.
4. Feinstein AR. P-values and confidence intervals: Two sides of the same unsatisfactory coin. J Clin Epidemiol. 1998;51:355-360.
5. Salsberg D. The Lady Tasting Tea: How Statistics Revolutionized Science in the Twentieth Century. New York, NY: WH Freeman and Co.; 2001.
6. Portney LG, Watkins MP. Foundations of Clinical Research: Applications to Practice 2nd ed. Upper Saddle River, NJ: Prentice Hall Health; 2000.
7. Goodman SN, Berlin JA. The use of predicted confidence intervals when planning experiments and the misuse of power when interpreting results. Ann Intern Med. 1994;121:200-206.
8. Hennekens CH, Buring JE. Epidemiology in Medicine. 1st ed. Boston, Mass: Little, Brown and Company; 1987.
9. Berry MJ, Rejeski WJ, Adair NE, et al. A randomized controlled trial comparing long-term and short-term exercise in patients with chronic obstructive pulmonary disease. J Cardiopulm Rehabil. 2003;23:60-68.
10. Pereira MA, Kriska AM, Day RD, et al. A randomized walking trial in postmenopausal women: Effects on physical activity and health 10 years later. Arch Intern Med. 1998;158:1695-1701.
11. Smith AH, Bates MN. Confidence limit analysis should replace power calculations in the interpretation of epidemiologic studies. Epidemiology. 1992;3:449-452.
12. White RJ, Rudkin ST, Harrison ST, et al. Pulmonary rehabilitation compared with brief advice given for severe chronic obstructive pulmonary disease. J Cardiopulm Rehabil. 2002;22:338- 344.
13. Feinstein AR, Horwitz RI. Problems in the “evidence” of evidence-based medicine. Am J Med. 1997;103:529-535.
Gary Brooks, PT, DrPH, CCS
Associate Professor, School of Health Professions, Grand Valley State University and Research Associate, Grand Rapids Medical Education and Research Center, Grand Rapids, MI
Address correspondence to: Gary Brooks, PT, DrPH, CCS, Grand Rapids Medical Education and Research Center, 1000 Monroe Ave, NW. Grand Rapids, MI 49503 ([email protected]).
Copyright Cardiopulmonary Physical Therapy Journal Sep 2003