Quality Issues and Standards
JAMA. 2002;287(21):2815-2817. doi: 10.1001/jama.287.21.2815

Extrapolation of Correlation Between 2 Variables in 4 General Medical Journals

  1. Yen-Hong Kuo, ScM, MS
  1. Author Affiliation: Department of Research, Jersey Shore Medical Center, Meridian Health System, Neptune, NJ.

Abstract

Context  An estimated correlation between 2 variables is valid only within the range of observed data. Extrapolation is risky and should be performed with caution.

Methods  To assess the prevalence of problems with data extrapolation in the medical literature, all articles published from January through June 2000 in BMJ, JAMA, The Lancet, and The New England Journal of Medicine (NEJM) were reviewed manually. Articles containing at least 1 scatterplot with raw data and a corresponding fitted regression line were included in the analysis. Articles were considered to involve extrapolation if they contained at least 1 fitted line beyond the observed data in any scatter plot.

Results  A total of 178 articles presenting at least 1 scatterplot were identified. Among them, 37 articles (21%) (5 from BMJ, 7 from JAMA, 23 from The Lancet, and 2 from NEJM) were included. Twenty-two articles (59% [95% confidence interval, 42%-75%]) from all 4 journals involved extrapolation. None changed the line type to indicate extrapolation. Four articles (11%) contained a plot in which the fitted line reached unreasonable or meaningless values. Three articles (8%) stated explicit conclusions about values outside the range of the observed data.

Conclusions  A high proportion of the articles analyzed from all 4 weekly general medical journals involved extrapolation without indication. Researchers, reviewers, and editors should be aware of this problem and work to eliminate it.

Adding a fitted regression line to a scatterplot is helpful for describing an estimated relationship between 2 continuous variables. However, this estimation is valid only within the range of data. Therefore, without knowing information beyond the observed characteristics, it is very risky to extrapolate the fitted line.1-3 In spite of this concern, several articles have been published in which the fitted line not only exceeded the range of the data but also reached an undesirable value.4-6

This study was undertaken to assess how prevalent the extrapolation problem is and how this issue was managed in 4 general medical journals.

METHODS

All of the articles published from January through June 2000 in 4 weekly general medical journals (BMJ, JAMA, The Lancet, and The New England Journal of Medicine [NEJM]) were manually reviewed. The main outcome measure was the proportion of articles that involved data extrapolation problems. Extrapolation was defined as an instance in which the fitted line exceeded the observed data range of explanatory variables in the regression model (as shown in Figure 1). The first step was to identify articles in which at least 1 scatter plot was presented. Subsequently, to assess extrapolation, only articles showing a scatterplot of raw data and a corresponding fitted regression line were included in the analysis.

Figure. Illustration of Extrapolation Problems

The hypothetical response scores are on a scale of 0 to 40. The regression line is incorrectly extrapolated to an unreasonable weight (negative value) and meaningless score (>40).

RESULTS

A total of 178 articles with at least 1 scatter plot were identified. Among them, 37 articles (21%) with scatterplots presenting raw data and a corresponding fitted regression line were included in this study: 5 from BMJ,7-11 7 from JAMA,12-18 23 from The Lancet,19-41 and 2 from NEJM.42-43

All together, 22 articles (59% [95% confidence interval, 42%-75%]) from these journals had an extrapolation problem. The proportions (from lowest to highest, 40%, 50%, 57%, and 86%) were not statistically significantly different (P = .37 by 2-sided Fisher exact test) among journals because of the small sample size. This problem was found regardless of whether the correlation was assessed using simple linear regression. None of the illustrations of fitted lines included graphic distinction of the line to indicate extrapolation. Four articles (11%) presented a fitted line that reached a meaningless11, 40 or unreasonable14, 28 value. In addition, 3 articles (8%) stated explicit conclusions about values outside the range of the observed data.33, 40, 43

COMMENT

This study reveals that almost 60% of the included articles involved extrapolation when displaying a correlation between 2 variables with a fitted line. This common problem was found in all 4 weekly general medical journals.

Mathematically, a fitted regression line can be drawn by plugging in almost any real numbers to the estimated equation. However, in clinical applications, this line should not be presented to exceed the range of data. Otherwise, extrapolation can result in reaching an undesirable value. For example, in one study, the fitted line reached a negative value for time from stroke symptom onset to emergency department arrival.14 This suggests that patients arrived at the emergency department before the onset of stroke symptoms. Another article presented 2 regression lines that unreasonably arrived at a negative level of proteinuria.28

Sometimes the error of reaching an undesirable value is not obvious and is difficult to discern unless readers can carefully identify the reasonable range of data in the study (which is not always available). For example, in one article, the fitted line crossed a meaningless area of the Townsend score.11 The range of scores was not described in the legend of the graph but, rather, in the text. Another study showed a fitted line reaching a value of undetectable cytomegalovirus viral load.40 A nominal value of negative result was described in the text but not shown in the graph. As a consequence, using an undetectable amount to make a prediction is apparently meaningless.

Limitations of or errors generated by computer programs could be a possible cause of extrapolating or reaching an undesirable value. For example, in the 4 articles that involved extrapolating an undesirable value, all of the lines reached the edges of the graph.11, 14, 28, 40 However, 9 of 22 articles had extrapolation problems in which the fitted lines did not reach the margins of the graph. Authors are responsible for ensuring that the estimation and presentation of their data are clinically meaningful and should carefully check all data and graphs generated by a computer program.

Problems with extrapolation also involved stating explicit conclusions about values outside the range of the observed data. In some cases, the reader is required to perform some calculations to become aware of the extrapolation problem. For example, in one study, 2 doses used for demonstrating the effect of inhaled corticosteroids on bone mineral density were both outside the range of observed data; one was apparently higher; the other one, after computation, was lower.33 In another study, the expected time to reach undetectable levels of thymus-dependent T-cell antigen-receptor episomes not only exceeded the maximum number of years after transplantation but also did not correspond to the fitted line in the graph as well as the regression equation.43

With raw data, it is very easy for a reader to be aware of an extrapolation problem from a graph. However, if only a fitted line is presented without the original data points, or even an integrated plot, the extrapolation problem becomes harder to identify. For example, in one article, some of the areas in the constructed contour plot exceeded the range of data.40 However, this problem could not be identified readily by the reader without performing calculations based on information from the text.44

Extrapolation is very dangerous for medical decision making and can result in damaging outcomes.1-3 Describing or presenting the estimated correlation within the range of data can prevent potential problems. However, if making an extrapolation is necessary for a researcher, such analysis must be handled with caution. That is, it needs to be described explicitly in the text or indicated in the plot by use of differentiating line types, especially when the range of data is not provided. In addition, during the peer review process, editors and reviewers need to be aware of and identify such extrapolation issues to prevent and eliminate potential problems.

Acknowledgments

Author Contribution: Mr Kuo contributed the conception and design; data acquisition, analysis, and interpretation; manuscript drafting and critical revision for important intellectual content; statistical expertise; and administrative, technical, and material support for this article.

Corresponding Author and Reprints: Yen-Hong Kuo, ScM, MS, Department of Research, Jersey Shore Medical Center, Meridian Health System, 1945 State Rte 33, Neptune, NJ 07753 (e-mail: yhkuo{at}jhu.edu).

References

  1. 1.
  2. 2.
  3. 3.
  4. 4.
  5. 5.
  6. 6.
  7. 7.
  8. 8.
  9. 9.
  10. 10.
  11. 11.
  12. 12.
  13. 13.
  14. 14.
  15. 15.
  16. 16.
  17. 17.
  18. 18.
  19. 19.
  20. 20.
  21. 21.
  22. 22.
  23. 23.
  24. 24.
  25. 25.
  26. 26.
  27. 27.
  28. 28.
  29. 29.
  30. 30.
  31. 31.
  32. 32.
  33. 33.
  34. 34.
  35. 35.
  36. 36.
  37. 37.
  38. 38.
  39. 39.
  40. 40.
  41. 41.
  42. 42.
  43. 43.
  44. 44.
« Previous | Next Article »Table of Contents

More in JAMA & Archives Journals