# Guest Post: Confidence in Intervals and Diffidence in the Courts

This guest post comes to the STLR Blog from CLS Lecturer-in-Law Nathan A. Schachtman. He blogs regularly at http://schachtmanlaw.com/blog/. This post was originally published at that site and is available here.

Next year, the Supreme Court’s *Daubert* decision will turn 20. The decision, in interpreting Federal Rule of Evidence 702, dramatically changed the landscape of expert witness testimony. Still, there are many who would turn the clock back to disabling the gatekeeping function. In past posts, I have identified scholars, such as Erica Beecher-Monas and the late Margaret Berger, who tried to eviscerate judicial gatekeeping. Recently a student note argued for the complete abandonment of all judicial control of expert witness testimony. *See* Note, “Admitting Doubt: A New Standard for Scientific Evidence,” 123 *Harv. L. Rev*. 2021 (2010)(arguing that courts should admit all relevant evidence).

One advantage that comes from requiring trial courts to serve as gatekeepers is that the expert witnesses’ reasoning is approved or disapproved in an open, transparent, and rational way. Trial courts subject themselves to public scrutiny in a way that jury decision making does not permit. The critics of *Daubert *often engage in a cynical attempt to remove all controls over expert witnesses in order to empower juries to act on their populist passions and prejudices. When courts misinterpret statistical and scientific evidence, there is some hope of changing subsequent decisions by pointing out their errors. Jury errors on the other hand, unless they involve determinations of issues for which there were “no evidence,” are immune to institutional criticism or correction.

Despite my whining, not all courts butcher statistical concepts. There are many astute judges out there who see error and call it error. Take for instance, the trial judge who was confronted with this typical argument:

“While Giles admits that a p-value of .15 is three times higher than what scientists generally consider statistically significant—that is, a p-value of .05 or lower—she maintains that this ‘‘represents 85% certainty, which meets any conceivable concept of preponderance of the evidence.’’ (Doc. 103 at 16).”

*Giles v. Wyeth, Inc*., 500 F.Supp. 2d 1048, 1056-57 (S.D.Ill. 2007), *aff’d*, 556 F.3d 596 (7th Cir. 2009). Despite having case law cited to it (such as *In re Ephedra*), the trial court looked to the *Reference Manual on Scientific Evidence*, a resource that seems to be ignored by many federal judges, and rejected the bogus argument. Unfortunately, the lawyers who made the bogus argument still are licensed, and at large, to incite the same error in other cases.

This business perhaps would be amenable to an empirical analysis. An enterprising sociologist of the law could conduct some survey research on the science and math training of the federal judiciary, on whether the federal judges have read chapters of the *Reference Manual* before deciding cases involving statistics or science, and whether federal judges expressed the need for further education. This survey evidence could be capped by an analysis of the prevalence of certain kinds of basic errors, such as the transpositional fallacy committed by so many judges (but decisively rejected in the *Giles* case). Perhaps such an empirical analysis would advance our understanding whether we need specialty science courts.

One of the reasons that the *Reference Manual on Scientific Evidence* is worthy of so much critical attention is that the volume has the imprimatur of the Federal Judicial Center, and now the National Academies of Science. Putting aside the idiosyncratic chapter by the late Professor Berger, the *Manual* clearly present guidance on many important issues. To be sure, there are gaps, inconsistencies, and mistakes, but the statistics chapter should be a must-read for federal (and state) judges.

Unfortunately, the *Manual* has competition from lesser authors whose work obscures, misleads, and confuses important issues. Consider an article by two would-be expert witnesses, who testify for plaintiffs, and confidently misstate the meaning of a confidence interval:

“Thus, a RR [relative risk] of 1.8 with a confidence interval of 1.3 to 2.9 could very likely represent a true RR of greater than 2.0, and as high as 2.9 in 95 out of 100 repeated trials.”

Richard W. Clapp & David Ozonoff, “Environment and Health: Vital Intersection or Contested Territory?” 30 *Am. J. L. & Med*. 189, 210 (2004). This misstatement was then cited and quoted with obvious approval by Professor Beecher-Monas, in her text on scientific evidence. Erica Beecher-Monas, *Evaluating Scientific Evidence: An Interdisciplinary Framework for Intellectual Due *Process 60-61 n. 17 (2007). Beecher-Monas goes on, however, to argue that confidence interval coefficients are not the same as burdens of proof, but then implies that scientific standards of proof are different from the legal preponderance of the evidence. She provides no citation or support for the higher burden of scientific proof:

“Some commentators have attributed the causation conundrum in the courts to the differing burdens of proof in science and law.

^{28}In law, the civil standard of ‘more probable than not’ is often characterized as a probability greater than 50 percent.^{29}In science, on the other hand, the most widely used standard is a 95 percent confidence interval (corresponding to a 5 percent level of significance, or p-level).^{30}Both sound like probabilistic assessment. As a result, the argument goes, civil judges should not exclude scientific testimony that fails scientific validity standards because the civil legal standards are much lower. The transliteration of the ‘more probable than not’ standard of civil factfinding into a quantitative threshold of statistical evidence is misconceived. The legal and scientific standards are fundamentally different. They have different goals and different measures. Therefore, one cannot justifiably argue that evidence failing to meet the scientific standards nonetheless should be admissible because the scientific standards are too high for preponderance determinations.”

*Id*. at 65. This seems to be on the right track, although Beecher-Monas does not state clearly whether she subscribes to the notion that the burdens of proof in science and law differ. The argument then takes a wrong turn:

“Equating confidence intervals with burdens of persuasion is simply incoherent. The goal of the scientific standard – the 95 percent confidence interval – is to avoid claiming an effect when there is none (i.e., a false positive).

^{31}“

*Id*. at 66. But this is crazy error; confidence intervals are not burdens of persuasion, legal ** or** scientific. Beecher-Monas is not, however, content to leave this alone:

“Scientists using a 95 percent confidence interval are making a prediction about the results being due to something

other than chance.”

*Id*. at 66 (emphasis added). Other than chance? Well this implies causality, as well as bias and confounding, but the confidence interval, like the p-value, addresses only random or sampling error. Beecher-Monas’s error is neither random nor scientific. Indeed, she perpetuates the same error committed by the Fifth Circuit in a frequently cited Bendectin case, which interpreted the confidence interval as resolving questions of the role of matters “other than chance,” such as bias and confounding. *Brock v. Merrill Dow Pharmaceuticals, Inc.*, 874 F.2d 307, 311-12 (5th Cir. 1989)(“Fortunately, we do not have to resolve any of the above questions [as to bias and confounding], since the studies presented to us incorporate the possibility of these factors by the use of a confidence interval.”)(emphasis in original). *See, e.g.,* David H. Kaye, David E. Bernstein, and Jennifer L. Mnookin, *The New Wigmore – A Treatise on Evidence: Expert Evidence* § 12.6.4, at 546 (2d ed. 2011) Michael O. Finkelstein, *Basic Concepts of Probability and Statistics in the* *Law* 86-87 (2009)(criticizing the overinterpretation of confidence intervals by the *Brock* court).

Clapp, Ozonoff, and Beecher-Monas are not alone in offering bad advice to judges who must help resolve statistical issues. Déirdre Dwyer, a prominent scholar of expert evidence in the United Kingdom, manages to bundle up the transpositional fallacy and a misstatement of the meaning of the confidence interval into one succinct exposition:

“By convention, scientists require a 95 per cent probability that a finding is not due to chance alone. The risk ratio (e.g. ‘2.2’) represents a mean figure. The actual risk has a 95 per cent probability of lying somewhere between upper and lower limits (e.g. 2.2 ±0.3, which equals a risk somewhere between 1.9 and 2.5) (the ‘confidence interval’).”

Déirdre Dwyer, *The Judicial Assessment of Expert* *Evidence* 154-55 (Cambridge Univ. Press 2008).

Of course, Clapp, Ozonoff, Beecher-Monas, and Dwyer build upon a long tradition of academics’ giving errant advice to judges on this very issue. *See, e.g.,* Christopher B. Mueller, “Daubert Asks the Right Questions: Now Appellate Courts Should Help Find the Right Answers,” 33 *Seton Hall L. Rev*. 987, 997 (2003)(describing the 95% confidence interval as “the range of outcomes that would be expected to occur by chance no more than five percent of the time”); Arthur H. Bryant &Alexander A. Reinert, “The Legal System’s Use of Epidemiology,” 87 *Judicature* 12, 19 (2003)(“The confidence interval is intended to provide a range of values within which, at a specified level of certainty, the magnitude of association lies.”) (incorrectly citing the first edition of Rothman & Greenland, *Modern Epidemiology* 190 (Philadelphia 1998); John M. Conley & David W. Peterson, “The Science of Gatekeeping: The Federal Judicial Center’s New *Reference Manual on Scientific Evidence*,” 74 *N.C.L.Rev*. 1183, 1212 n.172 (1996)(“a 95% confidence interval … means that we can be 95% certain that the true population average lies within that range”).

Who has prevailed? The statistically correct authors of the statistics chapter of the *Reference Manual on Scientific Evidence*, or the errant commentators? It would be good to have some empirical evidence to help evaluate the judiciary’s competence. Here are some cases, many drawn from the *Manual*‘s discussions, arranged chronologically, before and after the first appearance of the *Manual*:

**Before First Edition of the Reference Manual on Scientific Evidence:**

*DeLuca v. Merrell Dow Pharms., Inc.,* 911 F.2d 941, 948 (3d Cir. 1990)(“A 95% confidence interval is constructed with enough width so that one can be confident that it is only 5% likely that the relative risk attained would have occurred if the true parameter, i.e., the actual unknown relationship between the two studied variables, were outside the confidence interval. If a 95% confidence interval thus contains ’1′, or the null hypothesis, then a researcher cannot say that the results are ‘statistically significant’, that is, that the null hypothesis has been disproved at a .05 level of significance.”)(internal citations omitted)(citing in part, D. Barnes & J. Conley, *Statistical Evidence in Litigation* § 3.15, at 107 (1986), as defining a CI as “a limit above or below or a range around the sample mean, beyond which the true population is unlikely to fall”).

*United States ex rel. Free v. Peters*, 806 F. Supp. 705, 713 n.6 (N.D. Ill. 1992) (“A 99% confidence interval, for instance, is an indication that if we repeated our measurement 100 times under identical conditions, 99 times out of 100 the point estimate derived from the repeated experimentation will fall within the initial interval estimate … .”), *rev’d in part*, 12 F.3d 700 (7th Cir. 1993)

*DeLuca v. Merrell Dow Pharms., Inc.,* 791 F. Supp. 1042, 1046 (D.N.J. 1992)(”A 95% confidence interval means that there is a 95% probability that the ‘true’ relative risk falls within the interval”) , *aff’d*, 6 F.3d 778 (3d Cir. 1993)

*Turpin v. Merrell Dow Pharms., Inc., *959 F.2d 1349, 1353-54 & n.1 (6th Cir. 1992)(describing a 95% CI of 0.8 to 3.10, to mean that “random repetition of the study should produce, 95 percent of the time, a relative risk somewhere between 0.8 and 3.10″)

*Hilao v. Estate of Marcos*, 103 F.3d 767, 787 (9th Cir. 1996)(Rymer, J., dissenting and concurring in part).

**After the first publication of the Reference Manual on Scientific Evidence:**

*American Library Ass’n v. United States*, 201 F.Supp. 2d 401, 439 & n.11 (E.D.Pa. 2002), *rev’d on other grounds*, 539 U.S. 194 (2003)

*SmithKline Beecham Corp. v. Apotex Corp*., 247 F.Supp.2d 1011, 1037-38 (N.D. Ill. 2003)(“the probability that the true value was between 3 percent and 7 percent, that is, within two standard deviations of the mean estimate, would be 95 percent”)(also confusing attained significance probability with posterior probability: “This need not be a fatal concession, since 95 percent (i.e., a 5 percent probability that the sign of the coefficient being tested would be observed in the test even if the true value of the sign was zero) is an arbitrary measure of statistical significance. This is especially so when the burden of persuasion on an issue is the undemanding ‘preponderance’ standard, which requires a confidence of only a mite over 50 percent. So recomputing Niemczyk’s estimates as significant only at the 80 or 85 percent level need not be thought to invalidate his findings.”), *aff’d on other grounds*, 403 F.3d 1331 (Fed. Cir. 2005)

*In re Silicone Gel Breast Implants Prods. Liab. Litig*, 318 F.Supp.2d 879, 897 (C.D. Cal. 2004) (interpreting a relative risk of 1.99, in a subgroup of women who had had polyurethane foam covered breast implants, with a 95% CI that ran from 0.5 to 8.0, to mean that “95 out of 100 a study of that type would yield a relative risk somewhere between on 0.5 and 8.0. This huge margin of error associated with the PUF-specific data (ranging from a potential finding that implants make a woman 50% *less likely* to develop breast cancer to a potential finding that they make her 800% *more likely* to develop breast cancer) render those findings meaningless for purposes of proving or disproving general causation in a court of law.”)(emphasis in original)

*Ortho–McNeil Pharm., Inc. v. Kali Labs., Inc*., 482 F.Supp. 2d 478, 495 (D.N.J.2007)(“Therefore, a 95 percent confidence interval means that if the inventors’ mice experiment was repeated 100 times, roughly 95 percent of results would fall within the 95 percent confidence interval ranges.”)(apparently relying party’s expert witness’s report), *aff’d in part, vacated in part, sub nom. Ortho McNeil Pharm., Inc. v. Teva Pharms Indus., Ltd.*, 344 Fed.Appx. 595 (Fed. Cir. 2009)

*Eli Lilly & Co. v. Teva Pharms, USA*, 2008 WL 2410420, *24 (S.D.Ind. 2008)(stating incorrectly that “95% percent of the time, the true mean value will be contained within the lower and upper limits of the confidence interval range”)

*Benavidez v. City of Irving*, 638 F.Supp. 2d 709, 720 (N.D. Tex. 2009)(interpreting a 90% CI to mean that “there is a 90% chance that the range surrounding the point estimate contains the truly accurate value.”)

*Estate of George v. Vermont League of Cities and Towns*, 993 A.2d 367, 378 n.12 (Vt. 2010)(erroneously describing a confidence interval to be a “range of values within which the results of a study sample would be likely to fall if the study were repeated numerous times”)

**Correct Statements**

There is no reason for any of these courts to have struggled so with the concept of statistical significance or of the confidence interval. These concepts are well elucidated in the *Reference Manual on Scientific Evidence (RMSE)*:

“To begin with, ‘confidence’ is a term of art. The confidence level indicates the percentage of the time that intervals from repeated samples would cover the true value. The confidence level does not express the chance that repeated estimates would fall into the confidence interval.91

* * *

According to the frequentist theory of statistics, probability statements cannot be made about population characteristics: Probability statements apply to the behavior of samples. That is why the different term ‘confidence’ is used.”

*RMSE* 3d at 247 (2011).

Even before the Manual, many capable authors have tried to reach the judiciary to help them learn and apply statistical concepts more confidently. Professors Michael Finkelstein and Bruce Levin, of the Columbia University’s Law School and Mailman School of Public Health, respectively, have worked hard to educate lawyers and judges in the important concepts of statistical analyses:

“It is the confidence limits

PLandPUthat are random variables based on the sample data. Thus, a confidence interval (PL, PU) is a random interval, which may or may not contain the population parameterP. The term ‘confidence’ derives from the fundamental property that, whatever the true value ofP, the 95% confidence interval will containPwithin its limits 95% of the time, or with 95% probability. This statement is made only with reference to the general property of confidence intervals and not to a probabilistic evaluation of its truth in any particular instance with realized values ofPLandPU. “

Michael O. Finkelstein & Bruce Levin, *Statistics for Lawyers* at 169-70 (2d ed. 2001)

Courts have no doubt been confused to some extent between the operational definition of a confidence interval and the role of the sample point estimate as an estimator of the population parameter. In some instances, the sample statistic may be the best estimate of the population parameter, but that estimate may be rather crummy because of the sampling error involved. *See, e.g*., Kenneth J. Rothman, Sander Greenland, Timothy L. Lash, *Modern Epidemiology* 158 (3d ed. 2008) (“Although a single confidence interval can be much more informative than a single P-value, it is subject to the misinterpretation that values inside the interval are equally compatible with the data, and all values outside it are equally incompatible. * * * A given confidence interval is only one of an infinite number of ranges nested within one another. Points nearer the center of these ranges are more compatible with the data than points farther away from the center.”); Nicholas P. Jewell, *Statistics for Epidemiology* 23 (2004)(“A popular interpretation of a confidence interval is that it provides values for the unknown population proportion that are ‘compatible’ with the observed data. But we must be careful not to fall into the trap of assuming that each value in the interval is equally compatible.”); Charles Poole, “Confidence Intervals Exclude Nothing,” 77 *Am. J. Pub. Health* 492, 493 (1987)(“It would be more useful to the thoughtful reader to acknowledge the great differences that exist among the p-values corresponding to the parameter values that lie within a confidence interval … .”).

Admittedly, I have given an impressionistic account, and I have used anecdotal methods, to explore the question whether the courts have improved in their statistical assessments in the 20 years since the Supreme Court decided *Daubert*. Many decisions go unreported, and perhaps many errors are cut off from the bench in the course of testimony or argument. I personally doubt that judges exercise greater care in their comments from the bench than they do in published opinions. Still, the quality of care exercised by the courts would be a worthy area of investigation by the Federal Judicial Center, or perhaps by other sociologists of the law.