Discussion of The Problem of False Discoveries: How to Balance Objectives

Discussion of
The Problem of False
Discoveries: How to
Balance Objectives
2009 IES Research Conference
David Judkins
I would like to commend the authors on their fine
I found nothing to disagree with.
I would like to spend my time talking about the nature
of confirmatory versus exploratory analysis, how to
group outcomes, how to drill down, and the utility of
single dimensional summaries of multi-dimensional
Thanks to Andrea Piesse of Westat for valuable
Of course, my remarks are personal and do not
necessarily reflect Westat policies.
G.E.P. Box
I haven’t read any work by him directly on
multiple comparisons or false discovery control
But he has written elegantly about the nature of
discovery and the use of statistics in that process
An understanding of his work will help
researchers distinguish between exploratory and
confirmatory analysis in their own work
Statistics for Discovery
Box, 2001, Journal of Applied Statistics
Based on his 2000 Deming Lecture
Knowledge development is an iterative process
Alternates between induction and deduction
In the inductive phase, we use new data to
improve current models
In the deductive phase, we design and conduct
experiments to test the logical consequences of
the improved models
Long History
Francis Bacon discussed the iterative nature of
knowledge development at the beginning of the
Age of Enlightenment.
Steve Stigler told Box that Bishop Robert
Grosseteste, one of the founders of Oxford
University in the 1200s, also talked about this
idea and attributed it to Aristotle.
Box’s Illustration
Model: Today is like every day.
Deduction: My car will be in my parking space.
Data: It isn’t!
Induction: Someone must have taken it.
Box’s Illustration (2)
Model: My car has been stolen.
Deduction: My car will not be in the parking lot.
Data: No. It’s over there!
Induction: Someone took it and brought it back.
Box’s Illustration (3)
Model: A thief took it and brought it back.
Deduction: My car will be broken into.
Data: No. It’s unharmed and locked!
Induction: Someone who had a key took it.
Box’s Illustration (4)
Model: My wife used my car.
Deduction: She has probably left me a note.
Data: Yes. Here it is!
Box on Judge versus Detective
In the trial, there is a judge and jury before
whom, under very strict rules, all the evidence
must be brought together at one time and the
jury must decide, whether the hypothesis of
innocence can be rejected beyond all reasonable
doubt. This is very much like a statistical test.
Box on Judge versus Detective (2)
However, the apprehension of the defendant by
a detective will have been conducted by a very
different process. … The approach of the
detective closely parallels that of the scientific
Fitting Randomized Trials into this
“Randomized trials” is, I believe, the name
favored in education research for experiments.
Much of the tradition for how to run them and
analyze them comes from the fields of medical
interventions, devices and pharmaceuticals,
where, of course, they are known as randomized
clinical trials.
What aspects of that tradition are appropriate in
education research?
Regulatory Role of CRTs
I think that much of the tradition has arisen
from the regulatory role of CRTs.
The FDA panels are much like Box’s juries, and
the FDA administrators like Box’s judges.
Of course, there is a huge set of investigators at
the drug companies working to synthesize new
drugs and to develop new devices.
But there is a severe administrative and legal
separation between the two operations.
Education Researchers Wear Both
So when are we acting like judges and when like
investigators? When like the FDA and when like
the drug company development arms?
This determines to a large extent whether
formal control over family-wise error rates is
appropriate and thus whether adjustments must
be made for multiple comparisons
I would say that we should treat an analysis as a
confirmatory analysis in the language of
Schochet and Deke if there is a good chance
that the findings will become accepted
knowledge for years to come.
I also think that there is a fairly strong danger of
exploratory analyses being mistaken for
confirmatory, so I urge very clear language in
the caveats of exploratory analyses
What Works Clearinghouse
The title suggests that all the guidance to be
found is very solid and reliable.
Thus, I think that requiring FWER control for
entry into WWC is very appropriate.
But then how do we facilitate the induction
How do we work to improve the models that for
the most part are still very primitive in education
What Might Work Clearinghouse
Report all the findings from randomized trials
with no concern about FWER?
Also, report findings from poorly controlled
observational studies?
A resource for experimenters not for
Peter and John mention that grouping outcomes
is a powerful way of mitigating the multiple
comparison problem
But how to form them?
In education research, there is a strong urge to
treat each assessment as a separate domain
Are receptive and expressive vocabulary skills really
separate domains?
Sources of Resistance to Grouping
Maybe a sense that they want to be doing
investigative work rather than judging work?
Pressure from test publishers to see results for
their assessments presented separately?
Post-Peek Grouping
What happens when we use conditional grouping rules?
Let X and Y be two outcome variables, and Z be the
average of the two
Suppose we only estimate the effect of T on Z if we
first find that the difference in the effects of T on X
and Y are not statistically different from each other?
Otherwise, we publish the effects of T on X and T on
Y separately with multiple comparison adjustment
Post-Peek Grouping (2)
The math is complex. Preliminary simulations
hint, however, that the procedure is too liberal,
failing to provide FWER control.
Also not clear how to generalize to more than
two outcomes
Drilling Down
If there is a significant effect on the composite
domain outcome, then there is natural interest in
the components.
I think this falls under the rubric of exploratory
analysis done to facilitate the induction phase of
knowledge building.
If FWER control is attempted for the drilldown, the resampling methods would certainly
appear best suited given the strong correlations.
Multi-Domain Outcome Indices
Not every summary measure needs to be built up from
a set of correlated items around the same latent
Think of the quality-of-life indices published for cities
around the world.
Educational and developmental progress is multidimensional, but that does not mean that every
dimension needs to be reported separately.
We should not insist that all outcome measures have
high reliability for uni-dimensional latent variables.