An introduction to Impact Evaluation Markus Goldstein Poverty Reduction Group

An introduction to Impact
Markus Goldstein
Poverty Reduction Group
The World Bank
My question is: Are we making an impact?
2 parts
• Impact evaluation methods
• Impact evaluation practicalities: IE and the
project cycle
• Use rural project examples
Outline - methods
Monitoring and impact evaluation
Why do impact evaluation
Why we need a comparison group
Methods for constructing the comparison
• When to do an impact evaluation
Monitoring and IE
Effect on living standards
- infant and child mortality,
- prevalence of specific disease
Access, usage and satisfaction of users
- number of children vaccinated,
- percentage within 5 km of health center
Goods and services generated
- number of nurses
- availability of medicine
Financial and physical resources
- spending in primary health care
Monitoring and IE
Program impacts
confounded by local,
national, global effects
Users meet
Impact evaluation
• Many names (e.g. Rossi et al call this
impact assessment) so need to know the
• Impact is the difference between
outcomes with the program and without it
• The goal of impact evaluation is to
measure this difference in a way that can
attribute the difference to the program, and
only the program
Why it matters
• We want to know if the program had an
impact and the average size of that impact
– Understand if policies work
• Justification for program (big $$)
• Scale up or not – did it work?
• Meta-analyses – learning from others
– (with cost data) understand the net benefits of
the program
– Understand the distribution of gains and
What we need
 The difference in outcomes with the
program versus without the program – for
the same unit of analysis (e.g. individual)
• Problem: individuals only have one
• Hence, we have a problem of a missing
counter-factual, a problem of missing data
Thinking about the counterfactual
• Why not compare individuals before and
after (the reflexive)?
– The rest of the world moves on and you are
not sure what was caused by the program
and what by the rest of the world
• We need a control/comparison group that
will allow us to attribute any change in the
“treatment” group to the program
comparison group issues
• Two central problems:
– Programs are targeted
 Program areas will differ in observable and unobservable
ways precisely because the program intended this
– Individual participation is (usually) voluntary
Participants will differ from non-participants in observable
and unobservable ways
• Hence, a comparison of participants and an
arbitrary group of non-participants can lead to
heavily biased results
Example: providing fertilizer to
• The intervention: provide fertilizer to farmers in a
poor region of a country (call it region A)
– Program targets poor areas
– Farmers have to enroll at the local extension office to
receive the fertilizer
– Starts in 2002, ends in 2004, we have data on yields
for farmers in the poor region and another region
(region B) for both years
• We observe that the farmers we provide fertilizer
to have a decrease in yields from 2002 to 2004
Did the program not work?
• Further study reveals there was a national
drought, and everyone’s yields went down
(failure of the reflexive comparison)
• We compare the farmers in the program region
to those in another region. We find that our
“treatment” farmers have a larger decline than
those in region B. Did the program have a
negative impact?
– Not necessarily (program placement)
• Farmers in region B have better quality soil (unobservable)
• Farmers in the other region have more irrigation, which is key
in this drought year (observable)
OK, so let’s compare the farmers in
region A
• We compare “treatment” farmers with their neighbors.
We think the soil is roughly the same.
• Let’s say we observe that treatment farmers’ yields
decline by less than comparison farmers. Did the
program work?
– Not necessarily. Farmers who went to register with the program
may have more ability, and thus could manage the drought better
than their neighbors, but the fertilizer was irrelevant. (individual
• Let’s say we observe no difference between the two
groups. Did the program not work?
– Not necessarily. What little rain there was caused the fertilizer to
run off onto the neighbors’ fields. (spillover/contamination)
The comparison group
• In the end, with these naïve comparisons,
we cannot tell if the program had an
 We need a comparison group that is as
identical in observable and unobservable
dimensions as possible, to those receiving
the program, and a comparison group that
will not receive spillover benefits.
How to construct a comparison
group – building the counterfactual
Instrumental variables
Regression discontinuity
1. Randomization
• Individuals/communities/firms are randomly assigned
into participation
• Counterfactual: randomized-out group
• Advantages:
– Often addressed to as the “gold standard”: by design: selection
bias is zero on average and mean impact is revealed
– Perceived as a fair process of allocation with limited resources
• Disadvantages:
– Ethical issues, political constraints
– Internal validity (exogeneity): people might not comply with the
assignment (selective non-compliance)
– Unable to estimate entry effect
– External validity (generalizability): usually run controlled
experiment on a pilot, small scale. Difficult to extrapolate the
results to a larger population.
Randomization in our example…
• Simple answer: randomize farmers within
a community to receive fertilizer...
• Potential problems?
– Run-off (contamination) so control for this
– Take-up (what question are we answering)
2. Matching
• Match participants with non-participants from a
larger survey
• Counterfactual: matched comparison group
• Each program participant is paired with one or more nonparticipant that are similar based on observable
• Assumes that, conditional on the set of observables, there
is no selection bias based on unobserved heterogeneity
• When the set of variables to match is large, often match
on a summary statistics: the probability of participation as
a function of the observables (the propensity score)
2. Matching
• Advantages:
– Does not require randomization, nor baseline (preintervention data)
• Disadvantages:
– Strong identification assumptions
– Requires very good quality data: need to control for
all factors that influence program placement
– Requires significantly large sample size to generate
comparison group
Matching in our example…
• Using statistical techniques, we match a
group of non-participants with participants
using variables like gender, household
size, education, experience, land size
(rainfall to control for drought), irrigation
(as many observable charachteristics not
affected by fertilizer)
Matching in our example…
2 scenarios
– Scenario 1: We show up afterwards, we can only
match (within region) those who got fertilizer with
those who did not. Problem?
• Problem: select on expected gains and/or ability
– Scenario 2: The program is allocated based on
historical crop choice and land size. We show up
afterwards and match those eligible in region A with
those in region B. Problem?
• Problems: same issues of individual unobservables, but
lessened because we compare eligible to potential eligible
• now unobservables across regions
An extension of matching:
pipeline comparisons
• Idea: compare those just about to get an
intervention with those getting it now
• Assumption: the stopping point of the
intervention does not separate two
fundamentally different populations
• example: extending irrigation networks
3. Difference-in-difference
• Observations over time: compare observed
changes in the outcomes for a sample of
participants and non-participants
• Identification assumption: the selection bias is timeinvariant (‘parallel trends’ in the absence of the program)
• Counter-factual: changes over time for the nonparticipants
Constraint: Requires at least two cross-sections of data, preprogram and post-program on participants and nonparticipants
– Need to think about the evaluation ex-ante, before the program
• Can be in principle combined with matching to adjust for
pre-treatment differences that affect the growth rate
Implementing differences in
differences in our example…
• Some arbitrary comparison group
• Matched diff in diff
• Randomized diff in diff
• These are in order of more problems 
less problems, think about this as we look
at this graphically
As long as the bias is additive and timeinvariant, diff-in-diff will work ….
t=1 time
What if the observed changes over time
are affected?
t=1 time
4. Instrumental Variables
• Identify variables that affects participation in the
program, but not outcomes conditional on
participation (exclusion restriction)
• Counterfactual: The causal effect is identified out of the
exogenous variation of the instrument
• Advantages:
– Does not require the exogeneity assumption of matching
• Disadvantages:
– The estimated effect is local: IV identifies the effect of the
program only for the sub-population of those induced to take-up
the program by the instrument
– Therefore different instruments identify different parameters. End
up with different magnitudes of the estimated effects
– Validity of the instrument can be questioned, cannot be tested.
IV in our example
• It turns out that outreach was done
randomly…so the time/intake of farmers
into the program is essentially random.
• We can use this as an instrument
• Problems?
– Is it really random? (roads, etc)
5.Regression discontinuity design
• Exploit the rule generating assignment into a program
given to individuals only above a given threshold –
Assume that discontinuity in participation but not in
counterfactual outcomes
• Counterfactual: individuals just below the cut-off who did
not participate
• Advantages:
– Identification built in the program design
– Delivers marginal gains from the program around the
eligibility cut-off point. Important for program
• Disadvantages:
– Threshold has to be applied in practice, and
individuals should not be able manipulate the score
used in the program to become eligible.
Figure 1: Kernel Densities of Discriminant Scores and Threshold points by region
Discriminant Score
Discriminant Score
Region 3
Discriminant Score
Region 4
Region 5
Discriminant Score
Region 6
Discriminant Score
Region 12
Discriminant Score
Region 27
Discriminant Score
Region 28
Example from Buddelmeyer and
Skoufias, 2005
RDD in our example…
• Back to the eligibility criteria: land size and
crop history
• We use those right below the cut-off and
compare them with those right above…
• Problems:
– How well enforced was the rule?
– Can the rule be manipulated?
– Local effect
Discussion example:
building a control group for
• Scenario: we have a project to extend
existing reaches and build some new
• An initial analysis shows that farmers who
are newly irrigated have increased
yield…was the project a success?
• What is the evaluation question?
• What is a logical comparison group and
Investment operation vs
adjustment/budget support
• Project
– Maybe evaluate all, but unlikely
• Pick subcomponents
• Adjustment/budget support
– Build a strong M&E unit
• Impact evaluation designed by govt
– Evaluate policy reform pilots
– e.g. health insurance pilot, P4P, tariff changes
– Anything economy wide ≠ impact evaluation
Prioritizing for Impact Evaluation
• It is not cheap – relative to monitoring
• Possible prioritization criteria:
– Don’t know if policy is effective
• e.g. conditional cash transfers
– Politics
• e.g. Argentina workfare program
– It’s a lot of money
• Note that 2 & 3 are variants of not
“knowing” – in this context, etc.
Summing up:
• No clear “gold standard” in reality – do
what works best in the context
• Watch for unobservables, but don’t forget
• Be flexible, be creative – use the context
• IE requires good monitoring and
monitoring will help you understand the
effect size
Impact Evaluation and the
Project Cycle
Objective of this part of the
• Walk you through what it takes to do an
impact evaluation for your project from
Identification to ICR
• Persuade you that impact evaluation will
add value to your project
We will talk about…
• General Principles
• In the context of 3 project periods:
– Evaluation activities – the core issues for
evaluation design and implementation, and
– Housekeeping activities—procedural,
administrative and financial management
• Where to go for assistance
Some general principles
• Government ownership as whole—what
matters is institutional buy-in so that the
results get used
• Relevance and applicability—asking the
right questions
• Flexibility and adaptability
• Horizon matters
• IE can provide one avenue to build institutional
capacity and a culture of managing-by-results – so
the IE should be as widely owned within gov’t as
• Agree on a dissemination plan to maximize use of
results for policy development.
• Identify entry points in project and policy cycles
– midpoint and closing, for project;
– sector reporting, CGs, MTEF, budget, for WB
– Budget cycles, policy reviews for gov’t
• Use partnerships with local academics to build
local capacity for impact evaluation.
Relevance and Applicability
• For an evaluation to be relevant, it must be
designed to respond to the policy questions that
are of importance.
• Clarifying early what it is that will be learned and
designing the evaluation to that end will go some
way to ensure that the recommendations of the
evaluation will feed into policy making.
• Make sure to to think about unintended
consequences (e.g. export crop promotion shifts
the intrahousehold allocation of power or S.
Africa pensions) – qualitative and
interdisciplinary perspectives are key here
Flexibility and adaptability
• The evaluation must be tailored to the specific project
and adapted to the specific institutional context.
• The project design must be flexible to secure our ability
to learn in a structured manner, feed evaluation results
back into the project and change the project mid-course
to improve project end results.
• Can be broad project redesign or push in new directions
e.g. feed into nutritional targeting design
This is an important point: In the past projects have been
penalized for affecting mid-course changes in project
design. Now we want to make change part of the project
Horizon matters
• The time it takes to achieve results is an important
consideration for timing the evaluation. Conversely, the
timing of the evaluation will determine what outcomes
should be focused on.
– Early evaluations should focus on outcomes that are quick to
show change
– For long-term outcomes, evaluations may need to span beyond
project cycle. e.g. Indonesia school building project
• Think through how things are expected to change over
time and focus on what is within the time horizon for the
Do not confuse the importance of an outcome with the time
it takes for it to change—some important outcomes are
obtained instantaneously !
But don’t be afraid to look at intermediate outcomes either
Stage 1:
Identification to PCN
Get an Early Start
How do you get started?
• Get help and access to resources: contact
person in your region or sector responsible for
impact evaluation and/or Thematic Group on
Impact Evaluation
• Define the timing for the various steps of the
evaluation to ensure you have enough lead time
for preparatory activities (e.g. baseline goes to
the field before program activities start)
• The evaluation will require support from a range
of policy-makers: start building and maintaining
constituents, dialogue with relevant actors in
government, build a broad base of support,
include stakeholders
Build the Team
• Select impact evaluation team and define
responsibilities of:
program managers (government),
WB project team, and other donors,
lead evaluator (impact evaluation specialist),
local research/evaluation team, and
data collection agency or firm
Selection of lead evaluator is critical for ensuring
quality of product, and so is the capacity of the data
collection agency
• Partner with local researchers and research
institutes to build local capacity
Shift Paradigm
• From a project design based on “we know what’s best”
• To project design based on the notion that “we can learn
what’s best in this context, and adapt to new knowledge as
Work iteratively:
– Discuss what the team knows and what it needs to learn–the
questions for the evaluation—to deliver on project objectives
– Discuss translating this into a feasible project design
– Figure out what questions can feasibly be addressed
– Housekeeping: Include these first thoughts in a paragraph in the
• e.g. ARV evaluation – funding constraints shifted radically,
quickly – design changed, and changed again
Stage 2:
Preparation through appraisal
Define project development
objectives and results framework
• This activity
– clarifies the results chain (logic of impacts) for the
– identifies the outcomes of interest and the indicators
best suited to measure changes in those outcomes,
– the expected time horizon for changes in those
• This will provide the lead evaluator with the
project specific variables that must be included
in the survey questionnaire and a notion of
timing for scheduling data collection.
Work out project design features
that will affect evaluation design
• Target population and rules of selection
– This provides the evaluator with the universe for the
treatment and comparison sample
• Roll out plan
– This provide the evaluation with a framework for
timing data collection and, possibly, an opportunity to
define a comparison group
• Think about non-objective undermining changes
that will enhance the evaluation (and this will
likely be iterative)
Narrow down the questions for the
• Questions aimed at measuring the impact of
the project on a set of outcomes, and
• Questions aimed at measuring the relative
effectiveness of different features of the
Questions aimed at measuring the impact of
the project are relatively straightforward
• What is your hypothesis? (Results framework)
– By expanding water supply, the use of clean water will increase,
water borne disease decline, and health status will improve
• What is the evaluation question?
– Does improved water supply result in better health outcomes?
• How can do you test the hypothesis?
– The government might randomly assign areas for expansion in water
supply during the first and second phase of the program
• What will you measure?
– Measure the change in health outcomes in phase I areas relative to
the change in outcomes in phase II areas. Outcomes will include use
of safe water (S-T), incidence of diarrhea (S/M-T), and health status
(L-T, depending on when phase II occurs). Add other outcomes.
• What will you do with the results?
– If the hypothesis proves true go to phase II; if false, modify policy.
Questions aimed at measuring the relative
effectiveness of different project features
require identifying the tough design choices on the
• What is the issue?
– What is the best package of products or services?
• Where do you start from (what is the counterfactual)?
– What package is the government delivering now?
• Which changes do you or the government think could
be made to improve effectiveness?
• How do you test it?
– The government might agree to provide a package to a
randomly selected group of households and another
package to another group of households to see how the
two package perform
• What will you measure?
– The average change in relevant outcomes for
households receiving one package versus the same for
households receiving the other package
• e.g. extension vs fertilizer+extension vs
• What will you do with the results?
– The package that is most effective in delivering
desirable outcomes becomes the one adopted by the
project from the evaluation onwards
Application, features that should be
tested early on
• Early testing of project features (say 6 months to
1 year) can provide the team with the
information needed to adjust the project early on
in the direction most likely to deliver success.
• Features might include:
– alternative modes of delivery (e.g. use seed
merchants vs. extension agents),
– alternative packages of outputs, or
– different pricing schemes (e.g. alternative subsidy
Develop identification strategy
(to identify the impact of the project separately from changes due to
other causes )
• One the questions are defined, the lead
evaluator selects one or more comparison
groups against which to measure results in the
treatment group.
• The “rigor” with which the comparison group is
selected will determine the reliability of the
impact estimates.
• Rigor?
– More-same observables and unobservables
– Less-same observables (non-experimental)
Explore Existing Data
• Explore what data exists that might be relevant for use in
the evaluation.
– Discuss with the agencies of the national statistical system and
universities to identify existing data sources and future data
collection plans.
– Check DECDG website
• Record data periodicity, quality, variables covered and
sampling frame and sample size, for
Surveys (household, firms, facility, etc)
Administrative data
Data from the project monitoring system
New Data
• Start identifying additional data collection needs.
– Data for impact evaluation must be representative of treatment and
comparison group
– Questionnaires must include outcomes of interest (consumption,
income, assets etc), questions about the program in question and
questions about other programs, as well as control variables
– The data might be at household, community, firm, facility, or farm
levels and might be combined with specialty data such as those
from water or land quality tests.
• Investigate synergies with other projects to combine data
collection efforts and/or explore existing data collection
efforts on which the new data collection could piggy back
• Develop a data strategy for the impact evaluation including:
The timing for data collection
The variables needed
The sample (including size)
Plans to integrate data from other sources (e.g project monitoring
Prepare for collecting data
• Identify data collection agency
• Lead evaluator or team will work with the data
collection agency to design sample, and train
• Lead evaluator or team will prepare survey
questionnaire or questionnaire module as
• Pre-testing survey instrument may take place at
this stage to finalize instruments
• If financed with outside funds, baseline can now
go to the field. If financed by project funds,
baseline will go to the field just after
effectiveness but before implementation starts
Develop a Financial Plan
• Costs:
Lead evaluator and research/evaluation team,
Data collection,
Supervision and
• Finances:
Trust fund,
Research grants,
Project funds, or
Other donor funds
• Initiate an IE activity. The IE code in SAP is a way of
formalizing evaluation activities. The IE code recognizes
the evaluation as a separate AAA product.
– Prepare concept note
– Identify peer reviewers –impact evaluation and sector specialist
– Carry out review process
• Appraisal documents
– Include in the project description plans to modify project overtime
to incorporate results
– Work the impact evaluation into the M&E section of the PAD and
Annex 3
• Include the impact evaluation in the Quality
Enhancement Review (TTL).
Stage 3:
Negotiations to Completion
Ensure timely implementation
• Ensure timely procurement of evaluation
services especially contracting the data
collection, and
• Supervise timely implementation of the
evaluation including
– Data collection
– Data analysis
– Dissemination and feedback
Data collection agency/firm
• Data collection agency or firm must have
technical knowledge and sufficient
logistical capacity relative to the scale of
data collection required
• The same agency or firm should be
expected to do baseline and follow up data
collection (and use the same survey
Baseline data collection and analysis
• Baseline data collection should be carried
out before program implementation
begins; optimally even before program is
• Analysis of baseline data will provide
program management with additional
information that might help finalize
program design
Follow-up data collection and
• The timing of follow-up data collection
must reflect the learning strategy adopted
• Early data collection will help modifying
programs mid course to maximize longerterm effectiveness
• Later data collection will confirm
achievement of longer-term outcomes and
justify continued flows of fiscal resources
into the program
Watch implementation closely from
an evaluation point of view
• Watch (monitor) what is actually being implemented:
– Will help understand results of evaluation
– Will help with timing of evaluation activities
• Watch for contamination in the control group
• Watch for violation of eligibility criteria
• Watch for other programs for the same beneficiaries
• Look for unintended impacts
• Look for unexploited evaluation opportunities
 Good evaluation team communication is key here
• Implement plan for dissemination of evaluation results
ensuring that the timing is aligned with government’s
decision making cycle.
• Ensure that results are used to inform project
management and that available entry points are
exploited to provide additional feedback to the
• Ensure that wider dissemination takes place only after
the client has had a chance to preview and discuss the
• Nurture collaboration with local researchers throughout
the process
• Put in place arrangements to procure the
impact evaluation work and fund it on time
• Use early results to inform mid-term
• Use later results to inform the ICR, CAS
and future operations
Summing up:
• Making evaluation work for you requires a
change in the culture of project design and
implementation, one that maximizes the use of
learning to change course when necessary and
improve the chances for success
• Impact evaluation is more than a tool – it is an
organizing analytical framework for doing this –
it is not about measuring success or failure so
much as it is about learning…
Where to go for assistance / more
• Clinics
– Brochure here, PREM
• TG resources
Searchable database of evaluations
Searchable roster of consultants
Doing IE series – general and sector notes
Website (http://impactevaluation)
• Courses – workshop on IE, WBI training, PAL
• South Asia resources: Jishnu Das (12/06)
Thank you