ApEc 8212 Econometric Analysis II -- Lecture #16

ApEc 8212 Econometric Analysis II -- Lecture #16
Models of Sample Selection and Attrition
(Wooldridge, Chapter 19, Sections 1-6)
I. Introduction
So far in this class we have assumed that the data we
have are a random sample from some underlying
population. But sometimes the data collected are not
a random sample, and sometimes the relationships of
interest are not observed for some part of the
population, although in theory the relationship exists
for all members of the population. Examples are:
1. We are interested in estimating the determinants
of women’s wages, not just of women who are
working but of all women (those who are not
working would earn a wage if they did work).
2. We want to estimate the impact of some
education policy on the test scores of high
school students, but the data are collected only
from students currently in school and thus data
do not exist for students who “drop out”.
The first example is of data censoring, where we
have a random sample of the population but for some
members of the sample (and perhaps some of the
population) we are missing data on one or more of
the variables. More specifically, we observe those
variables only if they fall within a particular range.
The second example is data truncation. We don’t
have a random sample of the population of interest.
A third possibility is incidental truncation. Certain
variables are observed only if other variables take
particular values. The first example fits this.
If the data are not a random sample then we say that
they are a selected sample, and that some kind of
selection mechanism generates the selected sample.
II. Data Censoring
Let’s start with an example. In some data sets if the
value of a variable is “too high” it is just set at some
large value for that variable. For example, for a
household income variable, for any households with
an annual income of $200,000 or more the income
variable is set to $200,000. This is called top coding.
It is important to realize that this is different from a
corner solution (Tobit) model where, say, there are a
lot of values at zero (or some other number). In a
Tobit the “real values” of the variable are in fact zero.
Yet in data censoring the “real” values are not equal
to the “top code” value. However, the estimation
methods are the same, which sometimes confuses
people when it is time to interpret the estimates.
Let’s start with a simple linear model:
y = x′β + u, with E[u| x] = 0
Let w be the observed value of y. In the above top
coding example, we would have:
w = min(y, 200,000)
Binary censoring
Sometimes the w variable is simply a binary (dummy)
variable . Suppose we want to estimate the willingness of the population in some community to pay for a
public good, such as a public park. Let wtp represent
this “willingness to pay”. Since people may have
trouble telling interviewers their precise willingness to
pay, one approach is to randomly choose a value, call
it “r” (which could vary for the people in the survey),
and ask survey respondents a very simple question:
Are you willing to pay r for a new public park?
Let yi be the willingness to pay for person i. Person
i’s answer to the above question, when asked for a
“reference” value of ri, can be coded as:
wi = 1[yi > ri]
Assume that yi = xi′β + ui. How can we estimate β
with such data? If we are willing to assume that:
ui| xi, ri ~ N(0, σ2)
then we can use probit estimation if we make the
further assumption that ri is independent of ui
(independent of yi conditional on xi):
D(yi| xi, ri) = D(yi| xi)
This allows us to specify the probability that wi = 1,
conditional on xi and ri as:
Prob[wi = 1| xi, ri] = Prob[yi > ri| xi, ri]
Prob[ui/σ > (ri – xi′β)/σ| xi, ri]
= 1 – Φ((ri – xi′β)/σ) = Φ((xi′β - ri)/σ)
Question: In the probit model, we could only
estimate β/σ, not β or σ separately. Is that also the
case here? [Hint: We know the values of the ri’s]
We can use maximum likelihood methods to estimate
this model. However, as with the standard probit, if u
is heteroscedastic or not normally distributed than our
estimates will be inconsistent.
If we could get people to tell us their willingness to
pay (their value for y), we could use linear methods,
but we probably should not use this specification if a
substantial proportion report a willingness to pay of
zero. In this case, a (Type I) Tobit specification
makes more sense. See p.782 of Wooldridge for
further discussion and a couple other ideas for
specifying willingness to pay.
Interval Coding
Sometimes the value of y is not the precise value but
only an “ordered” indicator variable that denotes
which “interval” y falls into. The most common
example is income; in some household surveys
respondents are not asked their precise income, but
only what “range” it falls in. This type of data is
called interval-coded (or interval censored) data.
Assume again that E[y| x] = x′β. Let the known
interval limits be r1 < r2 < … rJ. The censored
variable, w, is related to y as follows:
w = 0 <=> y ≤ r1
w = 1 <=> r1 < y ≤ r2
w = J <=> y > rJ
In terms of estimation this is very much like an
ordered probit or logit. If we assume that the error
term is normally distributed we have an ordered
probit, and the log likelihood function is given by:
ℓi(β, σ) = 1[wi = 0]ln[Φ((r1 - xi′β)/σ)]
+ 1[wi = 1]ln[Φ((r2 - xi′β)/σ) - Φ((r1 - xi′β)/σ)] + …
… + 1[wi = J]ln[1 - Φ((rJ - xi′β)/σ)]
Question: In the standard ordered probit, we can only
estimate β/σ. Is that the case here? What is the
intuition for your answer? What does it imply for
estimating partial effects?
Additional comments for this model (see
Wooldridge, p.784):
1. Sometimes the r’s vary over different observations.
That does not cause any problems for estimation.
2. If any x variables are endogenous, this can be
fixed using the Rivers and Vuong (1988) method.
3. Panel data methods (random effects) can be used.
Censoring from Above and Below
Recall the “top coding” example above. It is straightforward to estimate this using the Tobit approach
discussed in the previous lecture. The important
thing is to be careful when interpreting the results. If
y really does have values beyond the “top code”
(which it does) then this is “real” censoring and what
we want to estimate is β; we are not interested in
estimating the probability that an observation hits the
“top code”. In contrast, in “corner solution” models
we are interested in that probability. For a more
detailed discussion, see Wooldridge, pp.785-790.
Wooldridge also provides an overview of sample
selection, with 2 examples, on pp.790-792.
III. When is Sample Selection NOT a Problem?
Sample selection does not always lead to bias if
standard methods (e.g. OLS) are applied to the
selected sample. In general, if sample selection is
based on exogenous explanatory variables, that is on
variables that are uncorrelated with the error term in
the equation of interest, then there is no problem with
applying standard methods to the selected sample.
Let’s start with linear models, both OLS and 2SLS
(IV). The model is:
y = β1x1 + β2x2 + … + βKxK + u = x′β + u
E[zu] = 0
where z is a vector of L instruments for possible use
in IV estimation. Note that x1 is just a constant.
If x = z then E[xu] = 0, so we can use OLS:
E[y| x] = x′β
Returning to the general case where some elements in
x may be correlated with u, let s be a binary (dummy
variable) selection indicator. For any member of the
population s = 1 indicates that the observation is not
“blocked” from being drawn into our sample, but if s
= 0 then it is “blocked”, i.e. cannot be in our sample.
Suppose we draw a random sample of {xi, yi, zi, si}
from some population. In fact, if si = 0 it cannot be
in our sample. The 2SLS estimate of β using the
observed data, which can be denoted as βˆ 2SLS, is:
i 1
i 1
i 1
βˆ 2SLS = [((1/N)  sixizi′)((1/N)  sizizi′)-1((1/N)  sizixi′)]-1
i 1
i 1
× ((1/N)  sixizi′)((1/N)  sizizi′) ((1/N)  siziyi)
i 1
(Notice that when si = 0 the observation is dropped
from the estimation.) Next, replace yi with xi′β + ui:
i 1
i 1
i 1
βˆ 2SLS = β + [((1/N)  sixizi′)((1/N)  sizizi′)-1((1/N)  sizixi′)]-1
i 1
i 1
× ((1/N)  sixizi′)((1/N)  sizizi′) ((1/N)  siziui)
i 1
Everthing to the right of β “disappears” if E[siziui] = 0.
More formally, we have the following theorem:
Theorem 19.1: Consistency of 2SLS (and OLS)
under Sample Selection
Assume that E[u2] < ∞, E[xj2] < ∞ for all j = 1, … K,
and E[zj2] < ∞ for all j = 1, … L. Assume also that:
E[szu] = 0
rank{E[zz′| s = 1]} = L
rank{E[zx′| s = 1]} = K
Then plim[βˆ 2SLS] = β, and βˆ 2SLS is asymptotically
normally distributed.
The assumption that E[szu] = 0 is key, so it merits
further discussion. Note first that E[zu] does not
imply that E[szu] = 0. However, if E[zu] and s is
independent of z and u then we have:
E[szu] = E[s]E[zu] = 0 (if E[zu])
The assumption that s is independent of z and u is
very strong. In effect it assumes that the censored
observations are “dropped randomly”, which if true
implies that censoring does not lead to bias. This can
be called missing completely are random (MCAR).
A somewhat more realistic assumption is that the
selection (censoring) is a function of the exogenous
variables but not a function of u. That is, conditional
on z, u and s are uncorrelated:
E[u| z, s] = 0
You should be able to show (applying iterated
expectations) that this implies E[szu]. This is
sometimes called exogenous sampling.
That is, after conditioning on (controlling for) z, s has
no predictive power for (and so is uncorrelated with) u.
To see why selection that is a function only of the
exogenous variables implies that E[u| z, s] = 0,
“strengthen” the assumption E[zu] = 0 to E[u| z] = 0.
Then when selection is a function only of z, which
can be expressed as s = h(z), we have:
E[u| z] = 0 => E[u| z, h(z)] = 0, which => E[u| z, s] = 0
More generally, if s is independent of z and u (which is
true if s is independent of y, z and x), then E[u| z, s] = 0.
Note that if we don’t need to instrument for x, that is
if E[u| x, s] = 0, then we have:
E[y| x, s] = E[y| x] = x′β
Sometimes the assumption that E[y| x, s] = x′β is
called missing at random (MAR).
If we make the more general assumption that,
conditional on z, u and s are independent, that is
D[s| z, u] = D[s| z], then:
Prob[s = 1| z, u] = Prob[s = 1| z]
If we add the homoscedasticity assumption that E[u2|
z, s] = σ2, then the standard estimate of the
covariance matrix for βˆ 2SLS is valid (for details see
p.796 of Wooldridge). If there is heteroscedasticity
we can use the heteroscedastic-robust standard errors
for 2SLS (see pp.16-17 of Lecture 4).
For OLS, analogous results hold. Just replace z with
x in everything (including the assumptions).
A final useful result occurs when s is a non-random
function of x and some variable v not in x: s = s(x, v).
Here we allow u and s to be correlated: E[u| s] ≠ 0.
If the joint distribution of u and v is independent of x,
then E[u| x, v] = E[u| v]. This implies that:
E[y| x, s(x, v)] = E[y| x, v] = x′β + E[u|v]
Assuming a particular functional form for E[u| v],
such as a linear form so E[u| v] = γv, implies
E[y| x, s] = x′β + γv
Thus adding v as a regressor will give consistent
estimates of β (and γ) even for a sample that includes
only the observations with s = 1.
Similar results hold for nonlinear models. See
Wooldridge, pp.798-799.
IV. Selection Based on y (Truncated Regression)
Suppose that inclusion in the sample is based on the
value of the y variable. One example is a 1970s study
of the impact of a “negative income tax”. This was an
experimental study that excluded households whose
income was more that 1.5 times the poverty line.
Suppose that y is a continuous variable, and that data
are available for y (and x) if the following holds:
a1 < yi < a2, which implies si = 1[a1 < yi < a2]
where 1[ ] is an indicator function that = 1 if the
condition inside the brackets holds, otherwise it = 0.
Continue to assume that E[yi| xi] = xi′β. We are
interested in estimating β. The selection rule depends
on y, and thus depends on u. If we simply apply OLS
to the sample for which si = 1, the estimates of β will
be inconsistent (draw a picture to give intuition).
Maximum likelihood estimation will give consistent
estimates. This requires specification of the conditional
density function of yi, conditional on xi: f(y| xi; β, γ),
where γ is a vector of parameters regarding u.
It is easier to start with the cdf (cumulative density
function) of yi conditional on xi and si = 1 (a1 < yi < a2):
P[ y i  y, s i  1 | x i ]
P[yi ≤ y| xi, si = 1] =
P[s i  1 | x i ]
Because we are conditioning on si = 1 we need to
divide the probability that both yi ≤ y and si = 1 by
the probability that si = 1.
P[si = 1| xi] = P[a1 < yi < a2| xi], which is equal to
F(a2| xi; β, γ) - F(a1| xi; β, γ). If a2 = ∞ (no upper limit
to y) then F(a2| xi; β, γ) = 1. If a1 = -∞ (no lower limit
to y) then F(a1| xi; β, γ) = 0.
What about the numerator of the above expression?
For any y between a1 and a2 we have:
P[yi ≤ y, si = 1| xi] = P[a1 < yi < y| xi]
= F(y| xi; β, γ) - F(a1| xi; β, γ)
The density of the cdf on p.7, conditional on yi ≤ y,
si = 1 and xi, is obtained by differentiating the cdf
with respect to y:
f(y| xi; si = 1) =
f ( y | x i ; β, γ )
F(a 2 | x i ; β, γ )  F(a 1 | x i ; β, γ )
[Wooldridge’s notation is p(y| xi; si = 1) for the density.]
To estimate β we need to specify the distribution of u
and then use (conditional) maximum likelihood
methods (we are conditioning on si = 1). Replace y in
the above expression for f(y| xi; si = 1) with yi for all
observed data. Typically we assume that u is
normally distributed with variance σ2. This amounts
to assuming that y ~ N(x′β, σ2). This is called the
truncated Tobit model. In this case γ is just σ2.
Question: How is this different from the Tobit model
discussed in Lecture 15?
Unfortunately, if the normality and homoscedasticity
assumptions are false, we get inconsistent estimates.
V. A Probit Selection Equation (selection on “y2”)
A more complicated model involves missing
observations for y (and perhaps some of the
associated x’s) based on the value of some other
endogenous variable, call it “y2”. The most common
example of this is when we do not observe wages for
people who are not working. We can think of
another equation, one for hours (y2). We do not
observe y (wages) when y2 = 0 (the person is not
working). The first paper that examined this in detail
was Gronau (1974); this example is explained in
Wooldridge, pp.802-803. More simply, y2 can be a
dummy variable that = 0 when a person is not
working and = 1 if a person is working.
The following 2 equations give the general model:
y1 = x1′β1 + u1
y2 = 1[x′δ2 + v2 > 0]
where again 1[ ] is an indicator function and x1 is a
subset of the variables in x. These 2 equations are
sometimes called the Type II Tobit Model. It is
useful to state some assumptions that will be used in
much of the rest of this lecture:
Assumption 19.1: (a) x and y2 are always observed,
but y1 is observed only if y2 = 1.
(b) Both u1 and v2 are independent
of x, but perhaps not of each other.
(c) v2 ~ N(0,1)
(d) E[u1| v2] = γ1v2
Note that (b) implies that if we observed all the
observations we could estimate E[y1| x] using OLS.
So the problem here is not endogenous x variables.
So, how can we estimate this thing? Any random
“draw” from a population gives the following for
some observation i: {y1, y2, x, u1 and v2}. The
problem is that when y2 = 0 we do not observe y1,
although it does really exist (e.g. if a person did
choose to work he or she would get some wage).
We definitely observe, and thus we can hope to
estimate, E[y1| x, y2 = 1]. We can also estimate
P[y2 = 1| x], which under the assumptions will give
us an estimate of δ2. But with this can we estimate
β1? The answer is: yes, we can. To start, note that:
E[y1| x, v2] = x1′β1 + E[u1| x, v2]
= x1′β1 + E[u1| v2] = x1′β1 + γ1v2
One useful implication of this expression is that when
γ = 0 then E[y1| x, v2] = x1′β = E[y1| x]. In this case
we can estimate β by running OLS n the observations
for which y1 is observed.
For the case where γ ≠ 0 we can modify the above
(conditioning on y2 is more useful than conditioning
on v2 since we observe y2 but not v2):
E[y1| x, y2] = E[E[y1| x, v2]| x, y2]
= x1′β1 + γ1E[v2| x, y2] = x1′β1 + γ1h(x, y2)
where h(x, y2) is defined as E[v2| x, y2]. [The first
equality holds by the law of iterated expectations; (x,
y2) is the “smaller information set” relative to (x, v2)
because x and v2 together tell us what y2 is, but x and
y2 together do not tell us what v2 is (they just give a
range for v2).]
If we knew the functional form of h(x, y2), then by
Theorem 19.1 we could create the variable h(x, y2)
from x and y2 and then regress y1 on x1 and h(x, y2) to
obtain a consistent estimate of β using only the
observations for which y1 is observed.
In particular, if y1 is observed then we know y2 = 1.
Thus we need to know h(x, 1) = E[v2| x, y2 = 1] =
E[v2| x, v2 > -x′δ2] = E[v2| v2 > -x′δ2] = λ(x′δ2), where
λ(x′δ2) =  (x′δ2)/Φ(x′δ2). (The second to last equality
follows from the independence of v2 and x).
Thus we have:
E[y1| x, y2 = 1] = x1′β1 + γ1λ(x′δ2)
This suggests the following procedure for estimating
β, which was introduced by Heckman (1976, 1979):
1. Estimate δ2 by running a probit on y2 and x.
2. Generate λ(x′ δˆ 2).
3. Regress y1 on x1 and on λ(x′δˆ 2) to estimate β
and γ1.
This procedure gives estimates of β and γ1 that are
consistent and asymptotically normal. You can test
whether there is “sample selection bias” by testing
whether γ1 = 0.
When γ1 ≠ 0 things get a little messy because it
introduces heteroscedasticity. This heteroscedasticity
is not hard to fix, the harder to fix problem is that the
standard errors for β and γ1 need to be recalculated
because δˆ 2 is an estimate of δ2, and the variance of
this estimate must be accounted for in calculating the
covariance matrix of β and γ1.
An important point is that, technically speaking, the
assumption that v2 is normally distributed means that
we do not have to have any “identifying” variables in
x beyond what is in x1. This sort of “identification
from a functional form assumption” (the assumption
that v2 is normally distributed) is not very credible. It
is much more convincing to have some variable in x
that is not in x1, and to have a sound (economic)
theoretical reason for excluding that variable from x1.
Two final notes:
1. If you assume that u1 is also normally
distributed, then you can use (conditional)
maximum likelihood, which is more efficient
then this 2-step method.
2. Some recent methods have been developed that
do not require either u1 of v2 to follow any
particular distribution. References are Ahn and
Powell (1993) and Vella (1998).
Endogenous Explanatory Variables
Suppose that one of the x variables is endogenous.
This leads to the model:
y1 = z1′δ1 + α1y2 + u1
y2 = z′δ2 + v2
y3 = 1[z′δ3 + v3 > 0]
The errors u1, v2 and v3 are all assumed to be
uncorrelated with z but they could be correlated with
each other. We are primarily interested in estimating
δ1 and α1, the parameters for the structural equation
of interest.
The following assumption clarifies what data are
Assumption 19.2: (a) z and y3 are always observed,
y1 and y2 are observed if y3 = 1.
(b) u1 and v3 are independent of z.
(c) v3 ~ N(0, 1)
(d) E[u1| v3] = γ1v3
(e) E[zv2] = 0, and zδ2 = z1δ21 + z2δ22,
where δ22 ≠ 0.
Note that (e) is needed to identify the first equation.
We can write the structural equation as:
y1 = z1′δ1 + α1y2 + g(z, y3) + e1
where g(z, y3) = E[u1| z, y3] and e1 ≡ u1 - E[u1| z, y3]. By
definition, E[e1| z1, y3] = 0. Note: g(z, y3) here plays the
same role as h(x, y2) on p.11 (control for selection bias).
If we knew g(z, y3), we could just estimate this equation
using 2SLS on the selected sample. Since we do not
know it, we need to estimate it, just as we did above
where we used OLS after generating λ( ).
Thus we can use the following procedure:
1. Estimate δ3 for selection equation using a probit.
2. Generate the inverse Mills ratio λ(z′δˆ 3)
3. Estimate the following equation on the selected
sample, using 2SLS:
y1 = z1′δ1 + α1y2 + γ1λ(z′δˆ 3) + error
The IVs are z and λ(z′δˆ 3).
Some other points:
 If γ1 = 0 then there is no sample selection bias
and you can just use 2SLS on the select sample
and ignore the selection equation.
 As in the selection equation with only exogenous
explanatory variables, the standard errors need to
be adjusted since δˆ 3 is an estimate of δ3.
 y2 could be a dummy variable, and no
assumptions are needed that v1 and/or v2 follow a
particular distribution.
 There should be at least 2 variables in z that are
not in z1, since 2 variables being instrumented
(counting as λ(z1′δˆ 3) as an IV for g(z, y3)).
Returning to the case where all explanatory variables
are exogenous, Wooldridge explains (pp.831-814)
how to estimate a model where both y1 and y2 are
dummy variables.