Bryan Graham - University of California - Berkeley
(with assistance from Seongjoo Min)
Ec 240a: Econometrics, Fall 2015
Class lectures have emphasized the various iteration properties of linear predictors. In this notebook we will use the subsample of white male NLSY79 respondents featured in Notebook 1 to concretely illustrate the sample analogs of these properties.
We will consider the "long" least squares fit of log earnings onto a constant, years of completed schooling and AFQT percentile score. We will show that the coefficient on schooling in the "short" least squares fit of log earnings onto a constant and years of completed schooling alone can be recovered from a combination of the "long" fit and the "auxiliary" least squares fit of AFQT onto a constant and years of completed schooling. This provides a concrete illustration of what is sometimes called the "omitted variable bias" formula in econometrics textbooks. I would like to emphasize that I dislike that terminology. A more insightful framing of the result is as a consequence of the law-of-iterated linear predictors.
We then turn to an illustration of the Frisch-Waugh Theorem. Specifically we show that that the coefficient on schooling in the "long" least squares fit can be recovered from the least squares fit of log earnings onto a certain residual. More specifically we compute the least squares fit of schooling onto a constant and AFQT score. We then compute the vector of residuals associated with this fit. This provides the component of years of completed schooling that is uncorrelated with AFQT scores. Finally we compute the least squares fit of log earnings onto these residuals (excluding the constant term). The coefficient on these residuals numerically coincides with the coefficient on schooling in the long regression fit. This is the Frisch-Waugh result.
The residual regression result has a variety of computational uses (it arises frequently in panel data analysis), and also provides insight into how conditioning on additional control variables affects the coefficient on the variable of interest in the context of linear regression analysis. It can also help in your understanding of more complex methods of covariate adjustment, as in the, for example, the partially linear model.
The notebook ends by providing an illustration of the so-called Bayesian Bootstrap. The Bayesian Bootstrap is a method of drawing from a particular posterior distribution for our linear predictor coefficients. We motivate and develop this approach later in the course.
# Direct Python to plot all figures inline (i.e., not in a separate window)
%matplotlib inline
# Load libraries
import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import statsmodels.api as sm
from __future__ import division
We begin by loading the subsample of white male NLSY79 respondents used in Notebook 1.
# Directory where nlsy79.csv file (created in Ec240a Python Notebook 1) is located
workdir = '/Users/bgraham/Dropbox/Teaching/Teaching_Datasets/'
# Read in NLSY79 Extract
nlsy79 = pd.read_csv(workdir+'nlsy79.csv') # Reading .csv as DataFrame
# Hierarchical index: household, then individual; keep indices as columns too
nlsy79.set_index(['HHID_79','PID_79'], drop=False)
# Display the first few rows of the dataframe
nlsy79.head()
Our linear regression analysis will use log earnings. A long literature in labor economics documents that average log earnings are approximately linear in years of completed schooling. A consequence of working with log earnings as our outcome variable is that we will need to confine our analysis to NLSY respondents with non-zero earnings. This further reduces the size of our sample (from 1,969 to 1,906). Recall that our target subsample included 2,439 respondents
# Drop units with zero earnings (will evaluate to -Inf by np.log) and compute log earnings
nlsy79 = nlsy79[nlsy79.Earnings!=0]
nlsy79['LogEarn'] = np.log(nlsy79.Earnings) # Log earnings
# Only return earnings, schooling and AFQT variables
nlsy79 = nlsy79[["PID_79","HHID_79","LogEarn","HGC_Age28","AFQT"]]
nlsy79.describe()
We begin our analysis by computing the least squares fit of log earnings onto a constant and years of completed schooling (i.e., the short regression). We will use the implementation of OLS found in the statsmodels library. Documentation for this library as well as a variety of code examples can be found at http://statsmodels.sourceforge.net/#.
y = nlsy79['LogEarn']
X = nlsy79['HGC_Age28']
W = nlsy79['AFQT']
XW = nlsy79[['HGC_Age28','AFQT']]
# Short regression: OLS fit of log earnings on years of schooling
# Use White's (1980) heteroscedastic robust variance-covariance estimator
short_reg=sm.OLS(y,sm.add_constant(X)).fit(cov_type='HC0')
print '------------------------------------------------------------------------------'
print '- Model SR : Short Regression -'
print '------------------------------------------------------------------------------'
print ''
print(short_reg.summary())
Our OLS fit suggests that schooling is a strong (linear) predictor of log earnings. Imagine we drew two individuals at random (from the population of employed white males born between 1957 and 1964 and resident in the United States in 1979) and observed that one of these individuals had completed college, while the second had competed only high school. What should our prediction of the earnings gap between these two individuals be? Our OLS fit suggests using the prediction $4\times 0.16\times 100 = 64$ percent.
The standard error reported in the results table is the one due to White (1980). This standard error provides an estimate of the variability of our coefficient estimate across repeated large samples. We will motivate this measure of uncertaintly and derive it formally later in the course.
Next we compute the long least squares fit of log earnings onto a constant, schooling and AFQT score percentile.
# Long regression: OLS fit of log earnings on years of schooling & AFQT percentile
# Use White's (1980) heteroscedastic robust variance-covariance estimator
long_reg=sm.OLS(y,sm.add_constant(XW)).fit(cov_type='HC0')
print '------------------------------------------------------------------------------'
print '- Model LR : Long Regression -'
print '------------------------------------------------------------------------------'
print ''
print(long_reg.summary())
Additionally conditioning on AFQT changes the coefficient of schooling quite a bit (from 0.16 to 0.11). If our two randomly drawn individuals also happened to have identical AFQT scores (and we observed this coincidence), then our long regression fit suggests that our predicted earnings gap should only be $4\times 0.11\times 100 = 44$ percent.
Why are the predictions associated with the short and long regression coefficients different? The coefficient on schooling in the short regression reflects the fact that the distribution of AFQT differs systematically across subpopulations defined in terms of schooling. We can see this more concretely by computing the auxiliary regression of schooling onto a constant and AFQT score.
# Auxiliary linear regression: OLS fit of AFQT on years of schooling
# Use White's (1980) heteroscedastic robust variance-covariance estimator
aux_reg=sm.OLS(W,sm.add_constant(X)).fit(cov_type='HC0')
print '------------------------------------------------------------------------------'
print '- Model AR : Model with years of schooling alone & AFQT dummies -'
print '------------------------------------------------------------------------------'
print ''
print(aux_reg.summary())
The auxiliary least squares fit confirms that AFQT varies systematically with schooling. Specifically we would predict a AFQT percentile gap between a college and high school graduate of $4\times 7.6 = 30.4$ percent. If AFQT also helps to (linearly) predict earnings (which the long regression fit suggests it does), then the coefficient on schooling in the short and long fits will differ.
Using the law of iterated linear predictors we have that the short regression coefficient on schooling should equal the long regression coefficient plus the product of the auxiliary regression coefficient on schooling times the long regression coefficient on AFQT: $0.1626 = 0.1086 + 7.6237\times 0.0071 = 0.1626$
We next compute the residual regression, illustrating the Frish-Waugh Theorem.
# Residual linear regression: OLS fit of log earnings on (residualized) years of schooling
# Auxiliary regression of years of schooling onto AFQT; save residuals
aux_reg=sm.OLS(X,sm.add_constant(W)).fit(cov_type='HC0')
V = pd.DataFrame({'Res_HGC':aux_reg.resid})
# Compute residual regression
res_reg=sm.OLS(y,V,hasconst=False).fit(cov_type='HC0')
print '------------------------------------------------------------------------------'
print '- Model RR : Residual regression of earnings onto (residualized) schooling -'
print '------------------------------------------------------------------------------'
print ''
print(res_reg.summary())
The Frish-Waugh Theorem provides insight into the effect of conditioning in the linear regression context. The coefficient on schooling in the long regression reflects the covariance between log earnings and that component of schooling which is uncorrelated with the additional included/auxiliary variables.
We end our notebook with a snippet of code which implements the Bayesian Bootstrap. You can read more about this procedure in Chamberlain and Imbens (2003, Journal of Business and Economic Statistics). We will develop this resampling procedure more carefully later in the course. Mechanically it generates an estimate of the posterior distribution of our long linear regression coefficients by repeated computation of the least squares fit using different random perturbations of our sample. The code provides an illustration of how to construct a "for loop" in Python and also illustrates more of the graphing capabilities of the matplotlib library and its seaborn add-on.
An interesting question is how the "degree-of-belief" motivated posterior distribution generated by the Bayesian Bootstrap relates to the asymptotically normal repeated sampling distribution?
##############################################################################
# Bayesian Bootstrap (cf., Chamberlain and Imbens (2003, JBES)) #
##############################################################################
XW = sm.add_constant(XW) # "Permanently" add constant to design matrix
B = 10000 # Number of posterior draws
M = np.empty((B,3)) # Matrix with posterior draws of statistics of interest
N = len(y) # Number of observations in dataset
for b in range (0,B):
W = np.random.gamma(1.,1.,N) # Random draws of Gamma(1,1) variables
W = W/np.sum(W) # Converting draws to Dirichlet
result = sm.WLS(y,XW,weights=W).fit()
M[b,:] = np.matrix(result.params) # Linear regression with Dirichlet wgts
M[:,0] = M[:,1]/M[:,2] # Replacing the constant coefficient with
# ratio of school-to-AFQT coefficients
# Put Bayesian Bootstrap results in a pandas dataframe
BB=pd.DataFrame({'Ratio':M[:,0], 'Return-to-Schooling':M[:,1], 'Return-to-AFQT':M[:,2]})
# Scatter (use seaborn add-on to matplotlib)
sns.set(style="white")
sns.jointplot("Return-to-AFQT","Return-to-Schooling", data=BB, kind="scatter", \
size=7, space=0, color="g", xlim=(0.002,0.012), ylim=(0.06,0.16), stat_func=None)
print '\n'
print '======================================================================='
print 'BAYESIAN BOOTSTRAP RESULTS'
print '======================================================================='
print '2.5, 5, 50, 95, 97.5 Percentiles'
print '======================================================================='
print BB.quantile(q=[0.025, 0.05, 0.5, 0.95, 0.975])
print '======================================================================='
print 'Summary Statistics'
print '======================================================================='
print BB.describe()
print '======================================================================='
del y, X, XW, B, M, N, result
The figure above suggests that the posterior distribution of the linear predictor coefficients on schooling and AFQT could be well approximated by a bivariate normal distribution.
How about the distribution of the ratio of the schooling and AFQT coefficients? This distribution is depicted in the histogram below. Here we see that a normal approximation may be less accurate. Specifically it would fail to capture the right-skewness of the posterior.
# Histogram of coefficient ratio
sns.distplot(BB['Ratio']);
plt.title("Posterior Distribution of Ratio of Schooling-to-AFQT Linear Predictor Coefficients")
# This imports an attractive notebook style from Github
from IPython.core.display import HTML
import urllib2
HTML(urllib2.urlopen('http://bit.ly/1Bf5Hft').read())