{ "cells": [ { "cell_type": "markdown", "metadata": {}, "source": [ "# Notebook 4: K-Normal Means\n", "_Bryan Graham - University of California - Berkeley_\n", "\n", "_Ec 240a: Econometrics, Fall 2015_" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $E[Y|X] = m(X) + \\sigma U$ with $U$ a standard normal random variable independent of $X$. Assume that $m(X)$ equals some linear combination of a potentially high dimensional set of basis functions (e.g., a power series in $X$). In lecture we saw how we could use Gram-Schmidt orthonormalization to generate a representation of $m(X)$ in terms of a set of orthonormal basis functions of the same dimension. This, in turn, transforms our series regression problem into a canonical K-Normal means one (cf., Wasserman (2006, Chapter 7)).\n", "
\n", "
\n", "In this notebook I use K-Normal Means theory to study the conditional mean of log Earnings give years of completed schooling and AFQT scores among a sample of 733 white male respondents from the NLSY79. I focus on respondents who were 18 or younger when they took the AFQT test and restrict the analysis to high school graduates. The the population of interest, therefore, is white male high school graduates born in 1962, 1963 or 1964 and resident in the United States in 1979." ] }, { "cell_type": "code", "execution_count": 9, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Direct Python to plot all figures inline (i.e., not in a separate window)\n", "%matplotlib inline\n", "\n", "# Load libraries\n", "import numpy as np\n", "import numpy.linalg\n", "import pandas as pd\n", "import matplotlib.pyplot as plt\n", "import seaborn as sns\n", "\n", "from __future__ import division" ] }, { "cell_type": "code", "execution_count": 10, "metadata": { "collapsed": true }, "outputs": [], "source": [ "# Directory where NLSY79_TeachingExtract.csv file is located\n", "workdir = '/Users/bgraham/Dropbox/Teaching/Teaching_Datasets/'" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "First we import the NLSY79_TeachingExtract.csv dataset as a pandas dataframe and select the appropriate subsample." ] }, { "cell_type": "code", "execution_count": 11, "metadata": { "collapsed": false, "scrolled": true }, "outputs": [ { "name": "stdout", "output_type": "stream", "text": [ "Year-of-birth distribution of target sample\n", "62 307\n", "63 263\n", "64 251\n", "dtype: int64\n" ] }, { "data": { "text/html": [ "
\n", "\n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", " \n", "
PID_79HHID_79EarningsSchoolAFQTLogEarn
count733.000000733.000000733.000000733.000000733.000000733.000000
mean3497.2864943496.60163768569.92340313.82946859.99549210.830661
std2619.1655442619.22855557868.0697392.17275826.3883550.888534
min7.0000007.0000008.72573012.0000000.0940002.166276
25%1805.0000001804.00000034621.59560012.00000040.80199810.452233
50%3171.0000003171.00000053954.79685713.00000063.18399810.895902
75%4648.0000004648.00000080325.22214316.00000081.32499711.293839
max12139.00000012137.000000546176.49428620.000000100.00000013.210697
\n", "
" ], "text/plain": [ " PID_79 HHID_79 Earnings School AFQT \\\n", "count 733.000000 733.000000 733.000000 733.000000 733.000000 \n", "mean 3497.286494 3496.601637 68569.923403 13.829468 59.995492 \n", "std 2619.165544 2619.228555 57868.069739 2.172758 26.388355 \n", "min 7.000000 7.000000 8.725730 12.000000 0.094000 \n", "25% 1805.000000 1804.000000 34621.595600 12.000000 40.801998 \n", "50% 3171.000000 3171.000000 53954.796857 13.000000 63.183998 \n", "75% 4648.000000 4648.000000 80325.222143 16.000000 81.324997 \n", "max 12139.000000 12137.000000 546176.494286 20.000000 100.000000 \n", "\n", " LogEarn \n", "count 733.000000 \n", "mean 10.830661 \n", "std 0.888534 \n", "min 2.166276 \n", "25% 10.452233 \n", "50% 10.895902 \n", "75% 11.293839 \n", "max 13.210697 " ] }, "execution_count": 11, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# Read in NLSY79 Extract as a pandas dataframe\n", "nlsy79 = pd.read_csv(workdir+'NLSY79_TeachingExtract.csv') # Reading .csv as DataFrame\n", "\n", "# Hierarchical index: household, then individual; keep indices as columns too\n", "nlsy79.set_index(['HHID_79','PID_79'], drop=False)\n", "nlsy79.rename(columns = {'HGC_Age28':'School'}, inplace=True) # Renaming schooling\n", "\n", "# Only retain non-black, non-hispanic, male NLSY79 respondents belonging to core sample\n", "nlsy79 = nlsy79[(nlsy79.core_sample != 0) & (nlsy79.male != 0) & (nlsy79.black != 1) & (nlsy79.hispanic != 1)]\n", "\n", "# Like in Neal and Johnson (1996), only retain respondents born in 1962 or later and hence aged\n", "# 16 to 18 at the time of AFQT testing. Additionally require that they graduated from highschool\n", "nlsy79 = nlsy79[(nlsy79.year_born >= 62) & (nlsy79.School >= 12)]\n", "\n", "# Summarize age distribution of target sample\n", "print 'Year-of-birth distribution of target sample'\n", "print pd.value_counts(nlsy79.year_born)\n", "\n", "# Calculate average earnings across the 1999, 2001, 2003, 2005 and 2007 calendar years\n", "# NOTE: This is an average over non-missing earnings values\n", "nlsy79['Earnings'] = nlsy79[[\"real_earnings_1997\", \"real_earnings_1999\", \"real_earnings_2001\", \"real_earnings_2003\", \\\n", " \"real_earnings_2005\", \"real_earnings_2007\", \"real_earnings_2009\"]].mean(axis=1)\n", "\n", "# Only retain complete cases with earnings, schooling and AFQT\n", "nlsy79 = nlsy79[[\"PID_79\",\"HHID_79\",\"Earnings\",\"School\",\"AFQT\"]] \n", "nlsy79 = nlsy79.dropna()\n", "\n", "# Drop units with zero earnings (will evaluate to -Inf by np.log) and compute log earnings\n", "nlsy79 = nlsy79[nlsy79.Earnings!=0]\n", "nlsy79['LogEarn'] = np.log(nlsy79.Earnings) # Log earnings\n", "\n", "# Basic summary statistics of nlsy79 dataframe\n", "nlsy79.describe()" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Next we create a power series in years of schooling and AFQT scores (we first transform both variables to lie on the unit interval for numerical stability). We will work with a 6th order power series expansion, which with two covariates, includes 28 terms." ] }, { "cell_type": "code", "execution_count": 12, "metadata": { "collapsed": false }, "outputs": [], "source": [ "# Create 7th order power series in School and AFQT\n", "SeriesTerms = np.array([[0,1,0,2,0,1,3,0,2,1,4,0,3,1,2,5,0,4,1,3,2,6,0,5,1,4,2,3,7,0,6,1,5,2,4,3],\\\n", " [0,0,1,0,2,1,0,3,1,2,0,4,1,3,2,0,5,1,4,2,3,0,6,1,5,2,4,3,0,7,1,6,2,5,3,4]])\n", "# Only retain terms up to 6th order\n", "SeriesTerms = SeriesTerms[:,0:27]\n", "SeriesLabels = []\n", "\n", "for p in range(np.shape(SeriesTerms)[1]):\n", " i = SeriesTerms[0,p] #Exponent for School\n", " j = SeriesTerms[1,p] #Exponent for AFQT\n", " SeriesLabels.append('Sch_'+str(i)+'_X_AFQT_'+str(j))\n", " nlsy79['Sch_'+str(i)+'_X_AFQT_'+str(j)]=(((nlsy79['School']-12)/8)**i)*((nlsy79['AFQT']/100)**j)" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "Let $G$ be our $N\\times K$ matrix of basis functions. The GramSchmidt() function defined below transforms $G$ in a matrix $W$ of orthnormal columns using the inner product and norm definitions given in lecture." ] }, { "cell_type": "code", "execution_count": 13, "metadata": { "collapsed": false }, "outputs": [], "source": [ "def proj(f, g):\n", " # Calculate orthogonal projection of g onto f \n", " return ( np.dot(f,g) / np.dot(f,f) ) * f \n", " \n", "def GramSchmidt(G):\n", " \n", " G = np.array(G, dtype=float) # Convert design matrix to float type\n", " F = np.copy(G) # Copy of G -- will be used to store orthogonalization\n", " \n", " # orthogonalize design matrix\n", " for i in range(1, G.shape[1]): # Iterate over columns of G\n", " for j in range(i): # Iterate over columns less than current one\n", " F[:,i] -= proj(F[:,j], G[:,i]) # Subtract projection of g_i onto f_j for all j" ] }, "metadata": {}, "output_type": "display_data" }, { "data": { "text/plain": [ "" ] }, "metadata": {}, "output_type": "display_data" } ], "source": [ "Y = nlsy79['LogEarn'] # Outcome variable\n", "X = nlsy79[['School','AFQT']]\n", "\n", "N = len(Y)\n", "\n", "G = nlsy79[SeriesLabels] # Basis functions\n", "W = GramSchmidt(G) # Orthonormalized basis functions\n", "Z = W.T.dot(Y)/N # K Normal Means\n", "\n", "theta_ML = Z # Maximum likelihood estimate of basis function coefficients\n", "mu_MLE = W.dot(Z) # MLE of m(X) \n", "sigma2 = np.sum((Y - mu_MLE) ** 2)/(N-len(SeriesLabels)) # MLE of sigma2\n", "\n", "# Shrink basis function coeffients as described in lecture and very close\n", "# to the universal series estimator (USE) of Efromovich (1999). Truncate shrinkage \n", "# at zero.\n", "ShkCoef = []\n", "for k in range(0, len(SeriesLabels)):\n", " ShkCoef.append(max([1-(sigma2/N)/(Z[k]**2),0]))\n", "\n", "mu_USE = W.dot(ShkCoef*Z) # USE estimate of m(X)\n", "\n", "print \"\"\n", "print \"Empirical risk minimizing amount of shrinkage of K basis\"\n", "print \"function coeffients\"\n", "print \"\"\n", "print ShkCoef\n", "\n", "# Form interpolation grid upon which to plot estimated regression surface\n", "School_i = np.linspace(12, 20)\n", "AFQT_i = np.linspace(0, 100)\n", "mu_MLE_i = plt.mlab.griddata(X['School'], X['AFQT'], mu_MLE, School_i, AFQT_i,interp='linear')\n", "mu_USE_i = plt.mlab.griddata(X['School'], X['AFQT'], mu_USE, School_i, AFQT_i,interp='linear')\n", "\n", "\n", "# Create a 1 x 2 array of plots, with a contour representation of (i) the MLE regression surface\n", "# as well as (ii) its smoothed USE counterpart.\n", "plt.figure(1)\n", "plt.subplot(121)\n", "surf_MLE = plt.contourf(School_i, AFQT_i,mu_MLE_i, \\\n", " levels = [9,9.25,9.5,9.75,10,10.25,\\\n", " 10.35,10.45,10.55,10.65,10.75,10.85,10.95,11.05,11.15,\\\n", " 11.25,11.5,11.75,12], cmap='Greens')\n", "surf_MLE_cl = plt.contour(surf_MLE,colors = 'Green')\n", "plt.title('Maximum Likelihood')\n", "plt.xlabel('Years-of-Schooling')\n", "plt.ylabel('AFQT Percentile')\n", "\n", "plt.subplot(122)\n", "surf_USE = plt.contourf(School_i, AFQT_i,mu_USE_i, \\\n", " levels = [9,9.25,9.5,9.75,10,10.25,\\\n", " 10.35,10.45,10.55,10.65,10.75,10.85,10.95,11.05,11.15,\\\n", " 11.25,11.5,11.75,12], cmap='Greens')\n", "cb = plt.colorbar(surf_USE)\n", "cb.set_label('Log Earnings')\n", "surf_USE_cl = plt.contour(surf_USE,colors = 'Green')\n", "plt.title('Shrinkage')\n", "plt.xlabel('Years-of-Schooling')\n", "plt.ylabel('AFQT Percentile')\n", "plt.tight_layout()\n", "plt.show()\n", "plt.savefig(workdir+'Fig_KNormalMeansExample.png')" ] }, { "cell_type": "markdown", "metadata": {}, "source": [ "The two regression surfaces are depicted as countour plots in the figure above. There are no contours in the lower-right-hand region of the figures, since this is outside the support of the sample.\n", "
\n", "
\n", "The two figures are based on the same set of 28 basis functions, with the difference being that the basis function coefficients used to draw the right hand figure have been shrunk toward zero. Note that coefficient shrinkage induces smoothness in the resulting conditional mean function.\n", "
\n", "
\n", "The two plots reveal several interesting features about the conditional mean of log earnings given schooling and AFQT. In particular note that the \"return\" to additional years of schooling is very high for individuals with high AFQT scores relative to those with low scores. Put differently, for individuals with less than 16 years of schooling (which corresponds to an undergraduate degree), earnings only weakly increase in AFQT scores." ] }, { "cell_type": "code", "execution_count": 16, "metadata": { "collapsed": false }, "outputs": [ { "data": { "text/html": [ "" ], "text/plain": [ "" ] }, "execution_count": 16, "metadata": {}, "output_type": "execute_result" } ], "source": [ "# This imports an attractive notebook style from Github\n", "from IPython.core.display import HTML\n", "import urllib2\n", "HTML(urllib2.urlopen('http://bit.ly/1Bf5Hft').read())" ] } ], "metadata": { "kernelspec": { "display_name": "Python 2", "language": "python", "name": "python2" }, "language_info": { "codemirror_mode": { "name": "ipython", "version": 2 }, "file_extension": ".py", "mimetype": "text/x-python", "name": "python", "nbconvert_exporter": "python", "pygments_lexer": "ipython2", "version": "2.7.10" } }, "nbformat": 4, "nbformat_minor": 0 }