MLE
procedure allows users to program their
own likelihood functions. For most purposes, however, users will want to
use SST procedures which have been preprogrammed for standard models, such
as the LOGIT
, PROBIT
, TOBIT
, MNL
, and
DURAT
commands. These commands have roughly the same syntax as the
REG
command and make it no more difficult to estimate a probit
model, for instance, than to run a regression.
The purpose of this chapter is to describe the various models which can be estimated by SST and to explain how to execute the appropriate commands. The discussion of the models is kept at a fairly simple level and references are provided for more detailed discussions about the statistical properties of the procedures.
We focus initially upon models for dichotomous dependent variables. These models are somewhat simpler to explain than models for polytomous (multiple category) dependent variables and, for some users, are all that is required. In the case of a dichotomous dependent variable, the logit and probit models will produce very similar results (the estimated coefficients should be approximately proportional to one another, as described below). For this reason, we discuss the so-called binary models together.
When the dependent variable is polytomous, the LOGIT
and
PROBIT
commands, as implemented in SST, will produce rather
different output. The PROBIT
command in SST is for an ordered
categorical model. This will be appropriate when the categories
of the dependent variable can be ordered in a natural way.
Survey responses often can be treated this way. For example, if
respondents are asked whether they "agree strongly," "agree
weakly," "disagree weakly," or "disagree strongly" with some
statement, then their responses can be ordered. The ordered
probit model in SST provides an alternative to arbitrary scoring
schemes of responses that would be required for regression
analysis.
The LOGIT
command in SST is for unordered responses. A typical
example would be when an individual can choose from a set of discrete
alternatives. In transportation mode analysis, one may try to predict which
mode (e.g., car, bus, or train) an individual will pick. Since there is no
natural ordering of these alternatives, one approach would be to try to
estimate the probability that each alternative would be chosen. For this
purpose, the unordered logit model would be suitable. Later we also discuss
how random utility models can be used for this purpose (the MNL
command can be used for estimation of random utility models). These two
models are closely related. In fact, the MNL
command can be used to
estimate unordered logit models as well as the LOGIT
command, but it
is much more cumbersome to do it this way.
The TOBIT
command is used to estimate censored regression
models and is most often applied to the analysis of expenditure
data. The DURAT
command is used to estimate discrete time hazard
models (e.g., for analyzing spells of unemployment). They are
discussed later in the chapter.
The last section of the chapter discusses the MLE
command
which computes maximum likelihood estimates for user-defined
models. The same procedure can be used for nonlinear least
squares estimation. The MLE
command differs from the previously
mentioned commands in that the user must also specify the
derivatives of the log likelihood function, while in the other
commands these derivatives have been preprogrammed. Since the
algorithm used for the MLE
command is slightly different from
that used for the other commands, the options available also
differ.
If the data are continuous, rather than discrete, we simply substitute the probability density function for the probability mass function. Also, for most purposes it is easier to work with the log of the likelihood function. Since the logarithm is a monotone increasing function, log L(Beta_1, ...., Beta_K) and L(Beta_1, ..., Beta_K) will be maximized at the same values.
When the model is misspecified, then maximum likelihood estimators may lose
some of their desirable properties. However, it has been shown that under
very weak conditions, maximum likelihood estimators will still have a
well-defined probability limit and will be approximately normally
distributed. Moreover, it is still possible to compute the variance of
maximum likelihood estimators based on a random sample even if the model is
badly misspecified. The ROBUST
subop to the maximum likelihood
procedures computes standard errors for maximum likelihood estimates that
are insensitive (or "robust") to model misspecification. For details, see
Halbert White, "Misspecification and Maximum Likelihood Estimation",
Econometrica (January, 1982). White suggests comparing the usual
and robust estimates of the covariance matrix of maximum likelihood
estimators as a test of model specification.
Let Y_i denote the dependent variable and x_1i ... x_Ki
denote the independent variables. A reasonable choice of functional form
for the probabilities is:
where beta' x_i = beta_1 x_1i + ... + beta_K x_Ki and
F(.) is a nondecreasing function such that:
Any cumulative probability distribution function will satisfy these
conditions. Two distribution functions which have often been used are the
logistic:
and the normal:
PHI(.) corresponds to the SST function cumnorm
. In
either case, if beta' x_i is large, then the probability that Y_i
equals one is close to one. Similarly, if beta' x_i is small, then
the probability that Y_i equals one is close to zero. Whatever values
the independent variables take, the probability that Y_i equals one
will be admissible (i.e., between zero and one).
When we specify F(.) to be logistic, then we have the
logit model. The log likelihood function for the logit model is given
by:
where F(t) denotes the cumulative logistic distribution function.
When we specify F(.) to be normal, then we have the
probit model. The log likelihood in this case is given by:
Both the logit and probit log likelihoods are globally concave and
hence relatively easy to maximize using the Newton-Raphson algorithm.
LOGIT
command uses an iterative nonlinear optimization procedure
to obtain parameter estimates that maximize the logit log likelihood
function. You specify the dependent variable in the DEP
subop and the
independent variables in the IND
subop in much the same as the
REG
command. Suppose, for example, that the variable y
takes
only two values. In our examples, we will assume y
equals zero or
one, but the LOGIT
and PROBIT
commands don't require this.
Then to obtain estimates for a binary logit model, type:
logit dep[y] ind[x1 x2 x3]
As in regressions, it is advisable to include a constant term in
the IND
subop, though SST does not require you to do so.
The PROBIT
command produces parameter estimates that maximize
this function. The syntax is identical to that for the LOGIT
command:
probit dep[y] ind[x1 x2 x3]
The same advice about including a constant term in the IND
subop
applied here as well.
LOGIT
and PROBIT
will produce similar
results. The logistic distribution has variance (pi^2)/3 while the
standard normal distribution, of course, has variance one. This
means that the coefficients based on the binary logit
specification should be approximately sixty percent larger than those
based on the binary probit specification. (Multiplying the binary logit
estimates by sqrt(3)/pi makes them roughly comparable to the binary
probit estimates.) If the observations
are highly skewed, then the choice between the logit and probit
specifications may be of greater consequence, but this is a
rarity in most applications.
LOGIT
or PROBIT
commands, SST allows the user
to save the predicted probabilities with the PROB
subop. If the
dependent variable has two categories, then the predicted probability is
the estimated probability that the dependent variable takes its high
value. You should specify one variable name in the PROB
subop. Following our example above, where y takes the values zero and one,
we could give the LOGIT
command:
logit dep[y] ind[one x1 x2 x3] prob[phat]
SST will compute the probability y takes its high value and save these
estimated probabilities in the variable phat
. The ith
observation on the variable phat
would be:
If the PROB
subop is included with the PROBIT
command, then
the estimated probability that y takes its high value is given by:
In either case, the estimated probabilities can be used to compute a
goodness of fit statistic, the percentage of cases correctly predicted. In
the binary logit and probit models, we would predict that a case would fall
into the category with the modal (highest) predicted probability.
That is, if Prob(Y_i=1) >= Prob(Y_i=0), then we would
predict that Y_i equals one, while if Prob(Y_i=0) >
Prob(Y_i=1), then we would predict Y_i equals zero. We
could then compare our predictions to the actual outcomes on the dependent
variable. The percent "correctly predicted" is just the fraction of cases
where the actual outcome corresponds to the "predictions" described above.
We describe this calculation in some detail for the probit model. The
calculations for the logit command are identical, except that we initially
use the LOGIT
command instead of the PROBIT
command.
First, we compute maximum likelihood probit estimates and
save the probabilities in the variable phat
:
probit dep[y] ind[one x1 x2] prob[phat]
Next, we create a dummy variable which equals one if the case is predicted correctly and equals zero otherwise:
set success = ((phat >= 0.5) & (Y == 1)) | ((phat < 0.05) & (y == 0))
Finally, we calculate the percentage of successes:
freq var[success]
The number produced is the "percentage correctly predicted". It should be understood that while this statistic can be a useful diagnostic, it is not really a "prediction" (since the actual outcomes have been used to compute the probit or logit estimates). Furthermore, the percent correctly predicted is guaranteed to be fairly high, and will be very high if the dependent variable is skewed.
REG
command, estimated coefficients and the
variance-covariance matrix can be saved using the COEF
and
COVMAT
subops. The user supplies a variable name in each subop, and SST
stores the coefficient estimates or covariance matrix (stacked by
column) in the specified variable.
With the PROBIT
command, the values of beta' x_i can be saved
for each observation by specifying a variable in the PRED
subop.
This provides another route for obtaining predicted probabilities, since
the probability that the dependent variable takes its high value can be
obtained by evaluating the cumulative normal distribution function at
beta' x_i. Users may find this option helpful for performing selectivity
corrections (see James Heckman, "Sample Selection Bias as a Specification
Error", Econometrica, January 1979) or simultaneous probit
estimation (see L. F. Lee, "Simultaneous Equations Models with Discrete
Endogenous Variables", in C. Manski and D. McFadden, eds., Structural
Analysis of Discrete Data with Econometric Applications, MIT Press, 1981).
This option is not available for the LOGIT
command, but the same
results can be obtained in the binary case by saving the probabilities in
the PROB
subop (as described above) and then using a SET
statement:
set xb = 1.0/(1.0 + exp(-phat))
There appear to be fewer situations where these values are useful in logit analysis.
For concreteness we assume that we have a variable Y_i which takes
three different values, "low," "medium," and "high," scored one, two,
and three, respectively. (It does not matter to SST which values the
variable actually takes, since the PROBIT
command will temporarily
recode values from low to high as integers.) We can think of the discrete
dependent variable Y_i as being a rough categorization of a continuous,
but unobserved, variable Y_i^*:
If we observed Y_i^* we could apply standard regression methods. For
example, we might assume that Y^* is a linear function of some
independent variables x_1i ... x_Ki plus an additive
disturbance u_i:
where u_i is normally distributed. Since
the scale of Y_i^* cannot
be determined, there is no loss of generality in assuming that
the variance of u_i is equal to one; this represents an arbitrary
normalization. Next, we specify the relationship between the
categories of Y_i and the values of Y_i^*:
The "threshold value" mu is a parameter to be estimated, as are the
unknown coefficients beta_1 ... beta_K. Setting the threshold
between the "low" and "medium" categories equal to zero is arbitrary, but
inconsequential if one of the independent variables is constant. In fact,
it is not possible to estimate the coefficient of a constant term (the
intercept) and two thresholds in the three category case: a shift
in the intercept cannot be distinguished from a shift in the thresholds.
In the general case of C categories, there will be C-2 thresholds to estimate. SST will always set the threshold between the lowest and next lowest categories equal to zero. Moreover, the threshold values must be ordered from lowest to highest. If at some point during the estimation procedure the thresholds get out of order, the maximization algorithm perturbs the estimates enough to put them back into ascending order.
The probability that Y_i falls into the jth category is given by:
where mu_j and mu_(j+1) denote the upper and lower threshold
values for category j. If j is the low category, then the lower
threshold value is - and the upper threshold value is zero. If
j is the high category, the upper threshold value is +. The
log likelihood function is the sum of the individual log probabilities:
PROBIT
command. The model described above is a straightforward
generalization of the binary probit model and, not surprisingly,
the same syntax works for the ordered probit model:
probit dep[y] ind[one x1 x2]
SST counts the number of categories in the dependent variable
and automatically orders the categories from "low" to "high".
Optional subops are the same for the ordered probit model as the
binary probit model with only slight modification. The number of
probabilities produced by the ordered probit model is equal to
one less than the number of categories in the dependent variable.
The omitted category is the low category. If the dependent
variable takes three values, then the user should supply two
variables in the PROB
subop, and SST will store the estimated
probabilities that the dependent variable falls into the middle
and high categories in the specified variables:
probit dep[y] ind[one x1 x2] prob[p2 p3]
The probability that the dependent variable falls into the low
category can be obtained using a SET
statement:
set p1 = 1.0 - p2 - p3
A similar calculation can be made for the percentage of cases correctly predicted in the three alternative case:
set success=1; if[(p1 >= p2) & (p1 >= p3)] set success=2; if[(p2 > p1) & (p2 >= p3)] set success=3; if[(p3 > p1) & (p3 > p2)] set success=(success==y) freq var[success]
The more alternatives, the more tedious this calculation becomes. Typing can be reduced by defining a macro to perform this procedure (see Chapter 8).
The PRED
subop can be used to compute beta' x_i. The
threshold values are not considered part of
beta' x_i and only one variable
should be specified in the PRED
subop regardless of the number of
categories of the dependent variable. Coefficients and the
covariance matrix can be stored using the COEF
and COVMAT
subops.
The thresholds are stored as the last elements of the coefficient
vector and correspond to the last rows and columns of the
covariance matrix. The ROBUST
subop also is allowed for the
ordered probit model.
The unordered logit model has attained some popularity in sociology because of its usefulness in modelling interactions in contingency tables. If the independent variables consist solely of dummy variables for categories of other discrete variables and various interaction terms, then the unordered logit model corresponds to the log Linear model for contingency table analysis. See S. Fienberg, The Analysis of Cross-Classified Categorical Data, MIT Press, 1981, for an introductory treatment of these models and related issues.
The log likelihood function for the unordered logit model is given by the
product of the probabilities for each case taking its observed value:
where beta_0 is a K vector of zeroes and each of the remaining
beta_j is a K vector of parameters to be estimated.
Since the number of parameters grows with the number of
categories in the dependent variable, users may want to consider
grouping categories of the dependent variable which do not occur
frequently.
logit dep[y] ind[one x1 x2]
Probabilities can be saved by specifying variable names in the
PROB
subop. The number of variable names specified should be one
less that the number of categories in the dependent variable.
Coefficients can be saved with the COEF
subop. Coefficients are
grouped according to which odds-ratio they pertain to. If there
are K independent variables, the first K
coefficients correspond
to the log odds of choosing category 1 versus category 0, the next K
coefficients to the odds of category 2 versus category 0, and so
on. The covariance matrix, which can be saved with the COVMAT
subop, is ordered in the same way. ROBUST
is also an allowable
option.
The Tobit model also fits within the general regression framework. Let
Y_i^* denote the unobserved latent variable which is subject to
censoring. We assume that Y_i^* is generated by a linear combination of
the independent variables x_1i ... x_Ki plus an error u_i
which is normally distributed with mean zero and variance sigma^2:
Y_i^* is observed only if Y_i^* exceeds some threshold. Again there
is no loss of generality in assuming that the threshold equals zero since
any shift in the threshold value can be absorbed into the intercept. Let
Y_i denote the observed variables and suppose censored values of
Y_i^* have been coded to zero (in the TOBIT
command, unlike the
other limited dependent variable commands, SST does care how the
variable is coded). Then:
The probability that Y_i^* < 0 is given by:
The conditional distribution of Y_i^* given that Y_i^* > 0 is given
by the following probability density function:
where phi(.) denotes the standardized normal density function:
Combining the above expressions, we obtain the sample log likelihood function:
The difficulty in maximizing the Tobit log likelihood occurs because
of the presence of sigma^2. SST computes a starting value for
sigma^2 based on user-supplied starting values for
beta_1 ... beta_K. If you do not supply starting values
SST computes an initial regression to obtain better starting values.
Poor starting values can cause the Newton-Raphson algorithm to
diverge more frequently in Tobit than most other maximum
likelihood procedures.
TOBIT
command maximizes the above log likelihood using an
iterative non-linear optimization algorithm. If censored values of
the variable y
have been coded equal to zero, Tobit estimation
can be accomplished by giving the command:
tobit dep[y] ind[one x1 x2]
The options to the TOBIT
command are similar to those for the other
limited dependent variable commands. The user may specify one a variable
for the censoring probabilities (1 - PHI(beta' x_i )) in the
PROB
subop, while beta' x_i can be stored in a variable
specified in the PRED
subop:
tobit dep[y] ind[one x1 x2] prob[phat] pred[xb]
Coefficient estimates can be saved using the COEF
subop and the
covariance matrix with the COVMAT
subop. The coefficients in the
TOBIT
model include, in addition to
beta_1 ... beta_K, the
disturbance variance sigma^2, which SST automatically
appends to the end of the
coefficient vector and as the last row and column of the covariance
matrix.
Further details on Tobit and related models may be found in a survey paper by T. Amemiya, "Tobit Models," Journal of Econometrics (1984).
The probability that an individual chooses alternative j is given by:
Once a joint probability distribution is specified for the disturbances
u_i1 ... u_iC, then the likelihood function will be determined
and estimation can proceed in the usual manner. One choice of distribution
would be the normal, but in this case the computations turn out to be
rather time-consuming. Daniel McFadden ("Conditional Logit Analysis of
Qualitative Choice Behavior," in P. Zarembka, Frontiers of
Econometrics Academic Press, 1973) proposed the type I extreme value
distribution as a distribution for the errors. This distribution is
particularly convenient, since the probability that an individual picks
alternative j is then given by:
where Y_ij is a dummy variable equal to one if individual i
chose alternative j and equal to zero otherwise.
(The same probabilities could also be generated by the unordered
logit, which demonstrates the essential connection between these two
models.) The likelihood function is then the product of the
individual choice probabilities:
The multinomial logit log likelihood is globally concave in the
parameters and generally easy to maximize.
IVALT
subop. The user provides a label for each set of independent
variables that will be used to identify the estimate of the coefficient
associated with those variables in the MNL
output. Using the
transportation mode choice example discussed above, let mode equal zero
if "bus" is chosen and equal one if "train" is chosen. We also have
data on "busfare", "trainfare", "bustime" and "traintime" reflecting
the costs and amount of time for each individual and each mode:
mnl dep[mode] ivalt[cost: busfare trainfare commut: bustime traintim]
The labels "cost" and "commut" are printed with the output to identify
which variables correspond to the parameter estimates. These are just
labels used for printing and have no other purpose. SST assume that the
dependent variable for the MNL
command will be coded with values
that are in the same order as the associated variable in the IVALT
subop. Thus, since the value of "mode" for "bus" is lower than that for
"train", the variables associated with "bus" should precede those for
"train" in the IVALT
subop. The number of variables following each
label in IVALT
should be the same as the number of alternatives or
categories in the DEP
variable.
MLE
command. The MLE
command is much slower than the
preprogrammed maximum likelihood procedures, so it should only be used
for problems that do not fit within any of the models described above.
To use the MLE
command, you will need to obtain the derivatives
of the log likelihood with respect to each parameter in the model and
then to specify these using the DEFINE
command. For purposes of
exposition, we will illustrate the procedure using a normal linear
regression model:
Y_i = alpha + beta x_i + u_i
where u_i has a normal distribution with mean zero and variance
sigma^2. If x_i is non-random, the density for Y_i is:
The log likelihood for a sample (Y_1 ... Y_n) of size n
is given by:
The partial derivatives of log L with respect to alpha,
beta, and sigma^2 are given by:
The maximum likelihood estimator is found by setting these partial
derivatives equal to zero and finding a solution to the equations.
In this case, of course, a closed-form solution to the likelihood
equations is available, but in general it will be necessary to
resort to an iterative non-linear procedure to solve the likelihood
equations. The method used by SST was proposed by E. R. Berndt,
B. H. Hall, R. E. Hall, and J. A. Hausman, "Estimation and Inference
in Nonlinear Structural Models," Annals of Economic and Social
Measurement (1974).
DEFINE
statement. The log likelihood function is a sum of
@(n@) terms, one for each observation. We specify the likelihood and its
derivates in terms of the contributions from each observation. For the
example above, suppose data on the variables x
and y
have
already been loaded into SST. We would give the commands:
define llk(a,b,s2) = -0.5*log(6.28) - 0.5*log(s2) - 0.5*(y-a-b*x)^2/s2 define ga(a,b,s2) = (y-a-b*x)/s2 define gb(a,b,s2) = x*(y-a-b*x)/s2 define gs2(a,b,s2) = - 0.5/s2 + 0.5*(y-a-b*x)^2/(s2*s2)
LIKE
subop, its derivatives in the GRAD
subop, and a list
of parameters is specified in the PARM
subop:
mle like[llk(a,b,s2)] grad[ga(a,b,s2),gb(a,b,s2),gs2(a,b,s2)] \ parm[a,b,s2] start[0.0,0.0,1.0]
The order in which the gradients are given in the GRAD
subop should
be the same as the order in which the parameters are specified in the
PARM
subop. We have supplied starting values in the START
subop. If no starting values are specified, SST will use zero as a
starting value for each parameter by default. In this case, using zero
as a starting value for the parameter s2
(sigma^2) would cause
a division by zero and the MLE
command would abort. Starting values
are very important for the MLE
command. Good starting values
speed up the procedure; bad ones often cause the algorithm to fail.
Users can supply
starting values in the START
subop. The number of starting values
should correspond to the number of parameters in the model. (In
the case of MLE
, the number of parameters is equal to the number
of parameter names specified in the PARM
subop. For other
procedures, consult the description above). When starting values
are not specified by the user, typically zeroes are used instead.
The criteria used to determine whether the algorithm has converged
depends on a quadratic form in the gradient of the log likelihood
function with matrix equal to the negative of the inverse of hessian.
The default convergence criteria is set at 0.001. To set the
convergence criteria at another value, include the CONVG
subop with a value of your own choosing. For example:
probit dep[y] ind[one x] convg[0.0001]
would set a more stringent criteria for convergence.
By default, the maximum likelihood routines continue for a maximum
of fifteen iterations. To increase (or decrease) the iteration
limit, include the MAXIT
subop. For example:
logit dep[y] ind[one x] maxit[50]
would allow SST to continue for up to fifty iterations.
At each iteration, SST normally picks a direction to search in the
parameter space and attempts to pick the optimum stepsize for taking
a step in this direction. Occasionally, the program will fail to
find a stepsize which will increase the likelihood and will
report a convergence failure. You may force SST to take a fixed
stepsize of your own choosing at each iteration by including the
STEP
subop which takes as its argument a positive number:
tobit dep[y] ind[one x] step[1.0]
With the above command, SST will use a stepsize of one at each iteration.
If SST has difficulty achieving convergence, it is a good idea to
increase the amount of information that the program prints at each
iteration using the PRT
subop. The default level of printing
is one and includes the value of the likelihood function at each
iteration, stepsize, and the convergence criteria. To have SST
print out the covariance matrix after the last iteration, increase
the print level to two. To have parameter values output at each
iteration, set the print level at three. A common problem is
incorrectly specified derivatives. SST will calculate numerical
derivatives and print them on the first iteration if you set the
print level at four. For example:
mnl dep[y] ivalt[constant: one one x: x1 x2] prt[4]
will produce the maximum available output.