MLEprocedure allows users to program their own likelihood functions. For most purposes, however, users will want to use SST procedures which have been preprogrammed for standard models, such as the
DURATcommands. These commands have roughly the same syntax as the
REGcommand and make it no more difficult to estimate a probit model, for instance, than to run a regression.
The purpose of this chapter is to describe the various models which can be estimated by SST and to explain how to execute the appropriate commands. The discussion of the models is kept at a fairly simple level and references are provided for more detailed discussions about the statistical properties of the procedures.
We focus initially upon models for dichotomous dependent variables. These models are somewhat simpler to explain than models for polytomous (multiple category) dependent variables and, for some users, are all that is required. In the case of a dichotomous dependent variable, the logit and probit models will produce very similar results (the estimated coefficients should be approximately proportional to one another, as described below). For this reason, we discuss the so-called binary models together.
When the dependent variable is polytomous, the
PROBIT commands, as implemented in SST, will produce rather
different output. The
PROBIT command in SST is for an ordered
categorical model. This will be appropriate when the categories
of the dependent variable can be ordered in a natural way.
Survey responses often can be treated this way. For example, if
respondents are asked whether they "agree strongly," "agree
weakly," "disagree weakly," or "disagree strongly" with some
statement, then their responses can be ordered. The ordered
probit model in SST provides an alternative to arbitrary scoring
schemes of responses that would be required for regression
LOGIT command in SST is for unordered responses. A typical
example would be when an individual can choose from a set of discrete
alternatives. In transportation mode analysis, one may try to predict which
mode (e.g., car, bus, or train) an individual will pick. Since there is no
natural ordering of these alternatives, one approach would be to try to
estimate the probability that each alternative would be chosen. For this
purpose, the unordered logit model would be suitable. Later we also discuss
how random utility models can be used for this purpose (the
command can be used for estimation of random utility models). These two
models are closely related. In fact, the
MNL command can be used to
estimate unordered logit models as well as the
LOGIT command, but it
is much more cumbersome to do it this way.
TOBIT command is used to estimate censored regression
models and is most often applied to the analysis of expenditure
DURAT command is used to estimate discrete time hazard
models (e.g., for analyzing spells of unemployment). They are
discussed later in the chapter.
The last section of the chapter discusses the
which computes maximum likelihood estimates for user-defined
models. The same procedure can be used for nonlinear least
squares estimation. The
MLE command differs from the previously
mentioned commands in that the user must also specify the
derivatives of the log likelihood function, while in the other
commands these derivatives have been preprogrammed. Since the
algorithm used for the
MLE command is slightly different from
that used for the other commands, the options available also
If the data are continuous, rather than discrete, we simply substitute the probability density function for the probability mass function. Also, for most purposes it is easier to work with the log of the likelihood function. Since the logarithm is a monotone increasing function, log L(Beta_1, ...., Beta_K) and L(Beta_1, ..., Beta_K) will be maximized at the same values.
When the model is misspecified, then maximum likelihood estimators may lose
some of their desirable properties. However, it has been shown that under
very weak conditions, maximum likelihood estimators will still have a
well-defined probability limit and will be approximately normally
distributed. Moreover, it is still possible to compute the variance of
maximum likelihood estimators based on a random sample even if the model is
badly misspecified. The
ROBUST subop to the maximum likelihood
procedures computes standard errors for maximum likelihood estimates that
are insensitive (or "robust") to model misspecification. For details, see
Halbert White, "Misspecification and Maximum Likelihood Estimation",
Econometrica (January, 1982). White suggests comparing the usual
and robust estimates of the covariance matrix of maximum likelihood
estimators as a test of model specification.
Let Y_i denote the dependent variable and x_1i ... x_Ki
denote the independent variables. A reasonable choice of functional form
for the probabilities is:
where beta' x_i = beta_1 x_1i + ... + beta_K x_Ki and F(.) is a nondecreasing function such that:
Any cumulative probability distribution function will satisfy these conditions. Two distribution functions which have often been used are the logistic:
and the normal:
PHI(.) corresponds to the SST function
either case, if beta' x_i is large, then the probability that Y_i
equals one is close to one. Similarly, if beta' x_i is small, then
the probability that Y_i equals one is close to zero. Whatever values
the independent variables take, the probability that Y_i equals one
will be admissible (i.e., between zero and one).
When we specify F(.) to be logistic, then we have the
logit model. The log likelihood function for the logit model is given
where F(t) denotes the cumulative logistic distribution function.
When we specify F(.) to be normal, then we have the
probit model. The log likelihood in this case is given by:
Both the logit and probit log likelihoods are globally concave and hence relatively easy to maximize using the Newton-Raphson algorithm.
LOGITcommand uses an iterative nonlinear optimization procedure to obtain parameter estimates that maximize the logit log likelihood function. You specify the dependent variable in the
DEPsubop and the independent variables in the
INDsubop in much the same as the
REGcommand. Suppose, for example, that the variable
ytakes only two values. In our examples, we will assume
yequals zero or one, but the
PROBITcommands don't require this. Then to obtain estimates for a binary logit model, type:
logit dep[y] ind[x1 x2 x3]
As in regressions, it is advisable to include a constant term in
IND subop, though SST does not require you to do so.
PROBIT command produces parameter estimates that maximize
this function. The syntax is identical to that for the
probit dep[y] ind[x1 x2 x3]
The same advice about including a constant term in the
applied here as well.
PROBITwill produce similar results. The logistic distribution has variance (pi^2)/3 while the standard normal distribution, of course, has variance one. This means that the coefficients based on the binary logit specification should be approximately sixty percent larger than those based on the binary probit specification. (Multiplying the binary logit estimates by sqrt(3)/pi makes them roughly comparable to the binary probit estimates.) If the observations are highly skewed, then the choice between the logit and probit specifications may be of greater consequence, but this is a rarity in most applications.
PROBITcommands, SST allows the user to save the predicted probabilities with the
PROBsubop. If the dependent variable has two categories, then the predicted probability is the estimated probability that the dependent variable takes its high value. You should specify one variable name in the
PROBsubop. Following our example above, where y takes the values zero and one, we could give the
logit dep[y] ind[one x1 x2 x3] prob[phat]
SST will compute the probability y takes its high value and save these
estimated probabilities in the variable
phat. The ith
observation on the variable
phat would be:
PROB subop is included with the
PROBIT command, then
the estimated probability that y takes its high value is given by:
In either case, the estimated probabilities can be used to compute a goodness of fit statistic, the percentage of cases correctly predicted. In the binary logit and probit models, we would predict that a case would fall into the category with the modal (highest) predicted probability. That is, if Prob(Y_i=1) >= Prob(Y_i=0), then we would predict that Y_i equals one, while if Prob(Y_i=0) > Prob(Y_i=1), then we would predict Y_i equals zero. We could then compare our predictions to the actual outcomes on the dependent variable. The percent "correctly predicted" is just the fraction of cases where the actual outcome corresponds to the "predictions" described above. We describe this calculation in some detail for the probit model. The calculations for the logit command are identical, except that we initially use the
LOGIT command instead of the
First, we compute maximum likelihood probit estimates and
save the probabilities in the variable
probit dep[y] ind[one x1 x2] prob[phat]
Next, we create a dummy variable which equals one if the case is predicted correctly and equals zero otherwise:
set success = ((phat >= 0.5) & (Y == 1)) | ((phat < 0.05) & (y == 0))
Finally, we calculate the percentage of successes:
The number produced is the "percentage correctly predicted". It should be understood that while this statistic can be a useful diagnostic, it is not really a "prediction" (since the actual outcomes have been used to compute the probit or logit estimates). Furthermore, the percent correctly predicted is guaranteed to be fairly high, and will be very high if the dependent variable is skewed.
REGcommand, estimated coefficients and the variance-covariance matrix can be saved using the
COVMATsubops. The user supplies a variable name in each subop, and SST stores the coefficient estimates or covariance matrix (stacked by column) in the specified variable.
PROBIT command, the values of beta' x_i can be saved
for each observation by specifying a variable in the
This provides another route for obtaining predicted probabilities, since
the probability that the dependent variable takes its high value can be
obtained by evaluating the cumulative normal distribution function at
beta' x_i. Users may find this option helpful for performing selectivity
corrections (see James Heckman, "Sample Selection Bias as a Specification
Error", Econometrica, January 1979) or simultaneous probit
estimation (see L. F. Lee, "Simultaneous Equations Models with Discrete
Endogenous Variables", in C. Manski and D. McFadden, eds., Structural
Analysis of Discrete Data with Econometric Applications, MIT Press, 1981).
This option is not available for the
LOGIT command, but the same
results can be obtained in the binary case by saving the probabilities in
PROB subop (as described above) and then using a
set xb = 1.0/(1.0 + exp(-phat))
There appear to be fewer situations where these values are useful in logit analysis.
For concreteness we assume that we have a variable Y_i which takes
three different values, "low," "medium," and "high," scored one, two,
and three, respectively. (It does not matter to SST which values the
variable actually takes, since the
PROBIT command will temporarily
recode values from low to high as integers.) We can think of the discrete
dependent variable Y_i as being a rough categorization of a continuous,
but unobserved, variable Y_i^*:
If we observed Y_i^* we could apply standard regression methods. For example, we might assume that Y^* is a linear function of some independent variables x_1i ... x_Ki plus an additive disturbance u_i:
where u_i is normally distributed. Since the scale of Y_i^* cannot be determined, there is no loss of generality in assuming that the variance of u_i is equal to one; this represents an arbitrary normalization. Next, we specify the relationship between the categories of Y_i and the values of Y_i^*:
The "threshold value" mu is a parameter to be estimated, as are the unknown coefficients beta_1 ... beta_K. Setting the threshold between the "low" and "medium" categories equal to zero is arbitrary, but inconsequential if one of the independent variables is constant. In fact, it is not possible to estimate the coefficient of a constant term (the intercept) and two thresholds in the three category case: a shift in the intercept cannot be distinguished from a shift in the thresholds.
In the general case of C categories, there will be C-2 thresholds to estimate. SST will always set the threshold between the lowest and next lowest categories equal to zero. Moreover, the threshold values must be ordered from lowest to highest. If at some point during the estimation procedure the thresholds get out of order, the maximization algorithm perturbs the estimates enough to put them back into ascending order.
The probability that Y_i falls into the jth category is given by:
where mu_j and mu_(j+1) denote the upper and lower threshold values for category j. If j is the low category, then the lower threshold value is - and the upper threshold value is zero. If j is the high category, the upper threshold value is +. The log likelihood function is the sum of the individual log probabilities:
PROBITcommand. The model described above is a straightforward generalization of the binary probit model and, not surprisingly, the same syntax works for the ordered probit model:
probit dep[y] ind[one x1 x2]
SST counts the number of categories in the dependent variable
and automatically orders the categories from "low" to "high".
Optional subops are the same for the ordered probit model as the
binary probit model with only slight modification. The number of
probabilities produced by the ordered probit model is equal to
one less than the number of categories in the dependent variable.
The omitted category is the low category. If the dependent
variable takes three values, then the user should supply two
variables in the
PROB subop, and SST will store the estimated
probabilities that the dependent variable falls into the middle
and high categories in the specified variables:
probit dep[y] ind[one x1 x2] prob[p2 p3]
The probability that the dependent variable falls into the low
category can be obtained using a
set p1 = 1.0 - p2 - p3
A similar calculation can be made for the percentage of cases correctly predicted in the three alternative case:
set success=1; if[(p1 >= p2) & (p1 >= p3)] set success=2; if[(p2 > p1) & (p2 >= p3)] set success=3; if[(p3 > p1) & (p3 > p2)] set success=(success==y) freq var[success]
The more alternatives, the more tedious this calculation becomes. Typing can be reduced by defining a macro to perform this procedure (see Chapter 8).
PRED subop can be used to compute beta' x_i. The
threshold values are not considered part of
beta' x_i and only one variable
should be specified in the
PRED subop regardless of the number of
categories of the dependent variable. Coefficients and the
covariance matrix can be stored using the
The thresholds are stored as the last elements of the coefficient
vector and correspond to the last rows and columns of the
covariance matrix. The
ROBUST subop also is allowed for the
ordered probit model.
The unordered logit model has attained some popularity in sociology because of its usefulness in modelling interactions in contingency tables. If the independent variables consist solely of dummy variables for categories of other discrete variables and various interaction terms, then the unordered logit model corresponds to the log Linear model for contingency table analysis. See S. Fienberg, The Analysis of Cross-Classified Categorical Data, MIT Press, 1981, for an introductory treatment of these models and related issues.
The log likelihood function for the unordered logit model is given by the
product of the probabilities for each case taking its observed value:
where beta_0 is a K vector of zeroes and each of the remaining beta_j is a K vector of parameters to be estimated. Since the number of parameters grows with the number of categories in the dependent variable, users may want to consider grouping categories of the dependent variable which do not occur frequently.
logit dep[y] ind[one x1 x2]
Probabilities can be saved by specifying variable names in the
PROB subop. The number of variable names specified should be one
less that the number of categories in the dependent variable.
Coefficients can be saved with the
COEF subop. Coefficients are
grouped according to which odds-ratio they pertain to. If there
are K independent variables, the first
K coefficients correspond
to the log odds of choosing category 1 versus category 0, the next K
coefficients to the odds of category 2 versus category 0, and so
on. The covariance matrix, which can be saved with the
subop, is ordered in the same way.
ROBUST is also an allowable
The Tobit model also fits within the general regression framework. Let
Y_i^* denote the unobserved latent variable which is subject to
censoring. We assume that Y_i^* is generated by a linear combination of
the independent variables x_1i ... x_Ki plus an error u_i
which is normally distributed with mean zero and variance sigma^2:
Y_i^* is observed only if Y_i^* exceeds some threshold. Again there is no loss of generality in assuming that the threshold equals zero since any shift in the threshold value can be absorbed into the intercept. Let Y_i denote the observed variables and suppose censored values of Y_i^* have been coded to zero (in the
TOBIT command, unlike the
other limited dependent variable commands, SST does care how the
variable is coded). Then:
The probability that Y_i^* < 0 is given by:
The conditional distribution of Y_i^* given that Y_i^* > 0 is given by the following probability density function:
where phi(.) denotes the standardized normal density function:
Combining the above expressions, we obtain the sample log likelihood function:
The difficulty in maximizing the Tobit log likelihood occurs because of the presence of sigma^2. SST computes a starting value for sigma^2 based on user-supplied starting values for beta_1 ... beta_K. If you do not supply starting values SST computes an initial regression to obtain better starting values. Poor starting values can cause the Newton-Raphson algorithm to diverge more frequently in Tobit than most other maximum likelihood procedures.
TOBITcommand maximizes the above log likelihood using an iterative non-linear optimization algorithm. If censored values of the variable
yhave been coded equal to zero, Tobit estimation can be accomplished by giving the command:
tobit dep[y] ind[one x1 x2]
The options to the
TOBIT command are similar to those for the other
limited dependent variable commands. The user may specify one a variable
for the censoring probabilities (1 - PHI(beta' x_i )) in the
PROB subop, while beta' x_i can be stored in a variable
specified in the
tobit dep[y] ind[one x1 x2] prob[phat] pred[xb]
Coefficient estimates can be saved using the
COEF subop and the
covariance matrix with the
COVMAT subop. The coefficients in the
TOBIT model include, in addition to
beta_1 ... beta_K, the
disturbance variance sigma^2, which SST automatically
appends to the end of the
coefficient vector and as the last row and column of the covariance
Further details on Tobit and related models may be found in a survey paper by T. Amemiya, "Tobit Models," Journal of Econometrics (1984).
The probability that an individual chooses alternative j is given by:
Once a joint probability distribution is specified for the disturbances u_i1 ... u_iC, then the likelihood function will be determined and estimation can proceed in the usual manner. One choice of distribution would be the normal, but in this case the computations turn out to be rather time-consuming. Daniel McFadden ("Conditional Logit Analysis of Qualitative Choice Behavior," in P. Zarembka, Frontiers of Econometrics Academic Press, 1973) proposed the type I extreme value distribution as a distribution for the errors. This distribution is particularly convenient, since the probability that an individual picks alternative j is then given by:
where Y_ij is a dummy variable equal to one if individual i chose alternative j and equal to zero otherwise. (The same probabilities could also be generated by the unordered logit, which demonstrates the essential connection between these two models.) The likelihood function is then the product of the individual choice probabilities:
The multinomial logit log likelihood is globally concave in the parameters and generally easy to maximize.
IVALTsubop. The user provides a label for each set of independent variables that will be used to identify the estimate of the coefficient associated with those variables in the
MNLoutput. Using the transportation mode choice example discussed above, let mode equal zero if "bus" is chosen and equal one if "train" is chosen. We also have data on "busfare", "trainfare", "bustime" and "traintime" reflecting the costs and amount of time for each individual and each mode:
mnl dep[mode] ivalt[cost: busfare trainfare commut: bustime traintim]
The labels "cost" and "commut" are printed with the output to identify
which variables correspond to the parameter estimates. These are just
labels used for printing and have no other purpose. SST assume that the
dependent variable for the
MNL command will be coded with values
that are in the same order as the associated variable in the
subop. Thus, since the value of "mode" for "bus" is lower than that for
"train", the variables associated with "bus" should precede those for
"train" in the
IVALT subop. The number of variables following each
IVALT should be the same as the number of alternatives or
categories in the
MLEcommand is much slower than the preprogrammed maximum likelihood procedures, so it should only be used for problems that do not fit within any of the models described above.
To use the
MLE command, you will need to obtain the derivatives
of the log likelihood with respect to each parameter in the model and
then to specify these using the
DEFINE command. For purposes of
exposition, we will illustrate the procedure using a normal linear
Y_i = alpha + beta x_i + u_i
where u_i has a normal distribution with mean zero and variance
sigma^2. If x_i is non-random, the density for Y_i is:
The log likelihood for a sample (Y_1 ... Y_n) of size n is given by:
The partial derivatives of log L with respect to alpha, beta, and sigma^2 are given by:
The maximum likelihood estimator is found by setting these partial derivatives equal to zero and finding a solution to the equations. In this case, of course, a closed-form solution to the likelihood equations is available, but in general it will be necessary to resort to an iterative non-linear procedure to solve the likelihood equations. The method used by SST was proposed by E. R. Berndt, B. H. Hall, R. E. Hall, and J. A. Hausman, "Estimation and Inference in Nonlinear Structural Models," Annals of Economic and Social Measurement (1974).
DEFINEstatement. The log likelihood function is a sum of @(n@) terms, one for each observation. We specify the likelihood and its derivates in terms of the contributions from each observation. For the example above, suppose data on the variables
yhave already been loaded into SST. We would give the commands:
define llk(a,b,s2) = -0.5*log(6.28) - 0.5*log(s2) - 0.5*(y-a-b*x)^2/s2 define ga(a,b,s2) = (y-a-b*x)/s2 define gb(a,b,s2) = x*(y-a-b*x)/s2 define gs2(a,b,s2) = - 0.5/s2 + 0.5*(y-a-b*x)^2/(s2*s2)
LIKEsubop, its derivatives in the
GRADsubop, and a list of parameters is specified in the
mle like[llk(a,b,s2)] grad[ga(a,b,s2),gb(a,b,s2),gs2(a,b,s2)] \ parm[a,b,s2] start[0.0,0.0,1.0]
The order in which the gradients are given in the
GRAD subop should
be the same as the order in which the parameters are specified in the
PARM subop. We have supplied starting values in the
subop. If no starting values are specified, SST will use zero as a
starting value for each parameter by default. In this case, using zero
as a starting value for the parameter
s2 (sigma^2) would cause
a division by zero and the
MLE command would abort. Starting values
are very important for the
MLE command. Good starting values
speed up the procedure; bad ones often cause the algorithm to fail.
Users can supply
starting values in the
START subop. The number of starting values
should correspond to the number of parameters in the model. (In
the case of
MLE, the number of parameters is equal to the number
of parameter names specified in the
PARM subop. For other
procedures, consult the description above). When starting values
are not specified by the user, typically zeroes are used instead.
The criteria used to determine whether the algorithm has converged
depends on a quadratic form in the gradient of the log likelihood
function with matrix equal to the negative of the inverse of hessian.
The default convergence criteria is set at 0.001. To set the
convergence criteria at another value, include the
subop with a value of your own choosing. For example:
probit dep[y] ind[one x] convg[0.0001]
would set a more stringent criteria for convergence.
By default, the maximum likelihood routines continue for a maximum
of fifteen iterations. To increase (or decrease) the iteration
limit, include the
MAXIT subop. For example:
logit dep[y] ind[one x] maxit
would allow SST to continue for up to fifty iterations.
At each iteration, SST normally picks a direction to search in the
parameter space and attempts to pick the optimum stepsize for taking
a step in this direction. Occasionally, the program will fail to
find a stepsize which will increase the likelihood and will
report a convergence failure. You may force SST to take a fixed
stepsize of your own choosing at each iteration by including the
STEP subop which takes as its argument a positive number:
tobit dep[y] ind[one x] step[1.0]
With the above command, SST will use a stepsize of one at each iteration.
If SST has difficulty achieving convergence, it is a good idea to
increase the amount of information that the program prints at each
iteration using the
PRT subop. The default level of printing
is one and includes the value of the likelihood function at each
iteration, stepsize, and the convergence criteria. To have SST
print out the covariance matrix after the last iteration, increase
the print level to two. To have parameter values output at each
iteration, set the print level at three. A common problem is
incorrectly specified derivatives. SST will calculate numerical
derivatives and print them on the first iteration if you set the
print level at four. For example:
mnl dep[y] ivalt[constant: one one x: x1 x2] prt
will produce the maximum available output.