SST, short for Statistical Software Tools, performs a large (and ever expanding) variety of statistical functions. These functions include the entering, editing, transforming, and recoding of data that every statistical package must have. Beyond data manipulation and frequently used statistical procedures (such as regression analysis), SST is geared toward the estimation of complicated statistical models. It is SST's ability to handle difficult estimation problems -- and handle them relatively quickly -- that distinguishes it from its competitors.
The purpose of this tutorial is to give the new user a short guided tour through SST. We do not attempt to cover any advanced features here, but after spending twenty or thirty minutes with the tutorial, you should be able to run regressions using SST and understand the flavor of the program.
A bit of background. SST was the idea of two econometricians at the California Institute of Technology, Jeff Dubin and Doug Rivers. The bulk of the programming was carried out by a group of present and former Caltech students (Bob Lord, Richard Murray, Steve Beccue, Dave Agabra, Carl Lydick, Dave Beccue, and others). SST, however, is not supported by or in any way affiliated with the California Institute of Technology. SST was designed with four primary considerations in mind:
If SST meets these four goals -- simple to learn and use commands, powerful estimation capabilities, speedy operation, and complete portability and compatibility of versions between different operating systems, then we think our effort will have been successful. This does not mean that SST will satisfy everyone's statistical needs. We wrote it to satisfy our own needs, not some nonexistent general user. But if your needs are at all like our own, then we think you will be satisfied with the result.
We will start with SST initially in interactive mode. To invoke SST from MS-DOS, issue the command:
C>sst
SST loads quickly (under five seconds on an IBM PC/AT). You will be greeted with the following response:
SST - Statistical Software Tools - Version 1.0 Copyright 1985,1986 by J.A. Dubin and R.D. Rivers SST1>
The string SST1>
(not shown in the examples below) is the
SST prompt. Whenever it
appears, SST is ready to receive your commands. SST doesn't need to
know the size of your datasets, but you can save the program some work
if you tell it the maximum number of observations (or cases) that you
expect to be working with. We will initially specify a limit of one
hundred observations:
range obs[1-100]
The RANGE
statement doesn't commit us to 100 observations. We can
always issue a new RANGE
statement increasing the number of
observations, but by limiting the sample this way we tell SST not to waste
its time (and ours) by worrying about data vectors longer than 100
observations. Failure to issue the range statement will slow processing and
may lead to out of memory conditions.
The RANGE
statement also illustrates the standard format of SST
commands. Each SST command has a name, such as `range', which should be
typed first. The command is modified by subops (in this case, the
OBS
subop informs SST about which observations are active). Most
subops take arguments, enclosed with square brackets []. The OBS
subop, for example, takes a list of observations as its argument (in the
above example 1-100
or, equivalently if you really like to type,
1 2 3-8 9 10 11-99 100
). There are also subops which don't take
arguments. One that works with nearly every command is the TIME
subop. Including the TIME
subop in the RANGE
command, as
follows:
range obs[1-100] time
would cause SST to print the elapsed time between the time it received a
command and it finished executing it. (If you want a thrill, try timing a
statistical procedure using your old statistical package and then try the
same thing with SST using the TIME
subop.)
For a listing of available SST commands, type:
help
and SST responds with a complete list of command names. If you don't know or remember the syntax of an SST command, type `help' followed by the command name. For example:
help range
produces a summary of the syntax of the range statement:
RANGE {OBS[observation list]2 {IF[expression]2 {TIME2
Keywords are printed in upper case. SST doesn't care whether you type
keywords in upper case or lower case, but you may not abbreviate or
misspell the keywords. Optional subops are enclosed in braces (`{' and
`}'). Don't type the braces (they have a special meaning in SST).
Arguments to subops are described in lower case letters. For example, the
syntax summary reminds you that the OBS
subop takes
a list of observations
as its argument. If you need more information about a command beyond that
provided by the HELP
command, consult the SST User's Guide or
Reference Manual.
Congratulations. If you have followed up to this point, you have successfully executed some SST commands. Of course, nothing useful has been accomplished yet. Now, we're ready to get down to work.
There are two basic ways to organize data in a text file. If you have ten
observations on five variables, one way is to first list the values for
each of the five variables for the first observation, followed by the
values of the five variables for the second observation, and so forth. This
is called listing by observation, and is the default in SST.
The other way is to first list all ten
observations for the first variable, followed by all ten observations for
the second variable, and so forth. This way is called listing
by variable or,
naturally enough, BYVAR
in SST.
The last thing to do before giving SST your data file is to think up names
for your variables. SST names are limited to a maximum of eight characters
and can be composed of alphabetic characters, digits, and the underscore
character (`_'). For illustration, suppose the file data.raw
contains the following numbers:
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15
(Not a very interesting dataset, but it will do for now.) Further, suppose
that these data represent five observations on each of three variables and
are organized by observation. We have chosen the rather uninspired names
var1
, var2
, and var3
for the variables. The following
command would have SST read the data:
read to[var1 var2 var3] file[data.raw]
The TO
subop tells SST to create three variables with the assigned
names and then to read data on these variables from the file specified in
the FILE
subop. Note that variable names in the TO
subop were
separated by spaces. We could just as well have used commas to separate the
variable names. SST now has the following data in memory:
obsno var1 var2 var3 1 1 2 3 2 4 5 6 3 7 8 9 4 10 11 12 5 13 14 15
You did not create the variable obsno (SST did this automatically), but you can refer to it just like any other variable.
If we had told SST instead that the data were organized by variable:
read to[var1 var2 var3] file[data.raw] byvar
then SST would have the following data in memory:
obsno var1 var2 var3 1 1 6 11 2 2 7 12 3 3 8 13 4 4 9 14 5 5 10 15
The BYVAR
subop does not take any arguments.
Now that your data is in SST, you may want to save it in a form that will
allow you to skip the READ
command when you use the data again. SST has
its own compact format for saving data files. These files have the
advantage that they can be read very quickly and that all information about
a variable (labels, missing values, date last modified) is stored with the
data values in an SST system file. If you give the command:
save file[olddata]
SST will save var1, var2, and var3 in the file newdata.sav
. Note
that SST
automatically adds the extension `.sav' to system files, unless you
specify some other extension in the FILE
subop. If, for some
reason, you only want to save variables selectively, add the VAR
subop specifying which variables are to be saved:
save file[newdata] var[var1 var2]
The file newdata.sav
would differ from olddata.sav
only in that
var3 was excluded from the former and included in the latter.
Saving data does not remove that data from memory. It is still available
for use during the current session. It is also a simple matter to reenter
saved data into SST. Just issue the command
LOAD
:
load file[newdata]
As with the SAVE
command, the FILE
subop of the LOAD
command will assume the file extension `.sav' unless some other extension is
specified. The SST system file newdata.sav
contains
five observations on
each of the variables var1 and var2 which would again be available for data
analysis.
There are other ways to enter data into SST. The ENTER
command
(described in the User's Guide) is convenient for entering small amounts of
data from the keyboard. SST can also interchange data from other programs
such as dBASE II and VisiCalc by simply specifying the format of the file
to be read (DB2 or DIF). This feature is covered in more detail in the
User's Guide.
faminc
) might be the
income of families in a sample of households, but the relevant variable
might be the logarithm of family income. We can use the SET statement to
create a new variable loginc
:
set loginc = log(faminc)
SST has a wide variety of functions including, as we will explain below, ones you define yourself. A few of the most commonly used functions are:
log() natural logarithm exp() exponentiation sqrt() square roots abs() absolute value cumnorm() cumulative normal probabilities phi() normal probability density function bvnorm() bivariate normal probabilities
SST also can perform all the standard arithmetic operations. For example:
set z = (x-y)\^3 / sqrt(x*y)
would be equivalent to:
(x - y)^3 z = --------- sqrt(xy)
The rules of precedence for evaluating complex expressions are standard, but if you have any doubts about what SST will do, add extra parentheses to be on the safe side.
It is also possible to perform conditional transformations. For
example, suppose our dataset has three variables: hinc
(husband's
income), winc
(wife's income), and head
(a dummy variable which
takes the value 1 if the wife is classified as head of household and the
value 0 if the husband is classified as the head of household). To
form a new variable which is the income of the head of household, we
could issue two set statements:
set headinc = winc; if[head==1] set headinc = hinc; if[head==0]
We have introduced a new subop, IF
, which controls which
observations the command will be applied to. The argument to the IF
subop is a logical expression, which can either be true or false. If the
expression in the IF
subop is true for a particular observation in
the active sample range, then the set statement will be performed for that
observation; otherwise that observation will be skipped.
Note in the above SET
statement that the transformation was set off
from the IF
subop with a semicolon (`;'). The semicolon is required
if some additional subops appear in the command, but otherwise may be
omitted.
Logical expressions in the IF
subop can use any of the following
relational operators:
== equals to > greater than >= greater than or equals to < less than <= less than or equals to != not equals to
Logical expressions can also be made fairly complex by using some additional operators. Any of the standard arithmetic operators can be used in logical expressions, as well as the following logical operators:
& and | or (nonexclusive) ! negation
For example, to set faminc
equal to the combined income of the
husband and wife only for those families with a male head and
combined income of less than $25,000:
set faminc = hinc+winc; if[(head==1) & (hinc+winc$<$25000)]
Also, if the IF
subop does not contain a relational operator, the
logical expression is evaluated numerically and values of one are
interpreted as being true. Thus, since head
is a dummy variable
(taking values of one and zero), the following SET
statement will
work on only the one values (female
head
):
set faminc = winc; if[head]
while the following SET
statement will only work on the zero
values (male
head
):
set faminc = hinc; if[!head]
This feature means that the IF
subop can be used like the Boolean
subop found in some other statistical packages.
It is also possible to modify the operation of the SET
statement
with the OBS
subop, which restricts the range of observations on
which the operation will be performed. For example, to transform only the
first ten observations:
set loginc = log(faminc); obs[1-10]
The OBS
subop can be combined with the IF
subop if further
control over the range of the transformation is desired. The OBS
and
IF subops only modify the active sample range for the particular command
they are issued with. Afterwards, the sample range returns to whatever it
was previously (as determined by the RANGE
statement, if issued).
If a particular transformation is going to be used over and over again, it
is simpler to define a function which will perform this transformation
using the DEFINE
statement. Sociologists frequently "standardize"
their data to have mean zero and variance one, and the resulting variable
is sometimes called a z-score. If you do this often, it is probably
worthwhile to define a z-score function:
define zscore(x,meanx,varx) = (x - meanx)/sqrt(varx)
In the above expression, x
, meanx
, and varx
are
parameters that the user can supply when needed, rather than existing
variables in memory. Later, to standardize a variable y
which has
mean -5.33 and variance 8.85, we could issue the following set statement:
set zy = zscore(y,-5.33,8.85)
Then zy
would be the standardized version of y
.
SST also has the ability to recode data. The RECODE
command uses the
MAP
subop to provide a list of values of the old variable which will
be recoded into a new variable. Sometimes, for example, we might want to
have a dummy variable coded +1 and -1 instead of 1 and 0. We could recode
the variable head by issuing a SET
statement:
set newhead = -1; if[head==0]
The variable newhead
equals 1 if the wife is head of household and
-1 if the husband is head of household. The same task could be accomplished
using the RECODE
command:
recode var[head] map[1=1 0=-1] to[newhead]
Three new subops have been encountered in the RECODE
command. The
VAR
subop takes a list of one or more variables which will be
operated upon. The TO
subop takes a list of one or more
variables which will be created by the procedure. In the above example,
both the VAR
and TO
subops were given only one variable, but
it would be possible to recode several variables simultaneously. The number
of variables specified in the VAR
subop must always equal the number of
variables specified in the TO
subop. The MAP
subop in the
example
tells SST how to recode the variable head
into the new variable
newhead
. When head
equals 1, newhead
will equal 0,
and when head
equals 0, newhead
will equal -1. Actually, the
assignment of the value 1 to newhead
when head
equals 1 is
redundant, since values of head
which are not specified in the MAP
subop are automatically written to the new variable. Thus the last
RECODE
statement is equivalent to:
recode var[head] map[0 = -1] to[newhead]
The TO
subop is optional in the RECODE
statement. If
the TO
subop is omitted, the new variable is written over the old
variable. When this is done, the old data is lost.
LIST
command:
list
LIST
can be used with some subops to obtain different information,
but we will not discuss them here. We will assume that variables x
,
y
, and z
are now stored in memory. It is always a good
practice to examine the data carefully to see that the data are what you
think they should be. Descriptive statistics can be calculated using the
COVA
command:
cova var[x y z]
COVA
now prints the means, standard deviations, minimum and maximum
values of each variable in the VAR
subop. It is possible to
restrict the amount of output that COVA
prints by specifying one or
more of the following subops:
MEAN STD MIN MAX COV
If any of these subops is included with the COVA
statement, only the
requested information will be printed. To get only the means and standard
deviations of the variables x
and y
, give the command:
cova var[x y] mean std
Using the COV
subop will display the correlation/covariance matrix.
The correlation/covariance matrix contains variances of the variables on
the diagonal, correlations above the diagonal, and covariances below the
diagonal.
Some users find typing subop names tedious and prefer to have SST prompt
them for subop arguments and options. If you type a command such as
COVA
which requires one or more subops, SST will prompt you for missing
subops. For example, type:
cova
and SST will respond:
VAR[]:
You now type the list of variables for which you desire descriptive statistics:
VAR[]: x y
In using subop prompting you do not enclose subop arguments in brackets.
Since VAR
is the only required subop for the COVA
command,
SST now asks you for a list of options:
OPTIONS: mean std
where the response mean std
was supplied by the user. Subops entered in
response to the OPTIONS
prompt should be written out in full, e.g.
OPTIONS: coef[beta] covmat[cov] pred[xbeta]
You may also want a scatterplot of the data. Currently SST supports plots in both character and graphics mode, depending upon how you have your system equipped. (See the Installation Instructions for details on how to configure your system.) Character plotting is rather crude since the limited resolution of the screen does not permit plotting more than one or two hundred data points. In graphics mode, however, SST scatterplots are feasible for large datasets.
To obtain a simple scatterplot of x
and y
, with x
on
the horizontal axis and y
on the vertical axis, give the following
command:
scat var[y x]
The VAR
subop specifies which variables to plot with the first
variable defining the horizontal axis and the second variable defining
vertical axis. (In fact, the SCAT
command has quite a few
variations. See the SST User's Guide for further details.)
A linear regression equation takes the form:
where i denotes observations, Y is the dependent
variable, X_1, ..., X_k are the independent variables, and Ui is an
unobserved error. Usually, one of the independent variables
is constant (e.g., X_1i = 1 for all i). In SST if you want a
constant term, you will have to create one and include it in the equation:
set one=1
Note that one pitfall of the SET statement is that if no RANGE
statement is
in effect, the above command will create a vector of 8000 one's, which is
time consuming and wastes memory. At this point we can regress y
on
x
, z
, and a constant:
reg dep[y] ind[one x z]
SST will print out ordinary least squares estimates of this equation along with standard errors, t-statistics, and the R^2.
The power of the REG
command, however, is not illustrated by this
simple example. If a regression is worth running, it is also worth the
time to check for patterns in the residuals. SST allows users to save
residuals and predicted values with the subops RSD
and PRED
:
reg dep[y] ind[one x z] pred[yhat] rsd[u]
The subops PRED
and RSD
take as their argument a variable
name in which the predicted values and residuals, respectively, will be
stored. Once created, the variables yhat
and u
can be printed, plotted,
and otherwise manipulated like any other variable.
Two other useful diagnostics, suggested by Belsey, Kuh, and Welch in their
book, Regression Diagnostics (Wiley, 1980), are available in SST.
Studentized residuals (residuals divided by their estimated standard
deviations) can be obtained by including the SRSD
subop, which takes
a variable name as an argument. The diagonal elements of the hat matrix
can be obtained by including the HAT
subop, which also takes a
variable name as an argument. Large values (positive or negative) of the
studentized residuals indicate outliers, while large values of the hat
matrix diagonal elements indicate observations with high leverage (i.e.,
the estimated regression will be sensitive to the deletion of these
observations).
It is often informative to rerun a regression after deleting observations
which are either outliers or high leverage points. First, we might run the
above regression and save the studentized residuals in a variable called
srsd1
and the hat values in a variable called hat1
:
reg dep[y] ind[one x z] srsd[srsd1] hat[hat1]
Next, we could delete observations for which the studentized residuals are greater than two in absolute value or the corresponding hat element is greater than two:
reg dep[y] ind[one x z] if[(abs(srsd1) $<$ 2) $|$ (hat $<$ 2)]
The IF
statement modifies the sample range used to calculate the
regression. Only observations for which the expression inside the If subop
is true will be used for estimation.
The estimation range for the REG
command can also be modified using
the OBS
subop. For example, we may want to check for possible
nonlinearities or interactions by splitting the sample depending on whether
z was greater than or less than some value (say 3.5):
reg dep[y] ind[one x z] if[z > 3.5] reg dep[y] ind[one x z] if[z <= 3.4]
The above two REG
commands would produce two regressions
corresponding to the division of the sample by values of z
.
quit
and SST returns you to the operating system. The QUIT
command is
very simple; it takes no subops.
Operating SST in interactive mode sometimes seems to require too much typing. If an operation needs to be repeated with minor modification, we would rather not have to retype the entire command. One way to lessen the typing required is to create a batch file with a word processor or editor. Your word processor allows you to duplicate lines, making small changes as necessary, and to create large files of commands relatively quickly.
If you have created a file batch.cmd
with SST commands, you
can run it by invoking SST in the following way:
C$>$sst batch
Note that SST assumes an extension of `.cmd' for command files, unless some
other extension is specified. You can then watch your commands and
SST output roll quickly down your screen (probably too quickly to
read). You can save the output for later viewing by redirecting it to a
file, say batch.out
:
C$>$sst batch $>$batch.out
All this is a little awkward and greatly cuts down on helpful interaction with the program. There is a better way.
The first thing to realize is that editing batch files and running SST need not be distinct processes. If you are using WordStar, for example, and WordStar can be accessed from your current directory, give the command:
sys ws
and, presto, you are in WordStar while everything in SST remains
undisturbed. When you exit WordStar, the SST prompt will reappear and you
can continue your SST session. The SYS
command can be used to
execute most programs from DOS. To obtain a listing of your current
directory, for example, just give the command:
sys dir
and a directory returns on screen before returning you to SST.
If you created a batch file of SST commands while you were temporarily
using your editor, these can now be executed using the RUN
command:
run batch
The SST commands in the file batch.cmd
will be executed. When SST has
finished executing the batch file, the SST prompt will reappear and SST is
ready to accept new commands.
During an SST session, you may want to save a record of your work. For
this, the SPOOL
command is handy:
spool file[sst.out]
The SPOOL
command, as issued above, causes a copy of the session to
be saved in the file specified in the FILE
subop (in this case,
sst.out
). To turn spooling off, give the command:
spool off
and all SPOOL
files are closed. Type the SPOOL
command by
itself, and SST reminds you which spool files you have open:
spool No spool files open
The SPOOL
command also enables you to create SST batch files
without using your word processor. If the subop CMD
is added to the
SPOOL
command, only commands are saved to the specified file. If the
subop OUT
is added to the SPOOL
command, only the output of
executed commands is saved to the specified file. It is possible to have two
SPOOL
files open at once: one for commands, the other for output:
spool cmd file[file1.cmd] spool out file[file2.out]
The file file1.cmd
would get your commands while the file
file2.out
would get the output produced by SST. This is useful
because the file generated by SPOOL CMD
is executable by SST.
You can rerun the
session by, first, closing the CMD
file:
spool cmd off
You may want to close the OUT
file too. The command:
spool off
would close both files, while the command:
spool out off
would only close the OUT
file. Next, give the command:
run file1
and SST runs the file file1.cmd
, giving you a rerun of the session!
Of course, you may want to make some minor modifications to the
CMD
file, which can be done using your word processor in
conjunction with the SYS
command. Thus, SST can provide a complete,
integrated environment for all your statistical work.