Entering Data into SST

The first task of any SST session it to input your data. SST provides a variety of ways to enter data. Once you have input your data once, it can be stored in the SST system file format and recalled with only a few keystrokes. Entering data for the first time, however, can be a laborious task so you should read this chapter carefully to determine the easiest way to enter your data.

Sample dataset

In the first half of the User's Guide, we will use an example to illustrate the use of various commands. In this example, we shall explore the relationship between growth in the money supply, the rate of inflation and unemployment, and the party occupying the White House. Since the goal is to learn about the use of SST commands, the theories that relate these variables will be vastly oversimplified to suit our purposes.

Annual data for the period from 1960 to 1980 are taken from the Economic Report of the President. The data are as follows:

Year   Money   Inflation   Unemployment   Party
1960    0.7       1.6          5.5          1
1961    3.2       0.9          6.7          0
1962    1.8       1.8          5.5          0
1963    3.7       1.5          5.7          0
1964    4.6       1.5          5.2          0
1965    4.7       2.2          4.5          0
1966    2.5       3.2          3.8          0
1967    6.6       3.0          3.8          0
1968    7.7       4.4          3.6          0
1969    3.2       5.1          3.5          1
1970    5.3       5.4          4.9          1
1971    6.5       5.0          5.9          1
1972    9.3       4.2          5.6          1
1973    5.5       5.7          4.9          1
1974    4.4       8.7          5.6          1
1975    5.0       9.3          8.5          1
1976    6.6       5.2          7.7          1
1977    8.1       5.8          7.1          0
1978    8.3       7.3          6.1          0
1979    7.2       8.5          5.8          0
1980    6.4       9.0          7.1          0

Money is the money supply growth rate (percent increase in M1 over each year). Inflation is the percent increase in the implicit GNP price deflator. Unemployment is measured as a percent of the civilian labor force. Party is the party holding the presidency (one for Republicans, zero for Democrats).

Ways to enter data

There are three commands in SST to input data: ENTER, READ, and LOAD. ENTER is used to input small amounts of data from the keyboard, READ to enter data stored in text files, and LOAD to enter previously saved data from SST or other programs.

Entering data from the keyboard

The ENTER command is used to enter new data or change existing data from the keyboard in interactive mode. You tell SST the variables you wish to create or alter and a range of observations and then SST prompts you for data values. The syntax for the ENTER command is:

enter to[variable list] obs[observation list]

SST will prompt you for data values on the variables specified in the TO subop in the range specified by the OBS subop. When finished with data entry, type the letter `q' or `quit'.

SST will supply the variable name, with the observation number in parentheses, followed by the current value of that variable in brackets. ("MD" indicates missing data if no value currently exists for the particular observation.) You can either change the value by typing a new value, followed by a carriage return, or leave the value unchanged by typing a carriage return. To enter a missing value, type either `MD', `md' or a period `.'.

Multiple values can be entered on one line, separated by blanks or commas. After the carriage return is pressed, the program the prompts you for the next data value. If you have not supplied a data value for all variables for the particular observation, it will remind you which variable comes next. If all data has been entered for a particular observation, it then prompts you for the next observation.

For example to enter data to the data listed at the start of this chapter, type:

enter to[year money inflat unemp party] obs[1-21]

SST responds with the prompt:

year(1) [ MD ]:

You could then type `1960' followed by a carriage return. The remainder of the session might continue as follows (carriage returns are entered after each list of data values):

money(1) [ MD ]: 0.7
inflat(1) [ MD ]: 1.6
unemp(1) [ MD ]: 5.5
party(1) [ MD ]: 1

To speed things up, you may want to type more than one value after each prompt. For example:

year(2) [ MD ]: 1961 3.2 0.9
unemp(2) [ MD ]: 6.7 0

Thus, the value of inflat for observation 2 is 3.2, the value of unemp for observation 2 is 0.9, and so forth.

For small amounts of data, ENTER works well, but you probably would not want to enter large amounts of data this way. To stop entering data at any point during the ENTER command, just type `q' or `quit':

year(3) [ MD ]: quit

and SST will be ready to accept new commands.

SST allows you to designate some values as missing with the ENTER command. When asked for a value, type either `MD' or a period (`.') when prompted, and SST will mark that data value as missing.

Reading ASCII data files

As we have seen,the ENTER command lets you input data from the keyboard in response to prompts. In some situations, however, it may be faster to create a text file with your data and input it using the READ command. For instance, you may have already typed your data into a file or have been given your data in this format. Text files on the IBM-PC (and most other computers) respresent characters using standard ASCII codes that can be displayed by using the DOS type command. Some spreadsheet and statistical programs (including SST) store data in a more compact binary format that cannot be displayed using the type command. If your data is in this form, check the LOAD command for details.

A data file can be created using a text editor or word processor (such as WordStar). If a word processor is used, be sure to use the "non-document" mode so that your word processor does not insert "invisible" control characters into the text file. SST normally ignores control characters. Although data need only be separated by a comma, space, or carriage return, the data set will be easier for you to read if the data is separated into fixed columns. You may want to set up tabs to input data into a fixed column format.

By variable or by observation?

SST expects data in text files to be organized in one of two ways: by variable or by observation. Which option is specified determines how values in the text file are read into SST.

For our example, the "observations" correspond to years. For the data to be organized by observation, the input file would look like:

1960 0.7 1.6 5.5 1
1961 3.2 0.9 6.7 0

and so on. On each line of the input file, there are five data values corresponding to the year, money supply growth rate, inflation rate, unemployment rate, and party holding the presidency for the particular observation (year) in question. The data for a single observation can occupy more than one line of the input file, but in general you might think of a data file organized by observation as being a rectangular array with variables defining columns and rows defining observations. Unless you state otherwise, the READ command expects data to be stored by observation.

Data organized by variable for the above example might look like:

1960 1961 1962 1963 1964 1965 1966 1967 1968 1969
1970 1971 1972 1973 1974 1975 1976 1977 1978 1979
1980 0.7 3.2 1.8 3.7 4.6 4.7 2.5 6.6 7.7 3.2 5.3

and so on. For data organized by variable, all data is input for one variable before data is entered on the next variable. If the data for each variable could fit on one line of the input file, then data organized by observation could be viewed as a rectangular array with variables defining rows and observations defining columns.

In the example we are using, the distinction between variables and observations is probably pretty clear. In other cases, however, the distinction will depend on how one wants to use the data. For example, each year a number of organizations prepare forecasts of GNP, inflation and other macroeconomic indicators. Suppose one wanted to analyze this data. In this case, it is not obvious what the variables should be. One possibility is to make the variable the forecast of a particular organization. Thus one has variables like OMB, CBO, DRI, CHASE, and WHARTON, with observations defined by which macroeconomic indicator is being forecast. Another possibility is to have variables corresponding to different macroeconomic indicators with each obseration corresponding to the organization that produced the forecast. Which way of thinking about the data in terms of variables and observations will depend on how one wants to use the data.

Note that the computer simply reads left to right, by row. Thus there is no difference to the computer between the following two data sets:

4.2  3.5  4.0  3.3
0.5  1.0  1.5  1.2

and

4.2  3.5  4.0  3.3  0.5  1.0  1.5  1.2

Of course if one is creating the data set, it may be simpler to read if each column and row correspond to a different variable or observation.

To summarize, decide which are variables and which are observations for your purpose. If the data set exists, then see whether the computer reading by row will first see different observations for the first variable (data by variable) or the values of different variables for the first observation (data by observation). If you are creating the data set, the method of organization is a matter of convenience.

How to use the READ command

To use the READ command you must supply two pieces of information: the name of the file which contains the data and the names of the variables in the file. If the data in our example were organized by obseration, the SST command to read the data from the file mydata would be:

read to[year money inflat unemp party] file[mydata]

Unless instructed otherwise, the READ command expects data to be organized by obseration. If the data in the file mydata were organized by variable, the appropriate command would be instead:

read to[year money inflat unemp party] obs[1-21] byvar

The BYVAR subop tells SST that the data is organized by variable instead of by observation.

Errors in the READ command

There are several errors that occur frequently in using the READ command. First, the data file may contain illegal characters. Data files used in the READ command should only contain valid numbers. Valid numbers can be in integer, decimal, or exponential format. For example:

1 1.0 +1.0e0

are all examples of valid numbers (each with the same meaning). On the other hand, a file containing:

1 abc 2.0

would cause SST to issue an error message and abort the READ command.

Second, the number of values in the data file may not correspond to your instructions in the READ command. The number of data values in a file should be a multiple of the number of variables specified in the TO subop. If data is read by variable, SST has no way to determine the number of observations in the file other than to divide the total number of data values in the file by the number of variables. If these numbers are different, it assumes that you have made a mistake and issues an error. If data is read by variable, it will issue a warning, generally this means something is amiss and you should examine your data file.

Third, reading data normally requires two passes through the data file: one to determine how many observations are in the data file, and another to process the data. If you know how many observations are in a file, you can speed up the READ command by giving it this information in the NOBS subop:

read to[year money inflat unemp party] nobs[21]

In reading large datasets, you may run out of memory. Reading data by variable is somewhat more efficient than reading data by observation. The former only requires that the data on a single variable fit into memory at once, while the latter requires that the entire data set fit into memory at once. If you run out of memory, try entering data in smaller batches and saving them using the LOAD command which can handle very large data sets efficiently.

Data in fixed format

So far we have been discussing data in free format. It makes no difference to SST how may spaces you have between data values or on which line they appear so long as the data values are in the correct order. Thus, SST will read the following data values:

1 2    3 , 4      5

the same as it will read:

1,2,3 4,5

For your own sanity, we suggest that you use a consistent system for data entry, but don't worry about SST -- it's very tolerant.

Users accustomed to mainframe computing often prefer to store their data in a fixed column format without spaces, commas, or other delimiters between data values. This format saves space, though it is somewhat difficult to examine. SST allows users to specify a FORTRAN style format statement for data in this form using the FMT subop.

In fixed format the data are required to appear in specified positions within the file. A summary of FORTRAN format statements appears in an appendix so we will only provide a few simple examples here. The letter F in a FORTRAN format statement tells SST that you will be inputting a floating point number. The letter F is followed by an integer indicating how many columns the number will occupy. Thus F3 tells SST that you will be inputting a floating point number occupying three columns in the data file. For example, the following data file:

123456

could be read using the FMT statement:

fmt[F3,F3]

The first number (123) occupies three columns and the second number (456) also occupies three columns. Instead of repeating the specification F3 twice, you could specify repeats of the same specification by preceding the letter F with an integer indicating the number of times the specification is to be repeated. Thus:

fmt[2F3]

is equivalent to the specification above. FORTRAN format statements are quite flexible, though perhaps a bit complicated for new users.

Checking inputted data

After the data is successfully read in, it is prudent to perform a few quick checks to make sure that the data values correspond to what you intended to enter. First, try listing which variables have been entered into SST using the LIST command:

list

SST will now provide you a listing of all variables entered, the number of non-missing observations on each variable, the date created, and the variable's label, if any (see below, for details of how to label a variable). For example:

Listing of variables in memory:
year      21  Thu Jan 09 14:41:06 1986
money     21  Thu Jan 09 14:41:06 1986  change in M1 from year earlier
inflat    21  Thu Jan 09 14:41:06 1986  change in GNP implicit price deflator
unemploy  21  Thu Jan 09 14:41:06 1986  civilian unemployment rate
party     21  Thu Jan 09 14:41:06 1986  republican president dummy

Are all the variables entered that you thought should be entered? Does each variable have the number of observations that you expected? If you just input a variable using the READ command, the date and time on the variable should be very recent.

Even if the information supplied by the LIST command is what you expected, you will still want to check if the data values are correct. There are several ways to do this. If you don't have too much data, you can examine it using the PRINT command. For example, type:

print var[year money]

and SST will print the values of the variables year and money that you have input. For large datasets, you will probably want to restrict the observations printed by specifying a limited observation range:

print var[year money] obs[1-10]

The OBS subop restricts which values will be printed out on the screen. The above example would only print the data for observations one through ten. Alternatively, the observation range can be restricted using the IF subop:

print var[year money] if[year > 1975]

which would print out data for years after 1975.

Another way to check the data that you have input is to compute some descriptive statistics on the data. If the data are discrete (i.e., take only a few distinct values), the FREQ command will show you which values the variable takes and the percentage of observations falling into each category. For example:

freq var[party]

would compute a frequency distribution for the variable party. For variables that take a large number of distinct values (any of the other variables in our data set), the COVA command will produce a few useful descriptive statistics on the variable:

cova var[year money inflat unemp]

The COVA command automatically produces the mean, standard deviation, minimum, and maximum of the variable specified in the VAR subop. Usually if there has been some error in data entry, one or more of these statistics will tip you off.

Further details of the PRINT, FREQ and COVA commands can be found in Chapter 4 of the User's Guide.

Variable labels

Since SST variable names are limited to eight characters, you may forget the meaning of some variables. SST provides a facility for adding short descriptions to each variable with the LABEL command. For example, type:

label var[money] lab[change in M1 from year earlier]

Inside the LAB subop, you type whatever description you want attached to the variable. The variable label ordinarily should not exceed thirty characters. The label will be printed when you issue the LIST command and at other points when you access the variable.

Value labels

A variable like party which takes only a few values (in our case, Republican and Democratic) can also have labels assigned to specific values. We have coded party equal to one when Republicans hold the White House and zero when the Democrats hold the White House:

label var[party] val[1 Repub 0 Democrat]

In the VAL subop, you first list a value of the variable in the VAR subop followed by its label and continue until you have finished labelling the values. Value labels are restricted to a maximum of eight characters and must not contain spaces or commas. Multiple variables whose categories have the same variables can be labelled simultaneously by specifying more than one variable name in the VAR subop. In principle, there is no limit to the number of value labels that can be assigned, but few people have enough patience to type more than ten labels.

Removing labeling information

If you give the LABEL command with the VAR subop, but omit both the LAB and VAL subops, SST will remove all labelling information from the variables specified. Since labelling information requires relatively little storage and is an invaluable reminder when you return to a data set that you have not worked with for awhile, we recommend that you keep as much labelling information as possible.

Saving data in SST system file format

Once data has been input into SST and you have entered as much labelling information as you desire, it's a good practice to save the data in an SST system file. Once a data is saved, you will not have to reconstruct your earlier work if you have a power failure or some other catastrophe. SST system files preserve your variable and value labels and can be loaded very quickly (much faster, for example, than an ASCII file read with the READ command). The command to save all data in currently in SST into a file mydata.sav is:

save file[myfile]

SST automatically adds the extension `.sav' to the filename you specify in the FILE subop. (If for some reason you wanted another extension, you would have to specify the full filename and extension in the FILE subop.) You may not want to save all variables in memory. In this case SST allows you to list which variables you want saved:

save file[myfile] var[year money inflat]

Alternatively, you might only want to save some subset of the data. To save only the first ten observations, add an observation range using the OBS subop:

save file[myfile] obs[1-10]

The observation range can also be restricted using the IF subop. To save only the post 1975 data, type:

save file[myfile] if[year > 1975]

Listing the contents of an SST system file

To see which variables have been stored in an SST system file and the number of observations on each variable, use the LIST command with the FILE subop:

list file[myfile]

SST only reads the "header" off the system file, so issuing this command does not cause the data to be actually entered into SST. It tells you what is in the file, but does not waste time reading through the entire file.

Loading data from an SST system file

The LOAD command is used to load a data set previously saved during an SST session. Once you have gone to the trouble of saving data in the form of an SST system file, reloading it is easy. Just type;

load file[myfile]

and SST loads the data and labelling information. It's fast and simple. If no filename extension is specified in the FILE subop, SST assumes the extension `.sav'. To load only selected variables stored in the system file myfile.sav, include the variables that you want in the TO subop:

load file[myfile] to[year money]

Appending data from an SST system file

Sometimes you may want to combine two data sets. If the data sets include different variables on the same observations, then just load the second data set into memory using the LOAD command, and the additional variables will be loaded into memory. (Caution: If some of the variables in the second data set have the same names as variables in the first, the old values will be overwritten.)

On other occassions, you may have two or more samples of data on the same variables. For example, you may have several household expenditure surveys conducted in different years. The variables in each data set are the same (or at least overlapping) and you want to combine the various samples. To do this, just add the APPEND subop to the LOAD command:

load file[yourdata] append

The variables in the file yourdata will be appended to whatever variables are currently in memory. The starting observation for the new data is determined by the maximum observation number of the data currently in memory (which can be determined using the LIST command).

Loading data produced by other programs

SST can also read some files produced by other programs. If you have used the popular database management program dBASE II to input your data, it can be read directly into SST using the LOAD command. Give the LOAD command and add the DB2 subop:

load file[filename] db2

SST assumes an extension of `.dbf' to the filename specified in the FILE subop, unless told otherwise. (dBASE II uses this extension by default when it produces a file in its standard format.)

Another common format for files produced by spreadsheet programs is the DIF (Data Interchange Format) format used by VisiCalc and other programs. To load a DIF file, enter:

load file[filename] dif

If no extension is specified, `.dif' is assumed when the DIF subop is present. With DIF files, column labels are used for variable names. If no column labels are present, names are assigned by SST.

Deleting variables

Keeping too many variables in memory can slow the operation of the program, so SST allows you to delete variables that you won't be needing again using the DEL command:

del var[x y]

After they have been deleted, the variables x and y are lost unless you previously saved them in a file. Use the DEL command with caution!

A wholesale delete of all variables (and everything else) from memory can be accomplished using the CLEAR command:

clear

CLEAR also resets the range, so you should issue a new RANGE statement after the CLEAR command. The primary use of the CLEAR command is to restart an SST sesssion without having to QUIT and reload the program into memory. Remember, however, that CLEAR removes everything from memory. It does not affect files that have been written to disk, but it is your responsibility to SAVE any data that you will need in the future.

Sorting data in sst

The SORT command sorts observations specified in the VAR subop according to values of the variables specified in the BY subop. If more than one variable is specified in the BY subop, the sort is lexicographic--that is, first the data is sorted according to the first variable, and the second variable is only used to break ties in the first variable, and so on. Variables are sorted in ascending order: low values are put ahead of high values. Missing values are treated as large values so that missing values in the variables specified in the BY subop tend to end up at the bottom of the data file.

The VAR subop is optional. If it is omitted, SST assumes that you want all of your data sorted so that observations are kept intact. The SORT command writes over the variables specified in the VAR subop; it is wise to save your data using the SAVE command prior to using SORT.

One use of the SORT command is to arrange your data in a way that permits visual (as opposed to statistical) analysis. For example, suppose wanted to examine the relationship between growth in the money supply and inflation. You could sort the data by money supply growth, and then look at the associated inflation rates, as below:

sort by[money]
print var[year money inflat unemp]

 OBS              VARIABLES
             year         money        inflat         unemp
  1:         1960           0.7           1.6           5.5
  2:         1962           1.8           1.8           5.5
  3:         1966           2.5           3.2           3.8
  4:         1969           3.2           5.1           3.5
  5:         1961           3.2           0.9           6.7
  6:         1963           3.7           1.5           5.7
  7:         1974           4.4           8.7           5.6
  8:         1964           4.6           1.5           5.2
  9:         1965           4.7           2.2           4.5
 10:         1975           5.0           9.3           8.5
 11:         1970           5.3           5.4           4.9
 12:         1973           5.5           5.7           4.9
 13:         1980           6.4           9.0           7.1
 14:         1971           6.5           5.0           5.9
 15:         1967           6.6           3.0           3.8
 16:         1976           6.6           5.2           7.7
 17:         1979           7.2           8.5           5.8
 18:         1968           7.7           4.4           3.6
 19:         1977           8.1           5.8           7.1
 20:         1978           8.3           7.3           6.1
 21:         1972           9.3           4.2           5.6

It appears that years with high money supply growth rates accompany years with high inflation rates. This relationship could then be investigated further using the statistical procedures described in later chapters.

The same sorting could be accomplished with:

sort by[money] var[year money inflat unemploy year party]

since SST assumes that you want all variables sorted if the VAR subop is omitted. If we had specified only a subset of variables in the VAR subop, then the data on different observations would be split up.

Saving data for input to other programs

If you add either the DIF or DB2 subops to the SAVE command, SST will write either a DIF file or a dBASE II file. For DIF files, variable names will be used for column names. For a dBASE II files, variable names will be used for field names.

Writing an ASCII file

SST will output data into a text file using the WRITE command. Unless a FORTRAN format is specified using the FMT subop, the data will be output with a space separating each data value. The default output format is by observation. For example:

write var[year money] file[myfile.out]

would create a file myfile.out with contents:

1960 0.7
1961 3.2

and so on.


Starting Back Data