Data Transformations

Once you have input your data, you may want to perform some data transformations, recode some variables, or construct some additional variables. SST provides a wide variety of data transformation capabilities that simplify the process. SST also allows you to perform conditional transformations depending upon whether a condition you specify is true or false.

Logical expressions in SST

In most SST command that deal with data, you may include the IF subop, you supply a logical expression. This logical expression is evaluated for each observation. If it is true for that observation, the command is specified for that observation; otherwise the observation is ignored for purposes of this command.

We have already encountered logical expressions before when we mentioned in passing the use of the PRINT command. In that case, recall that we only wanted observations printed when:

year > 1975

The variable year took values ranging from 1960 to 1980. If, for instance, year equalled 1964, the above expression would be false, and the effect of the < less than <= less than or equal to == equals >= greater than or equal to > greater than != not equals & logical and | logical or

The double equals (`==') is used to test equality and should not be confused with the single equals (`=') which is used in the SET statement (described below) for assignment. For example:

print var[money inflat] if[year == 1980]

would print the values of the variables money and inflat for the year 1980. We add spaces around logical operators to make the expressions easier to read, but the spaces have no significance for SST and you may omit them if you like.

The logical and (`&') and logical or (`|') symbols may require some explanation. SST allows you to build up fairly complicated logical expressions where multiple conditions are tested. For example:

print var[money inflat] if[(party == 1) & (inflat > 0)]

would print data only for those observations where party equals one and inflat was greater than zero. The command:

print var[money inflat] if[(party == 1) | (inflat > 0)]

would print data for those observations where either party equals one or inflat was greater than zero or both. Thus `|' means the non-exclusive or: either one condition or both conditions must be true for the expression to be true.

The SET statement

The SET statement performs data transformations either modifying existing variables or creating new ones using arithmetic expressions. The simplest SET statement would create a constant taking the same value for all observations:

set one = 1

You will now have a variable which is equal to one for all observations. (This is convenient for regressions and other statistical procedures.)

There is always only one variable name on the lefthand side of the equals sign in a SET statement. The expression on the righthand side of the equals sign can be as complicated as you like. Let's start with some simple examples before building up to more complicated cases. Suppose, for example, that you have a variable x and you would like a copy of it. Try:

set y = x

The variable y contains exactly the same data values as the variable x. You can now manipulate y as you like without fear of disturbing the values in x. SST allows you to use virtually any arithmetic expression that you might desire in the SET statement:

set y = x+z
set y = x*z
set y = x/z
set y = x^z

SST uses `*' to indicate multiplication, `/' for division, and `^' for exponentiation. When you build up more complicated expressions, either surround terms of the expression in parentheses:

set x = y+(z/w)

or remember the rules of precedence:

set x = y+z/w

The previous two expressions have the same meaning. When the order of operations is not determined by parentheses, exponentiation is performed first, followed by either multiplication or division, and addition or subtraction is performed last. Operations of equal order of precedence (e.g., multiplication and division) are performed from left to right. Whenever in doubt, add parentheses to be on the safe side.

Some functions available in SST

SST provides a number of functions that can be used in the SET statement:

exp(x)          exponential function, e^x
log(x)          natural logarithm, ln x
abs(x)          absolute value, |x|
sqrt(x)         square root, 
sin(x)          sine function (x in radians)
cos(x)          cosine function (x in radians)
tan(x)          tangent function (x in radians)
cumnorm(x)      cumulative normal distribution function evaluated at x
invnorm(x)      inverse cumulative normal evaluated at x
bvnorm(h,k,r)   bivariate normal probability with correlation r
phi(x)          normal probability density evaluated at x
floor(x)        greatest integer less than or equal to x

For the most part, the use of these functions is straightforward. You type an arithmetic expression substituting variable names, numbers or other expressions in place of the arguments of these predefined functions. For example:

set x = sin(x+exp(y+1))

would be equivalent to . When a function takes multiple arguments (such as the bivariate normal distribution function), the arguments should be separated by commas. For example:

set x = bvnorm(0,0,0.5)

gives the the probability that two jointly normal standard variables, each with mean zero and variance one and correlation 0.5 will both be negative.

Other functions available in SST

The functions described above are scalar functions: they take as arguments single numbers. SST also supports several vector functions which take vectors as arguments. For example, you might want to "standardize" a variable x by deviating it from its mean and dividing by its standard deviation. To do this, you would use the mean() and stddev() functions:

set z = (x-mean(x))/stddev(x)

The vector functions available in SST include:

sum(x)      sum of the values of x,

mean(x)     mean of x,

stddev(x)   standard deviation of x,

where n indicates the number of nonmissing observations in the variable x.

In evaluating vector functions, SST uses all valid observations of its argument. This means that mean(x+y) may not equal mean(x)+mean(y) since missing value deletion is not done listwise. (The COVA command, on the other hand, uses listwise deletion of missing values so that all statistics are based on a common set of observations.)

In addition, you can define your own functions using the DEFINE command described in Chapter 8 of the User's Guide.

Random number generators

SST makes available two random number generators that enable you to perform Monte Carlo simulations. These are the urnd function for generating random variables with a uniform distribution over the interval [0,1] and the nrnd function for generating standard normal random variables:

urnd        uniform random variable
nrnd        normal random variable

Both of these functions do not take arguments so that you do not supply parentheses or arguments when using them. To create 1000 random variables from a normal distribution with mean 1.0 and standard deviation 4.0, give the commands:

range obs[1-1000]
set x = 1.0 + 4.0*nrnd

Reserved variable names

SST allows you to use a number of predefined variables in SET statements. These include:

nobs          number of observations for the variable
obsno         the number of the observation being evaluated

For example, the variable x can be set equal to the observation number:

set x = obsno

The variable x will equal one for the first observation, two for the second observation, and so forth.

How to lag a variable

SST allows you to refer to the value of a variable for a particular observation by enclosing the observation number in parentheses after a reference to that variable. For example, the command:

set y = x(1)

would set all observations on the variable y equal to whatever value happened to be stored as the first observation of the variable x. This device can be used to "lag" variables. For example, to set xlag equal to the value of x lagged one period:

set xlag = x(obsno-1)

The values of xlag will correspond to the values of x for the preceding observation. The first observation of the lagged variable "xlag" will be missing.

Conditional transformations

Sometimes you will want to perform a transformation on only a subset of the data. This can be done by including the OBS subop in the SET statement. The OBS subops are optional; if present, they determine which observations will be affected by the SET statement.

The SET statement differs from other SST commands in that any subops, such as OBS must be separated from the righthand side of the arithmetic expression in a SET statement by a semicolon. This is how SST knows that you are finished entering your arithmetic expression. For example, you can't take the square root of a negative number and it's best not to try to tempt SST to do so. If you want to set y to be the square root of x when x is non-negative, give the command:

set y = sqrt(x); if[x>=0.0]

and SST will avoid those observations for which x is negative. Values not included in the range specified by the IF subop, SST would have assigned missing values to those observations for which the operation was illegal.)

The RP subop

In performing a conditional transformation, observations not satisfying the condition specified in the OBS subop will not be affected by the SET statement. If the variable on the lefthand side of the SET statement is mentioned for the first time, these observations will be assigned missing values. If the variable has already been created, however, old values not in the active observation range determined by the OBS subops will be left with their old values. Occasionally you will want to replace old values outside of the active observation range with missing values. To do this, use the RP subop and these observations will be overwritten as missing data. For example:

set x = 1; obs[1-10] rp

For observations one through ten, the value of x will equal one. For all other observations, x will be missing, regardless of what previous data values were stored there.

Calculating single values

Sometimes you may want to transform only a single value of a variable or to check one data value. This can be done by putting SST into "calculator mode". Type:
calc

and SST responds with the calculator prompt:

CALC>

You can now enter the same kind of expressions that you would with the SET statement and the answer will appear on your screen:

CALC>sqrt(4)
           2.00000

If you specify a variable name in calculator mode, it should normally be accompanied by a reference to the observation you want. For example,

CALC>year(1)
        1960.00000

since 1960 is the value of the variable year corresponding to the first observation. With vector functions, no reference to an observation number is used. For example, to obtain the mean of the variable x, type:

CALC>mean(x)

and SST would respond by printing the mean of x. To exit calculator mode, type `quit' or `q'.

CALC can be run from the command line without entering calculator mode by typing the expression that you want on the command line:

calc 2+2

SST will calculate the value you want, print it on the screen, and return you to normal command mode.

Recoding data

The RECODE command allows you to reassign values of a variable. You supply a list of variable names in the VAR subop and a "map" in the MAP subop that instructs SST how to reassign the values of the variables. In the MAP subop, you provide a list of values, enclosed in parentheses if the list consists of more than one value, followed by a new value to which the old values in the list are to be recoded. For example, suppose the variable x takes the values 1, 2 and 3, but we want to make it into a 0-1 dummy variable with the old value 1 coded as 0 and the values 2 and 3 coded as 1. Enter:

recode var[x] map[1=0,(2,3)=1]

The same operation could be performed specifying how each old value is to be recoded, though this is tedious:

recode var[x] map[1=0,2=1,3=1]

The RECODE command allows you to simplify the MAP subop using the keywords hi, lo, thru and else. For example, to recode all negative values of x as -1 and positive values as 1, you could enter:

recode var[x] map[(lo thru 0)=-1,(0 thru hi)=1,0=0]

When a range is specified using the keyword thru, SST assumes that the range includes its endpoints. When a value falls into more than one range specified in the MAP subop, the last recoding is the one used. Thus, in the above example, zero would be recoded as one.

If you want to preserve the old variable in the RECODE command, specify a new variable name for the recoded values using the TO subop:

recode var[x] map[(lo thru 0)=-1,(0 thru hi)=1,0=0] to[y]

In the example above, the variable x would remain unchanged while the new variable y would receive the recoded values. If the TO subop is specified, the same number of variables must be included in the TO subop as in the VAR subop or an error will occur.

The RECODE command can also be used in conjunction with the OBS, and RP subops. OBS control which observations the recoding will be applied to, while RP determines whether or not old values of the variable not included in any of the ranges specified in the MAP subop will be recoded as missing. Missing values are admissible as either old or new values of the variable and are designated using either `md' or a period (`.').

Note that anything that can be done by the RECODE command can also be done using a series of SET statements. For example, the above recoding could also be accomplished by:

set y = -1; if[x < 1]
set y = 1; if[x >= 2]

The RECODE command just decreases the amount of typing that is necessary.

Setting missing values

The RECODE command can be used to assign missing values. For example, if you would like the value -99 of the variable x to be treated as missing, use:

recode var[x] map[-99=md]

To change all missing values to a numerical value, use (for example):

recode var[x] map[md=-99]

Once SST marks a value as missing it does not preserve the old value. Thus the second command above would change all values which were missing to -99, not just the ones that were recoded to missing by the first RECODE.

Creating dummy variables

Many times you may want to create a dummy variable which takes the value one if some condition holds and otherwise takes the value zero. There are two simple ways to create a dummy variable in SST which we now illustrate. Suppose we want to create a dummy variable y which equals one if the variable x is greater than or equal to 100 and zero otherwise. We could use the RECODE command:

recode var[x] map[(lo thru 100)=0,(100 thru hi)=hi] to[y]

The same operation could be performed using the SET command:

set y = (x>=100)

The last command works because the logical expression x>=100 is assigned the value one if it is true and zero if it is false. This is a somewhat exotic usage of the SET statement, but if you understand it, it can save you some typing.


Entering Back Descriptive