# Descriptive Statistics

SST will compute a wide variety of descriptive statistics on your data set. These include means, standard deviations, ranges, correlations, one-way and multi-way cross-tabulations. These procedures are useful for checking data entry and exploring simple relationships between variables.

## Setting the range for procedures

Each SST statistical procedure allows you to restrict the range of observations over which statistics will be computed using the `IF` and `OBS` subops. However, you may want to use the same range for several procedures and avoid typing the `IF` and `OBS` subops over and over. To do this, issue a `RANGE` statement with the `IF` or `OBS` subops. For example:

```range obs[100-200] if[x > 0]
```

Until you issue a new `RANGE` statement, only observations numbered between 100 and 200 for which the value of the variable `x` is positive would be used.

## Obtaining frequency distributions

The `FREQ` command calculates one-way frequency distributions. In our example, the command:

```freq var[party]
```

produces the following output:

```    party

0           1
Democrat       Repub
----------  ----------
Count              12           9
Percent         57.14       42.86
```

Across the top of the frequency table are the values of the variable `party`. Underneath each value is its label (if any). The rows of the table give the number of observations taking the value and the percent of nonmissing observations taking that value.

## Contingency tables

The `TABLE` command is used to produce two-way, three-way, and higher dimensional contingency tables. `TABLE` crosstabulates the first variable specified in the `VAR` subop (the row variable) by the second variable in the `VAR` subop (the column variable). In its simplest and most common use, only two variables will be specified and one table will be created.

Output from the `TABLE` command shows the number of occurrences in each cell as well as the percentage this represents of the column. If one wants percentages of the row total instead, specify the `ROW` subop.

Contingency tables are most useful when the data for each variable only takes on only a few values. In our example, most of the variables are continuous (taking many distinct values, with the same value rarely occuring more than once) so a contingency table on these variables directly would be of little use. One possibility is to recode these variables into a few categories. For our purposes, two categories will be enough for each variable:

```recode var[money inflat unemp] map[(lo thru 70=0,else=1]
```

We will label the categories `low` and `high`. First we try a two-way table:

```table var[inflat money]
```

The `TABLE` output is:

```********** Crosstabulation of inflat by money **********

money
|---------|---------|---------|
|  COUNT  |      low|     high|   ROW
| COL PCT |        0|        1|  TOTAL
inflat|---------|---------|---------|
|        0|     13  |      3  |     16
|      low|   81.3  |   60.0  |   76.2
|---------|---------|---------|
|        1|      3  |      2  |      5
|     high|   18.8  |   40.0  |   23.8
|---------|---------|---------|
COLUMN       16         5        21
TOTAL     76.2      23.8     100.0
```

Note that there are 13 observations for which both money and inflat are zero, which means that both money growth and inflation are under seven percent. The numbers around the edges of the table are row and column marginal totals. For example, the number of `low` values for the variable `money` is given by the column total 16. Similarly for the row totals give the marginal distribution of the row variable `inflat`. The 16 observations for which inflation is `low` (i.e., under seven percent) represent 76.2% of the total of 21 observations.

The default is for SST to compute column percentages. If you would like for the cell entries to be row percentages, specify the subop `ROW` and SST will calculate percentages this way.

SST will also compute chi-square statistics for testing the independence of two discrete variables if you add the `MEASURES` subop to the `TABLE` command:

```table var[money inflat] measures
```

SST prints out the value of the chi-square statistic for the table along with its degrees of freedom. The hypothesis of independence is rejected if the computed value of this statistic exceeds the critical value for corresponding to the significance level chosen for the test. The critical value is chosen so that the probability of obtaining a value of the test statistic larger than the critical level if the null hypothesis of independence is correct is equal to the significance level. The critical value can be determined by consulting a table of the chi-square distribution.

Instead of determining the critical value from a table of the chi-square distribution, you may prefer to compute a p-value for the test statistic. The p-value is the probability (under the null hypothesis of independence) of obtaining a value of the test statistic greater than or equal to the observed value of the test statistic. Suppose for example that SST produces a chi-square statistic of 5.02 with one degree of freedom. The `CALC` command can be used to evaluate the upper tail probability:

```calc 1.0-cumchi(5.02,1)
0.025
```

Thus the p-value for this test statistic is 0.025. This means that if the significance level for the hypothesis test is larger than 0.05, then the null hypothesis can be rejected.

## Multiway contingency tables

Finally, n-way tables can be produced by specifying additional variables in the `VAR` subop. For example:

```table var[money inflat unemp]
```

will produce two tables. The variable `unemp` has been recoded so that it takes two values (zero and one). The `TABLE` command will produce a cross-tabulation of the variables `money` and `inflat` for each value of the variable `unemp`:

```********** Crosstabulation of money by inflat **********

unemp = low ( 0 )

inflat
|---------|---------|---------|
|  COUNT  |      low|     high|   ROW
| COL PCT |        0|        1|  TOTAL
money|---------|---------|---------|
|        0|     12  |      1  |     13
|      low|   85.7  |   33.3  |   76.5
|---------|---------|---------|
|        1|      2  |      2  |      4
|     high|   14.3  |   66.7  |   23.5
|---------|---------|---------|
COLUMN       14         3        17
TOTAL     82.4      17.6     100.0

********** Crosstabulation of money by inflat **********

unemp = hi ( 1 )

inflat
|---------|---------|---------|
|  COUNT  |      low|     high|   ROW
| COL PCT |        0|        1|  TOTAL
money|---------|---------|---------|
|        0|      1  |      2  |      3
|      low|   50.0  |  100.0  |   75.0
|---------|---------|---------|
|        1|      1  |      0  |      1
|     high|   50.0  |    0.0  |   25.0
|---------|---------|---------|
COLUMN        2         2         4
TOTAL     50.0      50.0     100.0
```

If additional variables are specified in the `VAR` subop, a separate table crosstabulating the first and the second variables for each combination of values in the remaining variables will be created. In this manner, an n-way table is constructed.

## Univariate statistics

The `COVA` command computes descriptive statistics (means, standard deviation, ranges, correlations) on a set of one or more variables. We will first consider the use of the `COVA` command for producing univariate statistics. If you do not include any subops other than the `VAR` subop, `COVA` will calculate by default the mean, minimum, maximum, and standard deviation of each variable in the `VAR` subop. Using an asterisk (`*') to match all variables, we obtain a complete set of univariate statistics for the variables in memory:

```cova var[*]
```

SST produces the following output:

```          nobs        mean         min         max     std dev
year        21        1970        1960        1980       6.055
money       21         5.3         0.7         9.3       2.211
inflat      21       4.729         0.9         9.3       2.625
unemploy    21       5.571         3.5         8.5       1.312
party       21       0.429           0           1       0.495
```

Thus we see that the variable year ranges from 1960 to 1980 with its mean equal to 1970 and variance equal to 6.055. The same information is provided for the remaining variables.

## Correlations and covariances

If the subop `COV` is added to the `COVA` command, SST will produce a matrix of correlations and covariances for the variables specified in the `VAR` subop instead of the univariate statistics obtained above:

```cova var[*] cov
```

The output looks like:

```Correlation and Covariance matrix
year        money       inflat        unemp        party

year       36.6666667    9.5285715   14.4666668    3.5428573    0.4761905
money       0.7117068    4.8885714    2.7971430    0.5428572   -0.0571429
inflat      0.9101309    0.4819420    6.8906124    1.3865307    0.3639456
unemploy    0.4458577    0.1870996    0.4025120    1.7220408    0.0931973
party       0.1589104   -0.0522250    0.2801657    0.1435123    0.2448980
```

The entries along the diagonal of the matrix are the variances of each variable. Below the diagonal are Pearson correlation coefficients between the row and column variables. Above the diagonal are covariances among the variables.

## Options to the COVA command

SST allows you to modify the output of the `COVA` command by specifying subops indicating which statistics you want output. The possible options are `MEAN` (for means), `MIN` (for minimums), `MAX` (for maximums), `STDDEV` (for standard deviations), and `COV` (for a correlation/covariance matrix).

As with most SST commands, the range of observations over which the statistics are calculated can be altered by adding the `IF` or `OBS` subops. To obtain the mean inflation rate in Republican administrations, one would enter:

```cova var[inflat] mean if[party==1]

nobs        mean
inflat       9       5.578
```

The `IF` and `OBS` subops determine which observations currently active under the `RANGE` statement will be used to calculate the reqested statistics; thus these subops act as further qualifications to the `RANGE` statement.

If data is missing for an observation for any variable specified in the variable list, the entire observation is omitted from the calculations on all variables. This is "listwise" deletion of missing data so that the statistics for all variables depend upon a common set of observations.   