4 Summary statistics

This section dicusses how to produce and analyse basic summary statistics. We make a distinction between categorical and continuous variables, for which different statistics are permissible.

OK to compute…. Nominal Ordinal Interval Ratio
frequency distribution Yes Yes Yes Yes
median and percentiles No Yes Yes Yes
mean, standard deviation, standard error of the mean No No Yes Yes
ratio, or coefficient of variation No No No Yes

As an example data set, we will be using the MRDA course survey data. Let’s load and inspect the data first.

test_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/survey2017.dat", 
    sep = "\t", header = TRUE)
head(test_data)

4.1 Categorical variables

Categorical variables contain a finite number of categories or distinct groups and are also known as qualitative variables. There are different types of categorical variables:

  • Nominal variables: variables that have two or more categories but no logical order (e.g., music genres). A dichotomous variables is simply a nominal variable that only has two categories (e.g., gender).
  • Ordinal variables: variables that have two or more categories that can also be ordered or ranked (e.g., income groups).

For this example, we are interested in the following two variables

  • “overall_knowledge”: measures the self-reported prior knowledge of marketing students in statistics before taking the marketing research class on a 5-point scale with the categories “none”, “basic”, “intermediate”,“advanced”, and “proficient”.
  • “gender”: the gender of the students (1 = male, 2 = female)

In a first step, we convert the variables to factor variables using the factor() function to assign appropriate labels according to the scale points:

test_data$overall_knowledge_cat <- factor(test_data$overall_knowledge, 
    levels = c(1:5), labels = c("none", "basic", "intermediate", 
        "advanced", "proficient"))
test_data$gender_cat <- factor(test_data$gender, levels = c(1:2), 
    labels = c("male", "female"))

The table() function creates a frequency table. Let’s start with the number of occurences of the categories associated with the prior knowledge and gender variables separately:

table(test_data[, c("overall_knowledge_cat")])  #absolute frequencies
## 
##         none        basic intermediate     advanced   proficient 
##           10           32            9            4            0
table(test_data[, c("gender_cat")])  #absolute frequencies
## 
##   male female 
##     14     41

It is obvious that there are more female than male students. For variables with more categories, it might be less obvious and we might compute the median category using the median() function, or the summary() function, which produces further statistics.

median((test_data[, c("overall_knowledge")]))
## [1] 2
summary((test_data[, c("overall_knowledge")]))
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##   1.000   2.000   2.000   2.127   2.000   4.000

Often, we are interested in the relative frequencies, which can be obtained by using the prop.table() function.

prop.table(table(test_data[, c("overall_knowledge_cat")]))  #relative frequencies
## 
##         none        basic intermediate     advanced   proficient 
##   0.18181818   0.58181818   0.16363636   0.07272727   0.00000000
prop.table(table(test_data[, c("gender_cat")]))  #relative frequencies
## 
##      male    female 
## 0.2545455 0.7454545

Now let’s investigate if the prior knowledge differs by gender. To do this, we simply apply the table() function to both variables:

table(test_data[, c("overall_knowledge_cat", "gender_cat")])  #absolute frequencies
##                      gender_cat
## overall_knowledge_cat male female
##          none            1      9
##          basic           8     24
##          intermediate    4      5
##          advanced        1      3
##          proficient      0      0

Again, it might be more meaningful to look at the relative frequencies using prop.table():

prop.table(table(test_data[, c("overall_knowledge_cat", 
    "gender_cat")]))  #relative frequencies
##                      gender_cat
## overall_knowledge_cat       male     female
##          none         0.01818182 0.16363636
##          basic        0.14545455 0.43636364
##          intermediate 0.07272727 0.09090909
##          advanced     0.01818182 0.05454545
##          proficient   0.00000000 0.00000000

Note that the above output shows the overall relative frequencies when male and female respondents are considered together. In this context, it might be even more meaningful to look at the conditional relative frequencies. This can be achieved by adding a ,2 to the prop.table() command, which tells R to compute the relative frequencies by the columns (which is in our case the gender variable):

prop.table(table(test_data[, c("overall_knowledge_cat", 
    "gender_cat")]), 2)  #conditional relative frequencies
##                      gender_cat
## overall_knowledge_cat       male     female
##          none         0.07142857 0.21951220
##          basic        0.57142857 0.58536585
##          intermediate 0.28571429 0.12195122
##          advanced     0.07142857 0.07317073
##          proficient   0.00000000 0.00000000

4.2 Continuous variables

4.2.1 Descriptive statistics

Continuous variables are numeric variables that can take on any value on a measurement scale (i.e., there is an infinite number of values between any two values). There are different types of continuous variables:

  • Interval variables: while the zero point is arbitrary, equal intervals on the scale represent equal differences in the property being measured. E.g., on a temperature scale measured in Celsius the difference between a temperature of 15 degrees and 25 degrees is the same difference as between 25 degrees and 35 degrees but the zero point is arbitrary.
  • Ratio variables: has all the properties of an interval variable, but also has an absolute zero point. When the variable equals 0.0, it means that there is none of that variable (e.g., number of products sold, willingness-to-pay, mileage a car gets).

Computing descriptive statistics in R is easy and there are many functions from different packages that let you calculate summary statistics (including the summary() function from the base package). In this tutorial, we will use the describe() function from the psych package:

library(psych)
psych::describe(test_data[, c("duration", "overall_100")])
##             vars  n    mean       sd median trimmed   mad min    max
## duration       1 55 4977.16 33672.62    237  271.20 93.40 107 250092
## overall_100    2 55   39.18    20.43     40   38.89 29.65   0     80
##              range skew kurtosis      se
## duration    249985 7.01    48.05 4540.41
## overall_100     80 0.12    -1.13    2.75

In the above command, we used the psych:: prefix to avoid confusion and to make sure that R uses the describe() function from the psych package since there are many other packages that also contain a desribe() function. Note that you could also compute these statistics separately by using the respective functions (e.g., mean(), sd(), median(), min(), max(), etc.).

The psych package also contains the describeBy() function, which lets you compute the summary statistics by sub-group separately. For example, we could easily compute the summary statistics by gender as follows:

describeBy(test_data[, c("duration", "overall_100")], 
    test_data$gender_cat)
## 
##  Descriptive statistics by group 
## group: male
##             vars  n     mean       sd median trimmed    mad min    max
## duration       1 14 18073.86 66779.48  238.5  236.25 132.69 107 250092
## overall_100    2 14    43.93    18.52   42.5   44.17  18.53  10     75
##              range skew kurtosis       se
## duration    249985 2.98     7.41 17847.57
## overall_100     65 0.00    -1.19     4.95
## -------------------------------------------------------- 
## group: female
##             vars  n   mean     sd median trimmed   mad min  max range skew
## duration       1 41 505.12 906.32    237  283.42 93.40 114 4735  4621 3.70
## overall_100    2 41  37.56  21.01     30   36.97 22.24   0   80    80 0.21
##             kurtosis     se
## duration       13.09 141.54
## overall_100    -1.17   3.28

Note that you could just as well use other packages to compute the descriptive statistics. For example, you could have used the stat.desc() function from the pastecs package:

library(pastecs)
stat.desc(test_data[, c("duration", "overall_100")])
##                       duration  overall_100
## nbr.val              55.000000   55.0000000
## nbr.null              0.000000    1.0000000
## nbr.na                0.000000    0.0000000
## min                 107.000000    0.0000000
## max              250092.000000   80.0000000
## range            249985.000000   80.0000000
## sum              273744.000000 2155.0000000
## median              237.000000   40.0000000
## mean               4977.163636   39.1818182
## SE.mean            4540.414684    2.7547438
## CI.mean.0.95       9102.983359    5.5229288
## var          1133845102.435690  417.3737374
## std.dev           33672.616507   20.4297268
## coef.var              6.765423    0.5214083

Computing statistics by group is also possible by using the wrapper function by(). Within the function, you first specify the data on which you would like to perform the grouping test_data[,c("duration", "overall_100")], followed by the grouping variable test_data$gender_cat and the function that you would like to execute (e.g., stat.desc()):

library(pastecs)
by(test_data[, c("duration", "overall_100")], test_data$gender_cat, 
    stat.desc)
## test_data$gender_cat: male
##                       duration overall_100
## nbr.val              14.000000   14.000000
## nbr.null              0.000000    0.000000
## nbr.na                0.000000    0.000000
## min                 107.000000   10.000000
## max              250092.000000   75.000000
## range            249985.000000   65.000000
## sum              253034.000000  615.000000
## median              238.500000   42.500000
## mean              18073.857143   43.928571
## SE.mean           17847.566755    4.949708
## CI.mean.0.95      38557.323811   10.693194
## var          4459498946.901099  342.994505
## std.dev           66779.479984   18.520111
## coef.var              3.694811    0.421596
## -------------------------------------------------------- 
## test_data$gender_cat: female
##                   duration overall_100
## nbr.val          41.000000   41.000000
## nbr.null          0.000000    1.000000
## nbr.na            0.000000    0.000000
## min             114.000000    0.000000
## max            4735.000000   80.000000
## range          4621.000000   80.000000
## sum           20710.000000 1540.000000
## median          237.000000   30.000000
## mean            505.121951   37.560976
## SE.mean         141.543022    3.281145
## CI.mean.0.95    286.069118    6.631442
## var          821411.509756  441.402439
## std.dev         906.317555   21.009580
## coef.var          1.794255    0.559346

These examples are meant to exemplify that there are often many different ways to reach a goal in R. Which one you choose depends on what type of information you seek (the results provide slightly different information) and on personal preferences.

4.2.2 Creating subsets

From the above statistics it is clear that the data set contains some severe outliers on some variables. For example, the maximum duration is 2.9 days. You might want to investigate these cases and delete them if they would turn out to indeed induce a bias in your analyses. You could easily create a subset of the original data, which you would then use for estimation using the subset() function. For example, the follwing code creates a subset that excludes all cases with a duration of more than 900 seconds (15 minutes):

estimation_sample <- subset(test_data, duration < 900)
psych::describe(estimation_sample[, c("duration", "overall_100")])
##             vars  n   mean     sd median trimmed   mad min max range skew
## duration       1 51 266.51 148.48    230  239.02 88.96 107 785   678 1.70
## overall_100    2 51  38.53  20.40     40   38.29 29.65   0  80    80 0.11
##             kurtosis    se
## duration        2.64 20.79
## overall_100    -1.15  2.86

4.2.3 Confidence intervals

Finally, we could be interested the generalize the finding from our sample to the entire student population. We could ask: what is the likely range of values that the overall knowledge might take in the entire student population with a certain level of confidence (e.g., 95%). Confidence interval provide answers to this kind of questions. Remember the formula for the confidence interval:

\[\begin{equation} \begin{split} CI = \overline{X}\pm(z*\frac{s}{\sqrt{n}}) \end{split} \tag{4.1} \end{equation}\]

, where \(\overline{x}\) is the sample mean, \(\frac{s}{\sqrt{n}}\) is the standard error of the mean, and z is the critical z-score for a given level of confidence. If the interval is small, the sample must be very close to the population. Conversely, when the interval is wide, the sample mean is likely very different from the population mean and therefore a bad representation of the population. We can obtain the critical z-score for the 95% CI using the qnorm()-function. Using nrow(data_set), we can obtain the number ob rows (i.e., observations) in a dataset.

mean(test_data$overall_100)
## [1] 39.18182
error <- qnorm(0.975) * sd(test_data$overall_100)/sqrt(nrow(test_data))
ci_lower <- mean(test_data$overall_100) - error
ci_upper <- mean(test_data$overall_100) + error
print(ci_lower)
## [1] 33.78262
print(ci_upper)
## [1] 44.58102

Given our sample size and the variability in our data, we are 95% confident that the true level of overall statstical knowledge is between 34 and 45!

Note that the CI is also implicitely included in the summary statistics created by the stat.desc() function.

mean_x <- stat.desc(test_data$overall_100)["mean"]
error_x <- stat.desc(test_data$overall_100)["CI.mean.0.95"]
ci_lower_x <- mean_x - error_x
ci_upper_x <- mean_x + error_x
ci_lower_x
##     mean 
## 33.65889
ci_upper_x
##     mean 
## 44.70475