4 Summary statistics
This section dicusses how to produce and analyse basic summary statistics. We make a distinction between categorical and continuous variables, for which different statistics are permissible.
OK to compute…. | Nominal | Ordinal | Interval | Ratio |
---|---|---|---|---|
frequency distribution | Yes | Yes | Yes | Yes |
median and percentiles | No | Yes | Yes | Yes |
mean, standard deviation, standard error of the mean | No | No | Yes | Yes |
ratio, or coefficient of variation | No | No | No | Yes |
As an example data set, we will be using the MRDA course survey data. Let’s load and inspect the data first.
test_data <- read.table("https://raw.githubusercontent.com/IMSMWU/Teaching/master/MRDA2017/survey2017.dat",
sep = "\t", header = TRUE)
head(test_data)
4.1 Categorical variables
Categorical variables contain a finite number of categories or distinct groups and are also known as qualitative variables. There are different types of categorical variables:
- Nominal variables: variables that have two or more categories but no logical order (e.g., music genres). A dichotomous variables is simply a nominal variable that only has two categories (e.g., gender).
- Ordinal variables: variables that have two or more categories that can also be ordered or ranked (e.g., income groups).
For this example, we are interested in the following two variables
- “overall_knowledge”: measures the self-reported prior knowledge of marketing students in statistics before taking the marketing research class on a 5-point scale with the categories “none”, “basic”, “intermediate”,“advanced”, and “proficient”.
- “gender”: the gender of the students (1 = male, 2 = female)
In a first step, we convert the variables to factor variables using the factor()
function to assign appropriate labels according to the scale points:
test_data$overall_knowledge_cat <- factor(test_data$overall_knowledge,
levels = c(1:5), labels = c("none", "basic", "intermediate",
"advanced", "proficient"))
test_data$gender_cat <- factor(test_data$gender, levels = c(1:2),
labels = c("male", "female"))
The table()
function creates a frequency table. Let’s start with the number of occurences of the categories associated with the prior knowledge and gender variables separately:
table(test_data[, c("overall_knowledge_cat")]) #absolute frequencies
##
## none basic intermediate advanced proficient
## 10 32 9 4 0
table(test_data[, c("gender_cat")]) #absolute frequencies
##
## male female
## 14 41
It is obvious that there are more female than male students. For variables with more categories, it might be less obvious and we might compute the median category using the median()
function, or the summary()
function, which produces further statistics.
median((test_data[, c("overall_knowledge")]))
## [1] 2
summary((test_data[, c("overall_knowledge")]))
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 1.000 2.000 2.000 2.127 2.000 4.000
Often, we are interested in the relative frequencies, which can be obtained by using the prop.table()
function.
prop.table(table(test_data[, c("overall_knowledge_cat")])) #relative frequencies
##
## none basic intermediate advanced proficient
## 0.18181818 0.58181818 0.16363636 0.07272727 0.00000000
prop.table(table(test_data[, c("gender_cat")])) #relative frequencies
##
## male female
## 0.2545455 0.7454545
Now let’s investigate if the prior knowledge differs by gender. To do this, we simply apply the table()
function to both variables:
table(test_data[, c("overall_knowledge_cat", "gender_cat")]) #absolute frequencies
## gender_cat
## overall_knowledge_cat male female
## none 1 9
## basic 8 24
## intermediate 4 5
## advanced 1 3
## proficient 0 0
Again, it might be more meaningful to look at the relative frequencies using prop.table()
:
prop.table(table(test_data[, c("overall_knowledge_cat",
"gender_cat")])) #relative frequencies
## gender_cat
## overall_knowledge_cat male female
## none 0.01818182 0.16363636
## basic 0.14545455 0.43636364
## intermediate 0.07272727 0.09090909
## advanced 0.01818182 0.05454545
## proficient 0.00000000 0.00000000
Note that the above output shows the overall relative frequencies when male and female respondents are considered together. In this context, it might be even more meaningful to look at the conditional relative frequencies. This can be achieved by adding a ,2
to the prop.table()
command, which tells R to compute the relative frequencies by the columns (which is in our case the gender variable):
prop.table(table(test_data[, c("overall_knowledge_cat",
"gender_cat")]), 2) #conditional relative frequencies
## gender_cat
## overall_knowledge_cat male female
## none 0.07142857 0.21951220
## basic 0.57142857 0.58536585
## intermediate 0.28571429 0.12195122
## advanced 0.07142857 0.07317073
## proficient 0.00000000 0.00000000
4.2 Continuous variables
4.2.1 Descriptive statistics
Continuous variables are numeric variables that can take on any value on a measurement scale (i.e., there is an infinite number of values between any two values). There are different types of continuous variables:
- Interval variables: while the zero point is arbitrary, equal intervals on the scale represent equal differences in the property being measured. E.g., on a temperature scale measured in Celsius the difference between a temperature of 15 degrees and 25 degrees is the same difference as between 25 degrees and 35 degrees but the zero point is arbitrary.
- Ratio variables: has all the properties of an interval variable, but also has an absolute zero point. When the variable equals 0.0, it means that there is none of that variable (e.g., number of products sold, willingness-to-pay, mileage a car gets).
Computing descriptive statistics in R is easy and there are many functions from different packages that let you calculate summary statistics (including the summary()
function from the base
package). In this tutorial, we will use the describe()
function from the psych
package:
library(psych)
psych::describe(test_data[, c("duration", "overall_100")])
## vars n mean sd median trimmed mad min max
## duration 1 55 4977.16 33672.62 237 271.20 93.40 107 250092
## overall_100 2 55 39.18 20.43 40 38.89 29.65 0 80
## range skew kurtosis se
## duration 249985 7.01 48.05 4540.41
## overall_100 80 0.12 -1.13 2.75
In the above command, we used the psych::
prefix to avoid confusion and to make sure that R uses the describe()
function from the psych
package since there are many other packages that also contain a desribe()
function. Note that you could also compute these statistics separately by using the respective functions (e.g., mean()
, sd()
, median()
, min()
, max()
, etc.).
The psych
package also contains the describeBy()
function, which lets you compute the summary statistics by sub-group separately. For example, we could easily compute the summary statistics by gender as follows:
describeBy(test_data[, c("duration", "overall_100")],
test_data$gender_cat)
##
## Descriptive statistics by group
## group: male
## vars n mean sd median trimmed mad min max
## duration 1 14 18073.86 66779.48 238.5 236.25 132.69 107 250092
## overall_100 2 14 43.93 18.52 42.5 44.17 18.53 10 75
## range skew kurtosis se
## duration 249985 2.98 7.41 17847.57
## overall_100 65 0.00 -1.19 4.95
## --------------------------------------------------------
## group: female
## vars n mean sd median trimmed mad min max range skew
## duration 1 41 505.12 906.32 237 283.42 93.40 114 4735 4621 3.70
## overall_100 2 41 37.56 21.01 30 36.97 22.24 0 80 80 0.21
## kurtosis se
## duration 13.09 141.54
## overall_100 -1.17 3.28
Note that you could just as well use other packages to compute the descriptive statistics. For example, you could have used the stat.desc()
function from the pastecs
package:
library(pastecs)
stat.desc(test_data[, c("duration", "overall_100")])
## duration overall_100
## nbr.val 55.000000 55.0000000
## nbr.null 0.000000 1.0000000
## nbr.na 0.000000 0.0000000
## min 107.000000 0.0000000
## max 250092.000000 80.0000000
## range 249985.000000 80.0000000
## sum 273744.000000 2155.0000000
## median 237.000000 40.0000000
## mean 4977.163636 39.1818182
## SE.mean 4540.414684 2.7547438
## CI.mean.0.95 9102.983359 5.5229288
## var 1133845102.435690 417.3737374
## std.dev 33672.616507 20.4297268
## coef.var 6.765423 0.5214083
Computing statistics by group is also possible by using the wrapper function by()
. Within the function, you first specify the data on which you would like to perform the grouping test_data[,c("duration", "overall_100")]
, followed by the grouping variable test_data$gender_cat
and the function that you would like to execute (e.g., stat.desc()
):
library(pastecs)
by(test_data[, c("duration", "overall_100")], test_data$gender_cat,
stat.desc)
## test_data$gender_cat: male
## duration overall_100
## nbr.val 14.000000 14.000000
## nbr.null 0.000000 0.000000
## nbr.na 0.000000 0.000000
## min 107.000000 10.000000
## max 250092.000000 75.000000
## range 249985.000000 65.000000
## sum 253034.000000 615.000000
## median 238.500000 42.500000
## mean 18073.857143 43.928571
## SE.mean 17847.566755 4.949708
## CI.mean.0.95 38557.323811 10.693194
## var 4459498946.901099 342.994505
## std.dev 66779.479984 18.520111
## coef.var 3.694811 0.421596
## --------------------------------------------------------
## test_data$gender_cat: female
## duration overall_100
## nbr.val 41.000000 41.000000
## nbr.null 0.000000 1.000000
## nbr.na 0.000000 0.000000
## min 114.000000 0.000000
## max 4735.000000 80.000000
## range 4621.000000 80.000000
## sum 20710.000000 1540.000000
## median 237.000000 30.000000
## mean 505.121951 37.560976
## SE.mean 141.543022 3.281145
## CI.mean.0.95 286.069118 6.631442
## var 821411.509756 441.402439
## std.dev 906.317555 21.009580
## coef.var 1.794255 0.559346
These examples are meant to exemplify that there are often many different ways to reach a goal in R. Which one you choose depends on what type of information you seek (the results provide slightly different information) and on personal preferences.
4.2.2 Creating subsets
From the above statistics it is clear that the data set contains some severe outliers on some variables. For example, the maximum duration is 2.9 days. You might want to investigate these cases and delete them if they would turn out to indeed induce a bias in your analyses. You could easily create a subset of the original data, which you would then use for estimation using the subset()
function. For example, the follwing code creates a subset that excludes all cases with a duration of more than 900 seconds (15 minutes):
estimation_sample <- subset(test_data, duration < 900)
psych::describe(estimation_sample[, c("duration", "overall_100")])
## vars n mean sd median trimmed mad min max range skew
## duration 1 51 266.51 148.48 230 239.02 88.96 107 785 678 1.70
## overall_100 2 51 38.53 20.40 40 38.29 29.65 0 80 80 0.11
## kurtosis se
## duration 2.64 20.79
## overall_100 -1.15 2.86
4.2.3 Confidence intervals
Finally, we could be interested the generalize the finding from our sample to the entire student population. We could ask: what is the likely range of values that the overall knowledge might take in the entire student population with a certain level of confidence (e.g., 95%). Confidence interval provide answers to this kind of questions. Remember the formula for the confidence interval:
\[\begin{equation} \begin{split} CI = \overline{X}\pm(z*\frac{s}{\sqrt{n}}) \end{split} \tag{4.1} \end{equation}\], where \(\overline{x}\) is the sample mean, \(\frac{s}{\sqrt{n}}\) is the standard error of the mean, and z is the critical z-score for a given level of confidence. If the interval is small, the sample must be very close to the population. Conversely, when the interval is wide, the sample mean is likely very different from the population mean and therefore a bad representation of the population. We can obtain the critical z-score for the 95% CI using the qnorm()
-function. Using nrow(data_set)
, we can obtain the number ob rows (i.e., observations) in a dataset.
mean(test_data$overall_100)
## [1] 39.18182
error <- qnorm(0.975) * sd(test_data$overall_100)/sqrt(nrow(test_data))
ci_lower <- mean(test_data$overall_100) - error
ci_upper <- mean(test_data$overall_100) + error
print(ci_lower)
## [1] 33.78262
print(ci_upper)
## [1] 44.58102
Given our sample size and the variability in our data, we are 95% confident that the true level of overall statstical knowledge is between 34 and 45!
Note that the CI is also implicitely included in the summary statistics created by the stat.desc()
function.
mean_x <- stat.desc(test_data$overall_100)["mean"]
error_x <- stat.desc(test_data$overall_100)["CI.mean.0.95"]
ci_lower_x <- mean_x - error_x
ci_upper_x <- mean_x + error_x
ci_lower_x
## mean
## 33.65889
ci_upper_x
## mean
## 44.70475