2 Data handling

This chapter covers the basics of data handling in R

2.1 Basic data handling

2.1.1 Creating objects

Anything created in R is an object. You can assign values to objects using the assignment operator <-:

x <- "hello world"  #assigns the words 'hello world' to the object x
# this is a comment

Note that comments may be included in the code after a #. The text after # is not evaluated when the code is run; they can be written directly after the code or in a separate line.

To see the value of an object, simply type its name into the console and hit enter:

x  #print the value of x to the console
## [1] "hello world"

You can also explicitly tell R to print the value of an object:

print(x)  #print the value of x to the console
## [1] "hello world"

Note that because we assign characters in this case (as opposed to e.g., numeric values), we need to wrap the words in quotation marks, which must always come in pairs. Although RStudio automatically adds a pair of quotation marks (i.e., opening and closing marks) when you enter the opening marks it could be that you end up with a mismatch by accident (e.g., x <- "hello). In this case, R will show you the continuation character “+”. The same could happen if you did not execute the full command by accident. The “+” means that R is expecting more input. If this happens, either add the missing pair, or press ESCAPE to abort the expression and try again.

To change the value of an object, you can simply overwrite the previous value. For example, you could also assign a numeric value to “x” to perform some basic operations:

x <- 2  #assigns the value of 2 to the object x
print(x)
## [1] 2
x == 2  #checks whether the value of x is equal to 2
## [1] TRUE
x != 3  #checks whether the value of x is NOT equal to 3
## [1] TRUE
x < 3  #checks whether the value of x is less than 3
## [1] TRUE
x > 3  #checks whether the value of x is greater than 3
## [1] FALSE

Note that the name of the object is completely arbitrary. We could also define a second object “y”, assign it a different value and use it to perform some basic mathematical operations:

y <- 5  #assigns the value of 2 to the object x
x == y  #checks whether the value of x to the value of y
## [1] FALSE
x * y  #multiplication of x and y
## [1] 10
x + y  #adds the values of x and y together
## [1] 7
y^2 + 3 * x  #adds the value of y squared and 3x the value of x together
## [1] 31

Object names

Please note that object names must start with a letter and can only contain letters, numbers, as well as the ., and _ separators. It is important to give your objects descriptive names and to be as consistent as possible with the naming structure. In this tutorial we will be using lower case words separated by underscores (e.g., object_name). There are other naming conventions, such as using a . as a separator (e.g., object.name), or using upper case letters (objectName). It doesn’t really matter which one you choose, as long as you are consistent.

2.1.2 Data types

The most important types of data are:

Data type Description
Numeric Approximations of the real numbers, \(\normalsize\mathbb{R}\) (e.g., mileage a car gets: 23.6, 20.9, etc.)
Integer Whole numbers, \(\normalsize\mathbb{Z}\) (e.g., number of sales: 7, 0, 120, 63, etc.)
Character Text data (strings, e.g., product names)
Factor Categorical data for classification (e.g., product groups)
Logical TRUE, FALSE
Date Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.)

Variables can be converted from one type to another using the appropriate functions (e.g., as.numeric(),as.integer(),as.character(), as.factor(),as.logical(), as.Date()). For example, we could convert the object y to character as follows:

y <- as.character(y)
print(y)
## [1] "5"

Notice how the value is in quotation marks since it is now of type character.

Entering a vector of data into R can be done with the c(x1,x2,..,x_n) (“concatenate”) command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator <-. You can choose names arbitrarily. Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first.

# Numeric:
top10_track_streams <- c(163608, 126687, 120480, 110022, 
    108630, 95639, 94690, 89011, 87869, 85599)

# Character:
top10_artist_names <- c("Axwell /\\ Ingrosso", "Imagine Dragons", 
    "J. Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", 
    "French Montana", "Calvin Harris", "Liam Payne", 
    "Lauv")  # Characters have to be put in ''

# Factor variable with two categories:
top10_track_explicit <- c(0, 0, 0, 0, 0, 0, 1, 1, 0, 
    0)
top10_track_explicit <- factor(top10_track_explicit, 
    levels = c(0:1), labels = c("not explicit", "explicit"))

# Factor variable with more than two categories:
top10_artist_genre <- c("Dance", "Alternative", "Latino", 
    "Dance", "Dance", "Dance", "Hip-Hop/Rap", "Dance", 
    "Pop", "Pop")
top10_artist_genre <- as.factor(top10_artist_genre)

# Date:
top_10_track_release_date <- as.Date(c("2017-05-24", 
    "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", 
    "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", 
    "2017-05-19"))

# Logical
top10_track_explicit_1 <- c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, TRUE, TRUE, FALSE, FALSE)

In order to “call” a vector we can now simply enter its name:

top10_track_streams
##  [1] 163608 126687 120480 110022 108630  95639  94690  89011  87869  85599
top_10_track_release_date
##  [1] "2017-05-24" "2017-06-23" "2017-07-03" "2017-06-30" "2017-05-05"
##  [6] "2017-06-09" "2017-07-14" "2017-06-16" "2017-05-18" "2017-05-19"

In order to check the type of a variable the class() function is used.

class(top_10_track_release_date)
## [1] "Date"

The video below gives a general overview of vectors and provides a more in-depth discussion.

2.1.3 Data structures

Now let’s create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames.

data types

Data frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: “F”, “M” instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where “data frames” come into play. Data frames can have different types of data in each column. For example, we can combine the vectors created above in one data frame using data.frame(). This creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).

music_data <- data.frame(top10_track_streams, top10_artist_names, 
    top10_track_explicit, top10_artist_genre, top_10_track_release_date, 
    top10_track_explicit_1)

2.1.3.1 Accessing data in data frames

When entering the name of a data frame, R returns the entire data frame:

music_data  # Returns the entire data frame

Hint: You may also use the View()-function to view the data in a table format (like in SPSS or Excel), i.e. enter the command View(data). Note that you can achieve the same by clicking on the small table icon next to the data frame in the “Environment”-window on the right in RStudio.

Sometimes it is convenient to return only specific values instead of the entire data frame. There are a variety of ways to identify the elements of a data frame. One easy way is to explicitly state, which rows and columns you wish to view. The general form of the command is data.frame[rows,columns]. By leaving one of the arguments of data.frame[rows,columns] blank (e.g., data.frame[rows,]) we tell R that we want to access either all rows or columns, respectively. Here are some examples:

music_data[, 2:4]  # all rows and columns 2,3,4
music_data[, c("top10_artist_names", "top_10_track_release_date")]  # all rows and columns 'top10_artist_names' and 'top_10_track_release_date'
music_data[c(1:5), c("top10_artist_names", "top_10_track_release_date")]  # rows 1 to 5 and columns 'top10_artist_names'' and 'top_10_track_release_date'

You may also create subsets of the data frame, e.g., using mathematical expressions:

music_data[top10_track_explicit == "explicit", ]  # show only tracks with explicit lyrics  
music_data[top10_track_streams > 1e+05, ]  # show only tracks with more than 100,000 streams  
music_data[top10_artist_names == "Robin Schulz", ]  # returns all observations from artist 'Robin Schulz'
music_data[top10_track_explicit == "explicit", ]  # show only explicit tracks

The same can be achieved using the subset()-function

subset(music_data, top10_track_explicit == "explicit")  # selects subsets of observations in a data frame
# creates a new data frame that only contains
# tracks from genre 'Dance'
music_data_dance <- subset(music_data, top10_artist_genre == 
    "Dance")
music_data_dance
rm(music_data_dance)  # removes an object from the workspace

You may also change the order of the variables in a data frame by using the order()-function

# Orders by genre (ascending) and streams
# (descending)
music_data[order(top10_artist_genre, -top10_track_streams), 
    ]

2.1.3.2 Inspecting the content of a data frame

The head() function displays the first X elements/rows of a vector, matrix, table, data frame or function.

head(music_data, 3)  # returns the first X rows (here, the first 3 rows)

The tail() function is similar, except it displays the last elements/rows.

tail(music_data, 3)  # returns the last X rows (here, the last 3 rows)

names() returns the names of an R object. When, for example, it is called on a data frame, it returns the names of the columns.

names(music_data)  # returns the names of the variables in the data frame
## [1] "top10_track_streams"       "top10_artist_names"       
## [3] "top10_track_explicit"      "top10_artist_genre"       
## [5] "top_10_track_release_date" "top10_track_explicit_1"

str() displays the internal structure of an R object. In the case of a data frame, it returns the class (e.g., numeric, factor, etc.) of each variable, as well as the number of observations and the number of variables.

str(music_data)  # returns the structure of the data frame
## 'data.frame':    10 obs. of  6 variables:
##  $ top10_track_streams      : num  163608 126687 120480 110022 108630 ...
##  $ top10_artist_names       : Factor w/ 10 levels "Axwell /\\ Ingrosso",..: 1 5 6 10 7 3 4 2 9 8
##  $ top10_track_explicit     : Factor w/ 2 levels "not explicit",..: 1 1 1 1 1 1 2 2 1 1
##  $ top10_artist_genre       : Factor w/ 5 levels "Alternative",..: 2 1 4 2 2 2 3 2 5 5
##  $ top_10_track_release_date: Date, format: "2017-05-24" "2017-06-23" ...
##  $ top10_track_explicit_1   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

nrow() and ncol() return the rows and columns of a data frame or matrix, respectively. dim() displays the dimensions of an R object.

nrow(music_data)  # returns the number of rows 
## [1] 10
ncol(music_data)  # returns the number of columns 
## [1] 6
dim(music_data)  # returns the dimensions of a data frame
## [1] 10  6

ls() can be used to list all objects that are associated with an R object.

ls(music_data)  # list all objects associated with an object
## [1] "top_10_track_release_date" "top10_artist_genre"       
## [3] "top10_artist_names"        "top10_track_explicit"     
## [5] "top10_track_explicit_1"    "top10_track_streams"

2.1.3.3 Append and delete variables to/from data frames

To call a certain column in a data frame, we may also use the $ notation. For example, this returns all values associated with the variable “top10_track_streams”:

music_data$top10_track_streams
##  [1] 163608 126687 120480 110022 108630  95639  94690  89011  87869  85599

Assume that you wanted to add an additional variable to the data frame. You may use the $ notation to achieve this:

# Create new variable as the log of the number of
# streams
music_data$log_streams <- log(music_data$top10_track_streams)
# Create an ascending count variable which might
# serve as an ID
music_data$obs_number <- 1:nrow(music_data)
head(music_data)

To delete a variable, you can simply create a subset of the full data frame that excludes the variables that you wish to drop:

music_data <- subset(music_data, select = -c(log_streams))  # deletes the variable log streams 
head(music_data)

You can also rename variables in a data frame, e.g., using the rename()-function from the plyr package. In the following code “::” signifies that the function “rename” should be taken from the package “plyr”. This can be useful if multiple packages have a function with the same name. Calling a function this way also means that you can access a function without loading the entire package via library().

library(plyr)
music_data <- plyr::rename(music_data, c(top10_artist_genre = "genre", 
    top_10_track_release_date = "release_date"))
head(music_data)

Note that the same can be achieved using:

names(music_data)[names(music_data) == "genre"] <- "top10_artist_genre"
head(music_data)

Or by referring to the index of the variable:

names(music_data)[4] <- "genre"
head(music_data)
## Warning: package 'dplyr' was built under R version 3.4.4
## 
## Attaching package: 'dplyr'
## The following objects are masked from 'package:plyr':
## 
##     arrange, count, desc, failwith, id, mutate, rename, summarise,
##     summarize
## The following objects are masked from 'package:stats':
## 
##     filter, lag
## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
## Warning: package 'stringr' was built under R version 3.4.4

2.2 Advanced data handling

This chapter covers more advanced techniques for handling data in R. It is primarily based on

  • Wickham, H., & Grolemund, G. (2016). R for Data Science - Import, Tidy, Transform, Visualize, and Model Data. O’Reilly Media.

2.2.1 The dplyr package

There are many ways to achieve the same thing in R. Data transformation and handling tasks can be solved with the functions provided by base R (i.e. the functions that come with every R installation), but the dplyrpackage offers a comprehensive suite of functions that makes many tasks easier, while keeping code very readable. As such, we will be presenting it here, but keep in mind that all of these tasks could also be achieved without any packages.

Before we continue, ensure that the dplyr() package is installed and loaded.

The dplyr() package includes six core functions that make many data handling tasks a lot easier:

  • filter(): filter rows
  • select(): select columns
  • arrange(): re-order or arrange rows
  • mutate(): create new columns
  • summarise(): summarise values
  • group_by(): allows for group operations

Each of these functions will be discussed in the following in more detail.

2.2.1.1 Filter rows

One of the most basic tasks one could wish to perform on a data set is select certain observations based on various characteristics. dplyr uses the filter() function to this end. To select certain rows from a data set you simply supply the data frame as the first argument and then tell filter() the logical criteria it should use to select observations. While this may sound fairly abstract, it will become very clear after a few examples.

Recall the music_data data frame from the previous chapter. Suppose we want to select only observations where the lyrics are not explicit. The code to achieve this would looks as follows:

filter(music_data, top10_track_explicit == "not explicit")

The first argument supplied to the filter() function is the data frame we want to subset. The second argument tells filter() that we only want observations where the column top10_track_explicit is equal to the value "not explicit". If you look into the output you will notice that only tracks with non explicit lyrics have been returned.

Another way to filter observations is to choose all observations where one column is within a certain range. This can be achieved with the logical operators introduced in the basic data handling chapter. In the following example we select all tracks with less than 100000 streams.

filter(music_data, top10_track_streams < 100000)

You can enforce multiple conditions with &. The following example selects all observations with less than 150000 but more than 100000 streams.

filter(music_data,  top10_track_streams > 100000 & top10_track_streams < 150000)

The | symbol is the way R expresses “or”. This way you can select observations that fulfill either one or the other condition. Say we would like to select all observations with less than 100000 or more than 150000 streams. The following code would do exactly that

filter(music_data,  top10_track_streams < 100000 | top10_track_streams > 150000)

A very useful feature of the filter() function is its ability to accept multiple criteria at once. Say we want to select all tracks marked as "not explicit" with less than 100000 streams. This can be achieved by simply supplying the function with additional arguments, as in the example below.

filter(music_data, top10_track_explicit == "not explicit", top10_track_streams < 100000)

2.2.1.2 Select columns

Another common task is to select or exclude certain columns of a data frame. The dplyr package contains the select() function for exactly this purpose. Similarly to filter() you first supply the function with the data frame you wish to apply the selection to, followed by the columns you wish to select or exclude.

The following code selects the two columns top10_track_explicit and top10_track_streams from the music_data data set.

select(music_data, top10_track_explicit, top10_track_streams)

To remove columns from a data frame you simply put a - before the column name.

select(music_data, -top10_track_explicit, -top10_track_streams)

You can also select or exclude a whole range of columns through numbers or names.

# Selects all columns from top10_track_explicit to top_10_track_release_date
select(music_data, top10_track_explicit:top_10_track_release_date)
# This is equivalent to  
select(music_data, 3:5)

2.2.1.3 Arrange rows

If you just want to change the order of a data frame without discarding any observations or columns, you can use the arrange() function. It takes a data frame and a set of column names to order by, always in ascending order. If you provide more than one column name, each additional column will be used to break ties in the values of preceding columns.

arrange(music_data, top10_artist_genre, top10_track_streams)

If you wish to arrange them in descending order, you can wrap the column name in the desc() function.

arrange(music_data, top10_artist_genre, desc(top10_track_streams))

2.2.1.4 Adding and changing variables

There are two functions to create new variables based on other variables in the dplyr package, mutate() and transmute(). They both do the same thing, with one key difference: mutate() returns both the newly created variables and the ones previously contained in the data frame, while transmute() only returns the newly created variables.

Both functions take variables already present in the data set and apply a function to them to generate new variables. This can be something as simple as adding 1 to every entry in a column or more complex procedures, like standardizing variables. The syntax is fairly straightforward: The first argument is the data frame we wish to apply the transformation to, and all the following arguments are the new variables we want to create. An example will make this clearer. Say we (for some reason) want to add 10 to every entry of the top10_track_streams column in our data set. The code to do this would look as follows:

mutate(music_data, streams_plus_10 = top10_track_streams + 10)

This code essentially defines a new column named streams_plus_10, which is the column top10_track_streams + 10.

As previously mentioned, we can also perform more complex operations, such as standardizing variables (i.e. subtracting the mean value and dividing by the standard deviation). Note that we are generating the mean and the standard deviation in our code by applying functions (mean() and sd()) to the entire column and then we use these values to perform the standardization on each value of the column.

mutate(music_data, streams_standardised = (top10_track_streams - mean(top10_track_streams))/ sd(top10_track_streams))

Note that you could also use the scale() function to do the same:

mutate(music_data, streams_standardised = scale(top10_track_streams))

You can also add other vectors to an existing data frame with mutate, given that it is the same length as the data frame you want to add it to.

# create a vector of length 10
extra_column <- c(1,2,3,4,5,6,7,8,9,10)
mutate(music_data, new_data = extra_column) 

If you don’t want to add a new variable and only want to edit a variable already present in the data frame, dplyr has you covered with the mutate_at() function. The practical thing about this function is that it can also be applied to a whole range of variables if you want to perform the same operation on multiple columns. To do this, you give the function not just a single column name, but a whole vector of column names.

Say we first want to change the type of a single column. For example, maybe we want top10_track_explicit to be a character column and not a factor. We again tell mutate_at() which data frame we want to change, followed by the column(s) to change and finally the function we want to apply. You may have noticed that the column name is in quotation marks here, as opposed to the other functions we have met from the dplyrpackage. This is a side effect of its ability to apply a function to a range of columns and not just a single one, as this can only be achieved via a character vector. So from a technical standpoint we are giving the mutate_at() function a character vector of length one, solely containing the element "top10_track_explicit".

mutate_at(music_data, "top10_track_explicit", as.character)

Now say we want to change multiple columns to be character vectors. To do this we will first create a vector of the names of the columns we want to apply the function to and then simply give mutate_at()this vector as one of its arguments. If you do this, don’t forget the c() function to create a vector.

columns <- c("top10_track_explicit", "top10_artist_genre", "top10_track_explicit_1")

mutate_at(music_data, columns, as.character)

Note that if you merely want to rename a variable without changing its content, you may use the rename() function to achieve this. The syntax may seem familiar at this point, with the first argument being the data frame to apply the function to and the following arguments being the transformations to apply. The example changes the names of the top10_track_explicit and top10_artist_names columns into explicit and names, respectively.

rename(music_data, explicit = top10_track_explicit, names = top10_artist_names)

2.2.1.5 Creating custom summaries

The summarise() function lets you build customized summaries of your data. This can range from creating means and standard deviations of certain variables to simply counting how many observations are in a data frame. Say we want to find out the mean and standard deviation of the number of streams and also count the number of observations. With summarise() that would look as follows:

summarise(music_data, n_observations = n(), mean_streams = mean(top10_track_streams), sd_streams = sd(top10_track_streams))

On its own, this function is not that impressive. After all, we could just apply the mean(), sd() and nrow() functions individually and would have gotten the same result, albeit not in such a nice format. However, when combined with the group_by() function, summarise() becomes very useful as we will see next.

2.2.1.6 Group operations

The group_by() splits a data frame into groups, by the values of a column in the data frame. Say we wanted to calculate the mean and standard deviation of explicit and non-explicit songs separately.

music_data <- group_by(music_data, top10_track_explicit)
summarise(music_data, n_observations = n(), mean_streams = mean(top10_track_streams), sd_streams = sd(top10_track_streams))

2.2.1.7 Pipes

A very practical feature of the dplyr package are so called “pipes”. Say you want to apply three of the previously mentioned functions to the same data frame. So far, the way we learned to do this would be as follows:

# First use select() to take only certain columns
music_data_new <- select(music_data, top10_track_explicit_1, top10_artist_names, top10_track_streams)

# Now use filter() to choose only rows that fulfill certain criteria 
music_data_new <- filter(music_data_new, top10_track_streams < 100000)

# Then change order with arrange()
music_data_new <- arrange(music_data_new, top10_track_streams)

# Print to console
music_data_new

While this does achieve our objective, it is quite tedious. With the pipes offered by dplyr, you can chain these commands together to streamline your code, while keeping it very readable. The symbol for a pipe is %>%. From a technical perspective, this hands the preceding object to the next function as the first argument. This may sound complicated, but will become clear after an example. The code below will create exactly the same data frame as the example above, but in a much more compact form.

music_data_new <- music_data %>%
  select(top10_track_explicit_1, top10_artist_names, top10_track_streams) %>%
  filter(top10_track_streams < 100000) %>%
  arrange(top10_track_streams)

# Print to console
music_data_new

Let’s unpack what happened here. The first line “pipes” music_data into the first function, select(), which is in the second line. Here we remove all columns except for top10_track_explicit_1, top10_artist_names and top10_track_streams. Then we take this data frame with the reduced columns and hand it to filter(), which only selects observations with less than 100000 streams. Finally, we pass the filtered, column reduced data frame to arrange, which sorts the rows by the number of streams per track. The assignment operator (<-) at the top then saves this data frame in the environment as music_new_data. Note that, in contrast to the previous examples, we no longer have to specify which data frame we want to apply the various functions to, as the pipes take care of this for us.

2.2.2 Dealing with strings

Strings (which is short for “character strings”), can be tough to deal with. They are unstructured, messy and getting them into a format that one can perform analysis with is often a task that requires a lot of time. However, seeing as they appear fairly frequently in data sets and often contain valuable information, it is definitely worth the time to learn how to deal with them.

2.2.2.1 The stringr package

A very accessible package for manipulating strings is the stringr package. It is designed to be as uniform as possible, meaning that once you have understood the basic syntax of any one of its functions it is very easy to apply all of them. It sacrifices some flexibility for this simplicity, so if you ever encounter a task you can not easily solve with stringr it is worth checking out the package it is built on, stringi. For now, however, stringr will be more than sufficient.

The majority of functions in stringr are built around two core arguments: a string to be worked on and a pattern. There are quite a few that aren’t, such as str_length(), but these are (for the most part) fairly self explanatory and will not be explained further here. A good overview of the included functions can be found here.

The string to be worked on can either be an individual string in quotation marks or an entire vector or column of strings that the same operation should be applied to. The pattern can technically also be a vector of patterns to look for, however, in this tutorial we will only use single patterns.

Recall the list of artist names in the music_data data frame.

music_data$top10_artist_names
##  [1] "Axwell /\\ Ingrosso" "Imagine Dragons"     "J. Balvin"          
##  [4] "Robin Schulz"        "Jonas Blue"          "David Guetta"       
##  [7] "French Montana"      "Calvin Harris"       "Liam Payne"         
## [10] "Lauv"

Say, for example, we want to see which names contain a “g”. stringr contains the function str_detect, which tells you exactly that. Note that these functions are all case sensitive, i.e. the “G” in “David Guetta” is not detected.

str_detect(string = music_data$top10_artist_names, pattern = "g")
##  [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE FALSE FALSE FALSE

Say we want to be a bit more specific and want to know exactly where the g is located in each artist name. str_locate_all returns a list of start and end values for each entry in the character vector.

str_locate_all(string = music_data$top10_artist_names, pattern = "g")
## [[1]]
##      start end
## [1,]    13  13
## 
## [[2]]
##      start end
## [1,]     4   4
## [2,]    12  12
## 
## [[3]]
##      start end
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## 
## [[7]]
##      start end
## 
## [[8]]
##      start end
## 
## [[9]]
##      start end
## 
## [[10]]
##      start end

This obviously also works for patterns longer than just one letter.

str_locate_all(string = music_data$top10_artist_names, pattern = "vin")
## [[1]]
##      start end
## 
## [[2]]
##      start end
## 
## [[3]]
##      start end
## [1,]     7   9
## 
## [[4]]
##      start end
## 
## [[5]]
##      start end
## 
## [[6]]
##      start end
## 
## [[7]]
##      start end
## 
## [[8]]
##      start end
## [1,]     4   6
## 
## [[9]]
##      start end
## 
## [[10]]
##      start end

Another common task is to replace all instances of one character with another character. Say, for example, we want to replace all occurrences of “a” with “b”. The function str_replace_all() does exactly that. Note that there is also a function called str_replace(), but this only replaces the first match that the pattern finds.

str_replace_all(music_data$top10_artist_names, pattern = "a", replacement = "b")
##  [1] "Axwell /\\ Ingrosso" "Imbgine Drbgons"     "J. Bblvin"          
##  [4] "Robin Schulz"        "Jonbs Blue"          "Dbvid Guettb"       
##  [7] "French Montbnb"      "Cblvin Hbrris"       "Libm Pbyne"         
## [10] "Lbuv"

A common application for this function is to replace all “.” with “,” or vice versa in price data. R, for example, needs the decimal separator to be a “.”, so if you want to perform numerical analyses on price data that isn’t in a uniform format, this function can be very useful.

Say we want to replace all “.” with “,” in the artist names.

str_replace_all(music_data$top10_artist_names, pattern = ".", replacement = ",")
##  [1] ",,,,,,,,,,,,,,,,,," ",,,,,,,,,,,,,,,"    ",,,,,,,,,"         
##  [4] ",,,,,,,,,,,,"       ",,,,,,,,,,"         ",,,,,,,,,,,,"      
##  [7] ",,,,,,,,,,,,,,"     ",,,,,,,,,,,,,"      ",,,,,,,,,,"        
## [10] ",,,,"

This result may be somewhat surprising. Why did the function replace every character with “,” even though we specified pattern = "."? The explanation for this can be found in how the function matches patterns.

2.2.2.2 A crash course in regex

Behind the scenes stringr uses something called regex (short for “regular expressions”) to match patterns. regex allows you to match not just individual letters, but more abstract patterns. It does this by using special characters that do not match literally. You have already met the first such special character “.”, which is regex’s version of a wildcard, meaning that it matches any other character. That’s why, when we told stringr to replace all dots with commas, it replaced every character with a comma.

This raises the question of what to do if we want to literally match a dot and not any character. There are two ways to go about this. First, we could escape the wildcard character. What this means is that we use another special character to tell regex that the next character is meant to be taken literally and not as a wildcard. The symbol for this is a double back slash “\\”. So, if we wanted to change only the dot to a comma we would write

str_replace_all(music_data$top10_artist_names, pattern = "\\.", replacement = ",")
##  [1] "Axwell /\\ Ingrosso" "Imagine Dragons"     "J, Balvin"          
##  [4] "Robin Schulz"        "Jonas Blue"          "David Guetta"       
##  [7] "French Montana"      "Calvin Harris"       "Liam Payne"         
## [10] "Lauv"

The second way to achieve this would be by passing the pattern string through the fixed() function, which tells stringr that it should take the entire string literally. This means that the following code achieves the same result as using “\\”.

str_replace_all(music_data$top10_artist_names, pattern = fixed("."), replacement = ",")
##  [1] "Axwell /\\ Ingrosso" "Imagine Dragons"     "J, Balvin"          
##  [4] "Robin Schulz"        "Jonas Blue"          "David Guetta"       
##  [7] "French Montana"      "Calvin Harris"       "Liam Payne"         
## [10] "Lauv"
2.2.2.2.1 Other special characters in regex

We will quickly go through the most important regex special characters. Be aware that this list is by no means exhaustive and is only meant to give you some basic tools that can help you with string manipulation. Keep this in mind if a regex is displaying unexpected behavior, as it could be due to some wildcard you are not aware of. If all else fails you can always used the fixed() function to just match literal strings.

2.2.2.2.1.1 Square brackets []

Square brackets can be used to match from a set of different letters. This means that [abc] will match a, b or c. The following code will replace a, b or c with a capital X.

str_replace_all(music_data$top10_artist_names, pattern = "[abc]", replacement = "X")
##  [1] "Axwell /\\ Ingrosso" "ImXgine DrXgons"     "J. BXlvin"          
##  [4] "RoXin SXhulz"        "JonXs Blue"          "DXvid GuettX"       
##  [7] "FrenXh MontXnX"      "CXlvin HXrris"       "LiXm PXyne"         
## [10] "LXuv"

Note again that this is case sensitive, meaning that A, B and C are not replaced. However, square brackets are a great way to replace both capitalized and non-capitalized occurrences at once.

str_replace_all(music_data$top10_artist_names, pattern = "[ABCabc]", replacement = "X")
##  [1] "Xxwell /\\ Ingrosso" "ImXgine DrXgons"     "J. XXlvin"          
##  [4] "RoXin SXhulz"        "JonXs Xlue"          "DXvid GuettX"       
##  [7] "FrenXh MontXnX"      "XXlvin HXrris"       "LiXm PXyne"         
## [10] "LXuv"

If you include a ^ in the beginning of a square bracket pattern, regex will interpret that to mean any character except the ones in brackets. This means that if we take the same code as before and include a ^, all letters except for A, B and C (and their non-capitalized counterparts) will be replaced by a capital X.

str_replace_all(music_data$top10_artist_names, pattern = "[^ABCabc]", replacement = "X")
##  [1] "AXXXXXXXXXXXXXXXXX" "XXaXXXXXXXaXXXX"    "XXXBaXXXX"         
##  [4] "XXbXXXXcXXXX"       "XXXaXXBXXX"         "XaXXXXXXXXXa"      
##  [7] "XXXXcXXXXXXaXa"     "CaXXXXXXaXXXX"      "XXaXXXaXXX"        
## [10] "XaXX"
2.2.2.2.1.2 Repetition operators: *,+ and {}

Repetition operators can be used to match the same character (or set of characters) multiple times. + matches a character one or more times, * matches a character zero or more times and with {} you can specify the range that matches can occur in.

vector <- c("", "a", "aa", "aaa", "aaaa")
# Replace one or more a with an X
str_replace(vector, pattern = "a+", replacement = "X")
## [1] ""  "X" "X" "X" "X"
# replace zero or more a with an X
str_replace(vector, pattern = "a*", replacement = "X")
## [1] "X" "X" "X" "X" "X"
# replace exactly two a with an X
str_replace(vector, pattern = "a{2}", replacement = "X")
## [1] ""    "a"   "X"   "Xa"  "Xaa"
# replace two to three a with an X
str_replace(vector, pattern = "a{2,3}", replacement = "X")
## [1] ""   "a"  "X"  "X"  "Xa"

Note that the + and * operators are “greedy”, meaning that they try to match as much as possible, which can often lead to unintended consequences. It is often a good practice with regex to be as specific as possible while remaining as general as needed.

2.2.2.2.1.3 Parentheses: ()

Parentheses are used to create groups. Groups always match in their entirety and can be combined with other operators.

vector <- c("abc", "abcabc", "123abc", "abcabcabc")

str_replace_all(vector, pattern = "(abc){2}", replacement = "X")
## [1] "abc"    "X"      "123abc" "Xabc"

The pattern "(abc){2}" will match only "abcabc", seeing as it looks for matches that repeat the group "(abc)" twice.

2.2.2.2.1.4 Optional characters: ?

The question mark tells regex that the preceding character is optional for a match.

vector <- c("abc", "ac")

str_replace_all(vector, pattern = "abc", replacement = "X")
## [1] "X"  "ac"

As expected, this only replaces the first element of the vector, as the second ("ac") is not an exact match.

str_replace_all(vector, pattern = "ab?c", replacement = "X")
## [1] "X" "X"

By including ? after the b, we tell regex that it is optional, i.e. that both "abc" and "ac" are correct matches. This can also be applied to groups and sets.

2.2.2.2.1.5 Anchors: ^ and $

Anchors can be used to specify that a match should only occur at the very beginning or end of a character string, with ^ and $ standing for the beginning and the end, respectively. Note that the ^ operator has a different meaning inside square brackets ([]), as discussed above.

vector <- c("abc123", "123abc")

str_replace_all(vector, pattern = "^abc", replacement = "X")
## [1] "X123"   "123abc"

This code only replaces the "abc" in "abc123" because it appears at the beginning of the string.

str_replace_all(vector, pattern = "abc$", replacement = "X")
## [1] "abc123" "123X"

abc$, on the other hand, only matches the "abc" in "123abc" because it appears at the end of the string.

regex can do a lot more than shown here, but these basic tools already enable you to do a lot of things that would take much more time when done by hand.

2.2.3 Case study

Let’s take everything we have learned in this chapter and apply it to a practical example. We will be using survey data from Qualtrics, which was created by a group of students for this course in 2017. We will only be looking at a small subsection of the variables available, to keep things from becoming unwieldy.

As always, let’s first load all the required libraries and the data set and take a look at it.

library(dplyr)
library(stringr)

data <- read.csv("https://raw.githubusercontent.com/IMSMWU/MRDA2018/master/Survey_data.csv?token=AVa281hYEyEQbqaSBcQZcYU-da4rv9xkks5bXyUMwA%3D%3D", stringsAsFactors = FALSE)
data

This data frame consists of 11 variables and 305 observations.

  • Progress: How much of the survey (in percent) was completed.
  • ResponseId: A unique ID for each participant
  • Q39: A multiple choice question on supermarket recognition. Participants were presented with 8 supermarket brands and asked which of them they were familiar with. A value of 1,2,3,4,5,6,7,8 means that the participant knew all eight, while, e.g., 3,5 means that the person only knew supermarkets 3 and 5.
  • Q18_1 to Q18_1_8:_ A series of questions on willingness to pay for various products.

The data has a couple problems we need to take care of before we can start analyzing it properly.

  • The first two rows: Qualtrics data comes with two rows that contain no useful information. Additionally, these force all columns to be of type “character”, which we can’t perform all types of analyses on.
  • Not all respondents finished the survey: As you can tell by the progress column, not all respondents finished the survey. To be able to perform proper analysis, we only want those that completed the survey.
  • Multiple choice question: The multiple choice question is currently in a format that is very hard to work with. It would be best to have eight individual columns that each correspond to an individual supermarket.
  • The price data is a mess: The price data is arguably the hardest challenge. The data is not uniform, with the decimal separator symbol varying and some rows containing additional text and symbols that we do not need. We need to filter out only the relevant parts and then transform it from a character to a numeric column to work on it.

As a first step, we will filter out all observations where the progress column is unequal 100. This has the added advantage of removing the first two rows as well.

data <- data %>%
  filter(Progress == 100)

Next we want to turn the multiple choice question into something a bit more useful. We will create eight new columns, called Q39_A1 to Q39_A8, which contain a 1 if the respective number appears in Q39 and a 0 otherwise. So, for example, if Q39is equal to 1,2,5 then Q39_A1, Q39_A2 and Q39_A5 will be set to 1 and all others will be set to 0.

We will achieve this with mutate and str_detect.

data <- data %>%
   mutate(Q39_A1 = str_detect(Q39, pattern = "1"),
         Q39_A2 = str_detect(Q39, pattern = "2"),
         Q39_A3 = str_detect(Q39, pattern = "3"),
         Q39_A4 = str_detect(Q39, pattern = "4"),
         Q39_A5 = str_detect(Q39, pattern = "5"),
         Q39_A6 = str_detect(Q39, pattern = "6"),
         Q39_A7 = str_detect(Q39, pattern = "7"),
         Q39_A8 = str_detect(Q39, pattern = "8"))

First, we always define the name of the new column we want to create. Then we tell mutate the function with which to create the new column. In this case we use str_detect() to check if the correct number appears in the column Q39. You can read the first line in mutate() as: “If a 1 appears in the field Q39 set the column Q39_A1 to TRUE and otherwise set it to FALSE”.

Now is a good time to take a look at the columns we have created and see if it did what we wanted.

data %>%
  select(Q39, Q39_A1:Q39_A8)

Everything seems to have worked! In a later step we will change the values TRUE and FALSE to 1 and 0, respectively, simply because it is easier to work with.

The last big task is to clean up the price data. Let’s take a look at it.

data %>% 
  select(Q18_1_1:Q18_1_8)

The problems can be grouped roughly into two categories:

  • Additional symbols: Some respondents added additional symbols or text that we need to get rid of.
  • Wrong decimal separator: R needs all decimal separators to be . and not ,, so we need to make sure all of them are dots.

We will tackle both of these problems with mutate_at() and the the stringr package. First we will strip out any character that isn’t a number, a comma or a dot. str_remove_all removes all instances of characters that are matched by the expression given in “pattern”. Note that the function str_remove_all is called without parentheses. Its argument(s) are passed directly to mutate_at(), separated by commas. The pattern we use here is [^0-9,\\.], which can be read as “match all characters that are not (remember that a ^ in square brackets means ‘everything but’) 0-9 a comma (,) or a dot (\\.)”. Remember that the dot is a special character and therefore has to be escaped with the double backslash.

data <- data %>%
  mutate_at(vars(Q18_1_1:Q18_1_8), str_remove_all, pattern = "[^0-9,\\.]")

Let’s take a look and see if the code did what we wanted it to.

data %>% 
  select(Q18_1_1:Q18_1_8)

So far, so good. To complete the clean up of the price data, we simply want to replace any instances of a comma with a dot. The code for this is quite simple.

data <- data %>%
  mutate_at(vars(Q18_1_1:Q18_1_8), str_replace_all, pattern = ",", replacement = ".") 


# Print variables we just mutated to see if everything worked
data %>% 
  select(Q18_1_1:Q18_1_8)

As a final step we want to convert all variables that we want to work with into numeric variables. Again, we use mutate_at(), this time coupled with the as.numeric() function.

data <- data %>% 
  mutate_at(vars(Q18_1_1:Q39_A8), as.numeric)

Finally, let’s have a look at the finished data frame. All the relevant data is now in a format that we can perform further analysis with.

# Print entire data frame
data

2.3 Data import and export

Before you can start your analysis in R, you first need to import the data you wish to perform the analysis on. You will often be faced with different types of data formats (usually produced by some other statistical software like SPSS or Excel or a text editor). Fortunately, R is fairly flexible with respect to the sources from which data may be imported and you can import the most common data formats into R with the help of a few packages. R can, among others, handle data from the following sources:

In the previous chapter, we saw how we may use the keyboard to input data in R. In the following sections, we will learn how to import data from text files and other statistical software packages.

2.3.1 Getting data for this course

Most of the data sets we will be working with in this course will be stored in text files (i.e., .dat, .txt, .csv). There are two ways for you to obtain access to the data sets:

2.3.1.2 Download and import datasets from “Learn@WU”

It is also possible to download the data from the respective folder on the “Learn@WU” platform, placing it in the working directory and importing it from there. However, this requires an additional step to download the file manually first. If you chose this option, please remember to put the data file in the working directory first. If the import is not working, check your working directory setting using getwd(). Once you placed the file in the working directory, you can import it using the same command as above. Note that the file must be given as a character string (i.e., in quotation marks) and has to end with the file extension (e.g., .csv, .tsv, etc.).

test_data <- read.table("test_file.extension", header=TRUE)

2.3.2 Import data created by other software packages

Sometimes, you may need to import data files created by other software packages, such as Excel or SPSS. In this section we will use the readxl and haven packages to do this. To import a certain file you should first make sure that the file is stored in your current working directory. You can list all file names in your working directory using the list.files() function. If the file is not there, either copy it to your current working directory, or set your working directory to the folder where the file is located using setwd("/path/to/file"). This tells R the folder you are working in. Remember that you have to use / instead of \ to specify the path (if you use Windows paths copied from the explorer they will not work). When your file is in your working directory you can simply enter the filename into the respective import command. The import commands offer various options. For more details enter ?read_excel, ?read_spss after loading the packages.

list.files() #lists all files in the current working directory
#setwd("/path/to/file") #may be used to change the working directory to the folder that contains the desired file

#import excel files
library(readxl) #load package to import Excel files
excel_sheets("test_data.xlsx")
test_data_excel <- read_excel("test_data.xlsx", sheet = "mrda_2016_survey") # "sheet=x"" specifies which sheet to import
head(test_data_excel)

library(haven) #load package to import SPSS files
#import SPSS files
test_data_spss <- read_sav("test_data.sav")
head(test_data_spss)

The import of other file formats works in a very similar way (e.g., Stata, SAS). Please refer to the respective help-files (e.g., ?read_dta, ?read_sas …) if you wish to import data created by other software packages.

2.3.3 Export data

Exporting to different formats is also easy, as you can just replace “read” with “write” in many of the previously discussed functions (e.g. write.table(object, "file_name")). This will save the data file to the working directory. To check what the current working directory is you can use getwd(). By default, the write.table(object, "file_name")function includes the row number as the first variable. By specifying row.names = FALSE, you may exclude this variable since it doesn’t contain any useful information.

write.table(test_data, "testData.dat", row.names = FALSE, sep = "\t") #writes to a tab-delimited text file
write.table(test_data, "testData.csv", row.names = FALSE, sep = ",") #writes to a comma-separated value file 
write_sav(test_data, "my_file.sav")

2.3.4 Import data from the Web

2.3.4.1 Scraping data from websites

Sometimes you may come across interesting data on websites that you would like to analyse. Reading data from websites is possible in R, e.g., using the rvest package. Let’s assume you would like to read a table that lists the population of different countries from this Wikipedia page. It helps to first inspect the structure of the website (e.g., using tools like SelectorGadget), so you know which elements you would like to extract. In this case it is fairly obvious that the data are stored in a table for which the associated html-tag is <table>. So let’s read the entire website using read_html(url) and filter all tables using read_html(html_nodes(...,"table")).

library(rvest)
url <- "https://en.wikipedia.org/wiki/List_of_countries_and_dependencies_by_population"
population <- read_html(url) 
population <- html_nodes(population, "table")
print(population)
## {xml_nodeset (3)}
## [1] <table class="plainlinks metadata ambox ambox-move" role="presentati ...
## [2] <table class="wikitable sortable" style="text-align:right"><tbody>\n ...
## [3] <table class="nowraplinks hlist collapsible autocollapse navbox-inne ...

The output shows that there are two tables on the website and the first one appears to contain the relevant information. So let’s read the first table using the html_table() function. Note that population is of class “list”. A list is a vector that has other R objects (e.g., other vectors, data frames, matrices, etc.) as its elements. If we want to access the data of one of the elements, we have to use two square brackets on each side instead of just one (e.g., population[[1]] gets us the first table from the list of tables on the website; the argument fill = TRUE ensures that empty cells are replaced with missing values when reading the table).

population <- population[[2]] %>% html_table(fill = TRUE)
head(population) #checks if we scraped the desired data

You can see that population is read as a character variable because of the commas.

class(population$Population)
## [1] "character"

If we wanted to use this variable for some kind of analysis, we would first need to convert it to numeric format using the as.numeric() function. However, before we can do this, we can use the str_replace_all() function from the stringr package, which replaces all matches of a string. In our case, we would like to replace the commas (",") with nothing ("").

library(stringr)
population$Population <- as.numeric(str_replace_all(population$Population, pattern = ",", replacement = "")) #convert to numeric
head(population) #checks if we scraped the desired data

Now the variable is of type “numeric” and could be used for analysis.

class(population$Population)
## [1] "numeric"

2.3.4.2 Scraping data from APIs

2.3.4.2.1 Scraping data from APIs directly

Reading data from websites can be tricky since you need to analyze the page structure first. Many web-services (e.g., Facebook, Twitter, YouTube) actually have application programming interfaces (API’s), which you can use to obtain data in a pre-structured format. JSON (JavaScript Object Notation) is a popular lightweight data-interchange format in which data can be obtained. The process of obtaining data is visualized in the following graphic:

Obtaining data from APIs

Obtaining data from APIs

The process of obtaining data from APIs consists of the following steps:

  • Identify an API that has enough data to be relevant and reliable (e.g., www.programmableweb.com has >12,000 open web APIs in 63 categories).
  • Request information by calling (or, more technically speaking, creating a request to) the API (e.g., R, python, php or JavaScript).
  • Receive response messages, which is usually in JavaScript Object Notation (JSON) or Extensible Markup Language (XML) format.
  • Write a parser to pull out the elements you want and put them into a of simpler format
  • Store, process or analyze data according the marketing research question.

Let’s assume that you would like to obtain population data again. The World Bank has an API that allows you to easily obtain this kind of data. The details are usually provided in the API reference, e.g., here. You simply “call” the API for the desired information and get a structured JSON file with the desired key-value pairs in return. For example, the population for Austria from 1960 to 2016 can be obtained using this call. The file can be easily read into R using the fromJSON()-function from the jsonlite-package. Again, the result is a list and the second element ctrydata[[2]] contains the desired data, from which we select the “value” and “data” columns using the square brackets as usual [,c("value","date")]

library(jsonlite)
url <- "http://api.worldbank.org/countries/AT/indicators/SP.POP.TOTL/?date=1960:2016&format=json&per_page=100" #specifies url
ctrydata <- fromJSON(url) #parses the data 
str(ctrydata)
## List of 2
##  $ :List of 4
##   ..$ page    : int 1
##   ..$ pages   : int 1
##   ..$ per_page: chr "100"
##   ..$ total   : int 57
##  $ :'data.frame':    57 obs. of  5 variables:
##   ..$ indicator:'data.frame':    57 obs. of  2 variables:
##   .. ..$ id   : chr [1:57] "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" "SP.POP.TOTL" ...
##   .. ..$ value: chr [1:57] "Population, total" "Population, total" "Population, total" "Population, total" ...
##   ..$ country  :'data.frame':    57 obs. of  2 variables:
##   .. ..$ id   : chr [1:57] "AT" "AT" "AT" "AT" ...
##   .. ..$ value: chr [1:57] "Austria" "Austria" "Austria" "Austria" ...
##   ..$ value    : chr [1:57] "8736668" "8642699" "8546356" "8479823" ...
##   ..$ decimal  : chr [1:57] "0" "0" "0" "0" ...
##   ..$ date     : chr [1:57] "2016" "2015" "2014" "2013" ...
head(ctrydata[[2]][,c("value","date")]) #checks if we scraped the desired data
2.3.4.2.2 Scraping data from APIs via R packages

An even more convenient way to obtain data from web APIs is to use existing R packages that someone else has already created. There are R packages available for various web-services. For example, the gtrendsR package can be used to conveniently obtain data from the Google Trends page. The gtrends() function is easy to use ans returns a list of elements (e.g., “interest over time”, “interest by city”, “related topics”), which can be inspected using the ls() function. The following example can be used to obtain data for the search term “data science” in the US between September 1 and October 6:

library(gtrendsR)
index = 1
success = FALSE
while(!(success | index == 10)){
  google_trends <- try(gtrends("data science", geo = c("US"), gprop = c("web"), time = "2017-09-01 2017-10-06"), silent = TRUE)
  
  if(!is(google_trends, "try-error")){
    ls(google_trends)
    head(google_trends$interest_over_time)
    success = TRUE
  }else{
  index = index + 1
  Sys.sleep(runif(1,0,3))}
}
if(success == FALSE){
  warning("Google Trends has exited unsuccessfully")
}