2 Basic data handling

This chapter covers the basics of data handling in R

2.1 Creating objects

Anything created in R is an object. You can assign values to objects using the assignment operator <-:

x <- "hello world"  #assigns the words 'hello world' to the object x
# this is a comment

Note that comments may be included in the code after a #. The text after # is not evaluated when the code is run; they can be written directly after the code or in a separate line.

To see the value of an object, simply type its name into the console and hit enter:

x  #print the value of x to the console
## [1] "hello world"

You can also explicitely tell R to print the value of an object:

print(x)  #print the value of x to the console
## [1] "hello world"

Note that because we assign characters in this case (as opposed to e.g., numeric values), we need to wrap the words in quotation marks, which must always come in pairs. Although RStudio automatically adds a pair of quotation marks (i.e., opening and closing marks) when you enter the opening marks it could be that you end up with a mismatch by accident (e.g., x <- "hello). In this case, R will show you the continuation character “+”. The same could happen if you did not execute the full command by accident. The “+” means that R is expecting more input. If this happens, either add the missing pair, or press ESCAPE to abort the expression and try again.

To change the value of an object, you can simply overwrite the previous value. For example, you could also assign a numeric value to “x” to perform some basic operations:

x <- 2  #assigns the value of 2 to the object x
print(x)
## [1] 2
x == 2  #checks whether the value of x is equal to 2
## [1] TRUE
x != 3  #checks whether the value of x is NOT equal to 3
## [1] TRUE
x < 3  #checks whether the value of x is less than 3
## [1] TRUE
x > 3  #checks whether the value of x is greater than 3
## [1] FALSE

Note that the name of the object is completely arbitrary. We could also define a second object “y”, assign it a different value and use it to perform some basic mathematical operations:

y <- 5  #assigns the value of 2 to the object x
x == y  #checks whether the value of x to the value of y
## [1] FALSE
x * y  #multiplication of x and y
## [1] 10
x + y  #adds the values of x and y together
## [1] 7
y^2 + 3 * x  #adds the value of y squared and 3x the value of x together
## [1] 31

Object names

Please note that object names must start with a letter and can only contain letters, numbers, as well as the ., and _ separators. It is important to give your objects descriptive names and to be as consistent as possible with the naming structure. In this tutorial we will be using lower case words seperated by underscores (e.g., object_name). There are other naming conventions, such as using a . as a separator (e.g., object.name), or using upper case letters (objectName). It doesn’t really matter which one you choose, as long as you are consistent.

2.2 Data types

The most important types of data are:

Data type Description
Numeric Approximations of the real numbers, \(\normalsize\mathbb{R}\) (e.g., mileage a car gets: 23.6, 20.9, etc.)
Integer Whole numbers, \(\normalsize\mathbb{Z}\) (e.g., number of sales: 7, 0, 120, 63, etc.)
Character Text data (strings, e.g., product names)
Factor Categorical data for classification (e.g., product groups)
Logical TRUE, FALSE
Date Date variables (e.g., sales dates: 21-06-2015, 06-21-15, 21-Jun-2015, etc.)

Variables can be converted from one type to another using the appropriate functions (e.g., as.numeric(),as.integer(),as.character(), as.factor(),as.logical(), as.Date()). For example, we could convert the object y to character as follows:

y <- as.character(y)
print(y)
## [1] "5"

Notice how the value is in quotation marks since it is now of type character.

Entering a vector of data into R can be done with the c(x1,x2,..,x_n) (“concatenate”) command. In order to be able to use our vector (or any other variable) later on we want to assign it a name using the assignment operator <-. You can choose names arbitrarily. Just make sure they are descriptive and unique. Assigning the same name to two variables (e.g. vectors) will result in deletion of the first.

# Numeric:
top10_track_streams <- c(163608, 126687, 120480, 110022, 
    108630, 95639, 94690, 89011, 87869, 85599)

# Character:
top10_artist_names <- c("Axwell /\\ Ingrosso", "Imagine Dragons", 
    "J Balvin", "Robin Schulz", "Jonas Blue", "David Guetta", 
    "French Montana", "Calvin Harris", "Liam Payne", 
    "Lauv")  # Characters have to be put in ''

# Factor variable with two categories:
top10_track_explicit <- c(0, 0, 0, 0, 0, 0, 1, 1, 0, 
    0)
top10_track_explicit <- factor(top10_track_explicit, 
    levels = c(0:1), labels = c("not explicit", "explicit"))

# Factor variable with more than two categories:
top10_artist_genre <- c("Dance", "Alternative", "Latino", 
    "Dance", "Dance", "Dance", "Hip-Hop/Rap", "Dance", 
    "Pop", "Pop")
top10_artist_genre <- as.factor(top10_artist_genre)

# Date:
top_10_track_release_date <- as.Date(c("2017-05-24", 
    "2017-06-23", "2017-07-03", "2017-06-30", "2017-05-05", 
    "2017-06-09", "2017-07-14", "2017-06-16", "2017-05-18", 
    "2017-05-19"))

# Logical
top10_track_explicit_1 <- c(FALSE, FALSE, FALSE, FALSE, 
    FALSE, FALSE, TRUE, TRUE, FALSE, FALSE)

In order to “call” a vector we can now simply enter its name:

top10_track_streams
##  [1] 163608 126687 120480 110022 108630  95639  94690  89011  87869  85599
top_10_track_release_date
##  [1] "2017-05-24" "2017-06-23" "2017-07-03" "2017-06-30" "2017-05-05"
##  [6] "2017-06-09" "2017-07-14" "2017-06-16" "2017-05-18" "2017-05-19"

In order to check the type of a variable the class() function is used.

class(top_10_track_release_date)
## [1] "Date"

The video below gives a general overview of vectors and provides a more in-depth discussion.

2.3 Data structures

Now let’s create a table that contains the variables in columns and each observation in a row (like in SPSS or Excel). There are different data structures in R (e.g., Matrix, Vector, List, Array). In this course, we will mainly use data frames. The following graphic shows the different data structures in R (basded on “R in Action” by R. Kabacoff).

data types

Data frames are similar to matrices but are more flexible in the sense that they may contain different data types (e.g., numeric, character, etc.), where all values of vectors and matrices have to be of the same type (e.g. character). It is often more convenient to use characters instead of numbers (e.g. when indicating a persons sex: “F”, “M” instead of 1 for female , 2 for male). Thus we would like to combine both numeric and character values while retaining the respective desired features. This is where “data frames” come into play. Data frames can have different types of data in each column. For example, we can combine the vectors created above in one data frame using data.frame(). This creates a separate column for each vector, which is usually what we want (similar to SPSS or Excel).

music_data <- data.frame(top10_track_streams, top10_artist_names, 
    top10_track_explicit, top10_artist_genre, top_10_track_release_date, 
    top10_track_explicit_1)

2.3.1 Accessing data in data frames

When entering the name of a data frame, R returns the entire data frame:

music_data  # Returns the entire data frame

Hint: You may also use the View()-function to view the data in a table format (like in SPSS or Excel), i.e. enter the command View(data). Note that you can achieve the same by clicking on the small table icon next to the data frame in the “Environment”-window on the right in RStudio.

Sometimes it is convenient to return only specific values instead of the entire data frame. There are a variety of ways to identify the elements of a dataframe. One easy way is to explicitely state, which rows and columns you wish to view. The general form of the command is data.frame[rows,columns]. By leaving one of the arguments of data.frame[rows,columns] blank (e.g., data.frame[rows,]) we tell R that we want to access either all rows or columns, respectively. Here are some examples:

music_data[, 2:4]  # all rows and columns 2,3,4
music_data[, c("top10_artist_names", "top_10_track_release_date")]  # all rows and columns 'top10_artist_names' and 'top_10_track_release_date'
music_data[c(1:5), c("top10_artist_names", "top_10_track_release_date")]  # rows 1 to 5 and columns 'top10_artist_names'' and 'top_10_track_release_date'

You may also create subsets of the data frame, e.g., using mathematical expressions:

music_data[top10_track_explicit == "explicit", ]  # show only tracks with explicit lyrics  
music_data[top10_track_streams > 1e+05, ]  # show only tracks with more than 100,000 streams  
music_data[top10_artist_names == "Robin Schulz", ]  # returns all observations from artist 'Robin Schulz'
music_data[top10_track_explicit == "explicit", ]  # show only explicit tracks

The same can be achieved using the subset()-function

subset(music_data, top10_track_explicit == "explicit")  # selects subsets of observations in a dataframe
# creates a new data frame that only contains
# tracks from genre 'Dance'
music_data_dance <- subset(music_data, top10_artist_genre == 
    "Dance")
music_data_dance
rm(music_data_dance)  # removes an object from the workspace

You may also change the order of the variables in a data frame by using the order()-function

# Orders by genre (ascending) and streams
# (descending)
music_data[order(top10_artist_genre, -top10_track_streams), 
    ]

2.3.2 Inspecting the content of a data frame

The head() function displays the first X elements/rows of a vector, matrix, table, data frame or function.

head(music_data, 3)  # returns the first X rows (here, the first 3 rows)

The tail() function is similar, except it displays the last elements/rows.

tail(music_data, 3)  # returns the last X rows (here, the last 3 rows)

names() returns the names of an R object. When, for example, it is called on a data frame, it returns the names of the columns.

names(music_data)  # returns the names of the variables in the data frame
## [1] "top10_track_streams"       "top10_artist_names"       
## [3] "top10_track_explicit"      "top10_artist_genre"       
## [5] "top_10_track_release_date" "top10_track_explicit_1"

str() displays the internal structure of an R object. In the case of a dataframe, it returns the class (e.g., numeric, factor, etc.) of each variable, as well as the number of observations and the number of variables.

str(music_data)  # returns the structure of the data frame
## 'data.frame':    10 obs. of  6 variables:
##  $ top10_track_streams      : num  163608 126687 120480 110022 108630 ...
##  $ top10_artist_names       : Factor w/ 10 levels "Axwell /\\ Ingrosso",..: 1 5 6 10 7 3 4 2 9 8
##  $ top10_track_explicit     : Factor w/ 2 levels "not explicit",..: 1 1 1 1 1 1 2 2 1 1
##  $ top10_artist_genre       : Factor w/ 5 levels "Alternative",..: 2 1 4 2 2 2 3 2 5 5
##  $ top_10_track_release_date: Date, format: "2017-05-24" "2017-06-23" ...
##  $ top10_track_explicit_1   : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...

nrow() and ncol() return the rows and columns of a dataframe or matrix, respectively. dim() displays the dimensions of an R object.

nrow(music_data)  # returns the number of rows 
## [1] 10
ncol(music_data)  # returns the number of columns 
## [1] 6
dim(music_data)  # returns the dimensions of a data frame
## [1] 10  6

ls() can be used to list all objects that are associated with an R object.

ls(music_data)  # list all objects associated with an object
## [1] "top_10_track_release_date" "top10_artist_genre"       
## [3] "top10_artist_names"        "top10_track_explicit"     
## [5] "top10_track_explicit_1"    "top10_track_streams"

2.3.3 Append and delete variables to/from data frames

To call a certain column in a data frame, we may also use the $ notation. For example, this returns all values associated with the variable “top10_track_streams”:

music_data$top10_track_streams
##  [1] 163608 126687 120480 110022 108630  95639  94690  89011  87869  85599

Assume that you wanted to add an additional variable to the data frame. You may use the $ notation to achieve this:

# Create new variable as the log of the number of
# streams
music_data$log_streams <- log(music_data$top10_track_streams)
# Create an ascending count variable which might
# serve as an ID
music_data$obs_number <- 1:nrow(music_data)
head(music_data)

To delete a variable, you can simply create a subset of the full data frame that excludes the variables that you wish to drop:

music_data <- subset(music_data, select = -c(log_streams))  # deletes the variable log streams 
head(music_data)

You can also rename variables in a data frame, e.g., using the rename()-function from the plyr package. In the following code “::” signifies that the function “rename” should be taken from the package “plyr”. This can be useful if multiple packages have a function with the same name. Calling a function this way also means that you can access a function without loading the entire package via library().

library(plyr)
music_data <- plyr::rename(music_data, c(top10_artist_genre = "genre", 
    top_10_track_release_date = "release_date"))
head(music_data)

The video below provides an overview of dataframes, going into more detail on some points.