4 Data types

To a human, the difference between something numeric- like a person’s age- and something textual - like their name - isn’t a big deal. To a computer, however, this matters a lot. In order to ensure that there is sufficient memory to store the information and to ensure that it may be used in an operation, the computer needs to know what type of data it’s working with. In other words: 5 + “Steve” = Huh?

In this chapter, we’ll talk through the various primitive data types that R supports. By the end of this chapter, you will be able to answer the following:

What are the different data types?
When and how is one type converted to another?
How can I work with dates?
What the heck is a factor?

4.1 Data types

R supports four “primitive” data types as shown below:

logical
integer
double
character

To know what type of data you’re working with, you use the (wait for it) typeof function. If you want to test for a specific data type, you can use the suite of is. functions. Have a look at the example below. Note that when we want something to be an integer, we type the letter “L” after the number.

x <- 6
y <- 3L
z <- TRUE
typeof(x)
#> [1] "double"
typeof(y)
#> [1] "integer"
typeof(z)
#> [1] "logical"
is.logical(x)
#> [1] FALSE
is.double(x)
#> [1] TRUE

4.2 Data conversion

It’s possible to convert from one type to another. Most of the time, this happens implicitly as part of an operation. R will alter data in order for calculations to take place. For example, let’s say that I’m adding together x and y from the code snippet above. We know that an integer and a real number will add together easily, but the computer needs to convert the integer before the operation can take place.

typeof(x + y)
#> [1] "double"

Implicit conversion will change data types in the order shown below. Note that all data types for an operation will be converted to the most complex number involved in the calculation.

logical -> integer -> double -> character

Note that implicit conversion can’t always help us. Let’s try the example from the start of this chapter.

5 + 'Steve'
#> Error in 5 + "Steve": non-numeric argument to binary operator

Here, R is telling us that it doesn’t know how to add a number and a word. I don’t either.

For explicit conversion, use the as.* functions. When explicit conversion is used to convert a value to a simpler data type - double to integer, say - that there will likely be loss of information.

# Implicit conversion
w <- TRUE
x <- 4L
y <- 5.8
z <- w + x + y
typeof(z)
#> [1] "double"

# Explicit conversion. Note loss of data.
as.integer(z)
#> [1] 10

In addition to typeof there are two other functions which will return basic information about an object.

The mode of an object will return a value indicating how the object is meant to be stored. This will generally mirror the output produced by typeof except that double and integers both have a mode of “numeric”. This function has never improved my life and it won’t be discussed any further.

A class of an object is a very special kind of metadata. (We’ll get more into metadata in the next chapter Vectors.) When we get beyond primitive data types, this starts to become important. We’ll see two examples in just a moment when we talk about dates and factors. The class of a basic type will be equal to its type apart from ‘double’, whose class is ‘numeric’ for reasons I don’t pretend to understand.

class(TRUE)
#> [1] "logical"
class(pi)
#> [1] "numeric"
class(4L)
#> [1] "integer"
class(Sys.Date())
#> [1] "Date"

The table below summarizes most of the ways we can sort out what sort of data we’re working with.

Table 4.1: Key similarities and differences between vectors and lists
Function	Returns
typeof	The type of the object
mode	Storage mode of the object
class	The class(es) of the object
inherits	Whether the object is a particular class
is.	Whether the object is a particular type

4.3 Dates and times

Dates in R can be tricky. There are two basic classes: Date and POSIXt. The Date class does not get more granular than days. The POSIXt class can handle seconds, milliseconds, etc. My recommendation is to stick with the “Date” class. Introducing times means introducing time zones and the possibility for confusion or error. Actuaries rarely need to measure things in minutes.

x <- as.Date('2010-01-01')
class(x)
#> [1] "Date"
typeof(x)
#> [1] "double"

By default, dates don’t follow US conventions. Much like avoiding the metric system, United Statesians are sticking with a convention that doesn’t have a lot of logical support. If you want to preserve your sanity, stick with year, month, day order.

# Don't do this:
x <- as.Date('06-30-2010')
#> Error in charToDate(x): character string is not in a standard unambiguous format

# But this is just fine:
x <- as.Date('30-06-2010')

# Year, month, day is your friend
x <- as.Date('2010-06-30')

To get the date and time of the computer, use the either Sys.Date() or Sys.time(). Note that Sys.time() will return both the day AND the time as a POSIXct object.

x <- Sys.Date()
y <- Sys.time()

It’s worth reading the documentation about dates. Measuring time periods is a common task for actuaries. It’s easy to make huge mistakes by getting dates wrong.

The lubridate package has some nice convenience functions for setting month and day and reasoning about time periods. It also enables you to deal with time zones, leap days and leap seconds. This is probably more than most folks need, but it’s worth looking into.

The mondate package was written by Daniel Murphy (an actuary) and supports handling time periods in terms of months. This is a very good thing. You’ll quickly learn that the base functions don’t like dealing with time periods as measured in months. Why? Because they’re all different lengths. It’s not clear how to add “one month” to a set of dates. And yet, we very often want to do this. An easy example is adding a set of months to the last day in a month. The close of a quarter is a common task in financial circles. The code below will produce the end of the quarter for a single year.⁵

library(mondate)
#> Loading required package: methods
#> 
#> Attaching package: 'mondate'
#> The following object is masked from 'package:base':
#> 
#>     as.difftime
add(mondate("2010-03-31"), c(0, 3, 6, 9), units = "months")
#> mondate: timeunits="months"
#> [1] 2010-03-31 2010-06-30 2010-09-30 2010-12-31

The items below are all worth reading.

Date class: https://stat.ethz.ch/R-manual/R-devel/library/base/html/Dates.html
lubridate: http://www.jstatsoft.org/v40/i03/paper
Ripley and Hornik: http://www.r-project.org/doc/Rnews/Rnews_2001-2.pdf
mondate: (https://code.google.com/p/mondate/)

4.4 Factors

Factors are a pretty big gotcha. They were necessary many years ago when data collection and storage were expensive. A factor maps a character string to an integer, so that it takes up less space. The code below will illustrate the difference between a factor and a compararble character vector⁶.

myColors <- c("Red", "Blue", "Green", "Red", "Blue", "Red")
myFactor <- factor(myColors)
myColors
#> [1] "Red"   "Blue"  "Green" "Red"   "Blue"  "Red"
myFactor
#> [1] Red   Blue  Green Red   Blue  Red  
#> Levels: Blue Green Red
typeof(myFactor)
#> [1] "integer"
class(myFactor)
#> [1] "factor"
is.character(myFactor)
#> [1] FALSE
is.character(myColors)
#> [1] TRUE

Note that when we printed the value of myFactor we got the list of colors, but without the quotes around them. We are also told that our object has “Levels”. This is important as it defines the set of possible values for the factor. This is rather useful if you have a data set where the permissible values are constrained to a closed set, like gender, education, smoker/non-smoker, etc.

So, what happens if we want to add a new element to our factor?

# This probably won't give you what you expect
myOtherFactor <- c(myFactor, "Orange")
myOtherFactor
#> [1] "3"      "1"      "2"      "3"      "1"      "3"      "Orange"

# And this will give you an error
myFactor[length(myFactor)+1] <- "Orange"
#> Warning in `[<-.factor`(`*tmp*`, length(myFactor) + 1, value = "Orange"):
#> invalid factor level, NA generated

# Must do things in two steps
myOtherFactor <- factor(c(levels(myFactor), "Orange"))
myOtherFactor[length(myOtherFactor)+1] <- "Orange"

Ugh. In the first instance, R recognizes that it can’t append a new item to the factor. So, it converts the values to a string and then appends the string “Orange”. But note that the items are string values of integers. That’s because the underlying data of a factor is an integer. In the second instance, we first have to change the levels of the factor and then we can append our new data element.

Often when creating a data frame, R’s default behavior is to convert character values into a factor. When we get to creating data frames and importing data, you’ll often see us use code like the following:

mojo <- read.csv("myFile.csv", stringsAsFactors = FALSE)

Now that you know what they are, you can spend the next few months avoiding factors. When R was created, there were compelling reasons to include factors and they still have some utility. More often than not, though, they’re a confusing hindrance. If characters aren’t behaving the way you expect them to, check the variables with class or is.factor. Convert them with as.character and you’ll be back on the road to happiness.

4.5 Exercises

Create a logical, integer, double and character variable.
Can you create a vector with both logical and character values?
What happens when you try to add a logical to an integer? An integer to a double?

4.5.1 Answers

myLogical <- TRUE
myInteger <- 1:4
myDouble <- 3.14
myCharacter <- "Hello!"

y <- myLogical + myInteger
typeof(y)
#> [1] "integer"
y <- myInteger + myDouble
typeof(y)
#> [1] "double"

You can also use the quarter function to achieve much the same thing.↩
We haven’t covered vectors yet, but we’re getting there. If this code is confusing, just skip this section for now and come back after you’ve read up on vectors.↩