NULL
: Absence of a vector.Four basic types:
TRUE
or FALSE
L
behind it (for “long integer”).-1L
, 0L
, 1L
, 2L
, 3L
, etc…1
, 1.0
, 1.01
, etc…Inf
, -Inf
, and NaN
are also doubles."1"
, "one"
, "1 won one"
, etc…You create vectors with c()
for “combine”
<- c(TRUE, TRUE, FALSE, TRUE) ## logical
x <- c(1L, 1L, 0L, 1L) ## integer
x <- c(1, 1, 0, 1) ## double
x <- c("1", "1", "0", "1") ## character x
There are no scalars in R. A “scalar” is just a vector length 1.
is.vector(TRUE)
## [1] TRUE
Integers and doubles are together called “numerics”
You can determine the type with typeof()
.
<- c(TRUE, FALSE)
x typeof(x)
## [1] "logical"
<- c(0L, 1L)
x typeof(x)
## [1] "integer"
<- c(0, 1)
x typeof(x)
## [1] "double"
<- c("0", "1")
x typeof(x)
## [1] "character"
The special values, Inf
, -Inf
, and NaN
are doubles
typeof(c(Inf, -Inf, NaN))
## [1] "double"
Determine the length of a vector using length()
length(x)
## [1] 2
Missing values are represented by NA
.
NA
is technically is a logical value.
typeof(NA)
## [1] "logical"
This rarely matters because logicals get coerced to other types when needed.
typeof(c(1L, NA))
## [1] "integer"
typeof(c(1, NA))
## [1] "double"
typeof(c("1", NA))
## [1] "character"
But if you need missing values of other types, you can use
NA_integer_ ## integer NA
NA_real_ ## double NA
NA_character_ ## character NA
This typically shows up in dplyr::if_else()
where the return values need to be all of the same type.
::if_else(c(TRUE, FALSE), 1, NA) ## errors dplyr
## Error in `dplyr::if_else()`:
## ! `false` must be a double vector, not a logical vector.
::if_else(c(TRUE, FALSE), 1, NA_real_) ## works fine dplyr
## [1] 1 NA
Never use ==
when testing for missingness. It will return NA
since it is always unknown if two unknowns are equal. Use is.na()
.
<- c(NA, 1)
x == NA x
## [1] NA NA
is.na(x)
## [1] TRUE FALSE
You can check the type with is.logical()
, is.integer()
, is.double()
, and is.character()
.
is.logical(TRUE)
## [1] TRUE
is.integer(1L)
## [1] TRUE
is.double(1)
## [1] TRUE
is.character("1")
## [1] TRUE
Attempting to combine vectors of different types coerces them to the same type. The order of preference is character > integer > double > logical.
typeof(c(1L, TRUE))
## [1] "integer"
typeof(c(1, 1L))
## [1] "double"
typeof(c("1", 1))
## [1] "character"
Exercise (from Advanced R): Predict the output:
c(1, FALSE)
c("a", 1)
c(TRUE, 1L)
Exercise (from Advanced R): Explain these results:
1 == "1"
## [1] TRUE
-1 < FALSE
## [1] TRUE
"one" < 2
## [1] FALSE
Attributes are meta information applied to atomic vectors.
Many common objects (like matrices, arrays, factors, date-times) are just atomic vectors with special attributes.
You get and set attributes with attr()
<- 1:3
a attr(a, "x") <- "abcdef" # sets x attribute of vector a to be "abcdef"
attr(a, "x") # retrieve the x attribute of vector a
## [1] "abcdef"
You can see all attributes of a vector with attributes()
.
attr(a, "y") <- 4:6
attributes(a)
## $x
## [1] "abcdef"
##
## $y
## [1] 4 5 6
You can set many attributes at the same time with structure()
.
<- structure(1:3,
b x = "abcdef",
y = 4:6)
attributes(b)
## $x
## [1] "abcdef"
##
## $y
## [1] 4 5 6
Attributes are name-value pairs, and all of these attributes are associated with an object. Below, the vector c(1, 2, 3)
points to attributes x
and y
that each have their own values.
Most attributes are typically lost by most operations.
attributes(a[[1]])
## NULL
attributes(sum(a))
## NULL
Exception: Two attributes are not lost typically: names and dim.
Names are a character vector the same length as the atomic vector. Each name corresponds to a single element.
You could set names using attr()
, but you should not.
<- 1:3
x attr(x, "names") <- c("a", "b", "c")
attributes(x)
## $names
## [1] "a" "b" "c"
Names are so special, that there are special ways to create them and view them
<- c(a = 1, b = 2, c = 3)
x names(x)
## [1] "a" "b" "c"
<- 1:3
x names(x) <- c("a", "b", "c")
names(x)
## [1] "a" "b" "c"
The proper way to think about names is like this:
But each name corresponds to a specific element, so Hadley does it like this:
Names stay with single bracket subsetting (not double bracket subsetting)
names(x[1])
## [1] "a"
names(x[1:2])
## [1] "a" "b"
names(x[[1]])
## NULL
Names can be used for subsetting (more in Chapter 4)
"a"]] x[[
## [1] 1
You can remove names with unname()
.
unname(x)
## [1] 1 2 3
The dim attribute makes a vector into a matrix (a rectangle of numbers) or an array (a block of numbers).
Again, you could use attr()
to set dim
, but you should not.
<- 1:6
x attr(x, "dim") <- c(2, 3)
x
## [,1] [,2] [,3]
## [1,] 1 3 5
## [2,] 2 4 6
<- 1:12
x attr(x, "dim") <- c(2, 2, 3)
x
## , , 1
##
## [,1] [,2]
## [1,] 1 3
## [2,] 2 4
##
## , , 2
##
## [,1] [,2]
## [1,] 5 7
## [2,] 6 8
##
## , , 3
##
## [,1] [,2]
## [1,] 9 11
## [2,] 10 12
You should either use matrix()
or array()
to create these objects, or set the dimension with dim()
.
<- 1:6
x dim(x) <- c(2, 3)
dim(x)
## [1] 2 3
<- matrix(1:6, nrow = 2, ncol = 3)
x dim(x)
## [1] 2 3
<- 1:12
y dim(y) <- c(2, 2, 3)
dim(y)
## [1] 2 2 3
<- array(1:12, dim = c(2, 2, 3))
y dim(y)
## [1] 2 2 3
length()
still works on matrices and arrays, but is less useful
length(x)
## [1] 6
length(y)
## [1] 12
Instead, use nrow()
and ncol()
(for matrices), or dim()
(for arrays).
nrow(x)
## [1] 2
ncol(x)
## [1] 3
dim(y)
## [1] 2 2 3
Instead of having names
, arrays and matrices of dimnames
. The dimnames
of an array is a list the same length as the number of dimensions of the array.
<- array(1:12, dim = c(2, 2, 3))
x dimnames(x) <- list(first = c("a", "b"),
second = c("c", "d"),
third = c("e", "f", "g"))
dimnames(x)
## $first
## [1] "a" "b"
##
## $second
## [1] "c" "d"
##
## $third
## [1] "e" "f" "g"
x
## , , third = e
##
## second
## first c d
## a 1 3
## b 2 4
##
## , , third = f
##
## second
## first c d
## a 5 7
## b 6 8
##
## , , third = g
##
## second
## first c d
## a 9 11
## b 10 12
This is useful for subsetting, and for bookkeeping when you have data structured in a complicated multidimensional array (e.g. it is hard to remember what indexes the first vs second vs third dimensions without dimnames).
"a", "c", "g"] x[
## [1] 9
A vector is not a matrix with 1 dimension. It has NULL
dimensions.
<- c(1, 2, 3)
z dim(z)
## NULL
Exercise: What’s the differences between ncol()
and NCOL()
. Read the help file and demonstrate some code where they provide different results.
Exercise (from Advanced R): How would you describe the following three objects? What makes them different from 1:5?
<- array(1:5, c(1, 1, 5))
x1 <- array(1:5, c(1, 5, 1))
x2 <- array(1:5, c(5, 1, 1)) x3
Exercise: How do you get rid of the dimensions in the following array?
<- array(1:12, dim = c(2, 2, 3)) x
The class of an object is an important attribute that controls R’s S3 system for object oriented programming.
The class of an object will determine its behavior when you use that class in a generic function such as print()
or summary()
.
You can create your own S3 classes (chapter 13).
Here, we will talk about some S3 classes that come with R by default.
You can determine the class of object with class()
, and you can set the class to NULL
by unclass()
.
Factors, Dates, and POSIXct (date-times)
A factor is an integer vector with
class
attribute factor
, andlevels
attribute describing the possible levels<- factor(c("a", "b", "b", "a"))
x x
## [1] a b b a
## Levels: a b
typeof(x)
## [1] "integer"
class(x)
## [1] "factor"
attributes(x)
## $levels
## [1] "a" "b"
##
## $class
## [1] "factor"
R also does some stuff under the hood for encoding factors (i.e. has a lot of methods specifically for factors).
Factors are R’s way of storing categorical variables, and are useful when a variable only has a certain number of possible values.
Learn more about factors here.
A Date is a double vector with class attribute Date
.
<- Sys.Date()
today typeof(today)
## [1] "double"
attributes(today)
## $class
## [1] "Date"
class(today)
## [1] "Date"
Let’s look at the underlying double to today:
unclass(today)
## [1] 19031
unclass(as.Date("1970-01-01"))
## [1] 0
Date-time classes are called either POSIXct
(Portable Operating System Interface in Unix, Calendar Time) or POSIXlt
(Portable Operating System Interface in Unix, Local Time).
POSIXct
shows up more often. It is a double representing the number of seconds since the beginning of 1970.
<- Sys.time()
now typeof(now)
## [1] "double"
class(now)
## [1] "POSIXct" "POSIXt"
unclass(now)
## [1] 1.644e+09
POSIXlt
is a named list of vectors with elements representing seconds, minutes, hours, days of the month, months, years, weekdays, etc…
<- as.POSIXlt(x = c("1980-10-10 01:11:01",
ltvec "1970-01-11 10:15:22",
"2010-05-30 20:01:18"))
typeof(ltvec)
## [1] "list"
unclass(ltvec)
## $sec
## [1] 1 22 18
##
## $min
## [1] 11 15 1
##
## $hour
## [1] 1 10 20
##
## $mday
## [1] 10 11 30
##
## $mon
## [1] 9 0 4
##
## $year
## [1] 80 70 110
##
## $wday
## [1] 5 0 0
##
## $yday
## [1] 283 10 149
##
## $isdst
## [1] 1 0 1
##
## $zone
## [1] "EDT" "EST" "EDT"
##
## $gmtoff
## [1] NA NA NA
You mostly interact with these date-time objects through the {lubridate}
package, but base R has their own interfaces (which I think are more difficult to use).
Learn more about dates and date-times here.
Exercise (From Advanced R): table()
will take as input a vector or vectors and count how many observations have each value. What sort of object does table()
return? What is its type? What attributes does it have? How does the dimensionality change as you tabulate more variables?
In many applications, you will want to create empty vectors or vectors filled with missing values.
Create an empty vector with vector()
.
vector(mode = "character", length = 0)
## character(0)
vector(mode = "double", length = 0)
## numeric(0)
vector(mode = "integer", length = 0)
## integer(0)
vector(mode = "logical", length = 0)
## logical(0)
Shorthand for this is
character()
## character(0)
double()
## numeric(0)
integer()
## integer(0)
logical()
## logical(0)
Empty vectors often show up in defaults that are returned when folks ask for something of length 0.
E.g., in if you are simulating something, you might return a vector of length 0 if they ask for 0 elements.
<- function(n) {
f <- double(n)
sout for (i in seq_len(n)) {
<- simcode(...) ## put simulation code here
sout[[i]]
}return(sout)
}
You often want to create an empty vector that you then fill in with values. I like to create this vector to be with missing values, so that I know I made a mistake if they are not all filled in.
<- 100
n <- rep(NA_character_, lenght.out = n)
x <- rep(NA_integer_, lenght.out = n)
x <- rep(NA_real_, lenght.out = n)
x <- rep(NA, lenght.out = n) x
E.g. in a for-loop, you often fill in the elements of a vector. Let’s suppose we are evaluating the performance of the mean in a simulation study.
<- 1000 ## number of simulations
nsim <- 10 ## sample size
nsamp <- rep(NA_real_, length.out = nsim)
mvec <- 0
true_mean for (i in seq_len(nsim)) {
<- mean(rnorm(n = nsamp, mean = true_mean))
mvec[[i]]
}mean((mvec - true_mean)^2) ## mean squared error
## [1] 0.09854
If you are filling in the values of a matrix, you need to be able to create a matrix with missing values.
<- 100
n <- 3
p <- matrix(NA_character_, nrow = p, ncol = n)
matval <- matrix(NA_real_, nrow = p, ncol = n)
matval <- matrix(NA_integer_, nrow = p, ncol = n)
matval <- matrix(NA, nrow = p, ncol = n) matval
Lists are like vectors except each element can be of any type.
You create lists with list()
.
<- list(a = 1:3,
lobj log_val = TRUE,
list(c = 10))
You can view a list with str()
.
str(lobj)
## List of 3
## $ a : int [1:3] 1 2 3
## $ log_val: logi TRUE
## $ :List of 1
## ..$ c: num 10
c()
will combine lists into a single list. If you use c()
with a list and a vector, then it will first coerce the vector into a list where each element is a list.
<- list(1:2,
l1 c("a", "b"))
<- list(c(TRUE, FALSE))
l2 c(l1, l2)
## [[1]]
## [1] 1 2
##
## [[2]]
## [1] "a" "b"
##
## [[3]]
## [1] TRUE FALSE
c(l1, c("c", "d"))
## [[1]]
## [1] 1 2
##
## [[2]]
## [1] "a" "b"
##
## [[3]]
## [1] "c"
##
## [[4]]
## [1] "d"
as.list(c("c", "d")) ## this is what it does before combining
## [[1]]
## [1] "c"
##
## [[2]]
## [1] "d"
typeof()
will return "list"
and is.list()
tests for a list.
typeof(l1)
## [1] "list"
is.list(l1)
## [1] TRUE
Use unlist()
to remove the list structure.
l1
## [[1]]
## [1] 1 2
##
## [[2]]
## [1] "a" "b"
unlist(l1)
## [1] "1" "2" "a" "b"
The dim
attribute can be applied to lists
<- list(1:2,
lmat 3:10,
runif(4),
c("Hello", "world"))
dim(lmat) <- c(2, 2)
lmat
## [,1] [,2]
## [1,] integer,2 numeric,4
## [2,] integer,8 character,2
1, 2]] lmat[[
## [1] 0.7492 0.4440 0.7360 0.5687
Data Frames are lists where
<- data.frame(a = 4:6,
df b = c("A", "B", "C"))
typeof(df)
## [1] "list"
attributes(df)
## $names
## [1] "a" "b"
##
## $class
## [1] "data.frame"
##
## $row.names
## [1] 1 2 3
Above, the “names” attribute are the columnames, and you can get them with colnames()
colnames(df)
## [1] "a" "b"
names(df)
## [1] "a" "b"
The row.names are the row names, and you can obtain them with row.names()
or rownames()
.
row.names()
are specifically for data frames, whereas rownames()
was designed for extracting dimnames and was also altered to work with data frames.row.names(df)
## [1] "1" "2" "3"
rownames(df)
## [1] "1" "2" "3"
Those row names are automatically generated, but you can set them with rownames()
.
rownames(df) <- c("h", "i", "j")
df
## a b
## h 4 A
## i 5 B
## j 6 C
tibbles
, from the package {tibble}
are tidyverse data frames. The main differences are:
Tibbles do not automatically coerce data (such as from strings to factors). Data frames used to do this in older versions of R.
data.frame(x = c("a", "b", "c"),
stringsAsFactors = FALSE) ## needed to be safe for older versions of R
## x
## 1 a
## 2 b
## 3 c
::tibble(x = c("a", "b", "c")) tibble
## # A tibble: 3 × 1
## x
## <chr>
## 1 a
## 2 b
## 3 c
Tibbles do not change names if they happen to be non-syntactic (e.g. have spaces in them)
data.frame(`hello world` = c(1, 2, 3))
## hello.world
## 1 1
## 2 2
## 3 3
::tibble(`hello world` = c(1, 2, 3)) tibble
## # A tibble: 3 × 1
## `hello world`
## <dbl>
## 1 1
## 2 2
## 3 3
Tibbles will only recycle vectors of length 1.
data.frame(x = c(1, 2, 3, 4),
y = c(1, 2))
## x y
## 1 1 1
## 2 2 2
## 3 3 1
## 4 4 2
::tibble(x = c(1, 2, 3, 4),
tibbley = c(1, 2))
## Error:
## ! Tibble columns must have compatible sizes.
## • Size 4: Existing data.
## • Size 2: Column `y`.
## ℹ Only values of size one are recycled.
{tibbles}
do not reduce to vectors when you subset one column. Folks disagree on whether this is good or bad.
<- data.frame(`hello world` = c(1, 2, 3))
df <- tibble::tibble(`hello world` = c(1, 2, 3))
tib attributes(df[, 1])
## NULL
attributes(tib[, 1])
## $names
## [1] "hello world"
##
## $row.names
## [1] 1 2 3
##
## $class
## [1] "tbl_df" "tbl" "data.frame"
Data frames allow for row names, tibbles do not. Folks disagree on whether this is desirable (Hadley is extremely against it).
Tibbles print differently than data frames. Tibbles only print 10 rows and only the columns that will fit. But I actually prefer the data frame method better, because pretty doesn’t matter when you are doing data analysis, and it’s better to see all columns.
Exercise: Based on our discussion of making zero-length vectors, create a data frame with zero rows and columns a
, and b
. Both should be double columns.
Exercise: What does data.frame()
do without any arguments?
Exercise: Use the row.names
argument of data.frame()
to create a data frame with 100 rows and no columns.
NULL
is its own data type, that always has length 0.
typeof(NULL)
## [1] "NULL"
length(NULL)
## [1] 0
NULL
is used to represent an empty vector.
c()
## NULL
NULL
is often used as a default argument in a function for complicated arguments. The function operates one way unless a user specifies something for that argument. Look at ?ashr::ash.workhorse
for multiple examples.
<- function(x = NULL) {
f if (is.null(x)) {
## do something
else {
} ## do something else
} }
E.g., let’s create a function wmean
that calculates a weighted mean if weights are provided, an the sample mean otherwise.
<- function(x, w = NULL) {
wmean if (is.null(w)) {
<- rep(1 / length(x), length.out = length(x))
w else {
} <- w / sum(w)
w
}return(sum(x * w))
}
<- c(1, 2, 3)
x wmean(x)
## [1] 2
wmean(x, w = c(5, 1, 1))
## [1] 1.429
There are two alternative strategies to this. First, use missingArg()
to test if an argument is missing.
<- function(x, w) {
wmean if (missingArg(w)) {
<- rep(1 / length(x), length.out = length(x))
w else {
} <- w / sum(w)
w
}return(sum(x * w))
}
<- c(1, 2, 3)
x wmean(x)
## [1] 2
wmean(x, w = c(5, 1, 1))
## [1] 1.429
This works because of lazy evaluation (which we will learn about later).
I don’t like this because it is confusing to the user, who thinks w
is a required argument.
Second, you can include more complicated defaults.
<- function(x, w = rep(1, length(x))) {
wmean <- w / sum(w)
w return(sum(x * w))
}
<- c(1, 2, 3)
x wmean(x)
## [1] 2
wmean(x, w = c(5, 1, 1))
## [1] 1.429
I don’t like this because default arguments are evaluated inside the function, but user-provided arguments are evaluated outside the function (more on this later). This can lead to strange results.
NULL
is one of the ways R handle’s missingness. The others are NA
and NaN
.
NULL
: An empty object. Can be thought of as a zero-length vector.NA
: A missing value. Can be used as an element of a vector.NaN
: Undefined numeric values, such as the output of 0/0
.typeof()
: Determine the type of an object (character, double, integer, or logical).attr()
: Get or set an attribute.attributes()
: View all attributes.structure()
: Create an object with many attributes.names()
: Get or set names attributes.unname()
: Remove the names attribute.dim()
: Get or set dim attributes.class()
: Get or set class attributes.unclass()
: Remove the class attribute.