Subsetting is extracting elements from an object.
Six ways to subset atomic vector.
<- c(8, 1.2, 33, 14) x
Put integers in brackets and it will extract those elements. R starts counting at 1.
1] x[
## [1] 8
c(1, 3)] x[
## [1] 8 33
<- c(1, 3)
iset x[iset]
## [1] 8 33
This can be used for sorting
order(x)
## [1] 2 1 4 3
order(x)] x[
## [1] 1.2 8.0 14.0 33.0
You can use duplicate integers to extract elements more than once.
c(2, 2, 2)] x[
## [1] 1.2 1.2 1.2
Putting negative integers in instead will return all elements except the negative elements.
-1] x[
## [1] 1.2 33.0 14.0
c(-1, -3)] x[
## [1] 1.2 14.0
-c(1, 3)] x[
## [1] 1.2 14.0
Wherever there is a TRUE
will return the element.
c(TRUE, FALSE, TRUE, FALSE)] x[
## [1] 8 33
Empty brackets will return the original object.
x[]
## [1] 8.0 1.2 33.0 14.0
Using 0
in a bracket will return a zero-length vector.
0] x[
## numeric(0)
If a vector has names, then you can subset using those names in quotes.
names(x) <- c("a", "b", "c", "d")
"a"] x[
## a
## 8
c("a", "c")] x[
## a c
## 8 33
c("a", "a")] x[
## a a
## 8 8
If you know what names you want to remove, use setdiff()
.
setdiff(names(x), "a")
## [1] "b" "c" "d"
setdiff(names(x), "a")] x[
## b c d
## 1.2 33.0 14.0
Exercise: Explain the output of the following
<- 1:9
y c(TRUE, TRUE, FALSE)] y[
## [1] 1 2 4 5 7 8
TRUE] y[
## [1] 1 2 3 4 5 6 7 8 9
FALSE] y[
## integer(0)
Exercise: Explain the output of the following
<- c(1, 2)
y c(TRUE, TRUE, FALSE, TRUE, TRUE, FALSE)] y[
## [1] 1 2 NA NA
Exercise: Show all the ways to extract the second element of the following vector:
<- c(af = 3, bd = 6, dd = 2) y
Double brackets enforces that you are only extracting one element. This is really good in places where you know that you should only subset one element (like for-loops).
<- runif(100)
x <- 0
sval for (i in seq_along(x)) {
<- sval + x[[i]]
sval }
Double brackets remove attributes of the vector (even names).
<- c(a = 1, b = 2)
x 1] x[
## a
## 1
1]] x[[
## [1] 1
Include row and column indices, separated by a comma.
<- matrix(1:6, ncol = 2, nrow = 3)
x x
## [,1] [,2]
## [1,] 1 4
## [2,] 2 5
## [3,] 3 6
1, 2] x[
## [1] 4
2, 2] x[
## [1] 5
Have an empty space to get the whole row or the whole column.
1, ] x[
## [1] 1 4
1] x[,
## [1] 1 2 3
If you want it to stay a matrix (not convert to a vector), use drop = FALSE
1, drop = FALSE] x[,
## [,1]
## [1,] 1
## [2,] 2
## [3,] 3
1, , drop = FALSE] x[
## [,1] [,2]
## [1,] 1 4
You can include vectors of indices to get submatrices
c(1, 3), 1:2] x[
## [,1] [,2]
## [1,] 1 4
## [2,] 3 6
If you subset a matrix using just a single vector of indices, then it will go in column-major order. I.e. go through first column, then second column, then third column, etc…
4:5] x[
## [1] 4 5
You can also subset a matrix by providing a matrix of indices. The first column contains the row indices and the second column contains the column indices.
<- matrix(c(1, 3, 1, 2), nrow = 2)
imat imat
## [,1] [,2]
## [1,] 1 1
## [2,] 3 2
## extract (1, 1) and (3, 2) elements x[imat]
## [1] 1 6
For arrays, just add more commas
<- array(1:30, dim = c(2, 3, 5))
x 2, 3, 4] x[
## [1] 24
1, ,] x[
## [,1] [,2] [,3] [,4] [,5]
## [1,] 1 7 13 19 25
## [2,] 3 9 15 21 27
## [3,] 5 11 17 23 29
2, , 1] x[
## [1] 2 4 6
If you subset a list using single brackets, you will get a sublist. You can use integers, negative integers, logicals, and names as before
<- list(a = 1:3, b = "hello", c = 4:6)
x str(x)
## List of 3
## $ a: int [1:3] 1 2 3
## $ b: chr "hello"
## $ c: int [1:3] 4 5 6
1] x[
## $a
## [1] 1 2 3
c(1, 3)] x[
## $a
## [1] 1 2 3
##
## $c
## [1] 4 5 6
-1] x[
## $b
## [1] "hello"
##
## $c
## [1] 4 5 6
c(TRUE, FALSE, FALSE)] x[
## $a
## [1] 1 2 3
"a"] x[
## $a
## [1] 1 2 3
c("a", "c")] x[
## $a
## [1] 1 2 3
##
## $c
## [1] 4 5 6
Using double brackets extracts out a single element.
1]] x[[
## [1] 1 2 3
"a"]] x[[
## [1] 1 2 3
A shorthand for using names inside double brackets is to use dollar signs.
$a x
## [1] 1 2 3
Exericse: Why does this not work. Suggest a correction.
<- "a"
var $var x
## NULL
Data frame subsetting behaves both like lists and like matrices.
<- data.frame(a = 1:3,
df b = c("a", "b", "c"),
c = 4:6)
It behaves like a list for $
, [[
, and [
if you only provide one index. The columns are the elements of the list.
$a df
## [1] 1 2 3
1] df[
## a
## 1 1
## 2 2
## 3 3
1]] df[[
## [1] 1 2 3
c(1, 3)] df[
## a c
## 1 1 4
## 2 2 5
## 3 3 6
It behaves like a matrix if you provide two indices.
1:2, 2] df[
## [1] "a" "b"
You can keep the data frame structure by using drop = FALSE
.
1:2, 2, drop = FALSE] df[
## b
## 1 a
## 2 b
It is common to filter by rows by using the matrix indexing.
$a < 3, ] df[df
## a b c
## 1 1 a 4
## 2 2 b 5
Fix each of the following common data frame subsetting errors:
$cyl = 4, ]
mtcars[mtcars-1:4, ]
mtcars[$cyl <= 5]
mtcars[mtcars$cyl == 4 | 6, ] mtcars[mtcars
Why does the following code yield five missing values? (Hint: why is it different from x[NA_real_]
?)
<- 1:5
x NA] x[
## [1] NA NA NA NA NA
What does upper.tri()
return? How does subsetting a matrix with it work?
<- outer(1:5, 1:5, FUN = "*")
x upper.tri(x)] x[
## [1] 2 3 6 4 8 12 5 10 15 20
Why does mtcars[1:20]
return an error? How does it differ from the similar mtcars[1:20, ]
?
An lm
object is a list-like object. Given a linear model, e.g., mod <- lm(mpg ~ wt, data = mtcars)
, extract the residual degrees of freedom. Then extract the R squared from the model summary (summary(mod)
).
All subsetting operators can be used to assign subsets of a vector new values. This is called subassignment.
<- 1:5
x 2]] <- 200
x[[ x
## [1] 1 200 3 4 5
c(1, 3)] <- 0
x[ x
## [1] 0 200 0 4 5
== 0] <- NA_real_
x[x x
## [1] NA 200 NA 4 5
<- list(a = 1:3,
y b = "hello",
c = 4:6)
$a <- "no way"
y y
## $a
## [1] "no way"
##
## $b
## [1] "hello"
##
## $c
## [1] 4 5 6
Remove a list element with NULL
.
1]] <- NULL
y[[ y
## $b
## [1] "hello"
##
## $c
## [1] 4 5 6
$b <- NULL
y y
## $c
## [1] 4 5 6
Use a vector of values, and subset using those values from a named list.
<- c("m", "f", "u", "f", "f", "m", "m")
x <- c(m = "Male", f = "Female", u = NA)
lookup lookup[x]
## m f u f f m m
## "Male" "Female" NA "Female" "Female" "Male" "Male"
Resampling approaches are often used in Statistics to obtain standard errors, confidence intervals, and p-values.
Resampling: You sample each observation with replacement.
We typically resample entire rows of data frames (though not always, we’ll have a homework about this).
## Create fake data
<- data.frame(x = 1:10)
df $y <- df$x * 2 + rnorm(nrow(df))
df
## obtain indices of rows to sample
<- sample(seq_len(nrow(df)), replace = TRUE)
ind
<- df[ind, ] df_samp
These are just meant to buff up your Base R skills. Consider the data from the {Sleuth3}
package that contains information on sex and salary at a bank.
library(Sleuth3)
data("case0102")
<- case0102 sal
What is the salary of the person in the 51st row? Use two different subsetting strategies to get this.
What is the mean salary of Male’s?
How many Females are in the data?
How many Females make over $6000?