R is a statistical programming language designed to analyze data.
This is not an R course. But you need to know some tools to summarize/plot/model data.
R is free, widely used, more generally applicable (beyond linear regression), and a useful tool for reproducibility. So this is what we will use.
Python would have been a good choice too, but you can learn that in the machine learning course (STAT 427/627).
Install here R : https://cran.r-project.org/
Install R Studio here: https://www.rstudio.com/
NOTE: R is a programming language. R Studio is an IDE, a program for interacting with programming language (specifically R in this case). Thus, on your resume, you should say that you know R, not R Studio.
I cannot teach you everything there is to know in R. When you
know the name of a function, but don’t know the commands, use the
help()
function. For example, to learn more about
log()
type
help(log)
Alternatively, if you do not know the name of the function, you can Google the functionality you want. Googling coding solutions is a lot of what real data scientists do. Just append what you are Googling with “in R”. So, for example, “linear mixed effects models in R”.
When you first open up R Studio, it should look something like this
The area to the right of the carrot “>
” is called
a prompt. You insert commands into the prompt.
You can use R as a powerful calculator. Try typing some of the following into the command prompt:
3 * 7
9 / 3
4 + 6
3 - 9
(3 + 5) * 6
3 ^ 2
4 ^ 2
Exercise: What do you think the following will evaluate to? Try to guess before running it in R:
8 / 4 + 3 * 2 ^ 2
R consists of two things: variables and functions (computer scientists would probably disagree with this categorization).
A variable stores a value. You use the assignment
operator “<-
” to assign values to variables.
For example, we can assign the value of 10
to the variable
x
.
x <- 10
=
, and I think there is nothing
wrong with that. But for some reason the field has decided to only use
<-
, so you should too.Whenever we use x
later, it will use the value of
10
x
## [1] 10
This is useful because you can reuse this value over and over again:
y <- 0
x + y
x * y
x / y
x - y
To assign a “string” (a fancy way to say a word) to
x
, put the string in quotes. For example, we can assign the
value of "Hello World"
to x
.
x <- "Hello World"
x
## [1] "Hello World"
Functions take objects (such as numbers or variables) as input and output new objects. Let’s look at a simple function that takes the log of a number:
log(x = 4, base = 2)
The inputs are called “arguments”. Generally, every function will be for the form:
function_name(arg1 = val1, arg2 = val2, ...)
If you do not specify the name of the argument, R will assume you are assigning in their order.
log(4, 2)
You can change the order of the arguments if you specify them.
log(base = 2, x = 4)
To see the list of all possible arguments of a function, use the
help()
function:
help(log)
In the help file, there are often default values
for an argument. For example, the following indicates the the default
value of base
is exp(1)
.
log(x, base = exp(1))
This indicates that you can omit the base
argument
and R will assume that it should be exp(1)
.
log(x = 4, base = exp(1))
## [1] 1.386
log(x = 4)
## [1] 1.386
If an argument does not have a default, then it must be specified when calling a function.
Type this:
log(x = 4,
The “+
” indicates that R is expecting more input
(you forgot either a parentheses or a quotation mark). You can get back
to the prompt by hitting the ESCAPE key.
c()
creates a vector (sequence of
values)
y <- c(8, 1, 3, 4, 2)
y
## [1] 8 1 3 4 2
You can perform vectorized operations on these vectors
y + 2
## [1] 10 3 5 6 4
y / 2
## [1] 4.0 0.5 1.5 2.0 1.0
y - 2
## [1] 6 -1 1 2 0
exp()
: Exponentiation. This is the inverse of
log()
.
exp(10)
## [1] 22026
log(exp(10))
## [1] 10
mean()
: The mean of a vector
mean(y)
## [1] 3.6
sd()
The standard deviation of a vector
sd(y)
## [1] 2.702
sum()
: Sum the values of a vector.
sum(y)
## [1] 18
seq()
: Create a sequence of numbers
seq(from = 1, to = 10)
## [1] 1 2 3 4 5 6 7 8 9 10
Exercise: What does the by
argument
do in seq()
? Read the help file and modify it to
2
in the example code above.
head()
: Show the first six values of an
object.
A package is a collection of functions that don’t come with R by default.
There are many many packages available. If you need to do any data analysis, there is probably an R package for it.
Using install.packages()
, you can install packages
that contain functions and datasets that are not available by default.
Do this now with the tidyverse package:
install.packages("tidyverse")
You will only need to install a package once per computer. Once
it is installed you can gain access to all of the functions and datasets
in a package by using the library()
function.
library(tidyverse)
You will need to run library()
at the start of every
R session if you want to use the functions in a package.
When I want to write the name of a function, I will write it like
this()
.
The fundamental unit object of data analysis is the data frame.
A data frame has variables in the columns, and observations in the rows.
R comes with a bunch of famous datasets in the form of a data
frame. Such as the mtcars
dataset.
data("mtcars")
mtcars
## mpg cyl disp hp drat wt qsec vs am gear carb
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258.0 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360.0 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225.0 105 2.76 3.460 20.22 1 0 3 1
## Duster 360 14.3 8 360.0 245 3.21 3.570 15.84 0 0 3 4
## Merc 240D 24.4 4 146.7 62 3.69 3.190 20.00 1 0 4 2
## Merc 230 22.8 4 140.8 95 3.92 3.150 22.90 1 0 4 2
## Merc 280 19.2 6 167.6 123 3.92 3.440 18.30 1 0 4 4
## Merc 280C 17.8 6 167.6 123 3.92 3.440 18.90 1 0 4 4
## Merc 450SE 16.4 8 275.8 180 3.07 4.070 17.40 0 0 3 3
## Merc 450SL 17.3 8 275.8 180 3.07 3.730 17.60 0 0 3 3
## Merc 450SLC 15.2 8 275.8 180 3.07 3.780 18.00 0 0 3 3
## Cadillac Fleetwood 10.4 8 472.0 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460.0 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440.0 230 3.23 5.345 17.42 0 0 3 4
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1
## Toyota Corona 21.5 4 120.1 97 3.70 2.465 20.01 1 0 3 1
## Dodge Challenger 15.5 8 318.0 150 2.76 3.520 16.87 0 0 3 2
## AMC Javelin 15.2 8 304.0 150 3.15 3.435 17.30 0 0 3 2
## Camaro Z28 13.3 8 350.0 245 3.73 3.840 15.41 0 0 3 4
## Pontiac Firebird 19.2 8 400.0 175 3.08 3.845 17.05 0 0 3 2
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2
You can extract individual variables from a data frame using
$
mtcars$mpg
## [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
## [31] 15.0 21.4
You can explore these in a spreadsheet format using
View()
(note the capital “V”).
View(mtcars)
Most datasets will nead to be loaded into R. To do so, we will
use the {readr}
package.
library(readr)
The only function I will require you to know from this package is
read_csv()
, which loads in data from a CSV file
(“Comma-separated values”), a very popular format for storing
data.
If you have the CSV file somewhere on your computer, then specify the path from the current working directory, and assign the data frame to a variable.
For other file formats, you need to use other functions, such as
read_tsv()
, read_table()
,
read_fwf()
, etc. I will try to make sure
read_csv()
works for all datasets in this course.
I will typicaly post course datasets at https://dcgerard.github.io/stat_415_615/data.html. You
can load those data into R by pasting their URL’s into
read_csv()
.
copiers <- read_csv("https://dcgerard.github.io/stat_415_615/data/copiers.csv")
## Rows: 45 Columns: 3
## ── Column specification ────────────────────────────────────────────────────────
## Delimiter: ","
## chr (1): model
## dbl (2): minutes, copiers
##
## ℹ Use `spec()` to retrieve the full column specification for this data.
## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
head(copiers)
## # A tibble: 6 × 3
## minutes copiers model
## <dbl> <dbl> <chr>
## 1 20 2 S
## 2 60 4 L
## 3 46 3 L
## 4 41 2 L
## 5 12 1 L
## 6 137 10 L
Exercise: Load in the County Demographic Information data into R and print out the first six rows.
You will need to know just a few data frame manipulations, which
we will perform using the {dplyr}
package.
library(dplyr)
The first argument for {dplyr}
functions is always
the data frame you are modifying. The following arguments typically
involve the columns of that data frame.
Use the mutate()
function from the
{dplyr}
package to make variable transformations.
mtcars <- mutate(mtcars, kpg = mpg * 1.61)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs am gear carb kpg
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4 33.81
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4 33.81
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1 36.71
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1 34.45
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2 30.11
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1 29.14
Use glimpse()
to get a brief look at the data
frame.
glimpse(mtcars)
## Rows: 32
## Columns: 12
## $ mpg <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
## $ cyl <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
## $ hp <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
## $ wt <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
## $ vs <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
## $ am <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
## $ kpg <dbl> 33.81, 33.81, 36.71, 34.45, 30.11, 29.14, 23.02, 39.28, 36.71, 30…
Use View()
to see a spreadsheet of the data frame
(never put this in an R Markdown file). Note the capital “V”.
View(mtcars)
Use rename()
to rename variables.
mtcars <- rename(mtcars, auto_man = am)
head(mtcars)
## mpg cyl disp hp drat wt qsec vs auto_man gear carb
## Mazda RX4 21.0 6 160 110 3.90 2.620 16.46 0 1 4 4
## Mazda RX4 Wag 21.0 6 160 110 3.90 2.875 17.02 0 1 4 4
## Datsun 710 22.8 4 108 93 3.85 2.320 18.61 1 1 4 1
## Hornet 4 Drive 21.4 6 258 110 3.08 3.215 19.44 1 0 3 1
## Hornet Sportabout 18.7 8 360 175 3.15 3.440 17.02 0 0 3 2
## Valiant 18.1 6 225 105 2.76 3.460 20.22 1 0 3 1
## kpg
## Mazda RX4 33.81
## Mazda RX4 Wag 33.81
## Datsun 710 36.71
## Hornet 4 Drive 34.45
## Hornet Sportabout 30.11
## Valiant 29.14
Use filter()
to remove rows.
==
to select rows based on equality<
and >
to select rows based on
inequality<=
and >=
to select rows based
on inequality/equality.filter(mtcars, auto_man == 1)
## mpg cyl disp hp drat wt qsec vs auto_man gear carb kpg
## Mazda RX4 21.0 6 160.0 110 3.90 2.620 16.46 0 1 4 4 33.81
## Mazda RX4 Wag 21.0 6 160.0 110 3.90 2.875 17.02 0 1 4 4 33.81
## Datsun 710 22.8 4 108.0 93 3.85 2.320 18.61 1 1 4 1 36.71
## Fiat 128 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 52.16
## Honda Civic 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 48.94
## Toyota Corolla 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 54.58
## Fiat X1-9 27.3 4 79.0 66 4.08 1.935 18.90 1 1 4 1 43.95
## Porsche 914-2 26.0 4 120.3 91 4.43 2.140 16.70 0 1 5 2 41.86
## Lotus Europa 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2 48.94
## Ford Pantera L 15.8 8 351.0 264 4.22 3.170 14.50 0 1 5 4 25.44
## Ferrari Dino 19.7 6 145.0 175 3.62 2.770 15.50 0 1 5 6 31.72
## Maserati Bora 15.0 8 301.0 335 3.54 3.570 14.60 0 1 5 8 24.15
## Volvo 142E 21.4 4 121.0 109 4.11 2.780 18.60 1 1 4 2 34.45
filter(mtcars, mpg < 15)
## mpg cyl disp hp drat wt qsec vs auto_man gear carb
## Duster 360 14.3 8 360 245 3.21 3.570 15.84 0 0 3 4
## Cadillac Fleetwood 10.4 8 472 205 2.93 5.250 17.98 0 0 3 4
## Lincoln Continental 10.4 8 460 215 3.00 5.424 17.82 0 0 3 4
## Chrysler Imperial 14.7 8 440 230 3.23 5.345 17.42 0 0 3 4
## Camaro Z28 13.3 8 350 245 3.73 3.840 15.41 0 0 3 4
## kpg
## Duster 360 23.02
## Cadillac Fleetwood 16.74
## Lincoln Continental 16.74
## Chrysler Imperial 23.67
## Camaro Z28 21.41
Exercise: Calculate the log-displacement, add
this to the mtcars
data frame.
Exercise: Filter out cars with only one carborator (keep cars with more than 1 carborator).
Exercise: Rename the hp
variable to
horse
.
Here is the list of basic R stuff I expect you to know off the top of your head. We will add to this list throughout the semester.
help()
: Open help file.install.packages()
: Install an external R package. Do
this once per computer for each package.library()
: Load the functions of an external R package
so you can use them. Do this each time you start up R for each
package.<-
: Variable assignment.+
, -
, /
, *
:
Arithmetic operations.^
: Powers.sqrt()
: Square root.log()
: Log (base e).$
: Extracting a variable from a data frame.View()
: Look at a spreadsheet of data.head()
: See first six elements.{readr}
:
read_csv()
: Loading in tabular data.{dplyr}
:
glimpse()
: Look at a data frame.mutate()
: Variable transformation.rename()
: Variable renaming.filter()
: Select rows based on variable values.{ggplot2}
(see 01_ggplot).
ggplot2()
: Set a dataset and aesthetic map.geom_point()
: Make a scatterplot.geom_histogram()
: Make a histogram.geom_bar()
: Make a bar plot.geom_boxplot()
: Make a box plot.geom_smooth()
: Add a smoother.