Introduction

  • R is a statistical programming language designed to analyze data.

  • This is not an R course. But you need to know some tools to summarize/plot/model data.

  • R is free, widely used, more generally applicable (beyond linear regression), and a useful tool for reproducibility. So this is what we will use.

  • Python would have been a good choice too, but you can learn that in the machine learning course (STAT 427/627).

Installation

  • Install here R : https://cran.r-project.org/

  • Install R Studio here: https://www.rstudio.com/

  • NOTE: R is a programming language. R Studio is an IDE, a program for interacting with programming language (specifically R in this case). Thus, on your resume, you should say that you know R, not R Studio.

Before we begin

  • I cannot teach you everything there is to know in R. When you know the name of a function, but don’t know the commands, use the help() function. For example, to learn more about log() type

    help(log)
  • Alternatively, if you do not know the name of the function, you can Google the functionality you want. Googling coding solutions is a lot of what real data scientists do. Just append what you are Googling with “in R”. So, for example, “linear mixed effects models in R”.

R Basics

  • When you first open up R Studio, it should look something like this

     

  • The area to the right of the carrot “>” is called a prompt. You insert commands into the prompt.

  • You can use R as a powerful calculator. Try typing some of the following into the command prompt:

    3 * 7
    9 / 3
    4 + 6
    3 - 9
    (3 + 5) * 6
    3 ^ 2
    4 ^ 2
  • Exercise: What do you think the following will evaluate to? Try to guess before running it in R:

    8 / 4 + 3 * 2 ^ 2
  • R consists of two things: variables and functions (computer scientists would probably disagree with this categorization).

Variables

  • A variable stores a value. You use the assignment operator<-” to assign values to variables. For example, we can assign the value of 10 to the variable x.

    x <- 10
    • It is possible to use =, and I think there is nothing wrong with that. But for some reason the field has decided to only use <-, so you should too.
  • Whenever we use x later, it will use the value of 10

    x
    ## [1] 10
  • This is useful because you can reuse this value over and over again:

    y <- 0
    x + y
    x * y
    x / y
    x - y
  • To assign a “string” (a fancy way to say a word) to x, put the string in quotes. For example, we can assign the value of "Hello World" to x.

    x <- "Hello World"
    x
    ## [1] "Hello World"

Functions

  • Functions take objects (such as numbers or variables) as input and output new objects. Let’s look at a simple function that takes the log of a number:

    log(x = 4, base = 2)
  • The inputs are called “arguments”. Generally, every function will be for the form:

    function_name(arg1 = val1, arg2 = val2, ...)
  • If you do not specify the name of the argument, R will assume you are assigning in their order.

    log(4, 2)
  • You can change the order of the arguments if you specify them.

    log(base = 2, x = 4)
  • To see the list of all possible arguments of a function, use the help() function:

    help(log)
  • In the help file, there are often default values for an argument. For example, the following indicates the the default value of base is exp(1).

    log(x, base = exp(1))
  • This indicates that you can omit the base argument and R will assume that it should be exp(1).

    log(x = 4, base = exp(1))
    ## [1] 1.386
    log(x = 4)
    ## [1] 1.386
  • If an argument does not have a default, then it must be specified when calling a function.

  • Type this:

    log(x = 4,
  • The “+” indicates that R is expecting more input (you forgot either a parentheses or a quotation mark). You can get back to the prompt by hitting the ESCAPE key.

Useful Functions

  • c() creates a vector (sequence of values)

    y <- c(8, 1, 3, 4, 2)
    y
    ## [1] 8 1 3 4 2
  • You can perform vectorized operations on these vectors

    y + 2
    ## [1] 10  3  5  6  4
    y / 2
    ## [1] 4.0 0.5 1.5 2.0 1.0
    y - 2
    ## [1]  6 -1  1  2  0
  • exp(): Exponentiation. This is the inverse of log().

    exp(10)
    ## [1] 22026
    log(exp(10))
    ## [1] 10
  • mean(): The mean of a vector

    mean(y)
    ## [1] 3.6
  • sd() The standard deviation of a vector

    sd(y)
    ## [1] 2.702
  • sum(): Sum the values of a vector.

    sum(y)
    ## [1] 18
  • seq(): Create a sequence of numbers

    seq(from = 1, to = 10)
    ##  [1]  1  2  3  4  5  6  7  8  9 10
  • Exercise: What does the by argument do in seq()? Read the help file and modify it to 2 in the example code above.

  • head(): Show the first six values of an object.

R Packages

  • A package is a collection of functions that don’t come with R by default.

  • There are many many packages available. If you need to do any data analysis, there is probably an R package for it.

  • Using install.packages(), you can install packages that contain functions and datasets that are not available by default. Do this now with the tidyverse package:

    install.packages("tidyverse")
  • You will only need to install a package once per computer. Once it is installed you can gain access to all of the functions and datasets in a package by using the library() function.

    library(tidyverse)
  • You will need to run library() at the start of every R session if you want to use the functions in a package.

  • When I want to write the name of a function, I will write it like this().

Data Frames

  • The fundamental unit object of data analysis is the data frame.

  • A data frame has variables in the columns, and observations in the rows.

     

  • R comes with a bunch of famous datasets in the form of a data frame. Such as the mtcars dataset.

    data("mtcars")
    mtcars
    ##                      mpg cyl  disp  hp drat    wt  qsec vs am gear carb
    ## Mazda RX4           21.0   6 160.0 110 3.90 2.620 16.46  0  1    4    4
    ## Mazda RX4 Wag       21.0   6 160.0 110 3.90 2.875 17.02  0  1    4    4
    ## Datsun 710          22.8   4 108.0  93 3.85 2.320 18.61  1  1    4    1
    ## Hornet 4 Drive      21.4   6 258.0 110 3.08 3.215 19.44  1  0    3    1
    ## Hornet Sportabout   18.7   8 360.0 175 3.15 3.440 17.02  0  0    3    2
    ## Valiant             18.1   6 225.0 105 2.76 3.460 20.22  1  0    3    1
    ## Duster 360          14.3   8 360.0 245 3.21 3.570 15.84  0  0    3    4
    ## Merc 240D           24.4   4 146.7  62 3.69 3.190 20.00  1  0    4    2
    ## Merc 230            22.8   4 140.8  95 3.92 3.150 22.90  1  0    4    2
    ## Merc 280            19.2   6 167.6 123 3.92 3.440 18.30  1  0    4    4
    ## Merc 280C           17.8   6 167.6 123 3.92 3.440 18.90  1  0    4    4
    ## Merc 450SE          16.4   8 275.8 180 3.07 4.070 17.40  0  0    3    3
    ## Merc 450SL          17.3   8 275.8 180 3.07 3.730 17.60  0  0    3    3
    ## Merc 450SLC         15.2   8 275.8 180 3.07 3.780 18.00  0  0    3    3
    ## Cadillac Fleetwood  10.4   8 472.0 205 2.93 5.250 17.98  0  0    3    4
    ## Lincoln Continental 10.4   8 460.0 215 3.00 5.424 17.82  0  0    3    4
    ## Chrysler Imperial   14.7   8 440.0 230 3.23 5.345 17.42  0  0    3    4
    ## Fiat 128            32.4   4  78.7  66 4.08 2.200 19.47  1  1    4    1
    ## Honda Civic         30.4   4  75.7  52 4.93 1.615 18.52  1  1    4    2
    ## Toyota Corolla      33.9   4  71.1  65 4.22 1.835 19.90  1  1    4    1
    ## Toyota Corona       21.5   4 120.1  97 3.70 2.465 20.01  1  0    3    1
    ## Dodge Challenger    15.5   8 318.0 150 2.76 3.520 16.87  0  0    3    2
    ## AMC Javelin         15.2   8 304.0 150 3.15 3.435 17.30  0  0    3    2
    ## Camaro Z28          13.3   8 350.0 245 3.73 3.840 15.41  0  0    3    4
    ## Pontiac Firebird    19.2   8 400.0 175 3.08 3.845 17.05  0  0    3    2
    ## Fiat X1-9           27.3   4  79.0  66 4.08 1.935 18.90  1  1    4    1
    ## Porsche 914-2       26.0   4 120.3  91 4.43 2.140 16.70  0  1    5    2
    ## Lotus Europa        30.4   4  95.1 113 3.77 1.513 16.90  1  1    5    2
    ## Ford Pantera L      15.8   8 351.0 264 4.22 3.170 14.50  0  1    5    4
    ## Ferrari Dino        19.7   6 145.0 175 3.62 2.770 15.50  0  1    5    6
    ## Maserati Bora       15.0   8 301.0 335 3.54 3.570 14.60  0  1    5    8
    ## Volvo 142E          21.4   4 121.0 109 4.11 2.780 18.60  1  1    4    2
  • You can extract individual variables from a data frame using $

    mtcars$mpg
    ##  [1] 21.0 21.0 22.8 21.4 18.7 18.1 14.3 24.4 22.8 19.2 17.8 16.4 17.3 15.2 10.4
    ## [16] 10.4 14.7 32.4 30.4 33.9 21.5 15.5 15.2 13.3 19.2 27.3 26.0 30.4 15.8 19.7
    ## [31] 15.0 21.4
  • You can explore these in a spreadsheet format using View() (note the capital “V”).

    View(mtcars)

Reading in Data Frames

  • Most datasets will nead to be loaded into R. To do so, we will use the {readr} package.

    library(readr)
  • The only function I will require you to know from this package is read_csv(), which loads in data from a CSV file (“Comma-separated values”), a very popular format for storing data.

  • If you have the CSV file somewhere on your computer, then specify the path from the current working directory, and assign the data frame to a variable.

  • For other file formats, you need to use other functions, such as read_tsv(), read_table(), read_fwf(), etc. I will try to make sure read_csv() works for all datasets in this course.

  • I will typicaly post course datasets at https://dcgerard.github.io/stat_415_615/data.html. You can load those data into R by pasting their URL’s into read_csv().

    copiers <- read_csv("https://dcgerard.github.io/stat_415_615/data/copiers.csv")
    ## Rows: 45 Columns: 3
    ## ── Column specification ────────────────────────────────────────────────────────
    ## Delimiter: ","
    ## chr (1): model
    ## dbl (2): minutes, copiers
    ## 
    ## ℹ Use `spec()` to retrieve the full column specification for this data.
    ## ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
    head(copiers)
    ## # A tibble: 6 × 3
    ##   minutes copiers model
    ##     <dbl>   <dbl> <chr>
    ## 1      20       2 S    
    ## 2      60       4 L    
    ## 3      46       3 L    
    ## 4      41       2 L    
    ## 5      12       1 L    
    ## 6     137      10 L
  • Exercise: Load in the County Demographic Information data into R and print out the first six rows.

Basic Data Frame Manipulations

  • You will need to know just a few data frame manipulations, which we will perform using the {dplyr} package.

    library(dplyr)
  • The first argument for {dplyr} functions is always the data frame you are modifying. The following arguments typically involve the columns of that data frame.

  • Use the mutate() function from the {dplyr} package to make variable transformations.

    mtcars <- mutate(mtcars, kpg = mpg * 1.61)
    head(mtcars)
    ##                    mpg cyl disp  hp drat    wt  qsec vs am gear carb   kpg
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0  1    4    4 33.81
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0  1    4    4 33.81
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1  1    4    1 36.71
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1  0    3    1 34.45
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0  0    3    2 30.11
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1  0    3    1 29.14
  • Use glimpse() to get a brief look at the data frame.

    glimpse(mtcars)
    ## Rows: 32
    ## Columns: 12
    ## $ mpg  <dbl> 21.0, 21.0, 22.8, 21.4, 18.7, 18.1, 14.3, 24.4, 22.8, 19.2, 17.8,…
    ## $ cyl  <dbl> 6, 6, 4, 6, 8, 6, 8, 4, 4, 6, 6, 8, 8, 8, 8, 8, 8, 4, 4, 4, 4, 8,…
    ## $ disp <dbl> 160.0, 160.0, 108.0, 258.0, 360.0, 225.0, 360.0, 146.7, 140.8, 16…
    ## $ hp   <dbl> 110, 110, 93, 110, 175, 105, 245, 62, 95, 123, 123, 180, 180, 180…
    ## $ drat <dbl> 3.90, 3.90, 3.85, 3.08, 3.15, 2.76, 3.21, 3.69, 3.92, 3.92, 3.92,…
    ## $ wt   <dbl> 2.620, 2.875, 2.320, 3.215, 3.440, 3.460, 3.570, 3.190, 3.150, 3.…
    ## $ qsec <dbl> 16.46, 17.02, 18.61, 19.44, 17.02, 20.22, 15.84, 20.00, 22.90, 18…
    ## $ vs   <dbl> 0, 0, 1, 1, 0, 1, 0, 1, 1, 1, 1, 0, 0, 0, 0, 0, 0, 1, 1, 1, 1, 0,…
    ## $ am   <dbl> 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 0, 1, 1, 1, 0, 0,…
    ## $ gear <dbl> 4, 4, 4, 3, 3, 3, 3, 4, 4, 4, 4, 3, 3, 3, 3, 3, 3, 4, 4, 4, 3, 3,…
    ## $ carb <dbl> 4, 4, 1, 1, 2, 1, 4, 2, 2, 4, 4, 3, 3, 3, 4, 4, 4, 1, 2, 1, 1, 2,…
    ## $ kpg  <dbl> 33.81, 33.81, 36.71, 34.45, 30.11, 29.14, 23.02, 39.28, 36.71, 30…
  • Use View() to see a spreadsheet of the data frame (never put this in an R Markdown file). Note the capital “V”.

    View(mtcars)
  • Use rename() to rename variables.

    mtcars <- rename(mtcars, auto_man = am)
    head(mtcars)
    ##                    mpg cyl disp  hp drat    wt  qsec vs auto_man gear carb
    ## Mazda RX4         21.0   6  160 110 3.90 2.620 16.46  0        1    4    4
    ## Mazda RX4 Wag     21.0   6  160 110 3.90 2.875 17.02  0        1    4    4
    ## Datsun 710        22.8   4  108  93 3.85 2.320 18.61  1        1    4    1
    ## Hornet 4 Drive    21.4   6  258 110 3.08 3.215 19.44  1        0    3    1
    ## Hornet Sportabout 18.7   8  360 175 3.15 3.440 17.02  0        0    3    2
    ## Valiant           18.1   6  225 105 2.76 3.460 20.22  1        0    3    1
    ##                     kpg
    ## Mazda RX4         33.81
    ## Mazda RX4 Wag     33.81
    ## Datsun 710        36.71
    ## Hornet 4 Drive    34.45
    ## Hornet Sportabout 30.11
    ## Valiant           29.14
  • Use filter() to remove rows.

    • Use == to select rows based on equality
    • Use < and > to select rows based on inequality
    • Use <= and >= to select rows based on inequality/equality.
    filter(mtcars, auto_man == 1)
    ##                 mpg cyl  disp  hp drat    wt  qsec vs auto_man gear carb   kpg
    ## Mazda RX4      21.0   6 160.0 110 3.90 2.620 16.46  0        1    4    4 33.81
    ## Mazda RX4 Wag  21.0   6 160.0 110 3.90 2.875 17.02  0        1    4    4 33.81
    ## Datsun 710     22.8   4 108.0  93 3.85 2.320 18.61  1        1    4    1 36.71
    ## Fiat 128       32.4   4  78.7  66 4.08 2.200 19.47  1        1    4    1 52.16
    ## Honda Civic    30.4   4  75.7  52 4.93 1.615 18.52  1        1    4    2 48.94
    ## Toyota Corolla 33.9   4  71.1  65 4.22 1.835 19.90  1        1    4    1 54.58
    ## Fiat X1-9      27.3   4  79.0  66 4.08 1.935 18.90  1        1    4    1 43.95
    ## Porsche 914-2  26.0   4 120.3  91 4.43 2.140 16.70  0        1    5    2 41.86
    ## Lotus Europa   30.4   4  95.1 113 3.77 1.513 16.90  1        1    5    2 48.94
    ## Ford Pantera L 15.8   8 351.0 264 4.22 3.170 14.50  0        1    5    4 25.44
    ## Ferrari Dino   19.7   6 145.0 175 3.62 2.770 15.50  0        1    5    6 31.72
    ## Maserati Bora  15.0   8 301.0 335 3.54 3.570 14.60  0        1    5    8 24.15
    ## Volvo 142E     21.4   4 121.0 109 4.11 2.780 18.60  1        1    4    2 34.45
    filter(mtcars, mpg < 15)    
    ##                      mpg cyl disp  hp drat    wt  qsec vs auto_man gear carb
    ## Duster 360          14.3   8  360 245 3.21 3.570 15.84  0        0    3    4
    ## Cadillac Fleetwood  10.4   8  472 205 2.93 5.250 17.98  0        0    3    4
    ## Lincoln Continental 10.4   8  460 215 3.00 5.424 17.82  0        0    3    4
    ## Chrysler Imperial   14.7   8  440 230 3.23 5.345 17.42  0        0    3    4
    ## Camaro Z28          13.3   8  350 245 3.73 3.840 15.41  0        0    3    4
    ##                       kpg
    ## Duster 360          23.02
    ## Cadillac Fleetwood  16.74
    ## Lincoln Continental 16.74
    ## Chrysler Imperial   23.67
    ## Camaro Z28          21.41
  • Exercise: Calculate the log-displacement, add this to the mtcars data frame.

  • Exercise: Filter out cars with only one carborator (keep cars with more than 1 carborator).

  • Exercise: Rename the hp variable to horse.

Summary

Here is the list of basic R stuff I expect you to know off the top of your head. We will add to this list throughout the semester.

  • help(): Open help file.
  • install.packages(): Install an external R package. Do this once per computer for each package.
  • library(): Load the functions of an external R package so you can use them. Do this each time you start up R for each package.
  • <-: Variable assignment.
  • +, -, /, *: Arithmetic operations.
  • ^: Powers.
  • sqrt(): Square root.
  • log(): Log (base e).
  • $: Extracting a variable from a data frame.
  • View(): Look at a spreadsheet of data.
  • head(): See first six elements.
  • From {readr}:
    • read_csv(): Loading in tabular data.
  • From {dplyr}:
    • glimpse(): Look at a data frame.
    • mutate(): Variable transformation.
    • rename(): Variable renaming.
    • filter(): Select rows based on variable values.
  • From {ggplot2} (see 01_ggplot).
    • ggplot2(): Set a dataset and aesthetic map.
    • geom_point(): Make a scatterplot.
    • geom_histogram(): Make a histogram.
    • geom_bar(): Make a bar plot.
    • geom_boxplot(): Make a box plot.
    • geom_smooth(): Add a smoother.