R as a Research Tool: Lectures

Table of Contents

1. Introduction

1.1. Why R?

  • Why R?

    ?

    • Why use a programming language at all?
    • Why R vs Python, C/C++, Rust, Julia, Java, etc?
  • Why use a programming language at all?
    • Powerful
    • Reproducible
    • Interoperability
  • Why R vs others?
    • A language focused on data analysis and and well suited to interactive exploration of data.
    • By far the most popular software for statistics and data analyses – although losing ground to python.
    • A massive set of packages.
    • Strong and friendly community. R is the de facto standard language in statistics and many areas of biology.
    • Open source:
      • Available to all, not just the rich.
      • Contributes to reproducible research.
  • Shortcomings of R compared to other languages
    • Less "general purpose" than Python and some others.
    • There are nutty aspects of the language (eg counting starts at 1, not zero
    • There are MANY ways to accomplish the same thing in R. Many R functions use tricks to reduce the amount of typing at the cost of making code that is hard to understand. It is not really elegant.
    • R "evolved." Many APIs are inconsistent (especially in base R!)
    • Most R programmers are not computer scientists; there is a lot of poorly written and poorly documented R code out there.
    • R is not very fast. It uses memory inefficiently. But it is often fast enough.
  • Computer Languages
    • Machine Language

      :1111 0000 1011 0100 (in hexadecimal: F0B4)

    • Assembly
      iload intRate
      bipush 100
      if_icmpgt intError
      
    • High level language (C, Python, R …)
      if (intRate > 100)
      
  • Interpreted vs compiled languages
    • Compiled

      higher level language code is converted into machine code and then 'directly' executed by the host CPU

      • Faster execution, but needs compilation time
      • No dynamic typing or scoping
    • Interpreted (eg R)

      Programs are indirectly executed ("interpreted") by an interpreter program

      • Platform independence
      • Reflection (ability to examine and modify the structure and behavior of an object at run time)
      • Dynamic typing
  • R
    • R is a high level computer language for statistical computing
    • R is an interpreted language
    • FORTRAN, C, C++ are all high level compiled languages
    • other interpreted languages used in science: Python, Perl, Matlab/Octave, Julia.

1.2. Logistics

1.3. Getting started

  • Installing R

    See syllabus. You will install R (the open source language) and RStudio which is an integrated development environment many folks use for writing and running R code.

  • Running R from the command line

    Start R and you'll get a greeting, and then the R prompt, the > sign. The screen will look something like:

    R
    
    R version 4.3.2 (2023-10-31) -- "Eye Holes"
    Copyright (C) 2023 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    [...MORE...]
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    
  • RStudio

    You will probably use this. The Desktop version is open source (AGPLv3) and free. It is an IDE (integrated development environment) that allows one to edit scripts, send code directly to the R process in a console window, debug, edit RMarkdown, etc all in one application.

    • Windows: console, workspace, files, current open file (script).
    • Projects: RStudio saves a .RProj file (probably hidden by your file browser) in your project directory.
  • RStudio: keyboard shortcuts
    • In editor
      • Ctrl + 2: move cursor to console
      • Ctrl + enter: send code to console
      • Ctrl + Shift + /: rewrap comment block
      • Ctrl + Shift + c: comment/uncomment region
    • In console
      • Ctrl + 1: move cursor to editor
      • Up arrow: retrieve previous command
      • Ctrl + up arrow: search commands

1.4. Fundamentals of R

  • R as a calculator:
    4.0 * 3.5
    log(10)  # natural log
    log10(10)
    (3 + 1) * 5
    3^-1
    1/0
    
    [1] 14
    [1] 2.302585
    [1] 1
    [1] 20
    [1] 0.3333333
    [1] Inf
    
  • Interactive

    You can start an R session on the command line or as a "console window" in RStudio. This is an interactive session.

  • Or a stand alone script
    Rscript run-stats.R
    

    Or make an executable script: On linux/mac, just start the text file with a "shebang" line and make the file executable:

    #!/usr/bin/Rscript
    x <- 10
    print(x)
    
  • Statements
    • one statement per line
    • but R will keep reading if the statement is "not finished" (open parentheses, hanging operator, etc)
    • semicolon can end a statement (rarely needed)
    1 +
    2  # second statement ends here
    3 + 4 + ( 7
             - 7)
    
    [1] 3
    [1] 7
    
  • Assigning values to variables
    • Variables are assigned using <-
      x <- 12.6
      x
      x == 13
      
      [1] 12.6
      [1] FALSE
      
  • R now allows = for assignment
    • Originally only <- was used for assignment
      x = 13 # same as x <- 13
      x == 13 # tests equality
      
      [1] TRUE
      
  • Variables that contain many values
    • Create vectors with the concatenate function, c() :
      y <- c(3, 4, 9, 11)
      y
      
      [1]  3  4  9 11
      
  • There are no scalars in R

    But you can have vectors of length 1. Can you figure out what the [1], [20] etc output below means?

    y <- rep("a",50)
    y
    
     [1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
    [20] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
    [39] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
    
  • Naming variables (objects)
    • 1-63 characters
    • first character a letter
    • remaining characters: letters or numbers or period (+ underscore)
    • case sensitive
  • Parenthesis and braces
    • () Parenthesis group operators and are used in function calls
    • {} braces group statements and return last expression evaluated. This is needed for function definitions, if-then statements and a few other areas.
  • Functions

    Functions are modules of code that accomplish a specific task. Functions take data as "arguments" and "return" a result. Once a function is written, it can be used over and over and over again. Functions can be "called" from the console or from within other functions.

  • Functions in R

    Arguments to R functions can be matched by position or by name. So the following calls to sd() are all equivalent

    mydata <- rnorm(100)
    sd(mydata)
    sd(x = mydata)
    sd(x = mydata, na.rm = FALSE)
    sd(na.rm = FALSE, x = mydata)
    sd(na.rm = FALSE, mydata)
    
  • Example: random numbers
    runif(5)
    
    [1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583
    
    [1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583
    
    runif(5, 0, 10)
    
    [1] 2.9466004 8.9681397 2.7195012 6.7328881 0.2388521
    
    round(runif(5,-0.5,10.5))
    
    [1] 10  0  5 10  9
    
    floor(runif(5,0,11))
    
    [1]  7  5  8 10  7
    
  • Inspecting functions
    sd
    
    function (x, na.rm = FALSE) 
    sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
        na.rm = na.rm))
    <bytecode: 0x5d0f98e12650>
    <environment: namespace:stats>
    
    function (x, na.rm = FALSE) 
    sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
        na.rm = na.rm))
    <bytecode: 0x5ec3b5ea8de0>
    <environment: namespace:stats>
    
  • Comments

    The R interpreter ignores anything after the "#" symbol on a lines:

    x <- 2
    # On the line above, I assigned the value of 2 to x. This comment is for human
    # readers
    
    x <- 3 # This comment is "inline" (use judiciously)
    

1.5. Getting help

  • Help in R
    • Use R's built in self-documentation:
      help(seq)
      ?seq
      ?"<"
      ?"for"
      
    • For examples:
      example(seq)
      example(persp)
      
    • If you don't know exactly what you are looking for:
      help.search("mixed model")
      ??"mixed model"
      
  • Help on the internet
    • The R Project’s own manuals are available at the R home page and some search engines are listed there, too.
    • RSeek search engine: http://www.rseek.org
    • DANGER: You can post your R questions to r-help, the R listserve. See http://www.r-project.org/mail.html.
    • DANGER: Check out StackOverflow for programming questions tagged "r": http://stackoverflow.com/
    • StackOverflow is dying in part because of chatgpt — which was trained on all its answers.

2. Plots with ggplot

2.1. Setup

  • Housekeeping and reminders
    • Resources at http://r-research-tool.schwilk.org/ : class session (.R) files, weekly vocabulary lists, lecture notes
    • Follow along with code and STOP ME if you get an error.
    • Homework due on Mondays via email. These must be R scripts (plain text files). Write any answers or explanations asked for as comments (line starts with #).
  • Some advice on setting up your computer
    • Set your file browser to show file extensions. Those are part of the file name despite microsoft's defaults. You might set your browser to show "hidden" fiiles as well but note you are not expected to modify those manually.
    • Make a separate folder for each assignment and major analysis project. Use plain text for everything (you can edit in RStudio).

2.2. ggplot: introduction by example

  • Making pretty figures
    • Today we will get ahead of ourselves. Don't worry! I'll come back to the basics after showing you some fun with plots.
    • We will start with the ggplot() function
  • Where do functions come from?

    Many are part of base R. Others come from packages. Others you will write yourself.

    • R packages are loaded with library()
      library(ggplot2)
      
    • To install a new package:
      install.packages("ggplot2")
      
    • Note   B_note
      • In the R interpreter console, install.packages will install for the current user. Or use the menu in Rstudio. If you want to install system-wide you can use your operating system's package manager. Example on Debian-based linux: apt install r-cran-ggplot2
  • ggplot2 and built in data sets in R
    #install.packages("ggplot2")
    library(ggplot2)
    
    #?mpg
    head(mpg)
    #str(mpg)
    summary(mpg)
    
  • ggplot() function

    Create a plot object to which one can add layers (geoms)

    • ggplot(data, mapping, ...)
      data
      data frame with variables as columns
      mapping
      an aesthetic mapping of variable names to x, y position. Create this with the aes() function
    • aes(x, y, ...)
      x
      x axis variable
      y
      y axis variable
      other aesthetics (color, size, etc)
  • First plot
    ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()
    
    • Result

      figplot_1.png

  • geoms
    • ggplot() simply sets up the axes and the mapping. To put data on the plot you must add a "geometric object" or "geom"
    • Use geom_point() for a scatterplot. See other geoms such as geom_line(), geom_smooth(), geom_boxplot()
  • Additional variables, an example
    ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()
    
    • Result

      figggplot_2.png

  • Saving plots
    1. You can store the plot object to a variable name then save with ggsave():

      p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
      ggsave("plots/plot1.pdf", p)
      
    2. Save the last plot even if it was not assigned to a variable:

      ggplot(mpg, aes(cyl, hwy)) + geom_point()
      ggsave("plots/plot2.pdf")
      
    3. You can use the RStudio GUI ("export" button in the plots window)

2.3. Aesthetics and faceting

  • Aesthetics in addition to x and y
    Aesthetic Discrete variable Continuous variables
    Color Rainbow of colors Gradient red to blue
    Size Discrete size steps Mapping radius->value
    Shape Different shape for each Doesn't work
  • Example aesthetics
    ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point()
    
    • Result

      figggplot_3.png

  • Faceting
    • Small multiples displaying different subsets of the data.
    • Useful for exploring conditional relationships and for large data.
  • Examples to try:
    p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
    p + facet_grid(. ~ cyl)
    p + facet_grid(drv ~ .)
    p + facet_grid(drv ~ cyl)
    p + facet_wrap(~ class)
    
  • Faceting
    facet_grid()
    2d grid, rows ~ cols, . for no split
    facet_wrap()
    1d ribbon wrapped into 2d
    scales
    argument controls whether position scales are fixed or free
  • Overplotting: What is wrong here?
    p <- ggplot(mpg, aes(cty, hwy))
    p + geom_point()
    

    ggplot-no-jitter.png

  • Jitter
    p + geom_jitter()
    

    ggplot_jitter.png

  • How can we improve this plot?
    ggplot(mpg, aes(class, hwy)) + geom_boxplot()
    

    figgplot_4.png

  • Reorder
    p <- ggplot(mpg, aes(reorder(class, hwy), hwy))
    p + geom_boxplot()
    

    figgplot_5.png

  • Multiple geoms
    p + geom_jitter() + geom_violin()
    

    figggplot_6.png

  • Your turn
    • Read the help for reorder. Redraw the previously plots with class ordered by median hwy.
    • How would you put the jittered points on top of the violin plots?

3. Math and Data structures 1

3.1. More on getting started in R

  • Session data
    .Rprofile
    File loaded on startup which can contain your own custom functions
    .Rhistory
    File that contains the session history of commands
    .RData
    Binary blob of data which contains all the objects created in the session. Watch out for this!

    I suggest you write code which does not use or depend upon these files! For your homework, you must turn in code that runs in a "clean" environment. Check Tools -> Global options -> General in RStudio.

  • Working directory

    Set your working directory. You can use the GUI, an RStudio .Rproj file, or interactively on the command line:

    getwd()
    setwd("C:/Users/Username/Desktop")
    setwd(path.expand("~/projects/flammability") )
    
    • Note   B_note
      • Windows users: what looks wrong about that file path? Use forward slashes in code!
      • paths are a place of cross-platform incompatibility. Be careful with them. Never use absolute paths such as these in your own code.
  • Never set the working directory in code, only in an interactive session or a project file.
    • Your code will not be portable across machines.
    • Instead, document in your code or elsewhere the assumptions your code makes about paths and working directories.
    • For this class, we will always assume that the working directory is the same as the folder containing the current R script.
  • Special "hidden files" .Rdata and .Rhistory
    • When you quit a session you are asked if you want to save the workspace
    • or you can manually use save.image()
    • When R starts it loads all objects from ./.Rdata (Unset this in RStudio!)
    • Saving your workspace should usually be avoided. Make sure your code runs from a clean workspace: try rm(list=ls(all=TRUE)) and then rerun your code.
    • Note   B_note

      I set my preferences NOT to save the workspace and not to prompt me. That way, everything I do is documented in the code. If I must save objects (model results, etc) I do so explicitly as named files.

3.2. Math

  • Operators
    • Binary operators

      Are just functions whose two arguments are on left and right side

      Eg: 5 + 3 is same as '+'(5,3)

      To refer to an operator by its function syntax, you must put it in quotation marks.

    • Unary operators

      Have just one argument. Eg help operator (?log10) or negation (!5==3)

  • Operator precedence
    $ Subsetting by name
    @ component / slot extraction
    [ [[ subsetting
    ^ exponentiation
    : sequence operation
    %any% special operators, including %% and %/%
    * / multiply and divide
    + - add and subtract
    < > <= >= = ! greater, less than
    ! negation
    & and && and
    | and || or
    ~ "depends on" in formulae
    <- assignment
    <<- global assignment
    ? help
  • Order of operations
    x <- 3 + 4 * 2
    x
    
    [1] 11
    
    x <- (3 + 4) * 2
    x
    x <- (3 + 4) * sqrt(2)
    x
    
    [1] 14
    [1] 9.899495
    

3.3. Vector basics

  • R data structures
    • vectors
    • matrices and arrays
    • lists
    • data frames

    And other specialized structures provided by packages (eg tibbles, data.table, phylogenetic trees …)

  • Vectors: the heart of R
    x <- 1:20
    x
    y <- seq(1,20)
    y
    z <- seq(1,40,2)
    z
    
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
    [20] 20
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19
    [20] 20
     [1]  1  3  5  7  9 11 13 15 17 19 21 23 25 27 29 31 33 35 37
    [20] 39
    
  • Vectors: inspecting
    x <- c(5,12,13)
    x
    length(x)
    mode(x)
    
    [1]  5 12 13
    [1] 3
    [1] "numeric"
    
  • The ":" operator
    5:8
    5:1
    
    [1] 5 6 7 8
    [1] 5 4 3 2 1
    
  • Beware of operator precedence:
    i <- 4
    1:i-1
    1:(i-1)
    
    [1] 0 1 2 3
    [1] 1 2 3
    
  • seq()
    seq(from=12,to=30,by=3)
    seq(from=1.1,to=2,length=10)
    seq(from=12,to=30,by=3)
    
    [1] 12 15 18 21 24 27 30
     [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
    [1] 12 15 18 21 24 27 30
    
  • rep()
    rep(8,4)
    rep(c(5,12,13),3)
    rep(c(5,12,13),each=2)
    
    [1] 8 8 8 8
    [1]  5 12 13  5 12 13  5 12 13
    [1]  5  5 12 12 13 13
    
  • The subscripting operators (used to extract a subset)

    R has three subscripting operators:

    • [
    • [[
    • $

    We'll start with one you've already used in your homework: [

  • Vector subsetting
    v <- 1:100
    v
    
     [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14
    [15]  15  16  17  18  19  20  21  22  23  24  25  26  27  28
    [29]  29  30  31  32  33  34  35  36  37  38  39  40  41  42
    [43]  43  44  45  46  47  48  49  50  51  52  53  54  55  56
    [57]  57  58  59  60  61  62  63  64  65  66  67  68  69  70
    [71]  71  72  73  74  75  76  77  78  79  80  81  82  83  84
    [85]  85  86  87  88  89  90  91  92  93  94  95  96  97  98
    [99]  99 100
    
    v <- v * 10
    v[1:3]
    
    [1] 10 20 30
    
  • Vector indexing (subsetting)

    There are six things that you can use to subset a vector. First three are used commonly:

    • positive integers
    • negative integers
    • logical vectors
    • nothing ([] returns original vector)
    • zero ([0] returns empty vector)
    • character vectors (only if vector is named)
  • Subsetting vectors using [
    y <- c(1.2,3.9,0.4,0.12)
    y[c(1,3)]
    y[1:3]
    
    [1] 1.2 0.4
    [1] 1.2 3.9 0.4
    
    v <- c(1,4)
    y[-v]  # returns all except 1st and 4th element
    
    [1] 3.9 0.4
    
  • Subsetting with logical vectors
    y <- c(1.2,3.9,0.4,0.12)
    l <- y < 2
    l
    y[l]
    
    [1]  TRUE FALSE  TRUE  TRUE
    [1] 1.20 0.40 0.12
    
  • Character vectors (aka vectors of "strings")
    y <- "abc"
    length(y)
    mode(y)
    y <- c("abc","1","2")
    length(y)
    
    [1] 1
    [1] "character"
    [1] 3
    
    • Note   B_note

      If you are coming from another computer language, R's nomenclature may be a bit confusing. "Character vectors" are really vectors of character strings, not of individual characters.

  • Strings
    u <- paste("abc", "de", "f") # concatenate the strings
    u
    
    [1] "abc de f"
    
    v <- strsplit(u, " ") # split the string according to blanks
    v
    
    [[1]]
    [1] "abc" "de"  "f"
    
    • Note   B_note

      Vectors that hold pieces of text are called character vectors in R. Formally, the mode of an object that holds character strings in R is "character". The individual elements of an R "character vector" are character strings, or what most computer languages call "strings".

      See the stringr package for useful string functions. I'll introduce this package later in the course.

  • Some vector concepts
    Recycling
    Automatic lengthening of vectors in certain settings
    Filtering
    Extracting subsets of vectors
    Vectorization
    Where functions are applied element-wise to vectors
  • Recycling
    x <- c(1, 2, 3, 4)
    

    Recycling:

    c(1,2) + x
    
    [1] 2 4 4 6
    
    c(1, 2, 3) + x
    
    [1] 2 4 6 5
    
  • Vectorization

    Many functions are vectorized, including operators such as >. They conduct element-wise operations

    u <- c(5,2,8)
    v <- c(1,3,9)
    u > v
    
    [1]  TRUE FALSE FALSE
    
    sqrt(1:9)
    
    [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
    [9] 3.000000
    
  • Learn these functions

    exp(), log(), log10(), sqrt(), abs(), min(), which.min(), max(), which.max(), pmin(), pmax(), sum(), prod() (for products of multiple factors), round(), floor(), ceiling(), sort(), factorial()

    x <- c(1, 2, 3, 4)
    y <- c(3, 6, 1, 4)
    min(y)
    which.min(y)
    pmin(x, y)
    
    [1] 1
    [1] 3
    [1] 1 2 1 4
    

4. Data structures 2

4.1. Logical vectors

  • Boolean operators

    Operate on TRUE/FALSE values only

    Operation Math R
    ------------ ------------------------ ----------
    AND x ∧ y &
    OR x ∨ y |
    NOT !x !
    ------------ ------------------------ ----------
  • Boolean operators
    operator function
    -------- -----------------
    == Test for equality
    < Less than
    > Greater than
    <=,>= lt or equal, gt or equal
    & Boolean AND for vectors
    | Boolean OR for vectors
    !x Boolean negation
    xor() Exclusive OR
    %in% Set inclusion
    any(), all() Boolean vector reduction
  • Logical vectors and filtering (subscripting)
    y <- c(1.2, 3.9, 0.4, 0.12)
    v <- c(TRUE, TRUE, FALSE, FALSE)  # unusual to see literal booleans in real code
    y[v]
    
    [1] 1.2 3.9
    
    y[ y > 1 ]
    
    [1] 1.2 3.9
    
    y > 1
    
    [1]  TRUE  TRUE FALSE FALSE
    
  • Practice
    # All the values that are evenly divisible by 9 and are greater than 200
    v <- 1:500
    ch <- as.character(v)
    ch[v %% 9 == 0 & v > 200]
    
     [1] "207" "216" "225" "234" "243" "252" "261" "270" "279"
    [10] "288" "297" "306" "315" "324" "333" "342" "351" "360"
    [19] "369" "378" "387" "396" "405" "414" "423" "432" "441"
    [28] "450" "459" "468" "477" "486" "495"
    
    • Note   B_note

      I made a character vector by converting numbers to strings and then showed how to use the values in one vector to subscript into another

  • Practice:
    # Get all the elements that evenly divisible by 2 or by 3
    v <- 1:100
    
    # ?
    
  • match and %in%
    names <- c("fortis", "spinosa", "macrocarpa", "leucophylla")
    samples <- c("spinosa", "leucophylla")
    samples %in% names
    
    [1] TRUE TRUE
    
    names %in% samples
    
    [1] FALSE  TRUE FALSE  TRUE
    
    match(samples, names)
    
    [1] 2 4
    
  • match and %in%
    • match gives indices of where elements of first argument appear in the second argument
    • %in% returns vector of same length as first argument with TRUE for each element that occurs in second argument.
  • Practice
    head(LETTERS)
    match("G", LETTERS)
    length("G" %in% LETTERS) == 1
    length(LETTERS %in% "G") == 26
    
    [1] "A" "B" "C" "D" "E" "F"
    [1] 7
    [1] TRUE
    [1] TRUE
    
  • any(), all()

    Reduce boolean (logical) vectors to a scalar.

    any(y > 1)
    
    [1] TRUE
    
    all(y > 1)
    
    [1] FALSE
    
  • Additional logical functions
    is.na()
    is.finite()
    duplicated()
    which()
    ifelse()
    
  • Assignment to subset
    which(y > 1)
    
    [1] 1 2
    
    y[y > 1] <- 0
    y
    
    [1] 0.00 0.00 0.40 0.12
    
  • Vectorized if-else
    x <- 1:10
    y <- ifelse(x %% 2 == 0, 0, 2*x)
    y
    
    [1]  2  0  6  0 10  0 14  0 18  0
    

4.2. More vectors

  • NA and NULL
    NA
    Missing data
    NULL
    The null object
    x <- c(1,NA,3,4,5,6)
    mean(x)
    
    [1] NA
    
    mean(x, na.rm=TRUE)
    
    [1] 3.8
    
    x <- c(1, NULL, 3, 4, 5, 6)
    x # NULL can't be part of a vector
    mean(x)
    
    [1] 1 3 4 5 6
    [1] 3.8
    
  • Testing for NA
    x <- c(1, 2, NA)
    x==NA ## Oops
    is.na(x) # Correct
    
    [1] NA NA NA
    [1] FALSE FALSE  TRUE
    
  • unique()
    x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2)
    unique(x)
    
    [1] 1 2 3 4
    
  • Counting TRUES
    x < 3
    length(which(x < 3))
    sum(x < 3)
    
    [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    [1] 4
    [1] 4
    
  • Vector element names
    x <- c(1,2,4)
    names(x)
    
    NULL
    
    names(x) <- c("one","two","four")
    x
    
    one  two four 
      1    2    4
    
  • Subsetting on names
    x[c("one", "four")]
    
    one four 
      1    4
    
  • c(): Type coercion:
    • When you attempt to combine different types they will be coerced to the most flexible type.
    • The ordering from least to most flexible is logical < integer < numeric < complex < character < list.
    c(5,2,"abc")
    c(5,list(a=1,b=4))
    
    [1] "5"   "2"   "abc"
    [[1]]
    [1] 5
    
    $a
    [1] 1
    
    $b
    [1] 4
    
  • c(): Flattening:

    c() creates vectors

    c(1,2,c(100,200,300))
    
    [1]   1   2 100 200 300
    

4.3. Matrices

  • Matrices
    m <- matrix(data = c(1,3,5,2,4,6), nrow=3, ncol=2)
    m
    
         [,1] [,2]
    [1,]    1    2
    [2,]    3    4
    [3,]    5    6
    
    m <- matrix(data = c(1,3,5,2,4,6), nrow=3)
    m
    
         [,1] [,2]
    [1,]    1    2
    [2,]    3    4
    [3,]    5    6
    
  • Matrices con.
    a <- matrix(c(1, 2, 3, 4), nrow=2)
    a
    
         [,1] [,2]
    [1,]    1    3
    [2,]    2    4
    
    b <- matrix(c(6, 7, 8, 9), nrow=2)
    b
    
         [,1] [,2]
    [1,]    6    8
    [2,]    7    9
    
  • Matrix addition
    a + b
    
         [,1] [,2]
    [1,]    7   11
    [2,]    9   13
    
  • Matrices math, dimensions
    c <- matrix(c(3,4,5,6), nrow=1)
    c
    
         [,1] [,2] [,3] [,4]
    [1,]    3    4    5    6
    
    a + c
    
    # result:
    Error in a + c : non-conformable arrays
    
  • Matrix math, con.
    a * b
    
         [,1] [,2]
    [1,]    6   24
    [2,]   14   36
    
    a^3
    
         [,1] [,2]
    [1,]    1   27
    [2,]    8   64
    
  • Matrix multiplication
    a <- matrix(c(1,2,3),ncol=3)
    b <- matrix(c(4,5,6),ncol=1)
    a
    b
    
         [,1] [,2] [,3]
    [1,]    1    2    3
         [,1]
    [1,]    4
    [2,]    5
    [3,]    6
    

    Calculate dot product (inner product)? Or generally, matrix product?

  • Matrix multiplication
    # a*b # ERROR
    a %*% b
    
         [,1]
    [1,]   32
    
    b %*% a
    
         [,1] [,2] [,3]
    [1,]    4    8   12
    [2,]    5   10   15
    [3,]    6   12   18
    
  • rbind and cbind
    rbind(a, a)
    # rbind(a,b) # ERROR!
    cbind(b, b)
    
         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    1    2    3
         [,1] [,2]
    [1,]    4    4
    [2,]    5    5
    [3,]    6    6
    
  • Row and column means

    See:

    • rowMeans()
    • colMeans()
    • rowSums()
    • colSums()
  • Subsetting matrices
    m <- matrix(c(1,2,3,4,5,6),nrow=3)
    m[1:2,1:2]
    
         [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    
  • Dimension reduction
    z <- matrix(1:8, nrow=4)
    z
    
         [,1] [,2]
    [1,]    1    5
    [2,]    2    6
    [3,]    3    7
    [4,]    4    8
    
    r <- z[2,]
    r
    
    [1] 2 6
    
  • Dimension reduction result
    attributes(z)
    attributes(r)
    
    $dim
    [1] 4 2
    
    NULL
    
  • Avoiding dimension reduction
    r <- z[2,, drop=FALSE]
    r
    
         [,1] [,2]
    [1,]    2    6
    
    dim(r)
    
    [1] 1 2
    

5. Data structures 3

5.1. Lists

  • Lists

    An R list is a container whose contents can be items of different data types, unlike a vector.

    
    x <- list(u=2, v="abc")
    x
    x$u
    
    $u
    [1] 2
    
    $v
    [1] "abc"
    
    [1] 2
    
  • Lists vs vectors

    An R list is a container whose contents can be items of diverse data types, as opposed to them being, say, all numbers, as in a vector.

    • Lists are ordered lookup tables:

      One can look up values by key (named elements) but they are also ordered (like vectors).

  • Lists uses

    A common usage of lists is to package the return values of complex statistical functions. For example, the histogram function hist():

    hist(Nile)
    hn <- hist(Nile)
    print(hn)
    
    $breaks
     [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400
    
    $counts
     [1]  1  0  5 20 25 19 12 11  6  1
    
    $density
     [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
    
    $mids
     [1]  450  550  650  750  850  950 1050 1150 1250 1350
    
    $xname
    [1] "Nile"
    
    $equidist
    [1] TRUE
    
    attr(,"class")
    [1] "histogram"
    
  • Subsetting lists
    [
    returns a list
    [[
    returns an element in the list
    (no term)
    $ returns an element by name
  • List subsetting
    j <- list(name=c("Joe", "Doe"), salary=39000, permanent=TRUE)
    j[["salary"]]
    j$salary # shorthand for above
    j$sal    # "$" does partial matching - dangerous!
    j[[2]]   # extract by position
    
    [1] 39000
    [1] 39000
    [1] 39000
    [1] 39000
    
  • List indexing
    fieldname <- "salary"
    j[[fieldname]]  ## '[[' most robust way to extract a single element
    
    [1] 39000
    
  • List indexing
    j[2]
    
    $salary
    [1] 39000
    
  • [] vs [[]]
    j[[1]]
    Extracts a single item
    j[1]
    Extracts a sublist
  • [] vs [[]]
    class(j[["name"]])
    class(j["name"])
    
    [1] "character"
    [1] "list"
    
    • Note   B_note

      You can use [[ for vectors as well and it is not a bad idea when you are making clear you always want a single element.

  • Adding elements
    j$rnumber <- "R00001111"
    j
    
    $name
    [1] "Joe" "Doe"
    
    $salary
    [1] 39000
    
    $permanent
    [1] TRUE
    
    $rnumber
    [1] "R00001111"
    
  • Adding more elements
    length(j)
    j[[5]] <- "Extra"
    j[6:7] <- c(FALSE,FALSE)
    length(j)
    
    [1] 4
    [1] 7
    
  • Adding more elements
    j[3:7]
    
    $permanent
    [1] TRUE
    
    $rnumber
    [1] "R00001111"
    
    [[3]]
    [1] "Extra"
    
    [[4]]
    [1] FALSE
    
    [[5]]
    [1] FALSE
    
  • Deleting elements
    length(j)
    j$rnumber <- NULL
    length(j)
    names(j)
    
    [1] 7
    [1] 6
    [1] "name"      "salary"    "permanent" ""         
    [5] ""          ""
    
  • Inserting elements

    Either break apart and re-concatenate with c(), or use append():

    newj <- append(j, list(birthday="5/17/1990"), after=3)
    newj[3:5]
    
    $permanent
    [1] TRUE
    
    $birthday
    [1] "5/17/1990"
    
    [[3]]
    [1] "Extra"
    
  • Lists to vectors: unlist()
    l <- list(a=1,b=10,c=100)
    v <- unlist(l)
    v
    mode(l)
    mode(v)
    unname(v)
    
      a   b   c 
      1  10 100 
    [1] "list"
    [1] "numeric"
    [1]   1  10 100
    

5.2. Data Frames

  • Data Frames

    Like a matrix, but for mixed types. This is what you will use most often!

    d <- data.frame(kids=c("Jack","Jill","Igor"), ages=c(12,10,11))
    d
    
      kids ages
    1 Jack   12
    2 Jill   10
    3 Igor   11
    
  • Classes

    R is an object-oriented language in which objects are instances of classes. Much of R uses "S3" classes.

    mode(d)
    class(d)
    
    [1] "list"
    [1] "data.frame"
    
  • Reading data

    For now, we will assume you save your data as a comma separated values file (say, in microsoft excel using "save as"), and then read that file into a dataframe:

    fish <- read.csv("../data/fish-data.csv")
    # or via http:
    fish <- read.csv("http://r-research-tool.schwilk.org/data/fish-data.csv")
    head(fish, 3)
    
      aquarium nutrient color.score
    1        a      low         5.3
    2        a      low         4.7
    3        a      low         5.0
    
  • Try ?read.csv
    read.csv(file, header = TRUE, sep = ",", quote = "\"",
                  dec = ".", fill = TRUE, comment.char = "", ...) 
    
    • Note   B_note
      • read.csv is just a wrapper function (it is read.table with some defaults changed). Look at the arguments to read.table to understand.
  • Storing data as character delimited files

    Limitations:

    • The delim character cannot appear in a field unless it is quoted
    • The quote character should not occur at all in a field

    comma separated values (csv) and tab-delimited files (tsv) are both fairly common. You might also run into fixed-width data

5.3. Testing, inspecting, coercing

  • Vectors and lists relationships

    vector-tree.png

  • Testing
    • Vector types (internal storage mode)
      • is.logical(), is.integer(), is.double(), and is.character()
      • BUT AVOID is.vector(), is.atomic(), and is.numeric() because they don't do exactly what you expect.
      • mode(): Mutually exclusive classification according to structure with atomic modes named by type and recursive objects like lists and functions separate. typeof() will distinguish numeric types as integer vs double
    • class()
      • Property assigned to an object to determine how some generic functions like summary() work.
    • Lecture notes   B_note

      Class is just a property — you can change this.

  • Type coercion

    as.logical(), as.integer(), as.double(), or as.character()

    as.logical(0:3)
    as.logical(c("a", "b", "c"))
    as.character(0:3)
    as.logical(as.character(0:3))
    as.integer(as.character(0:3))
    
    [1] FALSE  TRUE  TRUE  TRUE
    [1] NA NA NA
    [1] "0" "1" "2" "3"
    [1] NA NA NA NA
    [1] 0 1 2 3
    

6. Introduction to Functions

6.1. Functions

  • Advice following HW02
    • Don't copy data around unnecessarily. Use good variable names.
    • Write code to obtain values
    • Don't hard-code indices.
    • Read style guide now.
  • Why write functions?

    Why not just copy and paste code?

    • You can give a function a name that makes your code easier to understand.
    • If you need to change the code, you only need to change it in one place.
    • You eliminate the chance of making mistakes when you copy and paste.
    • You can pass functions to other functions
  • When to write a function?

    Whenever you find yourself copying and pasting more than once or twice.

  • Functions are objects

    So we create them and assign them to a name.

    addone <- function(x) {
       return(x+1)
    }
    addone(1)
    
    [1] 2
    
  • Function example
    pluralize_if <- function(thenames, nums) {
      endings <- ifelse(nums > 1, "s", "")
      result <- paste(thenames, endings, sep="")
      return(result)
    }
    
    pluralize_if("computer", 44)
    pluralize_if("instructor", 1)
    pluralize_if(c("dog", "cat", "parakeet"), c(1,2,3))
    
    [1] "computers"
    [1] "instructor"
    [1] "dog"       "cats"      "parakeets"
    
  • Functions in R

    Arguments to functions can be required or optional. Optional arguments have default values provided in the function definition:

    pluralize_if <- function(thenames, nums=1) {
      endings <- ifelse(nums > 1, "s", "")
      result <- paste(thenames, endings, sep="")
      return(result)
    }
    
    pluralize_if("computer")
    pluralize_if(rep("cat", 3))
    pluralize_if(rep("dog", 3), c(1,2,3))
    
    [1] "computer"
    [1] "cat" "cat" "cat"
    [1] "dog"  "dogs" "dogs"
    

6.2. In class project 1

  • Function randomIntegers()
    • RandomIntegers(n, min_val, max_val)

      Function returns n random uniform integers between min_val and max_val, inclusive

      • Look up:
        ?runif
        ?round
        
  • Problem: convert random real numbers to integers

    rand_to_int.png

  • Your turn
    
    # randomIntegers: produce vector of uniformly
    # distributed random integers.
    #
    # Arguments:
    #   n = number of random integers to be generated.
    #         min_val, max_val = min, max possible values,
    #                      inclusive.
    # Returns:
    #         vector of random integers.
    
    randomIntegers <- function(n, min_val, max_val) {
    ## ????
      return(result)
    }
    

6.3. Notes on functions

  • Steps to writing a function
    1. Decide what information is needed (arguments) and what to return
    2. Develop the algorithm
    3. Implement in code
  • Potential problems
    • Syntax errors (error when you define the function)
    • Errors during run time (eg undefined variable names)
    • Runs with no errors but the output is wrong!
  • Debugging

    More on this later … For now:

    • use print()
    • use stopifnot()
  • Lecture notes   B_note

    Default printing only works in certain scope. You can use the print() function to be explicit and guarantee display to screen.

7. Control structures

7.1. if else

  • if-else control

    In R we can often treat function arguments and other objects with the same code regardless of values. If we need to make a decision based on element values, the ifelse() function is often enough.

    However, sometimes, we need to conditionally execute chunks of code.

  • if-else decisions
    tf <- function(a) {
      if (a == 13) {
        return("Lucky!")
      } else {
        return("No luck")
      }
    }
    
    tf(5)
    tf(13)
    
    [1] "No luck"
    [1] "Lucky!"
    
  • if ()

    The value in parentheses after the if should evaluate to a single boolean (logical) value. Note that if is not a function, we usually put space between if and (.

  • Be careful inverting boolean logic!
    A <- -8
    if (A>0 & A<10) { print("YES") } else { print("NO") }
    # If we want to invert the logic:
    if (! A>0 & !A<10) { print("YES") } else { print("NO") } # Not what we want
    if (! (A>0 & A<10) ) { print("YES") } else { print("NO") }
    
    [1] "NO"
    [1] "NO"
    [1] "YES"
    
  • if, else if, else
    tf <- function(r) {
      if (r == 13) {
        return("Lucky 13!")
      } else if (r < 0) {
        return("Negative!")
      } else {
        return("Just a number")
      }
    }
    
    tf(13)
    tf(-100)
    tf(10)
    
    [1] "Lucky 13!"
    [1] "Negative!"
    [1] "Just a number"
    
  • if needs a single logical value
    if (c(TRUE, FALSE)) {return("DONE")}
    
    Error in if (c(TRUE, FALSE)) { : the condition has length > 1
    
  • | vs || and & vs &&

    || and && can be safer in if tests: Require logical(1) arguments. These are "short circuiting"

    x <- 2; y <- 1:3
    if (x > 0 &&  x > y) {
      print("Yes")
    }
    
    Error in x > 0 && x > y 
    'length = 3' in coercion to 'logical(1)'
    
    • Lecture notes   B_note

      There are no true scalars in R so if you use, & or "|", then R will produce an output logical vector as long as the longest argument. But if/else decisions require a single result.

  • Short circuiting
    a
    TRUE || a
    TRUE | a
    
    Error: object 'a' not found
    [1] TRUE
    Error: object 'a' not found
    

7.2. Loops

  • for and while loops
    x <- c(1, 2, 3)
    for (i in x) { print(i^2) }
    
    [1] 1
    [1] 4
    [1] 9
    
    i <- 1
    while (i <= 10) {
      i <- i+4
      print(i)
    }
    
    [1] 5
    [1] 9
    [1] 13
    
    • Note   B_note

      When you start to use R in more powerful ways, you won't end up writing loops very often.

  • Breaking out of loops
    i <- 1
    while(TRUE) {
    i <- i+4
      if (i>10) break
    }
    i
    
    [1] 13
    

    Or use repeat:

    i <- 1
    repeat {
      i <- i+4
      if (i>10) break
    }
    i
    
    [1] 13
    
  • Loop hierarchy

    specific to general: for, while, repeat

    • So when do we use each type?
      • Use for loops when you need to iterate over a vector. There are a known number of iterations.
      • Use while loops when the loop must continue until some condition is met. If the condition is met when the loop is first encountered, the loop is skipped completely.
      • Use repeat as for while, but when the loop must execute at least once or when the best exit point(s) is/are somewhere in the middle of the loop block.
  • Two idioms for for loops

    Idiom 1: Do something to each element

    v <- c(2,4,5,8)
    for(n in v) {
      print(n)
    }
    
    [1] 2
    [1] 4
    [1] 5
    [1] 8
    
  • Idiom 2: Iterate over an indexing vector

    in order to do something to elements over multiple vectors

    v2 <- c(1,1,2,10)
    for(i in seq_along(v)) {
      r <- v[i] * v2[i]
      print(sprintf("Loop %d, Result: %d", i, r))
    }
    
    [1] "Loop 1, Result: 2"
    [1] "Loop 2, Result: 4"
    [1] "Loop 3, Result: 10"
    [1] "Loop 4, Result: 80"
    
  • seq_along vs :
    v <- c(2,4,5)
    for(i in 1:length(v) ) {
      r <- v[i] * v2[i]
      print(c(i, r))
    }
    v <- NULL  # Loop should not cycle even once
    for(i in 1:length(v)) {
      r <- v[i] * v2[i]
      print(c(i, r))
    }
    #print(1:0)
    
    [1] 1 2
    [1] 2 4
    [1]  3 10
    [1] 1
    [1] 0
    

8. Lists and apply functions

8.1. Functions review

  • The return() statement

    If you do not include one, R uses the last statement evaluated

    • This allows very concise functions:
      isEven <- function(x) x%%2 == 0
      

      OK for very short functions such as lambda functions (unnamed functions)

  • Side effects

    "Pure" functions have no side effects, only returned values. But you have probably already used functions entirely for their side effect (eg library(), write.csv(), ggplot())

  • Luckily, R protects you from most accidental side effects

    Function has read access to variables at higher scope, but no write access.

    x <- y <- 10
    func <- function(x) {
       x <- 0
       y <- y*2
       return(y)
    }
    func()
    x
    y
    
    [1] 20
    [1] 10
    [1] 10
    

8.2. In-class project

8.3. Apply functions

  • apply family of functions
    apply
    apply a function to the rows, columns or values in a matrix/array
    lapply
    apply a function to every item in a list and return a list
    sapply
    lapply but simplifies the result and returns a vector
    vapply
    Like `sapply` but safer when you know the element types you expect in the output. Use this when possible.
    mapply
    You have multiple data structures and want to hand elements of each as separate arguments to a function

    There are more in this family that we use less commonly. See ??apply and scroll down.

  • Applying functions to lists
    • lapply
      lapply(list(1:3,25:29), median)
      
      [[1]]
      [1] 2
      
      [[2]]
      [1] 27
      
      
    • sapply
      sapply(list(1:3,25:29),median)
      
      [1]  2 27
      
  • lapply example

    Consider sexing 10,000 shrimp. We'll use a subset. We'd like to know the indices for all females, males and juveniles:

    shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F")
    shrimp=="F"
    which(shrimp=="F")
    
    [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
    [1] 2 5 6 8 9
    
    • We can use apply here, how?
  • Example, continued
    codes <- c("F","M","J")
    lapply(codes, function(sex) which(shrimp==sex))
    
    [[1]]
    [1] 2 5 6 8 9
    
    [[2]]
    [1] 1 7
    
    [[3]]
    [1] 3 4
    
  • Whoa, where was the function?

    It was unnamed! Alternative with named function:

    
    # function to return indices of x in v
    indices <- function(x, v) {
      return(which(x==v))
    }
    # and use sapply for USE.NAMES=TRUE:
    sapply(codes, indices, v=shrimp, simplify=FALSE)
    
    $F
    [1] 2 5 6 8 9
    
    $M
    [1] 1 7
    
    $J
    [1] 3 4
    
  • Let's look at documentation
  • Lists of lists
    a <- list(name="Dylan", dept="Biology", role="faculty")
    b <- list(name="Jane", dept="Plant and Soil", role="student")
    c <- list(name="Kathryn", dept="NRM", role="student")
    course <- list(Schwilk=a, Doe=b, Smith=c)
    course[[3]]$role
    
    [1] "student"
    

    How can we extract a vector of just the first names?

  • sapply
    sapply(course,"[[","name")
    
    Schwilk       Doe     Smith 
    "Dylan"    "Jane" "Kathryn"
    
  • vapply

    The problem: R is flexible but that can lead to silent bugs

    test <- list(a = c(1, 3, 5), b = c(2,4,6), c = c(9,8,7))
    sapply(test, max)
    test$d <- c("a", "b", "c")
    sapply(test, max)
    
    a b c 
    5 6 9 
      a   b   c   d 
    "5" "6" "9" "c"
    
  • vapply
    vapply(test, max, numeric(1)) # error
    vapply(test[-4], max, numeric(1)) # works
    
    Error in vapply(test, max, numeric(1)) : values must be type 'double',
     but FUN(X[[4]]) result is type 'character'
    a b c 
    5 6 9
    
  • Apply functions to matrices: apply()

    y <- apply(X, MARGIN, FUN, ...)

    X
    an array
    MARGIN
    subscripts which the function gets applied over
    FUN
    the function to be applied
    ...
    additional arguments to function

    Returns a vector or array

  • Applying a function to rows or columns

    the apply() function. MARGIN argument = 1 if the function applies to rows, 2 for columns, etc.

    z <- matrix(c(1,2,3,4,5,6), ncol=2)
    z
    
         [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    [3,]    3    6
    
    apply(z, 2, mean)
    
    [1] 2 5
    
  • on rows, using our own function:
    f <- function(x) x/c(2, 8)
    y <- apply(z, 1, f)
    y
    
         [,1]  [,2] [,3]
    [1,]  0.5 1.000 1.50
    [2,]  0.5 0.625 0.75
    
  • What are "margins"?
    • d-dimensional arrays
      • can be split \(2^d - 1\) ways
      • example: 2-d matrix can be split 3 ways: rows, columns or cells
  • Examples: matrix
    • MARGIN = 1: slice up into rows
    • MARGIN = 2: slice up into columns
    • MARGIN = c(1,2): slice up into individual cells

    hadley-matrix1.jpg

    — Wickham (2011): http://www.jstatsoft.org/v40/i01/

  • 3-d arrays

    can be split \(2^3 - 1 = 7\) ways

    hadley-array1.jpg

    — Wickham (2011): http://www.jstatsoft.org/v40/i01/

  • Example: grid of genotypes

    Let's store genotypes as 1st dimension on a 2d landscape of organisms. So we will make a 3d array:

    makeGenotype <- function() sample(1:100, 25, replace=TRUE)
    # simple 10 x 10 landscape
    landscape <-  array(replicate(100, makeGenotype()),
                        dim =c(25,10,10 ))
    # The first genotype in landscape position 1,1:
    landscape[, 1, 1]
    
    
     [1]  62  90  35   1  44  76  50  29  96   2  38  98  54  34
    [15] 100   2  22  30  92  30  47   8  54  53  34
    
  • Example: apply
    ## get matrix of fitnesses
    testFitness <-  function(g, e) {
      return( 1 / ( sum((e-g)^2) + 1))
    }
    opt <- makeGenotype() # "best" genotype
    fitnesses <- apply(landscape, c(2,3), testFitness, e = opt)
    # lets see the relative fitnesses as percentages
    round(fitnesses / max(fitnesses)*100)
    
  • Results:
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
     [1,]   39   48   50   31   47   47   42   49   35    33
     [2,]   34   75   30   38   60   48   38   37   65    45
     [3,]   57   35   54   38   35   38   42  100   37    30
     [4,]   35   39   49   56   47   37   47   47   40    32
     [5,]   45   40   68   60   49   52   47   47   39    30
     [6,]   52   39   37   44   47   36   36   27   45    42
     [7,]   31   54   41   54   41   59   33   32   47    37
     [8,]   31   89   51   40   52   36   37   37   26    36
     [9,]   25   37   38   34   31   54   51   53   35    65
    [10,]   57   50   31   30   45   45   37   45   51    72
    
  • Where is the most fit genotype?
    max(fitnesses)
    indices <-  which(fitnesses == max(fitnesses), arr.ind=TRUE)
    indices
    landscape[, indices[1], indices[2]]
    opt
    
    [1] 6.573758e-05
         row col
    [1,]   3   8
     [1] 78 87 54 73 81 47 44 28 39 56 51 30 36 92 60  7 22 98 51
    [20] 93 42  3 54 18 54
     [1]  80  93  10  94  64  21  48  51  59  40  20   1  30  92
    [15] 100  40  63  57  44  65  56  34  66  29  32
    
  • Heatmap of landscape:
    image(fitnesses)
    

    fig_heatmap.png

9. Loops and more functions

9.1. In class project

  • Project 1: calculating a probability

    Suppose we have n independent events each with a separate probability of occurring. What is the probability of exactly one of those events occurring?

    • If there are three events (A, B and C)

      P(exactly one event occurring) =

      • P(A and not B and not C) +
      • P(not A and B and not C) +
      • P(not A and not B and C)
  • The algorithm

    if the probability of the \(i^{th}\) event is \(p_i\), then P(A and not B and not C) would be be \(p_A(1 - p_B)(1 - p_c)\). For general n:

    \[ \sum\limits_{i=1}^{n} p_i (1 - p_1) \ldots (1 - p_{i-1}) (1 - p_{i+1}) \ldots{} (1 - p_n) \]

  • Your turn: translate this to code

    Function takes vector of probabilities and returns a scalar probability of exactly one event occurring.

    exactlyOne <- function(p) {
    ## ?
    }
    

    You will want to loop through the vector p while also looping through "not p" (1-p)

10. Function exercises

10.1. In class projects

  • Converting units

    Write a function called convertInches that accepts two arguments. The first is a numeric vector representing measurements in inches, the second is a character vector whose elements have the possible values "cm", "m", "ft". Depending on the second argument, the function converts inches to either centimeters, meters, or feet. Make the second argument an optional argument with a default value.

    ## convertInches: Converts measurements in inches to centimeters, meters, or feet.
    ##     Args:
    ##       x = a numeric vector containing measurements in inches units = a
    ##       character vector with possible values "cm", "m", or "ft"
    ##     Returns:
    ##       a vector representing the measurements in the requested units. 
    convertInches <- function(x, units="cm") {
      # your code here
    }
    
  • Above approach with lookup table
    conversions <- list(cm = 2.54, m = 0.0254, ft = 1/12)
    getConversionMultiplier <- function(units) {
      if (!(units %in% names(conversions))) {
        stop(paste("Unrecognized units. Allowed values are ", 
                    names(conversions)))
      }
      return(conversions[[units]])
    }
    
    convertInches3 <- function(x, units="cm") {
      multipliers <- sapply(units, getConversionMultiplier)
      return(x*multipliers)
    }   
    
    convertInches3(1:3, c("cm", "m", "ft"))
    
    • Lecture notes   B_note

      Only reason to use a separate function here is to have a clean place to put the error checking. Otherwise, we could just make a lambda function

  • Choose n points within a circle
    • Lecture notes   B_note

      \(r^2 = x^2 + y^2\)

  • Choosing a random point in a circle

    Several methods:

    1. Rejection sampling
    2. Choose random angle and random distance from origin (need to scale radius to account for area going up as square of distance from the center). Then convert from radians and distance to Cartesian coordinates.
    3. Archimedes and triangles (not shown, check out StackOverflow)
  • Approach #1
    • The uniform distribution has the nice property that if we reject a portion of the sampled space, the remaining points are uniform in the remaining space.
    • However, if we use rejection sampling, we can't predict how many points we will need ahead of time (we should end up needing about 27% more random points than we use)
    • So we need to use a while loop to grab the points, then only keep each point if sqrt(x^2 + y^2) <= r
  • result

    circleplot_1.png

11. Recursion

11.1. Recursion

  • Project 2: Fibonacci sequence

    0,1,1,2,3,5,8,13 …

    • Definition:

      \(F_1 = 0, F_2 = 1\) and \(F_n = F_{n-1} + F_{n-2}\)

    • fib()

      Write a function called fib(n) that returns the nth value in the Fibonacci sequence.

  • The recursion stack

    recursion-stack.png

  • "Exploding" dice

    In some table-top role playing games, dice are used to provide stochasticity in the simulation of events. In order to produce a more "exponential-like" distribution, some games use "acing" or "exploding dice" in which the die is re-rolled if the first roll results in the maximum value and the new value is added to the first.

    • Example:

      Some potential results from roll when "acing" is allowed on a 6-sided die (a 'd6'):

      • 2
      • 5
      • 6 + 5 = 11
      • 6 + 6 + 1 = 13
      • 4
  • Dice

    Photo of a set of multi sided dice. Set of 4-sided, 6-sided, 8-sided, 10-sided (units and decades), 12-sided, 20-sided and 30-sided dice. By Clément Bucco-Lechat https://commons.wikimedia.org/wiki/File:Set_of_roleplaying_dice.jpg Photo by Clément Bucco-Lechat, CC BY-SA 3.0.

  • Simulating such a die roll
    • rollDie()

      Write a function called rollDie(x) that simulates rolling a dice with x sides and allows "acing" or "exploding dice". Under this rule, a maximum roll (eg a "6" on a six-sided die) results in a second roll whose value is added to the first. This is open-ended and the second roll can "ace" as well: in other words, the resulting probability distribution is unbounded on the upper end.

  • rollDice() ?

    Similar function but returns result of n independent rolls?

    # RollDice: A dice rolling simulator which returns a vector 
    #           of optionally acing die rolls
    #
    #     args:
    #       n         = size of output vector desired
    #       die       = An integer representing the die type
    #       allow.ace = a logical value indicating if the rolls 
    #                   are allowed to "ace"
    #     -----------------------------------------------------
    #     returns:
    #       a vector of length n containing the simulated 
    #       die roll results
    rollDice <-  function(n, die, allow.ace = FALSE) {
      # ?
    }
    
    
    • Lecture notes   B_note

      Modify the rollDie() function we wrote so that it is similar to the builtin random number functions such as runif() and returns a vector rather than a scalar. rollDice(n,die) takes an initial argument, n, that gives the number of output values and returns a vector of length n. Remember, this die roll can be "open ended:" a maximum result (eg a 6 on a six-sided die) results in a second roll the result of which is added to the first. This is controlled by the "allow.ace" argument ("acing" is another term for "exploding" dice). So there is no theoretical upper limit to the result.

  • Testing
    # rollTester: simulates many rolls and returns proportions
    rollTester <- function(die) {
      n <- 10000
      rolls <- rollDice(n, die, TRUE)
      # hist(rolls)
      res <- sapply(1:(die-1), function(x) sum(rolls==x) / n)
      res[die] <- sum(rolls > die) / n
      res <- res * die # scale so that all proportions should be 1 not 1/die
      return(res) 
    }
    rollTester(die=8) # should all be about 1.0
    
    [1] 1.0152 1.0120 0.9792 0.9664 0.9776 1.0328 0.9976 1.0192
    

11.2. Pros / Cons

  • Recursion pros / cons
    • Recursion allows a very simple definition of the function
    • Can be memory hungry: function is added to stack and values kept until call finishes
    • Can result in "stack overflow" if the function is called too many times before the first finally returns
    • Can be slower than iterative version (but in some cases, can be faster)

12. Scope and design

12.1. Scope

  • Where do names apply?
    • Environment

      A function consists not only of its arguments and body, but also its environment: the collection of objects present when the function is created.

      Within a function, variables defined at a higher scope are visible, but are not writeable with <-

  • Nested functions
     w <- 12
     f <- function(y) {
       d <- 8
       h <- function() {
         return(d * (w+y) )
        }
       print(environment(h))
       return(h())
     }
    
    environment(f)
    ls()
    f(10)
    
    <environment: R_GlobalEnv>
    [1] "f" "w"
    <environment: 0x5d0f9a9b8b50>
    [1] 176
    

    Note: confusingly, we refer to the top environment as .GlobalEnv in code.

  • Scope hierarchy
    • w is in global env which is parent to f().

      f is parent to h(). d is local to f() but in parent env to h().

      f(2)
      h()
      
      <environment: 0x5610aaa30fc0>
      [1] 112
      Error in h() : could not find function "h"
      
  • Scope hierarchy con.
    f <- function(y, ftn) {
      d <- 8
      print(environment(ftn))
      return(ftn(d, y))
    }
    
    h <- function(dd, yy) {
       return(dd * (w + yy))
    }
    
    w <- 12
    f(3, h)
    
    <environment: R_GlobalEnv>
    [1] 120
    
  • Scope and assignment

    In general, R functions do not change variables at a higher scope

    x <- 1
    increment.x <- function() {
      print(x)
      x <- x+1
      print(x)
    }
    increment.x()
    x
    
    [1] 1
    [1] 2
    [1] 1
    
  • Super assignment
    increment.x <- function() {
       x <<- x+1
    }
    x
    increment.x()
    x
    
    [1] 1
    [1] 2
    
  • assign()

    Write to upper level variables using character strings for names

    filelist <- list.files(path="../handouts/", pattern="vocab_week_[0-9][0-9].org$")
    #filelist
    for(s in filelist) {
      assign(s, readLines(file(file.path("../handouts/", s))) )
    }
    ls()
    
     [1] "f"                 "filelist"         
     [3] "h"                 "increment.x"      
     [5] "s"                 "vocab_week_10.org"
     [7] "vocab_week_11.org" "vocab_week_12.org"
     [9] "vocab_week_13.org" "vocab_week_14.org"
    [11] "w"                 "x"
    

12.2. Programming style and organization

  • Programming style

    "Design" and "style" are related but separate ideas. Design is more important but style matters, too.

    • Good style aids maintenance
      • Use comments to tell the readers what they need to know
      • Use indentation to mark the levels of program control
      • Use meaningful names
      • Develop a convention for names (global variables vs other variables vs function names)
      • Use standard idioms where appropriate
      • Avoid unnecessary complexity. Avoid code repetition!
  • Function Documentation

    For example, Google's style:

    calculateSampleCovariance <- function(x, y, verbose = TRUE) {
      # Computes the sample covariance between two vectors.
      #
      # Args:
      #   x: One of two vectors whose sample covariance is to be calculated.
      #   y: The other vector. x and y must have the same length, greater than one,
      #      with no missing values.
      #   verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE.
      #
      # Returns:
      #   The sample covariance between x and y.
      n <- length(x)
      # Error handling
      if (n <= 1 || n != length(y)) {
        stop("Arguments x and y have invalid lengths: ",
             length(x), " and ", length(y), ".")
      }
      if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) {
        stop(" Arguments x and y must not have missing values.")
      }
      covariance <- var(x, y)
      if (verbose) cat("Covariance = ", round(covariance, 4), ".\n", sep = "")
      return(covariance)
    
  • Organizing tasks (design)
    • Procedural programming

      is the concept of developing programs based on sub tasks (procedures = functions). The sub-tasks are called and connected by a single parent task that coordinates all of the procedures into a working solution.

  • Modularization

    Modularization is the process programmers use to break up complex programs into smaller more manageable procedures.

    • Divide and Conquer!
      • Review overall goal and identify subtasks
      • Make each function perform one small task and test it separately.
      • Keep functions small enough to understand
      • If you find yourself repeating code, reevaluate!
  • Constants
    • What is a "constant?"

      An identifier whose associated value cannot (or should not) be altered by the program during its execution

  • Global variables and constants

    It can be useful to put some infrequently changing constants at the top of a script file:

    GRID_SIZE <- 10
    NUM_ITERATIONS <- 10000
    

    Then use these global variables, perhaps as default values to functions:

    makeGrid <- function(gridSize = GRID_SIZE) {
      g <- matrix(1:(gridSize^2), nrow=gridSize)
      return(g)
    }
    
  • Multiple scripts

    In larger projects in can be helpful to break up the code into multiple files. For example:

    • read_clean_data.R
    • run_models.R
    • ms_figs.R
    • run_all.R

    In this example, run_all.R contains "steering code and might just look like

    source("./scripts/read_clean_data.R")
    source("./run_models.R")
    # etc
    

13. Debugging

13.1. Debugging approaches

  • Debugging

    "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

    — Brian W. Kernighan.

  • What is a bug?
    • Code fails (produces error)
    • Code runs but produces the wrong result
    • Lecture notes   B_note

      A syntax error is not a bug. That is something R catches as it reads the instructions.

  • Debugging

    Differential diagnosis

    • Characterize the error
    • Localize the error
    • Fix the error (and don't introduce a new bug!)
  • Characterizing the bug
    • Make the error reproducible
      • Can we always get the error when re-running the same code and values?
      • If we start the same code in a clean workspace, does the same thing happen?
    • Bound the error
      • How much can we change the inputs and get the same error? A different error?
      • For what inputs (if any) does the bug go away?
      • How big is the error?
    • Get more information

      Add extra output (e.g., number of optimization steps, did the loop converge, final value of optimized function)

  • Localizing the bug

    Tools:

    • traceback (where did an error message come from?)
    • Our own extra debugging statements: print, warning, stopifnot (messages from the code as it goes)
    • Trying controlled inputs
    • Interactive debugging
  • traceback
    • When an R function fails, an error is printed to the screen. Immediately after the error, you can call traceback() to see in which function the error occurred.
    • The traceback() function prints the list of functions that were called before the error occurred.
    • The functions are printed in reverse order.
  • traceback example
    f <- function(x) {
      r <- x - g(x)
      r
    }
    
    g <- function(y) {
      r <- y * h(y)
      r
    }
    
    h <- function(z) {
      r <- log(z)
      if(r < 10) r^2
      else r^3
    }
    f(-1)
    traceback()
    
  • traceback results
    Error in if (r < 10) r^2 else r^3 (from #3) : missing value where TRUE/FALSE needed
    In addition: Warning message:
    In log(z) : NaNs produced
    3: h(y) at #2
    2: g(x) at #2
    1: f(-1)
    
  • Interactive debugging

    traceback() does not always tell you where the error originated. In order to know which line causes the error, you may want to step through the function using debug().

  • debug() and browser()
    • debug(foo) flags the function foo() for debugging.
    • undebug(foo) unflags the function.
    • When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and you can interact with the environment.
    • After you obtained necessary information from the environment, you can let the function to execute the next statement.
    • You can call browser() from any point in your code to enter the debugger. This is useful if you know about where the bug is and you don't want to start at the beginning of the function.
  • debug example
    ## buggy version:
    mateVectors <- function(x, y, r) {
      child <- y
      usingX <- runif(1) <= 0.5
      for (i in x) {
        if(usingX) {
          child[i] <- x[i]
        }
        if(runif(1) <= r) {
          !usingX
        }
      }
      return(child)
    }
    
  • Trying the buggy version
    set.seed(1001)
    mateVectors(1:10, 11:20, 1)
    mateVectors(c("A","B","C"), c("a","b","c"), 0.3)
    
    
     [1] 11 12 13 14 15 16 17 18 19 20
                  A   B   C 
    "a" "b" "c"  NA  NA  NA
    
  • Let's debug this function
    • Useful debug commands
      • 'n' next line
      • 'c' Continue to end of context (finish loop, or finish function)
      • 'where' print a stack trace of all active function calls
      • 'Q' quit the debugger
    • Lecture notes   B_note

      So: with numeric inputs produces kind of ok result but no recombination. With character input it produces NAs

      Two errors: wrong `for` idiom and also does not do actual crossover.

  • Buggy swapHalves()
    swapHalves <- function(x) {
      vlen <- length(x)
      if(vlen %% 2 == 0) { # even case
        half <- vlen %/% 2
        firstHalf <- 1:half
        secondHalf <- -1 * 1:half
        result <- c(x[secondHalf], x[firstHalf])
      } else {
        firstHalf <- 1 : vlen %/% 2
        midValue <- vlen %/% 2 + 1
        secondHalf <- vlen%/%2 + 2 : vlen
        result <- c(x[secondHalf], x[midValue], x[firstHalf])
      }
      return(result)
    }
    
  • Buggy swapHalves() test
    print("even:")
    even <- c("a","b","c","d")
    print("odd:")
    odd  <- c("a","b","c","d","e")
    swapHalves(even)
    swapHalves(odd)
    
    [1] "even:"
    [1] "odd:"
    [1] "c" "d" "a" "b"
    [1] "d" "e" NA  NA  "c" "a" "a" "b" "b"
    
  • Debugging using RStudio

    RStudio includes a visual debugger. Have you noticed the "environment" window? It shows values for variables in the current environment. When debug is active, this shows the function being debugged.

    • RStudio’s error inspector and traceback()
    • RStudio’s "Rerun with Debug" tool and options(error = browser) which open an interactive session where the error occurred.
    • RStudio’s breakpoints and browser() which open an interactive session at an arbitrary location in the code.

14. Data Frames and Factors

14.1. Creation and operations

  • Transition point in course
    • Thus far:
      • Basics of R syntax
      • Manipulating vectors
      • Overview main R data types
      • Functions, algorithms, debugging
    • Next:
      • More on data frames and factors
      • Math operations and how computers store numbers
      • String operations and regular expressions
      • Dates and times and computers
      • The Grammar of Graphics (more ggplot2)
      • Data manipulation tidying and analysis using tidyverse tools.
  • Today: more detail on data frames

    Most useful data structure for analyses! Has characteristics of lists and matrices.

    kids <- c("Jack","Jill")
    ages <- c(10,12)
    d <- data.frame(kids, ages)
    d
    d[2,2]
    
      kids ages
    1 Jack   10
    2 Jill   12
    [1] 12
    
  • Extract columns 3 ways
    d[[1]]
    d$kids
    d[,1]
    
    [1] "Jack" "Jill"
    [1] "Jack" "Jill"
    [1] "Jack" "Jill"
    
  • Add to data frame

    rbind and cbind work as with matrices

    d <- rbind(d, list("Alberita", 11))
    d
    
          kids ages
    1     Jack   10
    2     Jill   12
    3 Alberita   11
    
    • See also

      bind_rows and bind_cols from the dplyr package. rbind_rows will concatenate a list of data frames filling missing columns with NA. I use this most of the time. More on this later in the course.

  • Merging
    kids <- c("Jack","Joan","Alberita")
    states <- c("CA","TX","TX")
    d2 <- data.frame(kids,states)
    merge(d, d2)
    
          kids ages states
    1 Alberita   11     TX
    2     Jack   10     CA
    
  • Merging (all, all.x, all.y)
    d3 <- merge(d, d2, all = TRUE)
    names(d3) <- c("name","age","state")
    d3
    
          name age state
    1 Alberita  11    TX
    2     Jack  10    CA
    3     Jill  12  <NA>
    4     Joan  NA    TX
    
  • Merging (all.x)
    d4 <- merge(d, d2, all.x = TRUE)
    d4
    
          kids ages states
    1 Alberita   11     TX
    2     Jack   10     CA
    3     Jill   12   <NA>
    
  • Tool: str() to examine complex objects
    str(d4)
    #str(hist(cars$speed))
    
    'data.frame':	3 obs. of  3 variables:
     $ kids  : chr  "Alberita" "Jack" "Jill"
     $ ages  : num  11 10 12
     $ states: chr  "TX" "CA" NA
    
  • Data frames: subset()

    Convenience function more concise than using [. Names assumed to be columns.

    subset(d3, state=="TX")
    subset(d3, age > 10)
    
          name age state
    1 Alberita  11    TX
    4     Joan  NA    TX
          name age state
    1 Alberita  11    TX
    3     Jill  12  <NA>
    
  • Data frames: add new columns
    d3$birth.year <- 2012 - d3$age
    d3
    
          name age state birth.year
    1 Alberita  11    TX       2001
    2     Jack  10    CA       2002
    3     Jill  12  <NA>       2000
    4     Joan  NA    TX         NA
    
  • tibbles

    Faster, more restrive data frames used by tidyverse functions:

    tibble::tibble(d3)
    
    # A tibble: 4 × 4
      name       age state birth.year
      <chr>    <dbl> <chr>      <dbl>
    1 Alberita    11 TX          2001
    2 Jack        10 CA          2002
    3 Jill        12 <NA>        2000
    4 Joan        NA TX            NA
    
  • More on data frames
    • Read data frame from file

      read.table() and read.csv()

    • Many matrix operations work on dataframes

      Eg rowMeans(), colMeans()

14.2. Merging examples

  • Data is often stored in separate files
    oak_wp <-
    read.csv("http://r-research-tool.schwilk.org/data/oak-water-potentials-simple.csv")
    species <- read.csv("http://r-research-tool.schwilk.org/data/oak-species.csv")
    head(oak_wp, 3)
    
       site tag spcode       date pd.psi md.psi year
    1 GATEN 110   QUEM 10/24/2010     NA     NA   10
    2 GATEN 111   QUEM 10/27/2010     NA     NA   10
    3 GATEN 113   QUEM 10/24/2010     NA     NA   10
    
    head(species, 3)
    
        genus specific.epithet   family spcode display.name CM DM
    1 Quercus           emoryi Fagaceae   QUEM    Q. emoryi  1  1
    2 Quercus         gambelii Fagaceae   QUGA  Q. gambelii  1  1
    3 Quercus         gravesii Fagaceae  QUGR2  Q. gravesii  1  1
      GM
    1 NA
    2  1
    3 NA
    
  • Plot data, but use real species names
    oak_wp <- merge(oak_wp, species)
    names(oak_wp)
    
     [1] "spcode"           "site"             "tag"             
     [4] "date"             "pd.psi"           "md.psi"          
     [7] "year"             "genus"            "specific.epithet"
    [10] "family"           "display.name"     "CM"              
    [13] "DM"               "GM"
    
  • Plot data with nice display names
    library(ggplot2)
    ggplot(oak_wp, aes(display.name, md.psi)) + geom_boxplot() +
      xlab("") + ylab("Mid-day water potental (MPa)")
    

    figggplot_oakdispl.png

  • Merging is also called "joining"

    In base R the function is merge(). In dplyr (tidyverse), there are several separate functions: left_join(), right_join(), full_join(), etc. We will learn about those later.

14.3. Factors

  • Grouping variables
    # head(iris)
    class(iris$Species)
    levels(iris$Species)
    
    [1] "factor"
    [1] "setosa"     "versicolor" "virginica"
    
  • Factors
    • stored as integers
    • but have a class attribute "factor" and an attribute "levels" which defines the allowed values.
    shrimp <- c("M", "J", "J", "M", "M")
    shrimp.factor <- factor(shrimp, levels = c("J", "F", "M"))
    
    table(shrimp.factor)
    
    shrimp.factor
    J F M 
    2 0 3
    
  • Factors vs strings
    • We often store factors as strings in a csv file
    • read.csv() will convert them to factors but how can it know the allowed levels?
    • Leave factors stored as strings alone until a factor is needed (use stringsAsFactors=FALSE in read.csv())
  • Factors vs strings

    Factors are stored as integers with a display name

    v <- c("3", "2", "10")
    factor(v)
    vf <- as.factor(v)
    vf
    as.numeric(vf[3])  ## How did "10" become [1] ?
    
    
    [1] 3  2  10
    Levels: 10 2 3
    [1] 3  2  10
    Levels: 10 2 3
    [1] 1
    
  • Factors: reassign levels
    v <- c("Med", "High", "Low", "Low", "Med")
    factor(v)
    vf <- factor(v, levels=c("Low", "Med", "High"))
    vf
    
    [1] Med  High Low  Low  Med 
    Levels: High Low Med
    [1] Med  High Low  Low  Med 
    Levels: Low Med High
    
  • Factors: rename levels
    # vf
    levels(vf) <- tolower(levels(vf))
    # equivalent to 
    # levels(vf) <- c("low", "medium", "high")
    vf
    
    [1] med  high low  low  med 
    Levels: low med high
    
  • split()
    split(iris$Sepal.Length,iris$Species)
    
    $setosa
     [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
    [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
    [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
    
    $versicolor
     [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
    [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
    [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
    
    $virginica
     [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
    [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
    [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
    
    
  • Subsetting a dataframe

    Shorthand for [ operator use

    x <- subset(iris, Species == "virginica")
    head(x)
    
        Sepal.Length Sepal.Width Petal.Length Petal.Width
    101          6.3         3.3          6.0         2.5
    102          5.8         2.7          5.1         1.9
    103          7.1         3.0          5.9         2.1
    104          6.3         2.9          5.6         1.8
    105          6.5         3.0          5.8         2.2
    106          7.6         3.0          6.6         2.1
          Species
    101 virginica
    102 virginica
    103 virginica
    104 virginica
    105 virginica
    106 virginica
    
  • tapply()

    Apply by factor level:

    tapply(iris$Sepal.Length, iris$Species, sd)
    
       setosa versicolor  virginica 
    0.3524897  0.5161711  0.6358796 
    
  • tapply() with multiple factors
    oak_wp <-
    read.csv("http://r-research-tool.schwilk.org//data/oak-water-potentials-simple.csv")
    tapply(oak_wp$md.psi, list(oak_wp$year, oak_wp$spcode), mean, na.rm=TRUE)
    
            QUEM      QUGA     QUGR2    QUGR3      QUHY
    10       NaN -1.818000 -1.900000      NaN -1.632500
    11 -2.417674 -2.920455 -4.297826 -3.42000 -2.426383
    12 -1.666667 -1.780000 -2.440000 -2.51500 -1.236000
    13 -1.587308 -1.400000 -1.686364 -1.94359 -1.215000
    
    • Note   B_note

      tapply returns a matrix. This is compact, but not the best form for some further analysis where we want the grouping factors to still be columns.

  • aggregate() will return a dataframe
    df <- aggregate(oak_wp$md.psi, list(yr=oak_wp$year, spcode=oak_wp$spcode),
                    mean, na.rm=TRUE)
    head(df)
    
      yr spcode         x
    1 10   QUEM       NaN
    2 11   QUEM -2.417674
    3 12   QUEM -1.666667
    4 13   QUEM -1.587308
    5 10   QUGA -1.818000
    6 11   QUGA -2.920455
    
    • Note   B_note

      If we name the items in the list that we hand to the by argument to aggregate, then we get names columns for the grouping variables in the output. Note that the na.rm argument was handed down to mean.

  • aggregate() with formula notation
    df <- aggregate(md.psi ~ year+spcode, data=oak_wp, FUN=mean, na.rm=TRUE)
    head(df)
    
      year spcode    md.psi
    1   11   QUEM -2.417674
    2   12   QUEM -1.666667
    3   13   QUEM -1.587308
    4   10   QUGA -1.818000
    5   11   QUGA -2.920455
    6   12   QUGA -1.780000
    
  • Shrimp again: split() and aggregate()
    shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F")
    split(seq_along(shrimp), shrimp)
    
    $F
    [1] 2 5 6 8 9
    
    $J
    [1] 3 4
    
    $M
    [1] 1 7
    
  • Frequencies:
    aggregate(shrimp, list(sex=shrimp), length)
    
      sex x
    1   F 5
    2   J 2
    3   M 2
    
  • Final thoughts
    • You will run into all these base R functions like tapply, aggregate, split. They are useful
    • Later in this course we will learn some more consistent (tidyverse) ways to accomplish these sorts of split-apply-recombine idioms.

15. Math and numbers

15.1. Number systems

  • A note on integers vs "floating point" numbers
    typeof(2)
    typeof(as.integer(2))
    typeof(c(1,2,3))
    typeof(1:3)
    typeof(c(1L,2L,3L))
    
    [1] "double"
    [1] "integer"
    [1] "double"
    [1] "integer"
    [1] "integer"
    
    • Note   B_note

      R has integers and real numbers. To enter integer numeric literals you must follow numeral with an "L". Otherwise R assumes you want a double.

      The information that follows is to help you have some idea of how things work under the hood on the computer. You don't need to memorize or fully understand how numbers are represented in order to use R.

  • Positional number systems
    • Simply ways to count things. Ours is the base-10.
    • There is no symbol "10" in base-10 and no symbol "2" in base-2, etc
    • Each column stands for a power of the base (eg powers of 10)
    • 1,357,896:
      1 3 5 7 8 9 6
      106 105 104 103 102 101 100
  • Binary: base-2
    • A single binary number is called a B inary dig IT, or bit.
    • Computers perform operations on binary number groups called "words". Today, most computers use 32-, 64-, or 128-bit words.
    • Words are subdivided into 8-bit groups called bytes.
    • 25510 = 111111112

      \( 1 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 \)

      \(= 2^8 - 1 \)

  • Hexadecimal: base-16
    Dec Hex Binary Dec Hex Binary
    0 0 0000 8 8 1000
    1 1 0001 9 9 1001
    2 2 0010 10 a 1010
    3 3 0011 11 b 1011
    4 4 0100 12 c 1100
    5 5 0101 13 d 1101
    6 6 0110 14 e 1110
    7 7 0111 15 f 1111
  • Hexadecimal example

    hex-example.png

    • Lecture notes   B_note

      16 * 5 + F (=15) = 95

  • You might have seen HTML colors

    written as 6 digit hexidecimal (really three 2-digit values)

    #CD5C5C means red=205 (CD), green=92 (5C), blue=92 (5C)

    • Lecture notes   B_note

      A 2 digit hex is 4 bits is 256 possible values (0-255) = FF

  • Negative numbers

    Negative numbers: signed integers

    • Takes an extra bit of information because it doubles the size we must be able to store.
    • Multiple ways to represent (Use first bit for sign, top half of range, or 2's complement)
    • Last way is what is used in computers — but you don't need to worry about representation.
  • Integer Overflow

    If we have a 1-byte (8 bit) integer in which the first bit stores the sign:

    01011101 : 93
    + 01101011 : 107
    \= 11001000 : -56
    • Less of an issue now that R uses 53 bit integers and converts to double as necessary

      This code used to return a result that was wrong rather than NA

      .Machine$integer.max
      .Machine$integer.max + 1L
      
      [1] 2147483647
      [1] NA
      

15.2. Floating point

  • Real numbers on computers: floating point
    • Problem:

      How to represent real numbers with integers? Binary point?

      101.011:

      1 0 1 . 0 1 1
      22 21 20 . 2-1 2-2 2-3
    • Solution:

      Fix the location of the binary point. Think of scientific notation with fixed precision: \(6.020123 \times 10^{23}\)

      In binary: \(\pm\) X.YYYYYY \(\times 2^{\pm N}\)

  • IEEE single precision: 32 bits, one word

    Simply "float" type in most languages: \(\pm\) 1.YYYYY \(\times 2^{N-127}\)

    • assumes X is always 1 (no leading zeros, trick to get one more bit of precision)
    • 1 sign bit
    • 8 bit biased exponent (n-127)
    • 23 bit mantissa (YYYYY)
  • "single precision floating point": 32 bits

    IEEE_754_Single_Negative_Zero.jpg

  • Different floating point types

    IEEE Standard for Floating-Point Arithmetic (IEEE 754) (1980s):

    • Single precision, 32 bits. Called "float" in the C language family. It has a one sign bit, 23 bit mantissa (so 24 bits of precision with the assumed 1) and 8 bit exponent .
    • Double precision, called "double" in the C language family, occupies 64 bits (8 bytes) and its significand has a precision of 53 with the assumed 1. Sign is 1 bit, exponent is 11 bits and mantissa is 52 bits.

    In R there are only 2 numeric types: integer and numeric (double precision). "Single precision" aka "float" is not used much anywhere nowadays.

  • Note   B_note
    • Some issues
      • Doubles types ("numeric" class) cannot represent all 64-bit integers! But since R uses 32 (now 53) bit integers, double can store more integers than can the integer type. See =?integer
      • Rounding error: The only numbers that can be represented exactly in R's numeric type are integers and fractions whose denominator is a power of 2. Other numbers have to be rounded to (typically) 53 binary digits accuracy.
  • In R, numeric literals are treated as floating point
    class(c(1,2,3))
    
    [1] "numeric"
    
    class(1:3)
    class(c(2L, 3L))
    
    [1] "integer"
    [1] "integer"
    
  • Rounding error

    64 bits are not enough to store every possible value on the number line!

    1.0 - 0.9 - 0.1
    
    [1] -2.775558e-17
    
  • Floating point error

    As a result, two floating point numbers will not reliably be equal:

    a <- sqrt(2)
    a * a == 2
    a * a - 2
    
    [1] FALSE
    [1] 4.440892e-16
    

    The function all.equal() allows you to compare numeric values. It basically tests abs(a - b) < epsilon value.

  • Testing equality of floating point numbers

    Compare all.equal, identical and ==

    all.equal(a*a, 2)  # see documentation for using in if statement
    identical(a*a, 2.0)
    all.equal(as.double(8), as.integer(8))
    identical(as.double(8), as.integer(8))
    
    [1] TRUE
    [1] FALSE
    [1] TRUE
    [1] FALSE
    

16. Strings

16.1. Text on computers

  • How to represent characters?

    Simple idea: numeric code for each character. Originated at Bell labs in 1960.

    • ASCII: American Standard Code for Information Interchange
      • 7 bits (128 symbols)
      • On modern computers, stored in one byte with zero as first bit
      • For example: "A" = 01000001
  • ASCII table

    0–31 used for "unprintable" characters (BEEP! page feed, etc)

    800px-ASCII_Code_Chart.png

  • All was good

    Assuming you were an English speaker.

    • Extended ASCII

      We have 8 bits! What about codes 128-255?

      • IBM-PC: The "OEM" character set: European accented characters and drawing characters.
      • But everyone had a different idea of what to do with the upper codes.
  • ANSI standard

    MS windows type solution to inconsistency in upper set use: define "code pages"

    • Same as ASCII for 0-127
    • But different "code pages" for upper set.
    • So everything was crazy but ok

      As long as you spoke a western language where 256 symbols is enough.

      In Asia, there was a whole other mess using a mixture of one and two-byte characters.

  • Unicode

    Basic idea started in 1980s: two bytes. Then expanded to more (up to 3 bytes) Enough to store every character in every language.

    • Originally: \(2^{14} = 16,384\) (2 bits reserved)
    • Currently: 154,998 code points (characters)
    • Every letter maps to a "code point" — a platonic ideal of a letter.
    • "A" is different from "a", but "A" in Times New Roman is the same as "A" in Arial
    • Notation is "U+" followed by at least four hexadecimal digits. "Hello" = U+0048 U+0065 U+006C U+006C U+006F
    • Since 2010: many emoji code points
    • Overseen by Unicode Consortium (part of United Nations).
    • Lecture notes   B_note

      There are separate code points for some italic and bold characters in some languages when those have different meaning (eg math)..

  • Encodings

    Still the issue of exactly how to store the numbers:

    There were multiple standards. Now:

    • UTF-8

      Idea: use one byte for code points 0-127, and only use two for points above that. UTF-8 is the de facto standard encoding for unicode in US.

      You can read in utf-8 encoded text in R. In fact utf-8 is probably the current default ("locale"):

      Sys.getlocale()
      
      [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
      
    • Lecture notes   B_note

      Unicode adoption into email systems has been slow and you might see hiccups since utf8 is not universal.

      Most fonts only support a small set of unicode needed for a particular script/language.

  • Unicode strings
    a <- c("\U00B5", "µ")  # note that "\" is escape char in R strings
    a
    
    [1] "µ" "µ"
    
  • You can use unicode (utf-8) in R scripts

    But may be awkward to enter depending on keyboard and text editor

    µ <- 0.0001
    x <-1000
    2*x*µ
    
    [1] 0.2
    
  • Note   B_note

    pi symbol ("\U03C0") won't show up in my lecture slides because of my export system. So I illustrate with mu

16.2. Strings

  • Character objects (strings)

    A "character" object is used to represent string values in R. We convert objects into character values with the as.character() function:

    x = as.character(pi)
    x
    typeof(x)
    
    [1] "3.14159265358979"
    [1] "character"
    
  • Formatting strings

    using paste():

    first <- "Dylan"
    last <- "Schwilk"
    n <- paste(first, last)
    n
    paste(last, first, sep=", ")
    
    [1] "Dylan Schwilk"
    [1] "Schwilk, Dylan"
    
  • Formatting strings

    using sprintf():

    s <- sprintf("%s has %d dollars", n, 50)
    s
    sprintf("%s has $%2.2f", n, 50)
    
    [1] "Dylan Schwilk has 50 dollars"
    [1] "Dylan Schwilk has $50.00"
    
    • Lecture notes   B_note

      See documentation for full details on format specification. s, d and f are most common letters used, but there are ways to indicate octal and hexadecimal, to pad leading zeros, etc.

  • Substrings

    substr() and substr<-()

    substr(s, 7, 13)
    
    [1] "Schwilk"
    

    Assignment:

    substr(s, 7, 13) <- "SCHWILK"
    s
    substr(s, 7, 13) <- "???"
    s
    
    [1] "Dylan SCHWILK has 50 dollars"
    [1] "Dylan ???WILK has 50 dollars"
    
  • Substitution
    sub("???WILK", "S.", s, fixed=TRUE)
    
    [1] "Dylan S. has 50 dollars"
    

16.3. String operations

  • Converting strings
    tolower(s)
    toupper(s)
    
    [1] "dylan ???wilk has 50 dollars"
    [1] "DYLAN ???WILK HAS 50 DOLLARS"
    
  • Built-in string functions
    grep()
    Searches for a substring, like the Linux command of the same name
    nchar()
    Finds the length of a string
    paste()
    Assembles a string from parts
    sprintf()
    Also assembles a string from parts
    substr()
    Extracts a substring
    strsplit()
    Splits a string into substrings
  • Example: testing for a file suffix
    # tests whether the file name `fname` has the suffix `suffix`,
    # e.g. "abc" in "x.abc"
    hasSuffix <- function(fname, suffix) {
      ncf <- nchar(fname)               # nchar() gives the string length
      dotpos <- ncf - nchar(suffix) + 1 # dot would start here if there is one
      # now check that stuff is at the end of the name
      return(substr(fname, dotpos, ncf) == suffix)
    }
    
    hasSuffix("file.csv", "csv")
    hasSuffix("file.csv", "txt")
    hasSuffix("file.textinfo", "textinfo")
    
    [1] TRUE
    [1] FALSE
    [1] TRUE
    
  • grep() and grepl()

    Takes its name from the linux/unix command.

    v <- c("beach", "beech", "back", "buster", "abea")
    grep("bust", v)
    
    v[grep("bust", v)]
    
    
    [1] 4
    [1] "buster"
    
    grepl("bust", v)
    
    [1] FALSE FALSE FALSE  TRUE FALSE
    

16.4. Regular expressions

  • Regular Expressions

    Sometimes we need to look for patterns, not specific strings

    • "regex"
      • Come from regular set theory (1950s)
      • Formal way of specifying a set string matches
      • Really are a whole mini computer language
      • Used in many command line tools (eg grep), and implemented as at least a library in most computer languages
      • Most text editors and word processors allow regex search and replace
  • Regex basics
    • Single character wildcard: "."
    • Single match from multiple options: [a-z] or [acdf-h]. Matches one of the characters in brackets or implied by range (hyphen is non-literal here).
    • [a-zA-Z] should match any ASCII alphabetic letter
    • To match all EXCEPT a set start the set with a ^: [^,] matches any character except a comma.
  • Examples
    v <- c("A27", "B30", "c56", "123", "33")
    grep(".[0-9]", v)
    
    [1] 1 2 3 4 5
    
    grep(".[0-9][0-9]", v)
    grep("[A-Z][0-9][0-9]", v)
    grep("[^0-9][0-9][0-9]", v)
    
    [1] 1 2 3 4
    [1] 1 2
    [1] 1 2 3
    
  • Positional matching
    • ^ matches beginning of string (note different meaning than when in brackets)
    • $ matches end of string
    • \b matches word boundary (empty string where non-word char followed by word char)
  • Positional matching example
    x <- unlist(strsplit("Out in the west Texas town of El Paso", split=" "))
    # capitalized words:
    grep("^[A-Z]", x)
    
    [1] 1 5 8 9
    
  • Matching word boundaries
    v <- c("dylan Schwilk", "Dylan schwilk", "dylanSchwilk", "dylan.SCHWILK")
    grep("\\b[A-Z]", v)
    
    [1] 1 2 4
    
  • Matching end of string
    v <- c("Dylan Schwilk", "dylan schwilk", "Schwilk, Dylan")
    grep("[Ss]chwilk$", v)
    
    [1] 1 2
    
  • Matching multiples

    Modify preceding character or expression

    • "*" Matches any number of preceding expression (zero or more!)
    • "?" matches one or none
    • "+" matches one or more
    • {n} matches n times
    • ={n,} matches n or more times
    • ={n,m} matches n to m times
  • Multiples examples
    x <- c("999xyz", "000", "A123", "a123", "1way", "WAY1", "a")
    grep("^[0-9]*[a-z]", x)
    grep("^[0-9]{3}", x)
    grep("[0-9]{3}$", x)
    grep("^[0-5]+", x)
    
    [1] 1 4 5 7
    [1] 1 2
    [1] 2 3 4
    [1] 2 5
    
  • Grouping
    • Use parentheses to group
      v <- c("aba", "abba", "aabba", "accc",  "ababa")
      grep("(ab)+a", v)
      
      [1] 1 5
      
    • Boolean OR
      v <- c("beach", "beech", "back", "buster", "abea")
      grep("be(e|a)ch", v, value=TRUE)
      grep("b(e+|a)c[hk]", v, value=TRUE)
      
      [1] "beach" "beech"
      [1] "beech" "back"
      
      • Note   B_note

        OR is greedy: it matches everything on left OR evertythinh on right up to end of regex or end of group

    • Referring to groups (back referencing)

      Find cases of thrice repeated numerals

      x <- c("999xyz", "000", "A123", "a123", "1way", "aaa", "a21113")
      grep("([0-9])\\1\\1", x)
      
      [1] 1 2 7
      
  • Regular expressions in R
    • R string functions (and functions in the stringr package) can handle regular expressions. Use fixed=TRUE to avoid.
    • R includes some predefined character classes (eg [:digit:], [:punct:]). See ?regex
    • In regexes, one must escape special characters with a backslash. But keep in mind that R will first interpret the string before it hands it to the regex engine. So to match a literal "?" use \\?
  • Escaping example
    v <- c("The first", "firstofall", "first time", "Who's on first?", "bfirstb")
    grep("first", v)
    grep("^first\\b", v)
    grep("\\bfirst\\?", v)
    
    [1] 1 2 3 4 5
    [1] 3
    [1] 4
    
  • R strings
    x <- "\\b" ; sprintf("x has %d chars", nchar(x))
    x <- "\n" ; sprintf("x has %d chars", nchar(x))
    x <- "\U00B5" ; sprintf("x has %d chars", nchar(x))
    paste(x)
    
    
    [1] "x has 2 chars"
    [1] "x has 1 chars"
    [1] "x has 1 chars"
    [1] "µ"
    
  • Text searches
    library(gutenbergr)
    oos <- gutenberg_download(1228) # origin of species
    subset(oos, grepl("Galapagos (Islands?|Archipelago).+bird", text))
    

    #+beginexample:

    gutenberg_id text
           <int> <chr>
            1228 in the Galapagos Islands nearly every land-bird, but only two ou…
            1228 Galapagos Islands reptiles, and in New Zealand gigantic wingless…
            1228 genera. In the Galapagos Archipelago, many even of the birds, th…
    

    #+endexample

16.5. stringr Package

  • Why the need?
    • R string functions are inconsistent

      Remedy: stringr package

      library(stringr)
      s <- c("(806) 777-7777", "(401) 555-5555", "(806) 444-4444")
      number_pattern <- "\\(([0-9]{3})\\) ([0-9]{3}-[0-9]{4})"
      str_match(s, number_pattern )
      
           [,1]             [,2]  [,3]      
      [1,] "(806) 777-7777" "806" "777-7777"
      [2,] "(401) 555-5555" "401" "555-5555"
      [3,] "(806) 444-4444" "806" "444-4444"
      
  • stringr and stringi

    stringr is built on top of a fast and efficient set of functions in the stringi package.

    stringi provides tools to do anything we could ever want to do with strings, whereas stringr provides tools to do the most common 95% of operations.

    library(stringi)
    ls("package:stringi")
    # vs
    ls("package:stringr")
    
  • Basic string operations with stringr
    str_c()
    like paste, but it uses the empty string as the default separator and silently removes zero length arguments
    str_length()
    is equivalent to nchar, but it preserves NAs and converts factors to characters (not integers)
    str_sub()
    is equivalent to substr() but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the end of the string. The end position defaults to -1, which corresponds to the last character.
    str_sub<-
    is equivalent to substr<-, but like str_sub it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing
    str_trim
    Trim leading and trailing whitespace
  • str_c(), str_length()
    str_c("Hello", " ", "world", "!")
    v <- c("A", "B", "C")
    str_c(v,v)
    str_length(c("Hello", " ", "world", "!"))
    
    [1] "Hello world!"
    [1] "AA" "BB" "CC"
    [1] 5 1 5 1
    
  • str_length() also automatically converts factors to strings
    # nchar(factor(v)) # produces error
    str_length(factor(v))
    
    [1] 1 1 1
    
  • str_sub()
    s <- "What do you do with a drunken sailor?"
    str_sub(s, -str_length("sailor?"), -1 ) <-  "student?"
    s
    
    [1] "What do you do with a drunken student?"
    
  • str_sort, str_order

    Provides option more natural-language friendly sorting than sort.

    s <- c("100a10", "100a5", "2b", "2a")
    str_sort(s, numeric=TRUE)
    
    [1] "2a"     "2b"     "100a5"  "100a10"
    
  • str_detect(), str_which()

    Instead of grepl() and grep()

    s <- c("100a10", "100a5", "2b", "2a")
    str_which(s, "[a-b]$")
    
    [1] 3 4
    
  • And more

17. Math II

17.1. Sorting

  • sort and order
    • sort()
      x <- c(13,5,12,5)
      sort(x)
      
      [1]  5  5 12 13
      
    • order()
      x <- c(13,5,12,5)
      order(x)
      x[order(x)]
      
      [1] 2 4 3 1
      [1]  5  5 12 13
      
  • Use order() to sort dataframes
    d1 <- data.frame(sites = c("MDS", "BGN", "MDN", "LCN"),
                     elevation = c(1900, 2300, 1950, 1750))
    d1
    
      sites elevation
    1   MDS      1900
    2   BGN      2300
    3   MDN      1950
    4   LCN      1750
    
  • Use order() to sort dataframes
    y <- order(d1$elevation)
    d1[y, ]
    
      sites elevation
    4   LCN      1750
    1   MDS      1900
    3   MDN      1950
    2   BGN      2300
    
  • Sets in R

    A set is a collection of distinct (unique) objects. In R we simply use a vector.

    numbers <- sample(1:10, 20, replace=TRUE)
    numbers
    myset <- unique(numbers)
    myset
    
     [1] 10  4  3  1  2  8  2  6  2  7  6  5  2  4  2  6  6  1  8
    [20]  6
    [1] 10  4  3  1  2  8  6  7  5
    
  • Set operations
    union(x,y)
    Union of sets x and y
    intersect(x,y)
    Intersection of sets x and y
    setdiff(x,y)
    All element in x that are not in y
    setequal(x,y)
    Test for set equality
    c %in% y
    Test for membership in set, returns boolean vector
    choose(n,k)
    Number of possible subsets of size k chosen from set of size n
    combn(x,n)
    Find all combinations of size m in x
  • We can add some functions
    1. Calculate the symmetric difference of two sets
    2. Define a subset operator
    • Note   B_note
      • symmetric difference is the set of elements which are in either of the sets, but not in their intersection.
      • subset: we want to return TRUE or FALSE based on if x is a subset of y

17.2. Math functions

  • Built-in functions
    exp()
    Exponential function, base e
    log()
    Natural logarithm
    log10()
    Logarithm base 10
    sqrt()
    Square root
    abs()
    Absolute value
    sin()
    Trig functions: sin(), cos(), tan(), atan(), atan2()
    min(), max()
    Min and max within a vector
    pmin(), pmax
    Element-wise maxima and minima of several vectors
    sum(), prod()
    Sum and product of elements of vector
    cumsum, cumprod()
    Cumulative sum or product of all elements in a vector
    round()
    Rounding to integers. See floor(), ceiling()
    factorial()
    Factorial function
    • Note   B_note

      Remember to check the weekly vocabulary lists. You should be trying out all of those functions.

  • Minimization and Maximization

    nlm() and optim() both find minima numerically

    f <- function(x) {
      return(x^2 - sin(x) )
    }
    
    d <- data.frame(x = seq(-10, 10, 0.01))
    d$y <- f(d$x)
    
    library(ggplot2)
    ggplot(d, aes(x, y)) + geom_point()
    
  • Plot of x,y

    fig15_1.png

  • Using nlm() (Newton algorithm only)
    nlm(f, 5)
    
    $minimum
    [1] -0.2324656
    
    $estimate
    [1] 0.4501833
    
    $gradient
    [1] 4.501954e-07
    
    $code
    [1] 1
    
    $iterations
    [1] 6
    
  • Using optim()
    optim(5, f, method = "Brent", lower = -100, upper = 100)
    
    $par
    [1] 0.4501836
    
    $value
    [1] -0.2324656
    
    $counts
    function gradient 
          NA       NA 
    
    $convergence
    [1] 0
    
    $message
    NULL
    
  • Symbolic math: differentiation

    There are packages to do symbolic math (and other open source alternatives eg sage) but base R has D()

    D(quote( x^2 + 2*x ), "x")
    
    2 * x + 2
    
    D(quote( exp(x^2) ), "x")
    
    exp(x^2) * (2 * x)
    
    D(quote(log(2*x) + x^2), "x")
    
    2/(2 * x) + 2 * x
    
  • Expressions: quote() and expression()

    quote returns its argument as an unevaluated expression, expression returns a list of unevaluated expressions.

    e1 <- quote(sin(x+y))
    e2 <- expression(sin(x+y))
    str(e1)
    str(e2)
    all.equal(e1, e2)
    all.equal(e1, e2[[1]])
    
     language sin(x + y)
      expression(sin(x + y))
    [1] "Modes of target, current: call, expression"
    [2] "target, current do not match when deparsed"
    [1] TRUE
    
  • D() continued
    # first argument treated as an expression
    curve(x^2 + 2*x ,0,10)
    

    fig10_2.png

    • Note   B_note

      curve() treats the first argument as an expression. It is confusing and don't worry about how this works for now. It gets into the internals of R.

  • D() continued II
    curve(eval(D(expression( x^2 + 2*x ), "x") ), 0, 10)
    

    fig15_3.png

  • Simple calculus: numeric integration

    integrate()

    f <- function(x) x^2
    integrate(f,0,1)
    
    0.3333333 with absolute error < 3.7e-15
    

17.3. Statistical distributions

  • Distribution functions
    Distribution pdf cdf quantiles rand. numbers
    Uniform dunif() punif() qunif() runif()
    Normal dnorm() pnorm() qnorm() rnorm()
    Chi square dchisq() pchisq() qchisq() rchisq()
    Binomial dbinom() pbinom() qbinom() rbinom()
    Poisson dpois() ppois() qpois() rpois()
    Student's T dt() pt() qt() rt()
    F dist. df() pf() qf() rf()

    and more!

    For continuous distributions, the most useful functions are the "p" (cumulative density) and "q" (quantile = inverse cdf) functions.

  • Using distributions

    \(95^{th}\) percentile of the Chi square distribution with 2 degrees of freedom:

    qchisq(0.95,df=2)
    
    [1] 5.991465
    

    \(5^{th}\) and \(95^{th}\) percentiles of the normal distribution with mean zero and sd of one:

    qnorm( c(0.05, 0.95) )
    
    [1] -1.644854  1.644854
    
  • Example: Binomial Distribution

    Probability of exactly 2 heads out of 12 coin flips:

    dbinom(2, size=12, prob=0.5)
    
    [1] 0.01611328
    

    Should be same as probability of 10 heads:

    dbinom(10, size=12, prob=0.5)
    
    [1] 0.01611328
    

    Probability of 1-10 heads:

    pbinom(10, size=12, prob=0.5)
    
    [1] 0.9968262
    

17.4. Randomization and simulation

  • Random variates

    Built-in random number generators for many statistical distributions.

    • Find probability of four or five heads out of five coin tosses:
      x <- rbinom(500000, 5, 0.5)
      mean(x >= 4)
      
      # The exact way:
      1 - pbinom(3, 5, 0.5)
      dbinom(4, 5, 0.5) + dbinom(5, 5, 0.5)
      
      [1] 0.187234
      [1] 0.1875
      [1] 0.1875
      
  • Expectation

    find \( E[max(X,Y)] \) when X,Y are normal random variables mean=0 sd=1

    sum <- 0
    nreps <- 100000
    for (i in 1:nreps) {
      xy <- rnorm(2)
      sum <- sum + max(xy)
    }
    print(sum/nreps)
    
    [1] 0.5648268
    
    # or store in vector:
    v <- replicate(nreps, max(rnorm(2)))
    hist(v)
    
  • Plot the results

    biv-norm-exp.png

  • Setting the random number seed

    Important for debugging!

    set.seed(100)
    mean(rnorm(10000))
    mean(rnorm(10000))
    set.seed(100)
    
    mean(rnorm(10000))
    
    [1] 0.00387737
    [1] -0.01183919
    [1] 0.00387737
    

17.5. Linear algebra

  • Linear Algebra

    Matrix multiplication:

    a <- matrix(c(1,3,2,4), ncol=2)
    b <- matrix(c(1,0,-1,1), ncol=2)
    a %*% b
    
         [,1] [,2]
    [1,]    1    1
    [2,]    3    1
    
  • Some linear algebra functions
    crossprod()
    Inner product (dot product) of two vectors. Not the real crossproduct.
    t()
    Matrix transpose
    qr()
    QR decomposition
    chol()
    Cholesky decomposition
    det()
    Determinant
    eigen()
    Eigenvalues / eigenvectors
    diag()
    Extracts the diagonal of a square matrix (useful for obtaining variances from a covariances matrix and for constructing a diagonal matrix)
    sweep()
    Numerical analysis sweep operations (a more capable and complicated "apply" type function)

18. The Grammar of Graphics and ggplot2 I

18.1. Grammar of graphics

  • Grammer of Graphics

    The Grammar of Graphics. Leland Wilkinson, 1999, 2005. Springer.

    • An abstraction

      To describe, think about, and implement graphical displays of data

  • What is a "plot"?

    A plot is

    • A data set and a set of mappings from variables to aesthetics
    • One or more layers
    • One scale for each aesthetic mapping
    • A coordinate system (we won't worry about this and just use cartesian coordinates!)
    • The faceting specification (how many subplots)
  • Example scatterplot

    A simple dataset. Plot A vs C with symbol shape determined by categorical variable, D

    A B C D
    2 3 4 a
    1 2 1 a
    4 5 15 b
    9 10 80 b
  • Map variables to aesthetic space
    • Turn values in coordinates, eg

      \(floor\left( \frac{x - min(x)}{range(x)} \times screen.width \right)\)

    • Transformed:
      x y Shape
      25 11 circle
      0 0 circle
      75 53 square
      200 300 square
  • Next steps
    • Statistical transformation

      Boring this time: the identity function

    • Combine the parts
      • Data (represented by the point geom)
      • Scales and coordinate system (axes and legends)
      • Annotations (such as background and plot title)
  • Grammar

    We can produce grammatically correct plots that are nonsensical, just as in English we can produce grammatically correct nonsense

  • Nonsense plot
    library(ggplot2)
    ggplot(mpg, aes(displ, hwy)) + geom_line()
    

    ggplot_nonsense.png

  • Resources for ggplot2

18.2. Components of a plot

  • Components of a plot
    • A layer:
      • data and aesthetic mappings
      • geometric objects
      • statistical transformations
    • The rest of the plot:
      • scales
      • facet specification
      • the coordinate system
  • Scales, facets, coordinate system
    • We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
      • A scale is a function and its inverse, along with some parameters
      • Example: a color gradient scale
      • The inverse gives us the axes (for position scales) or the legend (for other scales such as colors or shapes)
    • You've seen facets and we will stick with cartesian coordinates for now.
  • Layers

    composed of five parts:

    • data
    • aesthetic mapping (what is x, y, color, shape, etc). aes() command
    • statistical transformation (stat)
    • geometric object (geom) (geoms have default stats and vice versa)
    • a position adjustment

18.3. Aesthetics

  • Aesthetic mappings

    Refer to variables in the dataset only! So, back to our oak water potential data:

    oaks <- 
    read.csv("http://r-research-tool.schwilk.org//data/oak-water-potentials-simple.csv")
    oaks <-  oaks[complete.cases(oaks) & oaks$pd.psi < 0,]
    a <- aes(x = pd.psi, y = md.psi, color = spcode)
    p <- ggplot(oaks, a) + geom_point()
    p
    
  • Result:

    ggplot-aes.png

  • Mapping vs setting

    You can set certain aesthetics to fixed values rather than mapping to variables

    p <- ggplot(oaks, aes(pd.psi, md.psi))
    # set point color to dark blue
    p + geom_point(color = "darkblue", size=2.5, alpha=0.8)
    
  • Result

    ggplot-aes2.png

  • But don't map when you mean to set:
    # map new unamed variable with value "darkblue" to color
    p + geom_point(aes(color="darkblue"))
    
  • Result

    ggplot-aes3.png

  • A more useful mapping

    Don't worry about lubdridate::mdy() for now. That converted the date strings into continuous values on a timeline.

    # more useful:
    p + geom_point(aes(color=lubridate::mdy(date)))
    
  • Result

    ggplot-aes4.png

  • Individual vs collective geoms
    • Individual: Graphical object for each row (eg points)
    • Collective: statistical summary: polygon geom
    • Lines and smoothed paths somewhere in between
  • Grouping

    Lets look at each individual tree pre-dawn psi over time:

    p <- ggplot(oaks, aes(lubridate::mdy(date), pd.psi, group=tag)) +
                geom_line()
    p
    
  • Result:

    ggplot-oaksbytag.png

  • Grouping

    We'll use the Oxboys data set in nlme

    library(nlme)
    # ?Oxboys
    # head(Oxboys)
    # class(Oxboys)
    p <- ggplot(Oxboys, aes(age, height, group = Subject)) +
         geom_line()
    p
    
  • Result:

    ggplot-oxboys.png

  • Different groups for different layers
    # try adding lm line for all data
    p + geom_smooth(method="lm", se=FALSE)
    # same as:
    p + geom_smooth(aes(group = Subject), method="lm", se=FALSE)
    # what we want:
    p + geom_smooth(aes(group = NULL), method="lm", se=FALSE)
    
  • Default groups on categorical variables
    p <- ggplot(Oxboys, aes(Occasion, height)) + geom_boxplot()
    p + geom_line(aes(group = Subject), color = "#3366FF")
    
  • Result:

    ggplot-groups.png

  • Violin plot:
    ggplot(mpg, aes(class,hwy)) + geom_violin()
    

    ggplot-violin.png

  • Tables of geoms and stats (layer functions)
  • Stats

    Default stat for histogram geom is bin

    ggplot(diamonds, aes(carat)) +
      geom_histogram(aes(y= ..density..), binwidth=0.1)
    ## with counts
    ggplot(diamonds, aes(carat)) +
      geom_histogram(aes(y= ..count..), binwidth=0.1)
    
  • Position adjustment
    Adjustment Description
    dodge Adjust position by dodging overlaps on sides
    fill Stack overlapping objects and standardize to equal height
    identity Don't adjust position
    jitter Jitter points to avoid overlap
    stack Stack overlapping objects on top of one
  • Position examples
    # stack
    p <- ggplot(diamonds, aes(carat))
    p + stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "stack")
    # fill
    p +  stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "fill")
    # dodge
    p + stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "dodge")
    # Identity (use line not area)
    p + stat_bin(aes(y =..count.., color = cut), geom = "line",  binwidth=0.2,
                 position = "identity")
    
  • Writing your own layer functions
    stat_sum_single <- function(fun, ...) {
      stat_summary(fun = fun, colour="gray15",
                   geom="point", size = 4, ...)
    }
    se_bar <- function(){
      stat_summary(fun.data = mean_se,
                   geom = "errorbar",
                   width = 0.15)
    }
    
    p <- ggplot(iris, aes(Species, Petal.Length)) +
      se_bar() + stat_sum_single(mean)
    p
    # p + stat_sum_single(median,  size = 3, shape = 5)
    
  • Result:

    ggplot-own-func.png

19. The Grammar of Graphics and ggplot2 II

19.1. Scales

  • Scales
    • Control how data is mapped to perceptual properties, and produce guides (axes and legends) which allow us to read the plot.
    • Important parameters: name, breaks & labels, limits.
    • Naming scheme: scale_aesthetic_name. All default scales have name continuous or discrete.
  • Scales

    Position scales, color scales, manual scales, identity scale

    name
    text string, mathematical expressions (see ?plotmath). Use labs(), ylab(), xlab()
    limits
    Domain of the scale, Continuous scales take a vector of length 2, eg c(4,10).
    breaks
    sets the values that appear on the axis or the legend (tick marks).
  • Scales examples
    p <- ggplot(mpg, aes(cty, hwy, color = displ)) + geom_point()
    p
    p + scale_x_continuous("City mpg")
    p + xlab("City mpg")
    p + ylab("Highway mpg")
    p + labs(x="City mpg",
             y = expression(paste("Highway ",frac(miles,gal))),
             color= "Displacement (l)" )
    
  • Result

    ggplot-scales.png

  • Scale transformations

    "transformers":

    • Manage the transformation and the inverse for legends, values
    • Common ones: log, log10, logit, sqrt, exp, reverse
    • shorthand for common ones: scale_x_log10() is equivalent to scale_x_continuous(trans = "log10")
  • More scales examples
    library(scales) # for math_format()
    library(stringr) # for str_c
    p <- ggplot(diamonds, aes(carat, price)) + geom_point()
    p + scale_x_log10() + scale_y_log10()
    
    log10breaks <- seq(-0.8,0.8,.2)
    p + scale_x_log10(breaks = 10^(log10breaks),
                      labels = parse(
                      text=(str_c("10^",log10breaks)))) +
        scale_y_log10()
    
    # natural log
    p + scale_x_continuous(trans = "log",
                           breaks = trans_breaks("log", "exp"),
                           labels = trans_format("log", math_format(e^.x)))
    
  • Result

    ggplot-scales2.png

19.2. Themes

  • ggplot Themes
    p + theme_bw()
    
    theme_nogrid <- theme_bw() + theme(
                    axis.title = element_text(size = 16),
                    axis.text = element_text(size = 13),
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank()
      )
    p + theme_nogrid + xlab("Carat") + scale_y_continuous("Price ($)", labels = comma)
    
    # ?theme
    
  • Result:

    ggplot-theme.png

  • Some more resources for ggplot tweaking

19.3. Maps

  • Maps in ggplot

    Let's grab some public map data on Texas county boundaries

    library(maps)
    library(RColorBrewer)
    county_df <- ggplot2::map_data('county')
    tx.counties <- subset(county_df, region == "texas")
    names(tx.counties)[6] <- "county"
    
  • The TX county data
    head(tx.counties)
    
               long      lat group order region   county
    74520 -95.75271 31.53560  2492 74520  texas anderson
    74521 -95.76989 31.55852  2492 74521  texas anderson
    74522 -95.76416 31.58143  2492 74522  texas anderson
    74523 -95.72979 31.58143  2492 74523  texas anderson
    74524 -95.74698 31.61008  2492 74524  texas anderson
    74525 -95.72405 31.63873  2492 74525  texas anderson
    
  • The path geom
    tx.map <- ggplot(tx.counties, aes(x=long, y=lat, group=county))
    tx.map + geom_path(color="gray")
    

    tx_cnty_map1.png

  • Grab population data from texas.gov

    Originally at https://demographics.texas.gov/

    tx.pop <- read.csv("https://r-research-tool.schwilk.org/data/2023_txpopest_county.csv")
    tx.pop$county <- tolower(tx.pop$county)
    tx.pop <- merge(tx.counties, tx.pop)
    head(tx.pop,3)
    
        county      long      lat group order region FIPS
    1 anderson -95.75271 31.53560  2492 74520  texas    1
    2 anderson -95.76989 31.55852  2492 74521  texas    1
    3 anderson -95.76416 31.58143  2492 74522  texas    1
      census_2020_count july1_2023_pop_est jan1_2024_pop_est
    1             57922              57501             57465
    2             57922              57501             57465
    3             57922              57501             57465
      num_chg_20_23 num_chg_20_24 pct_chg_20_23 pct_chg_20_24
    1          -421          -457          -0.7          -0.8
    2          -421          -457          -0.7          -0.8
    3          -421          -457          -0.7          -0.8
    
  • TX population map
    tx.map <- ggplot(tx.pop, aes(x=long, y=lat, group=county))
    tx.map <- tx.map + geom_path(color="gray")
    tx.map <- tx.map + geom_polygon(aes(fill=log10(jan1_2024_pop_est)))
    tx.map
    
  • Result:

    ggplot_map1.png

  • Add a background map

    With Open Street Map data via stadia server (ggmap library). You will need an API key

    library(ggmap)
    bb <- c(min(tx.pop$long), min(tx.pop$lat), max(tx.pop$long), max(tx.pop$lat))
    stm.tx <- get_stadiamap(bbox=bb, zoom=6)
    tx.map2 <- ggmap(stm.tx)
    tx.map2 <- tx.map2  +  geom_path(aes(x=long, y=lat, group = county),
                                     color="gray", data=tx.pop)
    tx.map2 <- tx.map2 + geom_polygon(aes(x=long, y=lat, group = county,
                                      fill =jan1_2024_pop_est),
                                      alpha = 0.80, data=tx.pop)
    tx.map2  + scale_fill_gradientn("Population", trans="log10",
                                    colours=brewer.pal(5,"OrRd"),
                                    labels = scales::number_format(accuracy = 1))
    
  • Result with openstreetmap background

    ggplot_map2.png

  • Other mapping resources

20. Function exercises

20.1. Functions

  • Exercises 1

    Create a function that given two strings (two character vectors each of length one), check if one is an anagram of another. Look at ~sort~() function and think about how to use it in this case (hint: you need to split up the string into characters using `strsplit`).

  • Anagram finder

    Now what about producing an anagram finder using a corpus of words?

    library(qdapDictionaries)
    wordlist <- GradyAugmented
    length(wordlist)
    wordlist[1:5]
    
    [1] 122806
    [1] "cyber"        "ceasefire"    "it'll"       
    [4] "billionaires" "wheelhouse"
    
  • Approach
    • Create a lookup table with sorted words as keys and a list of all anagrams of that sorted (canonical word) as the values.
    • Then write a function that takes a word and returns all anagrams of that word by sorting it, then subscripting into the lookup table.
  • First: a function to sort a single string
    sortstring1 <- function(s) {
      return(paste(sort(unlist(strsplit(s, ""))), collapse = ""))
    }
    # better:
    sortstring <- function(s) {
      s <- stri_split_boundaries(s, type = "character")
      s <- lapply(s, stri_sort)
      s <- lapply(s, stri_join, collapse = "")
      unlist(s)
    }
    
    sortstring("goat")
    
    [1] "agot"
    
  • Second: let's use a smaller dictionary to save time here

    You will see why if you try my first solution on the long list!

    shortwordlist <- DICTIONARY[-(1:107),1] # ~20,000 words
    length(shortwordlist)
    shortwordlist[1:5]
    
    [1] 20030
    [1] "aback"   "abacus"  "abaft"   "abalone" "abandon"
    
  • Make the lookup table
    lookup_table <- list() # list of lists
    for (word in shortwordlist) {
    # 1. sort the word and call this the key
    # 2. Check if this key is already an item name in lookup_table:
    #    (!is.null(lookup_table[sorted_word]))
    # 3. If key does not exist add make a new list with word as only item and
    #    append it to lookup_table with name 'key'
    # 4. If key exists, append current word to lookup_table[key]
    }
    
    # Then write function that takes word and returns the anagrams by using that
    # global lookup table
    
    
    
    
  • Test
    getAnagrams("goat")
    getAnagrams("peels")  # our wordlist is too small, no plurals
    
    [1] "goat" "toga"
    [1] "sleep"
    
  • Trick for speed

    Using a list as a lookup table with tens of thousands of keys will be slow. It would be nice if R had a built in `dictionary` or `lookup table` class. But it does in a way: We can use an R environment as a hash map (lookup table). This is faster for large lookup tables than are lists when order is not important.

    lookup <- new.env(hash=TRUE, parent=emptyenv())
    lookup[["agot"]] <- list("goat","toga")
    lookup[["agot"]]
    
  • Test
    getAnagrams("goat")
    getAnagrams("estrngi")
    getAnagrams("peels")
    
    [1] "goat" "toga"
    [1] "resting" "stinger"
    [1] "peels" "peles" "sleep" "speel"
    

21. The Tidyverse

21.1. The "tidyverse"

  • R has a history
    • And that means some aspects have been proven effective, and some not. For example read.csv at one time has bad default stringsAsFactors=TRUE, matrix operations still have drop=TRUE)
    • Hadley Wickham and now a large group of others have worked on a set of packages that provide a consistent interface. This approach is heavily pushed by Posit (the RStudio folks).
    • You can even load all of these at once with the meta-package: library(tidyverse)
    • However, I recommend only loading what you need. Not all of these packages require committing to the "tidyverse" way of doing things.
  • Tidyverse good and bad (well, tricky)
    • Good: provides consistent, high-level appraoches to solving common data tidying and shaping tasks
    • Tricky: depends heavily on non-standard evaluation and messing with environments. Blurs the line between strings and object names even more than does base R.
  • readr

    Fixes some of the bad defaults in read.table, read.csv, etc.

  • dplyr
    • a grammar of data manipulation
    • summarizing by group
    • transforming by row or by group
    • filtering out specific records
  • tidyr
    • For obtaining tidy data (needed before steps above)
    • wide formats to long
    • long formats to wide
  • tibbles
    • tibbles are just data frames with nicer, more limited, and consistent behavior.
    • You don't need to explicitly load this package, it is loaded when you call library() on any other tidyverse package.

21.2. Reading data using readr

  • readr
    library(readr)
    fish <- read_csv("http://r-research-tool.schwilk.org/data/fish-data.csv")
    
    head(fish)
    
    # A tibble: 6 × 3
      aquarium nutrient color.score
      <chr>    <chr>          <dbl>
    1 a        low              5.3
    2 a        low              4.7
    3 a        low              5  
    4 a        low              4.3
    5 a        low              4.8
    6 b        low              4.8
    
  • Column names

    readr makes a tibble and does not automatically convert names

    bad_names <- read_csv("http://r-research-tool.schwilk.org/data/bad_names.csv", 
                          show_col_types=FALSE)
    
    bad_names
    
    # A tibble: 4 × 3
      `sample id` `1st mass` `2nd mass`
      <chr>            <dbl>      <dbl>
    1 a                100         200 
    2 b                100.        155.
    3 c                 99         106 
    4 d                 98.8       180.
    
  • Referring to non-syntactic column names
    bad_names$`2nd mass`
    
    [1] 200.0 155.3 106.0 180.3
    
  • Compare with read.csv

    read.csv replaces spaces with "." and adds "X" before numbers to make syntactically allowed R names

    bad_names2 <- read.csv("http://r-research-tool.schwilk.org/data/bad_names.csv")
    head(bad_names2)
    
      sample.id X1st.mass X2nd.mass
    1         a     100.0     200.0
    2         b     100.5     155.3
    3         c      99.0     106.0
    4         d      98.8     180.3
    
  • Other formats
    • tab delimited: read_tsv
    • other delims: read_delim
    • fixed width file: read_fwf
  • Fixed width files (base R)
    readsurvey <- function(filen) {
      d <- read.fwf(filen,
        # col_positions = (c(1, 7, 9, 16, 24, 32, 33, 50:100)),
        widths = c(6, -1, 1, -1, 5, -2, 1, -7, 8, 1, -17, rep(1,50) ),
                   na.strings = c(" ", "*", ""))
      names(d) <- c("V1", "V2", "V3", "V4", "rnumber", "version", paste("Q", as.character(seq(1:50)), sep="" ) )
      d$rnumber <- paste("R", d$rnumber, sep="")
      return(d)
    }
    grades <- readsurvey("http://r-research-tool.schwilk.org/data/fixed_width.txt")
    grades[1:3, 1:10]
    
          V1 V2 V3 V4   rnumber version Q1 Q2 Q3 Q4
    1 703001  2 11  1 R11111111       1  3  2  1  3
    2 703001  4 21  1 R12345678       2  1  3  3  3
    3 703001  6 31  1 R98765432       1  3  2  1  3
    
  • Using readr:read_fwf()

    See documentation. Similar arguments to read.fwf with some ability to automatically guess where fields begin and end.

21.3. Tidyverse idioms

  • Phylosophy
    • High level tools for data science.
    • Move away from worrying about details of subsetting, loops, etc when a common pattern is needed such as "split-apply-recombine"
    • Less focus on stability than in base R. The Tidyverse changes.
  • Everything operates on tibbles (data frames)
    • Generally, all functions take a tibble as the first argument and return a tibble as well
    • So one can chain together operations
  • An introduction to the pipe

    Defined in magrittr package (required by dplyr and re-exported). It is a bit like the | pipe in the unix command line. It simply lets the left hand object replace the first argument of the right hand function.

    library(dplyr)
    
    mtcars %>% filter(mpg > 30) %>% select(-qsec, -gear, -carb)
    # instead of filter(mtcars, mpg > 30)
    
    
                    mpg cyl disp  hp drat    wt vs am
    Fiat 128       32.4   4 78.7  66 4.08 2.200  1  1
    Honda Civic    30.4   4 75.7  52 4.93 1.615  1  1
    Toyota Corolla 33.9   4 71.1  65 4.22 1.835  1  1
    Lotus Europa   30.4   4 95.1 113 3.77 1.513  1  1
    
  • As of R 4.1 there is a native pipe
    mtcars |> filter(mpg > 30) |> select(-qsec, -gear, -carb)
    
                    mpg cyl disp  hp drat    wt vs am
    Fiat 128       32.4   4 78.7  66 4.08 2.200  1  1
    Honda Civic    30.4   4 75.7  52 4.93 1.615  1  1
    Toyota Corolla 33.9   4 71.1  65 4.22 1.835  1  1
    Lotus Europa   30.4   4 95.1 113 3.77 1.513  1  1
    
  • Also in 4.1: A lambda syntax
    myvec <- 1:10
    # instead of
    sapply(myvec, function(x) x+2)
    sapply(myvec, \(x) x+2) 
    # useful for reordering arguments for use with the native pipe:
    paste("Dr. Schwilk", "has $50.") |>
      {\(x) gsub("50", "2", x)}()
    
     [1]  3  4  5  6  7  8  9 10 11 12
     [1]  3  4  5  6  7  8  9 10 11 12
    [1] "Dr. Schwilk has $2."
    
  • Alternatives to using the pipe
    # the pipe:
    result1 <- mtcars %>% filter(mpg > 30) %>% select(-qsec, -gear, -carb)
    # nested function calls:
    result2 <- select(filter(mtcars, mpg > 30), -qsec, -gear, -carb)
    # intermediate objects:
    intermediate.result <- filter(mtcars, mpg > 30)
    result3 <- select(intermediate.result, -qsec, -gear, -carb)
    
    identical(result1, result2)
    identical(result1, result3)
    
    [1] TRUE
    [1] TRUE
    
  • Data science workflow

    From R for Data Science by Hadley and Grolemund 2017:

    data-science.png

    readr -> tidyr -> dplyr (maybe lubridate, stringr, others) -> ggplot2 and domain specific modeling packages.

  • Tidyverse design guide

    Good information on programming design decisions

    https://design.tidyverse.org/index.html

  • joins vs merge

    dplyr provides join functions to replace merge

22. Shaping data and dplyr

22.1. The dplyr package

  • dplyr

    dplyr includes a family of individually simple functions that all have the same structure and all operate on tibbles (will convert data frames to tibbles). Use non-standard evaluation.

    Five important functions all with verb names

    • select certain columns of data
    • filter your data to select specific rows (similar to base R subset)
    • arrange the rows of your data into an order
    • mutate your data frame to contain new columns based on calculations
    • summarize chunks of you data
  • also: joins
    • left_join like merge with all.x=TRUE
    • right_join like merge with all.y=TRUE
    • full_join All x and y
    • anti_join Try it!
  • Structure
    • Note   B_note

      So generally restrict dplyr operations to global scope code. Write functions that you use in dplyr operations.

  • Some sample data

    The Batting data set (in the Lahman package) contains the batting records for all professional US players. For more information use the command ?Batting

    #library(dplyr)
    library(Lahman) # baseball dataset
    names(Batting)
    
     [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"    
     [6] "G"        "AB"       "R"        "H"        "X2B"     
    [11] "X3B"      "HR"       "RBI"      "SB"       "CS"      
    [16] "BB"       "SO"       "IBB"      "HBP"      "SH"      
    [21] "SF"       "GIDP"
    
  • ID variables

    Factors by which we can group the data. For example, let's get mean stats (games, at bats, runs and hits, columns 6-9) for Babe Ruth across all years he played.

    colMeans( filter(Batting, playerID == "ruthba01")[6:9] )
    
            G        AB         R         H 
    113.77273 381.72727  98.81818 130.59091
    
  • Another ID variable: year

    Let's get mean stats (games, at bats, runs and hits, columns 6-9) across all players for 1871.

    colMeans( filter(Batting, yearID == 1871)[6:9] )
    
           G       AB        R        H 
    19.96522 94.10435 23.12174 26.96522
    
  • Obtain summaries for groups

    Finding summaries for one group is easy, but what if we want to find the same summaries for every group?

    • split, apply, combine.

      Summarize mean number of games, at bats, runs and hits (columns 6-9) for each year across all players.

  • In base R
    # Split
    pieces <- split(Batting[,6:9], Batting$yearID)
    years <- names(pieces)
    
    # Apply
    results <- lapply(pieces, colMeans)
    
    # Combine
    # do.call turns second argument (list) into individual arguments function
    result <- do.call(rbind, results)  
    result <- data.frame(result, year=years)
    
  • Results:
    head(result)
    
                G        AB        R        H year
    1871 19.96522  94.10435 23.12174 26.96522 1871
    1872 21.05732  99.77707 21.59236 28.45223 1872
    1873 28.83200 135.67200 28.64000 39.40800 1873
    1874 34.13821 155.31707 28.21138 42.47154 1874
    1875 28.79263 123.65438 19.51152 31.39171 1875
    1876 37.87097 162.26613 24.72581 43.04839 1876
    
  • using dplyr
    allyears <- select(Batting, yearID, G, AB, R, H)
    years <- group_by(allyears, yearID)
    yearstats <- summarize_all(years, list(mean=mean))
    head(yearstats, 3)
    
    # A tibble: 3 × 5
      yearID G_mean AB_mean R_mean H_mean
       <int>  <dbl>   <dbl>  <dbl>  <dbl>
    1   1871   20.0    94.1   23.1   27.0
    2   1872   21.1    99.8   21.6   28.5
    3   1873   28.8   136.    28.6   39.4
    
  • Split-apply-recombine

    When do we need to split-apply-recombine?

    • During data preparation, when performing group-wise ranking, normalization, etc
    • When creating summaries for display or analysis (marginal means, conditioning tables)
    • During modelling, when fitting separate models to subsets of the data
    • How to in other systems/languages?

      Excel pivot tables or the "by" argument to SAS procedures

22.2. dplyr functions

  • filter

    How many players in 1871?

    y1871 <- filter(Batting, yearID==1871)
    length(unique(y1871$playerID))
    
    [1] 115
    
  • mutate and summarise functions
    • mutate() creates a tibble based on an existing data frame; a row of output per row of input
    • summarise() A row of output per group
    • Arguments to mutate() and summarize()

      Function form is mutate(.data, ...) where the ... arguments are named parameters defining new columns

  • Strategy for grouped calculations
    1. Extract a single group
    2. Solve problem for that one group — perhaps write a wrapper function that takes the group and returns the summary or transformation
    3. Use dplyr functions and group_by to solve it for all groups
  • 1-2. Calculate career year for Babe Ruth

    career year = number of years since player began.

    baberuth <- filter(Batting, playerID == "ruthba01")
    baberuth <- mutate(baberuth,
                       cyear = yearID - min(yearID) + 1)
    head(baberuth, 5)[,c(2, 6:9, 23)]
    
      yearID  G  AB  R  H cyear
    1   1914  5  10  1  2     1
    2   1915 42  92 16 29     2
    3   1916 67 136 18 37     3
    4   1917 52 123 14 40     4
    5   1918 95 317 50 95     5
    
  • 3. Calculate career year for all players
    byplayer <- group_by(Batting, playerID)
    Batting <- mutate(byplayer, cyear = yearID - min(yearID) + 1)
    Batting <- arrange(Batting, playerID)
    head(Batting)[, c(1, 2, 6:9, 23)]
    
    # A tibble: 6 × 7
    # Groups:   playerID [1]
      playerID  yearID     G    AB     R     H cyear
      <chr>      <int> <int> <int> <int> <int> <dbl>
    1 aardsda01   2004    11     0     0     0     1
    2 aardsda01   2006    45     2     0     0     3
    3 aardsda01   2007    25     0     0     0     4
    4 aardsda01   2008    47     1     0     0     5
    5 aardsda01   2009    73     0     0     0     6
    6 aardsda01   2010    53     0     0     0     7
    
  • More summaries: Teams per player

    Total number of teams in the data frame

    length(unique(Batting$teamID))
    
    [1] 149
    

    Number of teams for which each player played:

    nteams <- summarise(byplayer,
                    nteams = length(unique(teamID)))
    head(nteams, 5)
    
    # A tibble: 5 × 2
      playerID  nteams
      <chr>      <int>
    1 aardsda01      8
    2 aaronha01      3
    3 aaronto01      2
    4 aasedo01       5
    5 abadan01       3
    
  • Do players improve?
    • Code   BMCOL
      brplot <- ggplot(baberuth, aes(cyear, RBI/AB)) +
            geom_point()
      brplot + geom_smooth(method="lm")
      
    • Result   BMCOL

      baseball-baberuth1.png

  • Do players improve? Modeling for one player

    Fitting a regression model relating runs per bat (rbi/ab) to career year

    mod.br <- lm(RBI/AB ~ cyear, data=baberuth)
    mod.br
    #summary(mod.br)
    
    
    Call:
    lm(formula = RBI/AB ~ cyear, data = baberuth)
    
    Coefficients:
    (Intercept)        cyear  
       0.203200     0.003413
    
  • Do players improve? Modeling for all

    How to model this for all players?

    lmslope <- function(x,y){
      linmod <- lm(y~x)
      coefs <- coef(linmod)
      return(coefs[2])
     }
    
    bb <- subset(Batting, AB >= 25)
    bb <- mutate(group_by(bb, playerID),
                 include = length(yearID) > 10)
    bb <- filter(bb, include)
    player_traj <- summarize(bb, slope=lmslope(cyear, RBI/AB))
    
    ggplot(player_traj, aes(x=slope)) +
      geom_histogram( binwidth=0.005)
    
  • Results:

    baseball-allplayers.png

  • Maybe players get better than worse again
    modq.br <- lm(RBI/AB ~ cyear + I(cyear^2), data=baberuth)
    #mod.br
    summary(modq.br)
    
    
    Call:
    lm(formula = RBI/AB ~ cyear + I(cyear^2), data = baberuth)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.10600 -0.01994  0.01035  0.04025  0.06291 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  0.1267872  0.0372004   3.408  0.00295 **
    cyear        0.0225163  0.0074510   3.022  0.00701 **
    I(cyear^2)  -0.0008306  0.0003146  -2.640  0.01613 * 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.05295 on 19 degrees of freedom
    Multiple R-squared:  0.3592,	Adjusted R-squared:  0.2917 
    F-statistic: 5.325 on 2 and 19 DF,  p-value: 0.01459
    
  • Quadratic plot
    brplot + stat_smooth(method="lm", formula = y~x+I(x^2))
    

    baseball-baberuth2.png

  • All players, quadratic term
    qtermmodel <- function(df){
      mod <- lm(RBI/AB ~ cyear + I(cyear^2), data = df)
      coefs <- data.frame(t(coef(mod)))
      names(coefs) <- c("int", "slope", "qterm")
      return(coefs)
     }
    
    bcoefs <- do(bb, qtermmodel(.))
    qplot(qterm, data=bcoefs, binwidth=0.0005) + xlim(c(-0.015, 0.015))
    
  • Results: all players, quadratic term

    baseball-allplayers-quad.png

23. More dplyr and data exploration

23.1. Baby names

  • US baby names data

    baby-names-ssa.png

  • Download the data
  • Read data
    bnames <- read.csv("http://r-research-tool.schwilk.org/data/top-1000-baby-names.csv")
    # gender here is labeled "sex" and "boy/girl" means male/female sex assigned at birth
    bnames$sex <- factor(bnames$sex)
    levels(bnames$sex) <- c("girl", "boy")
    head(bnames, 3)
    
      name  sex year rank  percent
    1 Mary girl 1880    1 7.764248
    2 Anna girl 1880    2 2.861727
    3 Emma girl 1880    3 2.201244
    
  • Name diversity

    Investigate the percentage of all names that the top 1000 names occupy

    bn <- group_by(bnames, year, sex)
    ofall <- summarize(bn, tprop = sum(percent))
    ggplot(ofall, aes(year, tprop, colour = sex)) + geom_line() +
     ylab("Percentage of all names")
    
  • Result:

    baby-names-ofall.png

  • Look at name trend
    a <- aes(year, percent, color = sex)
    p <- ggplot(filter(bnames, name=="Dylan"), a) + geom_line()
    p
    

    baby-names-dylan.png

  • Some helper functions
    # Return letter at position pos
    library(stringr)
    letter <- function(string, pos) {
      return(str_sub(string, pos, pos))
    }
    
    # and some consistent axes
    axes <- list(
      scale_x_continuous(breaks = c(1900, 1920, 1940, 1960, 1980, 2000)),
      scale_y_continuous("percent")
      )
    
    
  • Trend in first letter
    bn <- group_by(bnames, year,sex, first = letter(name, 1))
    fl <- summarize(bn, percent=sum(percent))
    ggplot(fl, a) + geom_line() + facet_wrap(~ first) + axes
    
  • Result

    baby-names-first-letter.png

  • Trend in last letter
    bn <- group_by(bnames, year,sex, last = letter(name, -1))
    ll <- summarize(bn, percent=sum(percent))
    ggplot(ll, a) + geom_line() + facet_wrap(~ last) + axes
    
  • Result

    baby-names-last-letter.png

  • Rise in names that end in n
    lastn <- group_by(filter(bnames, letter(name, -1)=="n"), year)
    lastn <- summarize(lastn, percent=sum(percent))
    ggplot(lastn, aes(year, percent)) + geom_line() + axes
    
  • Result

    baby-names-last-letter-n.png

  • What about names that sound alike?

    see http://en.wikipedia.org/wiki/Soundex

    soundex <- function(name) {
      name <- toupper(name)
      name <- unlist(strsplit(name,"")) # split into individual chars
      sdx <- name[1]
      dict <- c("BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", "AEIOUHWY")
      codes <- c("1","2","3","4","5","6",".")
            for (i in 2:length(name)) {
          code = codes[grep(name[i], dict, fixed=TRUE)]
           if (code != str_sub(sdx,-1,-1)) {
             sdx <- str_c(sdx, code)
          }
         }
      sdx <- gsub(".", "", sdx, fixed=TRUE)
      sdx <- str_pad(sdx, 3, pad = "0")
      return(sdx)
    }
    
  • Get soundex for each name in data
    unames <- unique(bnames$name)
    sdxs <- sapply(unames, soundex)
    bnames$soundex <- sdxs[ match(bnames$name, unames )]
    bnames$soundend <- str_sub(bnames$soundex,2)
    
  • So what is the deal with names that end in n?
    sdylan <- filter(bnames, soundex==soundex("Dylan"))
    unique(sdylan$name)
    
     [1] "Delma"   "Dillon"  "Delano"  "Delwin"  "Dylan"  
     [6] "Delaney" "Dillan"  "Dillion" "Dylon"   "Dyllan" 
    [11] "Dallin"  "Dilan"   "Daylen"
    
  • Plot sound alikes
    sdy <- group_by(sdylan, sex, year)
    sdylan.sum <- summarize(sdy, percent = sum(percent))
    ggplot(sdylan.sum, a)  + geom_line() + axes
    
  • Result

    baby-names-soundex-dylan.png

  • Maybe all soundexs that end in "45" + "n"?
    sy <- group_by(bnames, year, sex)
    s45 <- summarize(filter(sy, soundend == "45" & letter(name,-1)=="n"), percent=sum(percent))
    ggplot(s45, a)  + geom_line() + axes
    
    
  • Result

    baby-names-soundex-45.png

24. Tidy Data

24.1. Tidy data

  • What is tidy data?
    • A step along the road to clean data
    • Data that is easy to model, visualise and aggregate (i.e. works well with statistical model functions like lm, works well with ggplot, and dplyr)
    • Tidy:
      • Each variable must have its own column.
      • Each observation must have its own row.
      • Each value must have its own cell.
  • Example of messy data
    • There are three variables in this data set
        Pregnant Not Pregnant
      male 0 5
      female 1 4
    • Tidy:
      Sex Pregnant N
      male Yes 0
      male No 5
      female Yes 1
      female No 4
  • Data storage in tidy format
    Storage Meaning
    Table / File Data set
    Rows Observation
    Columns Variables
  • Common causes of messiness
    • column headers are values, not variable names
    • multiple variables are stored in one column
    • variables are stored in both rows and columns
    • multiple types of experimental unit stored in the same table
    • one type of experimental unit stored in multiple tables
  • Some tools
    library(tidyr)
    ?pivot_longer
    ?pivot_wider
    ?separate
    ?extract
    library(stringr)
    ?str_replace
    ?str_sub
    ?str_split_fixed
    
    

24.2. Long vs wide data

  • You've seen long vs wide data formats
    date species psi.pd psi.md RH.pd RH.md
    110810 JUDE2 -2.0 -4.5 88 50
    110810 QUGR -1.0 -2.2 90 40
    110810 PICE -0.4 -1.7 91 33

    Useful for regressing md.psi on pd.psi, but not useful if we want to facet on time of day. So is this tidy? Depends if we think of pd.psi and md.psi as separate variables or not.

  • Long format
    date species time psi RH
    110810 JUDE2 pd -2.0 88
    110810 JUDE2 md -4.5 90
    110810 QUGR pd -1.0 91
    110810 QUGR md -2.2 50
    110810 PICE pd -0.4 40
    110810 PICE md -1.7 33
  • Conceptual framework
    • Identifer (id) variables identify the unit that measurements take place on. ID variables are usually discrete, and are typically fixed by design.
    • Measured variables represent what is measured on that unit.
  • Further step

    In truly "long" format there are only id variables and a value column! (variable names all in a "variable" column)

    date species time variable value
    110810 JUDE2 pd psi -2.0
    110810 JUDE2 md psi -4.5
    110810 QUGR pd psi -1.0
    110810 QUGR md psi -2.2
    110810 PICE pd psi -0.4
    110810 PICE md psi -1.7
    110810 JUDE2 pd RH 88
    110810 JUDE2 md RH 90
    110810 QUGR pd RH 91
    110810 QUGR md RH 50
    110810 PICE pd RH 40
    110810 PICE md RH 33
  • Problem: Column headers are values, not variable names
    raw <- read.csv("https://r-research-tool.schwilk.org/data/pew.csv",
                    check.names = FALSE)
    head(raw[,1:6])
    
                religion <$10k $10-20k $20-30k $30-40k $40-50k
    1           Agnostic    27      34      60      81      76
    2            Atheist    12      27      37      52      35
    3           Buddhist    27      21      30      34      33
    4           Catholic   418     617     732     670     638
    5 Don’t know/refused    15      14      15      11      10
    6   Evangelical Prot   575     869    1064     982     881
    
  • Problem: Multiple variables in one column

    The WHO tuberculosis data:

    library(stringr)
    course_url <- "http://r-research-tool.schwilk.org"
    raw <- read.csv(file.path(course_url, "data/tb-simple-2020.csv"))
    raw %>% select(1:7) %>% tail()
    
          country iso2 g_whoregion year new_sp new_sp_m04
    8487 Zimbabwe   ZW         AFR 2014     NA         NA
    8488 Zimbabwe   ZW         AFR 2015     NA         NA
    8489 Zimbabwe   ZW         AFR 2016     NA         NA
    8490 Zimbabwe   ZW         AFR 2017     NA         NA
    8491 Zimbabwe   ZW         AFR 2018     NA         NA
    8492 Zimbabwe   ZW         AFR 2019     NA         NA
         new_sp_m514
    8487          NA
    8488          NA
    8489          NA
    8490          NA
    8491          NA
    8492          NA
    
  • Clean a bit
    ## get rid of the "overall count" for now
    raw$new_sp <- NULL
    ## rename rest of columns for clarity
    names(raw) <- str_replace(names(raw), "new_sp_", "")
    raw %>% select(1:8, -iso2) %>% tail()
    
          country g_whoregion year m04 m514 m014 m1524
    8487 Zimbabwe         AFR 2014  NA   NA   NA    NA
    8488 Zimbabwe         AFR 2015  NA   NA   NA    NA
    8489 Zimbabwe         AFR 2016  NA   NA   NA    NA
    8490 Zimbabwe         AFR 2017  NA   NA   NA    NA
    8491 Zimbabwe         AFR 2018  NA   NA   NA    NA
    8492 Zimbabwe         AFR 2019  NA   NA   NA    NA
    

24.3. Reshaping

  • Variables in rows and columns

    Example data, photosynthesis and respiration in two species

    library(tidyr)
    
    tab1 <- readr::read_csv("http://r-research-tool.schwilk.org/data/carex.csv")
    head(tab1)
    
    Species Treatment A.1 A.2 A.3 A.4 R.1 R.2 R.3 R.4
    Pa 0.15 8.36 7.632 5.522 6.3 0.7358 0.587 0.6236 0.4354
    Pa 3 11.06 11.54 7.914 7.914        
    Pa 15 22.32 18.64 17.2 18.4 1.168 0.96 1.546  
    Cs 0.15 15.6 10.24 13.1 10.48 1.484 0.7682 0.6236 1.306
    Cs 3 13.74 15.24 13.92 10.5        
    Cs 15 9.37 18.76 15.6 10.66 1.036 1.064 0.5276 0.899
    • Note   B_note

      But A, A.1 and A.2 all represent measurements of A (photosynthesis). Someone put multiple observations all in a single row.

  • Make data longer (wide to long format)
    tab.tidier <- pivot_longer(tab1, cols=c(-Species, -Treatment),
                               names_to="measurement")
    head(tab.tidier)
    
    # A tibble: 6 × 4
      Species Treatment measurement value
      <chr>       <dbl> <chr>       <dbl>
    1 Pa           0.15 A.1         8.36 
    2 Pa           0.15 A.2         7.63 
    3 Pa           0.15 A.3         5.52 
    4 Pa           0.15 A.4         6.3  
    5 Pa           0.15 R.1         0.736
    6 Pa           0.15 R.2         0.587
    
  • Separate replicate id from measurement type
    tab.long <- tab.tidier %>% separate(measurement, into = c("measurement", "rep"),
                                                     sep = "\\.")
    head(tab.long)
    
    # A tibble: 6 × 5
      Species Treatment measurement rep   value
      <chr>       <dbl> <chr>       <chr> <dbl>
    1 Pa           0.15 A           1     8.36 
    2 Pa           0.15 A           2     7.63 
    3 Pa           0.15 A           3     5.52 
    4 Pa           0.15 A           4     6.3  
    5 Pa           0.15 R           1     0.736
    6 Pa           0.15 R           2     0.587
    

    In more complicated cases you may need to use regular expressions to pull apart column data. See names_pattern argument, also tidyr::extract, stringr::str_match

  • Make data wider (long to wide)

    Why?

    tab.tidy <- tab.long %>% pivot_wider(names_from=measurement,
                                           values_from=value)
    head(tab.tidy)
    
    
    # A tibble: 6 × 5
      Species Treatment rep       A      R
      <chr>       <dbl> <chr> <dbl>  <dbl>
    1 Pa           0.15 1      8.36  0.736
    2 Pa           0.15 2      7.63  0.587
    3 Pa           0.15 3      5.52  0.624
    4 Pa           0.15 4      6.3   0.435
    5 Pa           3    1     11.1  NA    
    6 Pa           3    2     11.5  NA
    
  • pivot longer: TB data again
    tempwide <- raw %>% select(1:7, -iso2)
    tail(tempwide)
    
          country g_whoregion year m04 m514 m014
    8487 Zimbabwe         AFR 2014  NA   NA   NA
    8488 Zimbabwe         AFR 2015  NA   NA   NA
    8489 Zimbabwe         AFR 2016  NA   NA   NA
    8490 Zimbabwe         AFR 2017  NA   NA   NA
    8491 Zimbabwe         AFR 2018  NA   NA   NA
    8492 Zimbabwe         AFR 2019  NA   NA   NA
    
  • pivot longer
    temp_long <- tempwide %>%
      pivot_longer(cols=starts_with("m"), names_to="sexage", values_to="count")
    # More correctly, use a regex to capture m|f|sexunk
    temp_long <- tempwide %>%
      pivot_longer(cols=matches("^[mfs]"), names_to="sexage", values_to="count")
    
    head(temp_long)
    
    # A tibble: 6 × 5
      country     g_whoregion  year sexage count
      <chr>       <chr>       <int> <chr>  <int>
    1 Afghanistan EMR          1980 m04       NA
    2 Afghanistan EMR          1980 m514      NA
    3 Afghanistan EMR          1980 m014      NA
    4 Afghanistan EMR          1981 m04       NA
    5 Afghanistan EMR          1981 m514      NA
    6 Afghanistan EMR          1981 m014      NA
    
  • Several choices now for grabbing sex and age:
    1. Check possible values. I omitted most cols so this is incomplete.
    2. Then separate into age and sex columns by using either names_pattern in pivot_longer directly, tidyr:extract, or stringr::str_match
  • What to do with NAs?
    • In long data, some NAs can be implicit (missing rows), but when spread to a wider form we need to make them explicit because there is a cell to fill.
    • In some cases, an NA in one form (or a missing row) can imply a zero in another aggregate form (for example: oak cover on my Big Bend transects!).
  • Things you can do with tidyr
    • pivot wider or longer
    • split or combine columns (separate(), extract(), and unite())
    • make explicit missing value explicit (complete()) or vice-versa (drop_na()), or replace with known value (replace_na()).
  • Non standard evaluation

    Most dplyr verbs use non-standard evaluation and treat some arguments as expressions. In more detail:

    • arrange(), count(), filter(), group_by(), mutate(), and summarise() use data masking so that you can use data variables as if they were variables in the environment (i.e. you write my_variable not df$myvariable).
    • across(), relocate(), rename(), select(), and pull() use tidy selection so you can easily choose variables based on their position, name, or type (e.g. starts_with("x") or is.numeric).

      Look up these terms and look at examples.

25. Dates and Times

25.1. How to represent time

  • Time is tricky to represent

    Time is a deceptively complicated system to work with. Do we want it to behave like a number line, or according to complicated conventions?

    • Possible time objects
      • instants (specific moment)
      • intervals (specific timespans with a start and stop)
      • durations (length of timespan between two instants)
      • periods (time span in units larger than a second that can be of variable length)
  • Instants
    • An instant refers to a specific moment of time.
    • Instants are combination of years, months, days, hours, minutes, seconds, and timezone. e.g., 2011-03-28 08:44:32 CST
    • There are two main time data types in R:
  • POSIXct

    The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.

  • POSIXlt

    The POSIXlt data type is a vector, and the entries in the vector have the following meanings:

    • seconds
    • minutes
    • hours
    • day of month (1-31)
    • month of the year (0-11)
    • years since 1900
    • day of the week (0-6 where 0 represents Sunday)
    • day of the year (0-365)
    • Daylight savings indicator (positive if it is daylight savings)

25.2. lubridate

  • lubridate package

    See https://lubridate.tidyverse.org/ for cheatsheet.

    
    #  lubridate turns strings into instants with functions
    # that have y, m, and d in their names.
    
    ymd("2011-03-28")
    mdy("03/28/2011")
    dmy("28032011")
    ymd_hms("2011:03:28 08:50:30", tz="America/Chicago")
    dmy("05-03-2011")
    mdy("05-03-2011")
    
    [1] "2011-03-28"
    [1] "2011-03-28"
    [1] "2011-03-28"
    [1] "2011-03-28 08:50:30 CDT"
    [1] "2011-03-05"
    [1] "2011-05-03"
    
  • What type of objects do these return?
    # what is the object type (class)?
    class(mdy("05-03-2011"))
    class(ymd_hms("2011:03:28 08:50:30", tz="America/Chicago"))
    
    [1] "Date"
    [1] "POSIXct" "POSIXt"
    
  • Example use
    url <- "http://r-research-tool.schwilk.org/data/soil-sensors.csv"
    soil <- read.csv(url, na.strings = c("NA", "#N/A", ""))
    names(soil) <- c("time", "p1_psi", "p2_VWC",
                     "p3_VWC", "p4_psi")
    head(soil)
    class(soil$time)
    
                      time p1_psi p2_VWC p3_VWC p4_psi
    1 24 May 2012 12:00 PM    -36  0.230  0.338    -23
    2 24 May 2012 12:30 PM    -36  0.229  0.338    -23
    3 24 May 2012 01:00 PM    -36  0.226  0.337    -23
    4 24 May 2012 01:30 PM    -37  0.224  0.337    -23
    5 24 May 2012 02:00 PM    -37  0.222  0.336    -23
    6 24 May 2012 02:30 PM    -38  0.220  0.335    -24
    [1] "character"
    
  • Try:

    Convert time column into date-time objects

    • Note   B_note

      currently these functions rely on base R strptime function. use parse_date_time2() if you run into problems such as nonstandard month abbreviations, two-digit years, etc.

  • Manipulating instants
    now() > ymd("2025-04-16")
    today() == ymd("2025-04-15")  # date object
    now() > ymd("2025-04-15", tz="America/Chicago")
    
    
    [1] TRUE
    [1] FALSE
    [1] TRUE
    
  • Timezones

    Tricky to specify. Use this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

    d <- ymd_hms("2020-10-22 14:30:00", tz="America/Chicago")
    d
    with_tz(d, tz="UTC")
    with_tz(d, tz="America/Los_Angeles")
    force_tz(d, tz="America/Los_Angeles")
    force_tz(d, tz="America/Whitehorse")
    
    [1] "2020-10-22 14:30:00 CDT"
    [1] "2020-10-22 19:30:00 UTC"
    [1] "2020-10-22 12:30:00 PDT"
    [1] "2020-10-22 14:30:00 PDT"
    [1] "2020-10-22 14:30:00 PDT"
    
    • Note   B_note

      Note that TZ names like "US/Central" are officially considered deprecated by the tz database that is used globally (by many different software). They are unlikely to ever not work but just a heads up that the continent/city format is now the standard.

      Timezones are very confusing because daylight savings dates and observance could differ by municipality, and of course over time. And political boundaries cahnge. You might be surprised if you have a time series that crosses 1867 in Alaska, for example (Julian vs Gregorian calendar, international date line all changed with purchase from Russia).

  • Rounding times
    
    # round
    floor_date(now(), "month")
    ceiling_date(now(), "month")
    round_date(now(), "month")
    
    # extract
    year(now())
    
    [1] "2026-04-01 CDT"
    [1] "2026-05-01 CDT"
    [1] "2026-05-01 CDT"
    [1] 2026
    
  • Accessing components
    Date component Accessor
    Year year()
    Month month()
    Week week()
    Day of year yday()
    Day of month mday()
    Day of week wday()
    Hour hour()
    Minute minute()
    Second second()
    Time Zone tz()
  • Manipulating instants
    now()
    year(now())
    hour(now())
    day(now())
    yday(now())
    
    [1] "2026-04-28 10:18:36 CDT"
    [1] 2026
    [1] 10
    [1] 28
    [1] 118
    
  • Instants, continued
    wday(now())
    wday(now(), label = TRUE)
    wday(now(), label = TRUE, abbr = FALSE)
    month(now(), label = TRUE, abbr = FALSE)
    
    [1] 3
    [1] Tue
    Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat
    [1] Tuesday
    7 Levels: Sunday < Monday < Tuesday < ... < Saturday
    [1] April
    12 Levels: January < February < March < April < ... < December
    
  • Upon which day of the week were you born?
  • Changing components
    # Accessor functions can also be used to change the
    # elements of a date-time
    ti <- now()
    year(ti) <- 1999
    hour(ti) <- 23
    day(ti) <- 45
    tz(ti) <- "UTC"
    ti
    
    [1] "1999-05-15 23:18:36 UTC"
    
  • Summarizing over time
    library(dplyr)
    library(ggplot2)
    soil$date <- floor_date(soil$time, "day") # could use dplyr::mutate
    # or could use date(soil$time)
    psi.day <- soil %>% group_by(date) %>%
                summarize(avepsi = mean(p1_psi, na.rm=TRUE))
    ggplot(psi.day, aes(date, avepsi)) + geom_line()
    
  • Result

    date-aggregate.png

25.3. Time spans

  • Time spans
    #?dseconds
    dseconds(140)
    dhours(32)
    ddays(2)
    class(ddays(1))
    
    [1] "140s (~2.33 minutes)"
    [1] "115200s (~1.33 days)"
    [1] "172800s (~2 days)"
    [1] "Duration"
    attr(,"package")
    [1] "lubridate"
    
  • Let's use R to set our alarm clock
    x <- ymd_hms("2026-01-01 08:30:00", tz = "") +
            ddays(0:30)
    head(x)
    # cool
    
    [1] "2026-01-01 08:30:00 CST" "2026-01-02 08:30:00 CST"
    [3] "2026-01-03 08:30:00 CST" "2026-01-04 08:30:00 CST"
    [5] "2026-01-05 08:30:00 CST" "2026-01-06 08:30:00 CST"
    
  • A problem:
    ## what about:
    x <- ymd_hms("2026-03-01 08:30:00", tz = "") +
            ddays(0:30)
    head(x, 12)
    
     [1] "2026-03-01 08:30:00 CST" "2026-03-02 08:30:00 CST"
     [3] "2026-03-03 08:30:00 CST" "2026-03-04 08:30:00 CST"
     [5] "2026-03-05 08:30:00 CST" "2026-03-06 08:30:00 CST"
     [7] "2026-03-07 08:30:00 CST" "2026-03-08 09:30:00 CDT"
     [9] "2026-03-09 09:30:00 CDT" "2026-03-10 09:30:00 CDT"
    [11] "2026-03-11 09:30:00 CDT" "2026-03-12 09:30:00 CDT"
    
  • What went wrong?
  • 3 types of time spans

    We can still do math on the time line, as long as we’re specific about how we want to measure time.

    • durations (exact number of seconds)
    • periods (clock times)
    • intervals (two specific datetimes)
  • Durations
    # Use d + the plural of of a unit of time to create
    # a duration.
    dminutes(5)
    dhours(278)
    #dmonths(4) # why doesn't this work?
    
    [1] "300s (~5 minutes)"
    [1] "1000800s (~1.65 weeks)"
    
  • Periods
    # Use the plural of a time unit to create a period.
    minutes(5)
    hours(278)
    months(4) # months are not a problem
    years(1:10)
    
    [1] "5M 0S"
    [1] "278H 0M 0S"
    [1] "4m 0d 0H 0M 0S"
     [1] "1y 0m 0d 0H 0M 0S"  "2y 0m 0d 0H 0M 0S" 
     [3] "3y 0m 0d 0H 0M 0S"  "4y 0m 0d 0H 0M 0S" 
     [5] "5y 0m 0d 0H 0M 0S"  "6y 0m 0d 0H 0M 0S" 
     [7] "7y 0m 0d 0H 0M 0S"  "8y 0m 0d 0H 0M 0S" 
     [9] "9y 0m 0d 0H 0M 0S"  "10y 0m 0d 0H 0M 0S"
    
  • Why use periods?

    Appointments that refer to wall clock time once per week, per month, per year, etc.

    2020-11-01 00:00:00 + months(1) will always equal 2020-12-01 00:00:00 no matter how many leap seconds, leap days or changes in DST occur in between

  • Intervals
    # To create an interval, use the special %--%
    # operator:
    int <- ymd("2024-01-01") %--% ymd("2024-01-02")
    # Access and set the endpoints with int_start() and
    # int_end().
    int_start(int)
    int_end(int) <- ymd("2024-03-14")
    as.duration(int)
    as.period(int)
    # Intervals can be accurately converted to either
    # periods or durations with as.period() and
    # as.duration()
    
    [1] "2024-01-01 UTC"
    [1] "6307200s (~10.43 weeks)"
    [1] "2m 13d 0H 0M 0S"
    

25.4. Math with date-times

  • Math with date-times
    birthday <- ymd("1830-12-10") # Emily Dickinson
    life <- birthday %--% now()
    life / ddays(1)
    life / days(1)
    life %/% days(1)
    
    
    [1] 71362.64
    [1] 71362.64
    [1] 71362
    
  • Try:

    Calculate your age in minutes. In hours. In days. How old is someone who was born on April 4th, 1969?

26. Statistical Models

26.1. Statistical models

  • Definition
    Data
    A collection of variables, each variable being a vector of measurements of some specific trait/characteristic
    Problem
    How does variable \(Y\) depend upon variables \(X_1, X_2 \dots X_n\)
    Model
    The model represents the mathematical relationship between X and Y. It aims to represent the true relationship
  • First steps in modeling
    • Identify the response variable
    • Identify the explanatory variables
    • Are the explanatory variables continuous, categorical or a mixture of both?
    • What is the response variable? Continuous, a count or proportion, category, or time-at-death?
  • Explanatory variables
    Continuous
    Regression
    Categorical
    Analysis of Variance (ANOVA)
    Mixture
    Analysis of covariance (ANCOVA)
  • Response variables
    Continuous
    Linear regression, anova, ancova
    Proportion
    Logistic regression (and variants for counts, binary data)
    Time-at-death
    Survival analysis

26.2. Model formulas in R

  • Model formulas
    • A model formula is a sequence of R variables and symbols that specifies the variables in the model and their relationships
    • The formula is a necessary argument to many model-fitting functions in R (treated as an expression)
    • Although it resembles a mathematical formula, the symbols in the "equation" have different meanings
  • Model formula notation

    A model formula in R can look like this:

    y ~ A + B + A:B
    

    the ~ means: "is modeled as a function of"

  • Example: Linear regression

    Setup:

    library(ggplot2)
    x <- runif(1000, 0, 10)
    y <- 1.5*x + 0.09 + rnorm(1000, 0, 2)
    d <- data.frame(x=x, y=y)
    ggplot(d, aes(x, y)) + geom_point()
    

    figmodels1.png

    • Lecture notes   B_note
      • If we run a linear regression, what should the estimated intercept and slope be?
      • What should be the residual error?
  • Example: Linear regression
    fit <- lm(y ~ x, data = d)
    fit
    
    
    Call:
    lm(formula = y ~ x, data = d)
    
    Coefficients:
    (Intercept)            x  
        0.09578      1.50354
    
  • Example: Linear regression
    summary(fit)
    
    
    Call:
    lm(formula = y ~ x, data = d)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -5.2281 -1.3923 -0.0896  1.3105  6.9257 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  0.09578    0.12165   0.787    0.431    
    x            1.50354    0.02114  71.106   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.947 on 998 degrees of freedom
    Multiple R-squared:  0.8352,	Adjusted R-squared:  0.835 
    F-statistic:  5056 on 1 and 998 DF,  p-value: < 2.2e-16
    
  • Extracting information from a fitted object
    typeof(fit)
    class(fit)
    names(fit)
    coef(fit)
    
    [1] "list"
    [1] "lm"
     [1] "coefficients"  "residuals"     "effects"      
     [4] "rank"          "fitted.values" "assign"       
     [7] "qr"            "df.residual"   "xlevels"      
    [10] "call"          "terms"         "model"        
    (Intercept)           x 
     0.09577804  1.50353900
    
  • Residuals
    head(residuals(fit),5)
    sd(residuals(fit))
    
              1           2           3           4           5 
    -2.95736864  0.02948475 -0.95593011 -1.22266288 -1.48669100 
    [1] 1.945628
    
  • Intercepts

    But wait! R estimated two coefficients and we only had one explanatory variable.

    • R assumes that you want to include an intercept term (our call was equivalent to y ~ 1 + x)
    • You can exclude it by adding a 0 to the formula (or a -1). But you (almost) never want to do that.
  • Example of no intercept
    fitnoi <- lm(y ~ 0 + x, data = d)
    summary(fitnoi)
    
    
    Call:
    lm(formula = y ~ 0 + x, data = d)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -5.1886 -1.3703 -0.0788  1.3384  6.9321 
    
    Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
    x   1.5179     0.0107   141.9   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.946 on 999 degrees of freedom
    Multiple R-squared:  0.9527,	Adjusted R-squared:  0.9527 
    F-statistic: 2.013e+04 on 1 and 999 DF,  p-value: < 2.2e-16
    
  • Let's build some models
    # names(mtcars)
    ggplot(mtcars, aes(disp, mpg)) + geom_point()
    ggplot(mtcars, aes(wt, mpg)) + geom_point()
    ggplot(mtcars, aes(disp, wt)) + geom_point()
    fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars)
    summary(fuel.mod)
    
  • Model results
    
    Call:
    lm(formula = mpg ~ disp + wt + wt:disp, data = mtcars)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -3.267 -1.677 -0.836  1.351  5.017 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 44.081998   3.123063  14.115 2.96e-14 ***
    disp        -0.056358   0.013239  -4.257  0.00021 ***
    wt          -6.495680   1.313383  -4.946 3.22e-05 ***
    disp:wt      0.011705   0.003255   3.596  0.00123 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 2.455 on 28 degrees of freedom
    Multiple R-squared:  0.8501,	Adjusted R-squared:  0.8341 
    F-statistic: 52.95 on 3 and 28 DF,  p-value: 1.158e-11
    
  • anova() Test if added variables reduce residual sum-of-squares
    # type I tests (sequential):
    anova(fuel.mod)
    
    Analysis of Variance Table
    
    Response: mpg
              Df Sum Sq Mean Sq F value   Pr(>F)    
    disp       1 808.89  808.89 134.217  3.4e-12 ***
    wt         1  70.48   70.48  11.694 0.001942 ** 
    disp:wt    1  77.93   77.93  12.931 0.001227 ** 
    Residuals 28 168.75    6.03                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • But with regression, data are likely unbalanced, best to use type II or III SS
    library(car)
    
    
    # set contrasts different than system difaults: must sum to 0
    options(contrasts = c("contr.sum","contr.poly"))
    
    # car::Anova() can use type II and type III sum of squares. 
    # We found a significant interaction in our test with anova() 
    # so we should use type III. This will match SAS output.
    fuel.mod <- lm(mpg ~ disp + wt + wt:disp,
                   data=mtcars)
    Anova(fuel.mod, contrasts=list(disp=contr.sum, wt=contr.sum),
          type=3)
    
    
  • Results
    Anova Table (Type III tests)
    
    Response: mpg
                 Sum Sq Df F value    Pr(>F)    
    (Intercept) 1200.72  1 199.233 2.956e-14 ***
    disp         109.22  1  18.123 0.0002102 ***
    wt           147.42  1  24.461 3.217e-05 ***
    disp:wt       77.93  1  12.931 0.0012270 ** 
    Residuals    168.75 28                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    • Note   B_note

      Usually the hypothesis of interest is about the significance of one factor while controlling for the level of the other factors. This equates to using type II or III sum of squares. In general, if there is no significant interaction effect, then type II is more powerful, and follows the principle of marginality. If an interaction is present, then type II is inappropriate while type III can still be used, but results need to be interpreted with caution (in the presence of interactions, main effects are rarely interpretable).

  • Do use anova() to compare two models
    fuel.mod2 <- lm(mpg ~ disp + wt, data=mtcars)
    anova(fuel.mod2, fuel.mod)
    
    Analysis of Variance Table
    
    Model 1: mpg ~ disp + wt
    Model 2: mpg ~ disp + wt + wt:disp
      Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
    1     29 246.68                                
    2     28 168.75  1    77.934 12.931 0.001227 **
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Linear models: summary vs anova vs car::Anova
    • In short:
      • summary() gives T-test against null that \(\Beta = 0\).
      • anova() does sequential F-test if each variable improves model (lowers residual sum-of-squares).
      • car::Anova() allows type II and III sum of squares.
  • Linear models: summary vs anova vs car::Anova
    • Longer

      Ben Bolker (https://stackoverflow.com/questions/12863178/comparing-two-linear-models-with-anova-in-r): The p-values you get for summary() of each individual model are the p-values for the effects of each of the parameters in each model, conditional on all the other parameters in that model. If your data are perfectly balanced (which is unlikely in a regression design), you should get the same answers from summary() and anova(), but otherwise the results from anova are generally preferable.

  • Categorical variables and contrasts

    If you have N levels of a factor, you have N-1 contrasts (reported as coefficients).

    • Default is dummy coding: each level compared to reference level (eg first). You can re-order levels to change reference.
    • Reverse Helmert coding (contr.helmert) compares each level of a categorical variable to the mean of previous levels of the variable.
    • And others including user-defined

    See https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf

    or https://marissabarlaz.github.io/portfolio/contrastcoding/

  • Symbols in formulas
    + separate effects in a formula
    : interaction (A:B is interaction of A and B)
    * main effects plus interactions
    ^ crossed
    %in% nested within (but see lme4 and nlme)
    / nested within
    | conditional on (rarer)

    A*B is equivalent to A + B + A:B

  • lm() works for categorical variables too
    mtcars$cylnum <- factor(mtcars$cyl)
    levels(mtcars$cylnum)
    # treatment or dummy coding
    mtcars$cylnum <- C(mtcars$cylnum, treatment)
    fuel.aov <- lm(mpg ~ cylnum, data=mtcars)
    Anova(fuel.aov)
    
    [1] "4" "6" "8"
    Anova Table (Type II tests)
    
    Response: mpg
              Sum Sq Df F value    Pr(>F)    
    cylnum    824.78  2  39.697 4.979e-09 ***
    Residuals 301.26 29                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Coefficients of categorical variables
    summary(fuel.aov)
    
    
    Call:
    lm(formula = mpg ~ cylnum, data = mtcars)
    
    Residuals:
        Min      1Q  Median      3Q     Max 
    -5.2636 -1.8357  0.0286  1.3893  7.2364 
    
    Coefficients:
                Estimate Std. Error t value Pr(>|t|)    
    (Intercept)  26.6636     0.9718  27.437  < 2e-16 ***
    cylnum6      -6.9208     1.5583  -4.441 0.000119 ***
    cylnum8     -11.5636     1.2986  -8.905 8.57e-10 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 3.223 on 29 degrees of freedom
    Multiple R-squared:  0.7325,	Adjusted R-squared:  0.714 
    F-statistic:  39.7 on 2 and 29 DF,  p-value: 4.979e-09
    
    • Note   B_note

      The "contrasts" vector is set through `options()` in your R environment. This vector of length 2 sets contrast types represented as strings determines how categorical variables are handled in your models. The most common scheme in regression is called "treatment contrasts": with treatment contrasts, the first level of the categorical variable is assigned the value 0, and then other levels measure the change from the first level. The contrasts argument to options always takes two arguments. the first refers to the contrasts used for unordered factors and the second for ordered factors. The default for unordered variables is `contr.treatment()`.

      in anova, we usually use contrasts.sum. For ordered factors, contrasts.poly.

      Look up more on contrasts coding. contrasts() function and cont.treatment() etc:

      https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/ or

      For more control, there is the `contrasts` package: https://cran.r-project.org/web/packages/contrast/vignettes/contrast.html

26.3. Beyond the linear model

  • Beyond the linear model
    GLM
    Generalized linear modeling allow non-normal errors in response variable through a linear link function
    Mixed effects models
    relax assumption of independence. Model nested observations and random effects
    GLS
    Generalized least squares relax assumption of homogeneity, model the heterogeneity
    GLMM
    combine first and second, above.
    GAM and GAMM
    Generalized additive models and mixed versions
  • Generalized Linear Models (glm)

    Consist of:

    • Error distributions

      The distribution of the response around its mean. for ANOVA and regression, this is the normal; for logistic regression it is the binomial

    • Link function

      The function, g that shows how the linear function of the explanatory variables is related to the expected value of the response:

      \(g(\mu) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)

    • Variance function

      This captures how the variance of the response depends upon the mean

  • Example logistic regression
    trees <- data.frame(fire.sev = seq(0.01,1,0.005))
    fire.mort <- runif(length(trees$fire.sev)) < trees$fire.sev
    trees$mortality <- ifelse(fire.mort,1,0)
    ggplot(trees, aes(fire.sev, mortality)) + geom_point()
    
  • Not a linear relationship!

    fig_log_reg.png

  • Logistic regression
    trees.glm <- glm(mortality ~ fire.sev, data=trees, family = binomial())
    trees.glm
    # summary(trees.glm)
    
    
    Call:  glm(formula = mortality ~ fire.sev, family = binomial(), data = trees)
    
    Coefficients:
    (Intercept)     fire.sev  
         -2.323        4.627  
    
    Degrees of Freedom: 198 Total (i.e. Null);  197 Residual
    Null Deviance:	    275.9 
    Residual Deviance: 212.7 	AIC: 216.7
    
  • Prediction
    trees$pred.mort <- predict(trees.glm, type="response")
    ggplot(trees, aes(fire.sev, mortality)) + geom_point() +
      geom_line(aes(y =pred.mort) )
    
  • Prediction: plot

    fig_log_pred.png

26.4. Mixed models

  • Example: Mixed model
    fish <- read.csv("http://r-research-tool.schwilk.org/data/fish-data.csv")
    head(fish)
    
      aquarium nutrient color.score
    1        a      low         5.3
    2        a      low         4.7
    3        a      low         5.0
    4        a      low         4.3
    5        a      low         4.8
    6        b      low         4.8
    
  • Look at these data

    Is each fish by treatment row independent?

  • Mixed model using nlme
    library(nlme)
    fit1.lme <- lme(color.score ~ 1, random = ~ 1 | aquarium,
                    method="ML", data=fish)
    fit2.lme <- lme(color.score ~ nutrient, random = ~ 1 | aquarium,
                     method="ML", data=fish)
    
    anova(fit1.lme, fit2.lme)
    
             Model df      AIC      BIC    logLik   Test  L.Ratio
    fit1.lme     1  3 168.2086 173.9446 -81.10429                
    fit2.lme     2  4 147.2213 154.8694 -69.61064 1 vs 2 22.98729
             p-value
    fit1.lme        
    fit2.lme  <.0001
    
  • BEWARE! Those p values are wrong in any non classical design.

    Beware of p-values in multi-level models. Consider model selection (AIC or BIC, but careful there as well). Consider Markov chain Monte Carlo methods (MCMC) for confidence intervals.

    See package lmerTest, and pbkrtest for Kenward-Roger approximation, package afex, and car

  • Using lme4
    library(lme4)
    library(pbkrtest)
    
    fit1.lmer <- lmer(color.score ~ 1 + (1 | aquarium),
                      data=fish)
    fit2.lmer <- lmer(color.score ~ nutrient + (1 | aquarium),
                     data=fish)
    KRmodcomp(fit1.lmer, fit2.lmer)
    
    boundary (singular) fit: see help('isSingular')
    large : color.score ~ nutrient + (1 | aquarium)
    small : color.score ~ 1 + (1 | aquarium)
            stat    ndf    ddf F.scaling   p.value    
    Ftest 67.198  1.000  8.000         1 3.662e-05 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Mixed models in R
    • Pinheiro, J. C. & Bates, D. Mixed-Effects Models in S and S-Plus Springer-Verlag, New York, 2000.
    • Zuur, A.F., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. Mixed Effects Models and Extensions in Ecology with R. Springer. 2009.

27. Perception and Color

27.1. Graphics and color use

  • Future topics
    • Markup languages
    • Version control (git) - could spend multiple lectures.
    • ? emacs text editor / org mode
    • ? Survey of some domain specific R packages, PCA, time series, phylogenies
    • ? Many models in R
  • Exploratory vs communication graphics
    • Exploratory graphics

      Are for you (not others). Need to be able to create rapidly because your first attempt will never be the most revealing. Iteration is crucial for developing the best display of your data.

    • Communication graphics

      When you communicate your findings, you need to spend a lot of time polishing your graphics to eliminate distractions and focus on the story. Iteration is crucial to ensure all the the small stuff works well: labels, color choices, tick marks, etc

    • Note   B_note

      Check out Kieran Healey's book, Data Visualization

      https://socviz.co/index.html

27.2. Color

  • What are the components of color?

    RGB:

    RGB cube creative commons license, https://commons.wikimedia.org/wiki/File:RGB_Cube_Show_lowgamma_cutout_a.png

  • RGB triplets
    Notation RGB triplet
    Arithmetic (1.0, 0.0, 0.0)
    8-bit per channel (255, 0, 0)
    8-bit as hexidecimal #FF0000
  • Other color models?
    • Easy linear transformations of RGB:
      • RGB
      • HSV: Hue, Saturation, Value
      • HSL: Hue, Saturation, Lightness
    • Computationally more intensive:
      • LAB color spaces, eg CIE L*a*b* (CIELAB)
      • HCL : Hue, Chroma, Luminance
  • Luminance / Lightness

    cspace-compare.png

  • HCL

    fig_log_reg.png

  • Benefit
  • Color blindness
    • 7-10% of men are red-green color "blind."
    • Solutions: avoid red-green contrasts; use redundant mappings; test.
    • Color Oracle: http://colororacle.org

27.3. Colors in R

  • Colors in R

    Default in R is an hcl color scale. One can use scales in other packages (eg RColorBRewer) as well.

    • Continuous scales

      The default continuous scale in ggplot2 is a blue gradient, from low = #132B43 to high = #56B1F7 which can be reproduced as:

      library(ggplot2)
      scale_color_gradient(low = "#132B43", high = "#56B1F7", space = "Lab")
      
    • Discrete scales

      The default discrete scale in ggplot2 is a range of hues from hcl,

      scale_color_hue(h = c(0, 360) + 15, c = 100,
                      l = 65, h.start = 0, direction = 1)
      
  • Discrete color scales
    p = ggplot(iris, aes(Petal.Length, Sepal.Length, color = Species)) + geom_point()
    p
    p + scale_color_hue(l=40,c=30) # change luminance and chroma
    p + scale_color_hue(h = c(90,200), l = 50,c = 50) # change luminance and chroma
    p + scale_color_hue(h = c(90,200), l = 50,c = 50, direction = -1)
    
  • Results:

    color-hue.png

  • Continuous scales
    p <- ggplot(mpg, aes(hwy, cty, color = displ)) +
         geom_point(size = 3.5, alpha = 0.6, position = "jitter")
    p
    p +  scale_colour_gradient(low = "tan", high = "red")
    
  • Results:

    color-gradient.png

  • Other color palettes

    Cynthia Brewer's palettes for discrete scales http://colorbrewer2.org

    These are optimized for maps

    library(RColorBrewer)
    display.brewer.all()
    
    p <- qplot(carat, price, color = cut, data=diamonds)
    p + scale_color_brewer(palette = "Blues")
    
    
    p <- ggplot(mpg, aes(reorder(class, -cty), cty, color = cut(displ, breaks = 4))) + geom_boxplot(position = "dodge")
    p + scale_color_brewer(palette="Dark2")
    
  • Result:

    color-brewer.png

  • colorspace package

    Can be used wioth base graphics, ggplot2 etc. See https://colorspace.r-forge.r-project.org/articles/colorspace.html

    library(colorspace)
    
    • Note   B_note

      The scales are named via the scheme scale_<aesthetic>_<datatype>_<colorscale>(), where <aesthetic> is the name of the aesthetic (fill, color, colour), <datatype> is the type of the variable plotted (discrete or continuous) and <colorscale> sets the type of the color scale used (qualitative, sequential, diverging, divergingx).

  • colorspace palettes with ggplot2
    ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.6) +
      scale_fill_discrete_qualitative(palette = "Dark 3")
    

    colorspace-ggplot-ex1.png

  • Example of sequential palette
    dsamp <- diamonds[1 + 1:1000 * 50, ]
    ggplot(dsamp, aes(carat, price, color = cut)) + geom_point() +
      scale_color_discrete_sequential(palette = "Purples 3", nmax = 6, order = 2:6)
    

    colorspace-ggplot-ex2.png

27.4. Perception and comparisons

  • It is easier to make nearby comparisons
    • slide 1   B_onlyenv

      diamond-price1.png

    • slide 2   B_onlyenv

      diamond-price2.png

  • Beware of visual illusions

    vis-illusion.png

  • Boxplots and relatives

    barboxdotviolin_ggplot.png

  • My recommendations

    For comparing univariate responses across discrete groups. According to data set size

    • small: dotplots or jittered dotplots (a special challenge are small, discrete data sets, where there are lots of repeated values)
    • moderate: boxplots
    • moderate to large: violin plots, density strips, beanplots, box-percentile plots etc. etc.

28. Version Control

28.1. Setup

  • The Bash Shell

    Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.

    • Windows

      Install Git for Windows by download and running the installer this will provide you with both Git and Bash in the Git Bash program.

    • Mac

      The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in /Applications/Utilities). You may want to keep Terminal in your dock for this workshop.

    • Linux

      The default shell is usually bash, Just open a terminal. There is no need to install anything.

  • Git
    • Windows

      Git should be installed on your computer as part of your Bash install (described above).

    • Mac OS X

      For OS X 10.9 and higher, Install xcode command line tools. Easiest is just to try to run git: git --version

  • The shell
    • What is the shell?

      The shell a command line interface which allows you to control your computer.

      Tasks are accomplished by entering commands with a keyboard (or by running a plain text script) instead of controlling a graphical user interface.

      The most popular shell is Bash ("Bourne Again SHell")

    • What is a terminal?

      A terminal is a program you run that gives you access to the shell. There are many different terminal programs that vary across operating systems.

  • Using the shell

    Open a terminal (bash)

    ls
    echo "hello"
    echo "hello" > newfile.txt
    cat newfile.txt
    
  • Full vs relative paths
    pwd
    cd ~
    pwd
    
  • Shell command options and arguments
    ls -l
    ls -al
    man ls
    

28.2. Version control

  • Why use version control?
    • You already use some sort of version control
      • File naming schemes (eg my_file_April18-2023.docx, better as my_file_20230418.docx) or by copying folders around
      • Simple but error-prone
      • Does not help with branching, collaboration
    • A version control system (VCS) allows you to:
      • revert files back to a previous state
      • revert the entire project back to a previous state
      • review changes made over time
      • see who last modified something that might be causing a problem, who introduced an issue and when
      • bisect the history to discover what change caused the error
      • collaborate safely on a project without destroying anyone's work.
  • Local version control systems (VCS)
  • Centralized VCS

    central-vc.png

  • Distributed VCS

    distributed-vc.png

28.3. git

  • git
    • A distributed VCS
    • Written first by Linus Torvalds specifically for managing the linux kernel project.
    • Has become the standard VCS for many software projects, especially open-source projects
    • Many academics/scientists use git in part because of the popularity of the hosting site, github.com
  • Configuring
    git config --global user.name "Dylan Schwilk"
    git config --global user.email "dylan@schwilk.org"
    
  • Making a "repo" (repository)
    cd ~
    mkdir newrepo
    cd newrepo
    git init
    touch README.md
    ls .git -al
    
  • What happened?

    Hidden directory called .git

    .
    ├── .git
    │   ├── branches
    │   ├── config
    │   ├── description
    │   ├── HEAD
    │   ├── hooks
    │   │   ├── applypatch-msg.sample
    │   │   ├── commit-msg.sample
    │   │   ├── post-update.sample
    │   │   ├── pre-applypatch.sample
    │   │   ├── pre-commit.sample
    │   │   ├── prepare-commit-msg.sample
    │   │   ├── pre-push.sample
    │   │   ├── pre-rebase.sample
    │   │   └── update.sample
    │   ├── info
    │   │   └── exclude
    │   ├── objects
    │   │   ├── info
    │   │   └── pack
    │   └── refs
    │       ├── heads
    │       └── tags
    └── README.md
    
  • Checking status
    git status
    
    On branch master
    
    Initial commit
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            README.md
    
    nothing added to commit but untracked files present (use "git add" to track)
    
  • git commands
    • git states   BMCOL

      git-states.png

    • git commands   BMCOL
      • git init
      • git add
      • git status
      • git log
      • git commit
  • staging files
    git add .
    git status
    
    On branch master
    
    Initial commit
    
    Changes to be committed:
      (use "git rm --cached <file>..." to unstage)
    
            new file:   README.md
    
  • Committing changes
    git commit -m "initial commit"
    git status
    
    On branch master
    nothing to commit, working directory clean
    
  • git commit

    A commit is a snapshot taken from the index (staging area) not from the working directory!

  • What exists now?

    git-filesystem2.png

  • Cloning repo that already exists
    git clone git@github.com:schwilklab/trait-flam.git
    cd trait-flam
    ls
    
    data  ms  README.md  results  scripts
    
  • A basic workflow
    • Edit files
    • Stage the changes (git add)
    • Review the changes (git diff)
    • Commit the changes (git commit)
  • git log
    git log --pretty=oneline --abbrev-commit -10
    
    d2ec17b Reduce markup lecture
    e4d405f Add folder for tangled session files
    bb2d29e Add shell (Bash) examples and some data
    261af83 Clean up birdd files
    e289d7e Minor edit
    f2820d3 Scatterplot fig in birdd example
    75a4be6 Add slides on Bash shell
    66c9a7f Add etherpad link for workshop
    e23b559 Suggest getting github.com account
    7e6a6df Add time and date for workshop
    
  • git diff
    git diff
    difference between working directory and index (staging area)
    git diff --staged
    difference between index and last commit

    With right options, diff can show changes between any two commits, any two files, etc.

  • git diff example

    Try:

    git diff HEAD~1
    
  • Diffs on GitHub

    github-diff-example.png

  • Practice
    • edit file
    • stage
    • commit

28.4. Working with branches

  • Branches

    A branch represents an independent line of development.

  • Show and create branches
    • Show
      git branch
      
      * master
        new-feature
      
    • Create
      git branch <newbranch>
      
    • Delete
      git branch -D <newbranch>
      
  • Using branches

    git-branch-diag.png

  • Switch branches
    git checkout <existing-branch>
    
  • Merge

    Merging is Git's way of putting a branched history back together again.

    • Note:

      This merges into the current branch

      git merge <branch>
      
  • Fast-forward merge

    git-ff-merge.png

  • Three-way merge

    git-merge-3w.png

  • Example fast-foward merge
    # Start a new feature
    git checkout -b new-feature master
    
    # Edit some files
    git add <file>
    git commit -m "Start a feature"
    
    # Edit some files
    git add <file>
    git commit -m "Finish a feature"
    
    # Merge in the new-feature branch
    git checkout master
    git merge new-feature
    git branch -d new-feature
    
  • Merge vs rebase

28.5. Working with remotes

  • What is a "remote"
    • a version of the repository that hosted somewhere on the internet.
    • Remotes have a url (https or ssh), but the can also have names.
    • It is common to have only one remote at it is named "origin"
    • Why remotes

      Allow collaboration!

  • GitHub

    Many foilks use GitHub for hosting. If you do, most of of your local repos will be associated with a remote named "origin" and a url something like git@github.com:schwilklab/CSC-sky-island-forest.git

  • Working with remotes

    When you clone from a remote, git remembers and calls that remote "origin". You can look at .git/config or, better use git commands to view:

    git remote -v
    
    origin	git@github.com:schwilklab/R-research-tool.git (fetch)
    origin	git@github.com:schwilklab/R-research-tool.git (push)
    
  • Adding a remote

    If you need to add a remote to an existing repository:

    git remote add rem2 git@github.com:dschwilk/software-tools-course.git
    
  • Fetching and pulling
    • fetch
      git fetch origin
      git fetch
      

      This pulls the data to your local repository but doe not merge it. It just means you have a local copy of the changes in the remote so that you can merge them.

    • pull
      git pull
      

      Fetches data and then merges with your local repo.

  • git push
    git push <remote> <branch name>
    eg: git push origin master
    or just: git push
    

    This works if no one else has pushed changes to that remote since you last pulled. If there is an error, you will need to pull (fetch and merge) first.

  • git and GitHub
    • "forking": creating a repo hosted on github that was cloned from someone else's repo.
    • Your local version will then have your "fork" as the remote named "origin" and the original repo will be a remote called "upstream"
    • This is overkill for most small groups but essential for open source projects with many contributors

28.6. Resolving conflict

  • Resolving conflict

    Walk through example

28.7. Advanced uses

  • stage only portions of change
    git add -p
    

    Interactively add changed "hunks".

  • Interactive rebase
  • git on emacs with magit
Back to top | E-mail Schwilk