R as a Research Tool: Lectures

Table of Contents

1 Introduction

1.1 Introduction to course

  • Website and homework
    Website
    http://r-research-tool.schwilk.org/
    Homework
    email me. See webpage for format instructions. dylan.schwilk@ttu.edu
    Syllabus
    See handout, link to syllabus
  • No textbook

    but you can check out these books/ebooks (all linked on courses website):

    • R for Data Science by Garrett Grolemund and Hadley Wickham: http://r4ds.had.co.nz/. This is a great new book focused on tidying, visualizing and exploring data in R using the great packages of the "hadleyverse" or "tidyverse". I will cover some the concepts in this book but our class is more focused on general purpose programming.
    • Books in the Springer Use R! series are free from a TTU IP address: http://www.springerlink.com/books/
    • Advanced R Programming by Hadley Wickham http://adv-r.had.co.nz/ Available as a wiki for free.
    • The Art of R Programming by Norman Matloff.
  • Me
  • Computers

    If you don't know anything about computers, just remember that they are machines that do exactly what you tell them but often surprise you in the result.

    — Richard Dawkins, The Blind Watchmaker (1986)

  • Computer Architecture
    • Architecture

      comp-arch.png

    • Instructions
  • Computer Languages
    • Machine Language

      :1111 0000 1011 0100 (in hexadecimal: F0B4)

    • Assembly
      iload intRate
      bipush 100
      if_icmpgt intError
      
    • High level language (C, Python, R …)
      if (intRate > 100)
      
  • Interpreted vs compiled languages
    • Compiled

      higher level language code is converted into machine code and then 'directly' executed by the host CPU

      • Faster execution, but needs compilation time
      • No dynamic typing or scoping
    • Interpreted (eg R)

      Programs are indirectly executed ("interpreted") by an interpreter program

      • Platform independence
      • Reflection (ability to examine and modify the structure and behavior of an object at run time)
      • Dynamic typing
  • R
    • R is a computer language for statistical computing
    • R is an interpreted language
    • FORTRAN, C,C++ are all compiled languages
    • other interpreted languages used in science: Perl, Python, Julia, Matlab (Octave).
    • R and Python are the most commonly used in science.

1.2 R

  • Why R?
    • Open source, available on every major platform. Ensures that scientists around the world — and not just ones in rich countries — are the co-owners to the software tools needed to carry out research. Contributes to reproducible research!
    • A language designed around data analysis and and well suited to interactive exploration of data.
    • A massive set of packages. If want to do something, chances are that someone has already tried to do it.
    • Is the product of 1000s of leading experts in the fields they know best and therefore provides cutting edge tools.
    • Strong and friendly community. R is the de facto standard language in statistics and many areas of biology.
  • Shortcomings of R compared to other languages
    • There are nutty aspects of the language and many ways to do the same thing. Many R functions use tricks to reduce the amount of typing (interaction!) at the cost of making code that is hard to understand.
    • R "evolved" it was not designed. Many APIs are inconsistent (Even in core R!)
    • Most R programmers are focused on results, using data. They are not computer scientists. There is a lot of poorly written and poorly documented R code out there.
    • R is not very fast. It uses memory inefficiently. But it is often fast enough. This is less true than it once was.
  • Installing R

    CRAN: The Comprehensive R Archive Network

    • R lives at http://cran.r-project.org/
    • If you run Windows or Mac go to the address above and download precompiled binary distributions.
    • If you run a debian-based linux (eg ubuntu), simply do:

      sudo apt-get update
      sudo apt-get install r-base
      
    • In linux, you can keep up to date with the newest changes by adding a cran repository to your package manager, eg: add

      deb http://<cran.mirror>/bin/linux/ubuntu trusty/
      

      to sources.list

  • Running R on Windoze
    • You can use the Rgui program which will open a console window and an editing window. but…
    • Use RStudio. Instruction for installing in the syllabus
  • Running R from command line

    Start R and you'll get a greeting, and then the R prompt, the > sign. The screen will look something like:

    R
    
    R version 3.3.0 (2016-05-03) -- "Supposedly Educational"
    Copyright (C) 2016 The R Foundation for Statistical Computing
    Platform: x86_64-pc-linux-gnu (64-bit)
    
    [...MORE...]
    
    Type 'demo()' for some demos, 'help()' for on-line help, or
    'help.start()' for an HTML browser interface to help.
    Type 'q()' to quit R.
    
  • RStudio

    rstudio-screenshot.jpg

  • RStudio
    • Windows: console, workspace, files, current open file (script).
    • Projects: Use them! saves a hidden .RProj file in your project directory.
  • RStudio: keyboard shortcuts
    • In editor
      • Command/ctrl + enter: send code to console
      • Ctrl + 2: move cursor to console
    • In console
      • Up arrow: retrieve previous command
      • Ctrl + up arrow: search commands
      • Ctrl + 1: move cursor to editor
  • As a script from the command line
    • You can write an script and save it as a plain text file (usually with the extension ".R")
    • Then call your script by typing:

      Rscript myscript.R
      
    • On linux, you can use the bash "shebang" mechanism to make the script executable: just put #!/usr/bin/Rscript as the first line of the script. then you can call your script directly:

      myscript.R
      

1.3 Fundamentals of R

  • R as a calculator:
    4.0 * 3.5
    log(10)  # natural log
    log10(10)
    (3 + 1) * 5
    3^-1
    1/0
    
    .Rprofile: Setting TX repository
    [1] 14
    [1] 2.302585
    [1] 1
    [1] 20
    [1] 0.3333333
    [1] Inf
    
  • Statements
    • one statement per line
    • but R will keep reading if the statement is "not finished" (open parentheses, haning operator, etc)
    • semicolon can end a statement (rarely, if ever, needed)
    1 +
    2  # second statement ends here
    3 + 4 + ( 7 
             - 7)
    
    .Rprofile: Setting TX repository
    [1] 3
    [1] 7
    
  • Assigning values to variables
    • Variables are assigned using "<-"'

      "=" is also now allowed. "==" tests equality.

      x <- 12.6
      x
      x == 13
      
      .Rprofile: Setting TX repository
      [1] 12.6
      [1] FALSE
      
    • Variables that contains many values

      vectors create with the concatenate function, c() :

      y <- c(3,7,9,11)
      y
      
      .Rprofile: Setting TX repository
      [1]  3  7  9 11
      
  • Naming variables (objects)
    • 1-63 characters
    • first character a letter
    • remaining characters: letters or numbers or period (+ underscore)
    • case sensitive
  • Parenthesis and braces
    • () Parenthesis group operators
    • {} braces group statements and return last expression evaluated. This is needed for function definitions, if-then statements and a few other areas.
  • Functions

    Functions are modules of code that accomplish a specific task. Functions take data as "arguments" and "return" a result. Once a function is written, it can be used over and over and over again. Functions can be "called" from within other functions.

  • Functions in R

    Arguments to R functions can be matched by position or by name. So the following calls to sd() are all equivalent

    mydata <- rnorm(100)
    sd(mydata)
    sd(x = mydata)
    sd(x = mydata, na.rm = FALSE)
    sd(na.rm = FALSE, x = mydata)
    sd(na.rm = FALSE, mydata)
    
  • Example: random numbers
    runif(5)
    
    [1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583
    
    runif(5, 0, 10)
    
    [1] 10  0  5 10  9
    
    floor(runif(5,0,11))
    
    [1] 10  0  5 10  9
    
  • Inspecting functions
    sd
    
    .Rprofile: Setting TX repository
    function (x, na.rm = FALSE) 
    sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), 
        na.rm = na.rm))
    <bytecode: 0x1ef8258>
    <environment: namespace:stats>
    
  • Where do these functions come from?

    Many are part of base R. Others come from packages. Others you will write yourself.

    • R packages are loaded with library()
      library(ggplot2)
      
    • To install a new package:
      install.packages("ggplot2")
      
  • Scatterplots
    library(ggplot2)
    # ?mpg
    head(mpg)
    str(mpg)
    summary(mpg)
    
    qplot(displ, hwy, data = mpg)
    
  • The result

    fig1_1.png

  • Vectors: the heart of R
    x <- 1:20
    x
    
    .Rprofile: Setting TX repository
     [1]  1  2  3  4  5  6  7  8  9 10 11 12 13 14 15 16 17 18 19 20
    
  • Matrices
    B <- matrix( c(2,4,3,1,5,7), nrow=3, ncol=2)
    B
    
    .Rprofile: Setting TX repository
         [,1] [,2]
    [1,]    2    1
    [2,]    4    5
    [3,]    3    7
    
  • Session data
    .Rprofile
    File loaded on startup which can contain your own custom functions
    .Rhistory
    File that contains the session history of commands as
    .RData
    Binary blob of data which contains all the objects created in the session. Watch out for this!
  • Working directory

    Set your working directory! If you have a $HOME environmental variable set, then path.expand is helpful.

    getwd()
    setwd(path.expand("~/") )
    
  • .Rdata and .Rhistory
    • When you quit a session you are asked if you want to save the workspace
    • or use save.image()
    • When R starts it loads all objects from ./.Rdata !
    • Good idea to start scripts:

      rm(list=ls(all=TRUE))
      

1.4 Getting help and essential vocabulary

  • Help in R
    • Use R's built in self-documentation:
      help(seq)
      ?seq
      ?"<"
      ?"for"
      
    • For examples:
      example(seq)
      example(persp)
      
    • If you don't know exactly what you are looking for:
      help.search("multivariate normal")
      ??"multivariate normal"
      
  • Help on the internet
    • The R Project’s own manuals are available at the R home page and some search engines are listed there, too.
    • Quick R website: http://www.statmethods.net/
    • RSeek search engine: http://www.rseek.org/
    • You can post your R questions to r-help, the R listserve. See http://www.r-project.org/mail.html.
    • Check out StackOverflow for programming questions tagged "r": http://stackoverflow.com/
    • R is difficult to search for when using general search engines such as Google. One trick: use google's filetype criterion. To search for scripts with a .R suffix, that pertain to, say, permutations, do: filetype:R permutations -rebol
  • Essential vocabulary

    See Hadley Wickham's list: http://adv-r.had.co.nz/Vocabulary.html

2 Plots with ggplot

2.1 Introduction by example

  • Making pretty figures

    Today we will get ahead of ourselves. Don't worry! I'll come back to the basics after showing you some fun with plots.

  • ggplot2
    #install.packages("ggplot2")
    library(ggplot2)
    
    #?mpg
    head(mpg)
    #str(mpg)
    summary(mpg)
    
  • ggplot() function

    Create a plot object to which one can add layers (geoms)

    • ggplot(data, mapping, ...)
      data
      data frame with variables as columns
      mapping
      an aesthetic mapping of variable names to x, y position. Create this with the aes() function
    • aes(x, y, ...)
      x
      x axis variable
      y
      y axis variable
      other aesthetics (color, size, etc)
  • First plot
    ggplot(mpg, aes(displ, hwy)) + geom_point()
    
    • Result

      figggplot_1.png

  • geoms
    • ggplot() simply sets up the axes and the mapping. To plot data you must add a "geometric object" or "geom"
    • Use geom_point() for a scatterplot. See other geoms such as geom_line(), geom_smooth(), geom_boxplot()
  • Additional variables, an example
    ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()
    
    • Result

      figggplot_2.png

  • Saving plots
    1. You can store the plot object to a variable name then save with ggsave():
      • p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
      • ggsave("plot1.pdf", p)
    2. Save the last plot even if it was not assigned to a variable:
      • ggplot(mpg, aes(displ, hwy)) + geom_point()
      • ggsave("plot1.pdf")
    3. You can use the RStudio GUI ("export" button in the plots window)

2.2 Aesthetics and faceting

  • Aesthetics in addition to x and y
    Aesthetic Discrete variable Continuous variables
    Color Rainbow of colors Gradient red to blue
    Size Discrete size steps Mapping radius->value
    Shape Different shape for each Doesn't work
  • Example aesthetics
    ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point()
    
    • Result

      figggplot_3.png

  • Faceting
    • Small multiples displaying different subsets of the data.
    • Useful for exploring conditional relationships and for large data.
  • Examples to try:
    p <- ggplot(mpg, aes(displ, hwy)) + geom_point()
    p + facet_grid(. ~ cyl)
    p + facet_grid(drv ~ .)
    p + facet_grid(drv ~ cyl)
    p + facet_wrap(~ class)
    
  • Faceting
    facet_grid()
    2d grid, rows ~ cols, . for no split
    facet_wrap()
    1d ribbon wrapped into 2d
    scales
    argument controls whether position scales are fixed or free
  • Overplotting: What is wrong here?
    p <- ggplot(mpg, aes(cty, hwy))
    p + geom_point()
    

    ggplot-no-jitter.png

  • Jitter
    p + geom_jitter()
    

    ggplot_jitter.png

  • How can we improve this plot?
    ggplot(mpg, aes(class, hwy)) + geom_boxplot()
    

    figggplot_4.png

  • Reorder
    p <- ggplot(mpg, aes(reorder(class, hwy), hwy))
    p + geom_boxplot()
    

    figggplot_5.png

  • Multiple geoms
    p + geom_jitter() + geom_violin()
    

    figggplot_6.png

  • Your turn
    • Read the help for reorder. Redraw the previously plots with class ordered by median hwy.
    • How would you put the jittered points on top of the violin plots?

3 Data Structures

3.1 Math in R

  • Doing math in R

    exp(), log(), log10(), sqrt(), abs(), min(), which.min(), max(), which.max(), pmin(), pmax(), sum(), prod() (for products of multiple factors), round(), floor(), ceiling(), sort(), factorial() … and more.

  • Operator precedence
    $ Subsetting by name
    @ component / slot extraction
    [ [[ subsetting
    ^ exponentiation
    : sequence operation
    %any% special operators, including %% and %/%
    * / multiply and divide
    + - add and subtract
    < > <= >= = ! greater, less than
    ! negation
    & && and
    | || or
    ~ "depends on" in formulae
    <- assignment
    <<- global assignment
    ? help
  • Order of operations
    x <- 3 + 4 * 2
    x
    
    .Rprofile: Setting TX repository
    [1] 11
    
    x <- (3 + 4) * 2
    x
    x <- (3 + 4) * sqrt(2)
    x
    
    .Rprofile: Setting TX repository
    [1] 14
    [1] 9.899495
    

3.2 Vector basics

  • R data structures
    • vectors
    • matrices and arrays
    • lists
    • data frames

    And other specialized structures provided by packages (eg tibbles, data.table, phylogenetic trees …)

  • Vectors
    x <- c(5,12,13)
    x
    length(x)
    mode(x)
    
    .Rprofile: Setting TX repository
    [1]  5 12 13
    [1] 3
    [1] "numeric"
    
  • The ":" operator

    Note the ":" operator you used in homework:

    5:8
    5:1
    
    .Rprofile: Setting TX repository
    [1] 5 6 7 8
    [1] 5 4 3 2 1
    

    Beware of the operator precedence:

    i <- 3
    1:i-1
    1:(i-1)
    
    .Rprofile: Setting TX repository
    [1] 0 1 2
    [1] 1 2
    
  • seq()
    seq(from=12,to=30,by=3)
    seq(from=1.1,to=2,length=10)
    seq(from=12,to=30,by=3)
    
    [1] 12 15 18 21 24 27 30
     
    [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0
    [1] 12 15 18 21 24 27 30
    
  • rep()
    rep(8,4)
    rep(c(5,12,13),3)
    rep(c(5,12,13),each=2)
    
  • The subsetting operators

    R has three subset operators:

    • [
    • [[
    • $

    We'll start with one you've already used in your homework: [

  • Vector subsetting
    v <- 1:100
    v
    
     [1]   1   2   3   4   5   6   7   8   9  10  11  12  13  14  15  16  17  18
    [19]  19  20  21  22  23  24  25  26  27  28  29  30  31  32  33  34  35  36
    [37]  37  38  39  40  41  42  43  44  45  46  47  48  49  50  51  52  53  54
    [55]  55  56  57  58  59  60  61  62  63  64  65  66  67  68  69  70  71  72
    [73]  73  74  75  76  77  78  79  80  81  82  83  84  85  86  87  88  89  90
    [91]  91  92  93  94  95  96  97  98  99 100
    
    v <- v * 10
    v[1:3]
    
    [1] 10 20 30
    
  • Character vectors (strings)
    y <- "abc"
    length(y)
    mode(y)
    y <- c("abc","1","2")
    length(y)
    
    .Rprofile: Setting TX repository
    [1] 1
    [1] "character"
    [1] 3
    
  • Strings
    u <- paste("abc", "de", "f") # concatenate the strings
    u
    
    [1] "abc de f"
    
    v <- strsplit(u, " ") # split the string according to blanks
    v
    
    [[1]]
    [1] "abc" "de"  "f"
    
    • Note

      See stringr package for useful string functions

  • Some vector concepts
    Recycling
    Automatic lengthening of vectors in certain settings
    Filtering
    Extracting subsets of vectors
    Vectorization
    Where functions are applied element-wise to vectors
  • Recycling
    x <- c(1, 2, 3, 4)
    

    Recycling:

    c(1,2) + x
    
    [1] 2 4 4 6
    
    c(1, 2, 3) + x
    
    [1] 2 4 6 5
    Warning message:
    In c(1, 2, 3) + x :
      longer object length is not a multiple of shorter object length
    
  • Vectorization

    Many functions are vectorized, including operators such as >

    u <- c(5,2,8)
    v <- c(1,3,9)
    u > v
    
    .Rprofile: Setting TX repository
    [1]  TRUE FALSE FALSE
    
    sqrt(1:9)
    
    .Rprofile: Setting TX repository
    [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427
    [9] 3.000000
    
  • Vectorized if-then-else
    x <- 1:10
    y <- ifelse(x %% 2 == 0, 0, 2*x)
    y
    
    .Rprofile: Setting TX repository
     [1]  2  0  6  0 10  0 14  0 18  0
    

3.3 Subsetting

  • Vector indexing (subsetting)

    There are six things that you can use to subset a vector. First three are used commonly:

    • positive integers
    • negative integers
    • logical vectors
    • nothing ([] returns original vector)
    • zero ([0] returns empty vector)
    • character vectors (only if vector is named)
  • Subsetting vectors
    y <- c(1.2,3.9,0.4,0.12)
    y[c(1,3)]
    y[1:3]
    
    [1] 1.2 0.4
    [1] 1.2 3.9 0.4
    
    v <- c(1,4)
    y[-v]
    
    [1] 3.9 0.4
    
  • Logical vectors and filtering
    v <- c(TRUE,FALSE, FALSE, TRUE)
    y[v]
    
    [1] 1.20 0.12
    
    y[ y > 1 ]
    
    [1] 1.2 3.9
    
    y > 1
    
    [1]  TRUE  TRUE FALSE FALSE
    
  • match and %in%
    names <- c("fortis", "spinosa", "macrocarpa", "leucophylla")
    samples <- c("spinosa", "leucophylla")
    samples %in% names
    
    [1] TRUE TRUE
    
    names %in% samples
    
    [1] FALSE  TRUE FALSE  TRUE
    
    match(samples, names)
    
    [1] 2 4
    
  • any(), all()

    Useful for boolean (logical) vectors

    any(y > 1)
    
    [1] TRUE
    
    all(y > 1)
    
    [1] FALSE
    
  • Assignment to subset
    which(y > 1)
    
    [1] 1 2
    
    y[y > 1] <- 0
    y
    
    [1] 0.00 0.00 0.40 0.12
    
  • Boolean operators
    operator function
    -------- -----------------
    == Test for equality
    < Less than
    > Greater than
    <=,>= lt or equal, gt or equal
    && Boolean AND for scalars
    || Boolean OR for scalar
    & Boolean AND for vectors
    | Boolean OR for vectors
    !x Boolean negation
    xor Exclusive OR
    any(), all() Boolean vector reduction

3.4 Final words on vectors

  • NA and NULL
    NA
    Missing data
    NULL
    The null object: The value simply does not exist (eg an impossible value)
    x <- c(1,NA,3,4,5,6)
    mean(x)
    
    [1] NA
    
    mean(x,na.rm=TRUE)
    
    [1] 3.8
    
    x <- c(1,NULL,3,4,5,6)
    x # NULL can't be part of a vector
    mean(x)
    
    [1] 1 3 4 5 6
    [1] 3.8
    
  • unique()
    x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2)
    unique(x)
    
    [1] 1 2 3 4
    
  • Counting TRUES
    x < 3
    length(which(x < 3))
    sum(x < 3)
    
    [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    [1] 4
    [1] 4
    
  • Vector element names
    x <- c(1,2,4)
    names(x)
    
    NULL
    
    names(x) <- c("one","two","four")
    x
    
    one  two four 
      1    2    4
    
  • Subsetting on names
    x[c("one", "four")]
    
    one four 
      1    4
    
  • c(): Type coercion:
    • Values are converted to the simplest type required to represent all information.
    • The ordering is roughly logical < integer < numeric < complex < character < list.
    c(5,2,"abc")
    c(5,list(a=1,b=4))
    
    [1] "5"   "2"   "abc"
    [[1]]
    [1] 5
    
    $a
    [1] 1
    
    $b
    [1] 4
    
  • c(): Flattening:
    c(1,2,c(100,200,300))
    
    [1]   1   2 100 200 300
    

3.5 Matrices and arrays

  • Matrices
    m <- matrix(data = c(1,3,5,2,4,6), nrow=3, ncol=2)
    m
    
    .Rprofile: Setting TX repository
         [,1] [,2]
    [1,]    1    2
    [2,]    3    4
    [3,]    5    6
    
    m <- matrix(data = c(1,3,5,2,4,6), nrow=3)
    m
    
    .Rprofile: Setting TX repository
         [,1] [,2]
    [1,]    1    2
    [2,]    3    4
    [3,]    5    6
    
  • Matrices con.
    a <- matrix(c(1,2,3,4), nrow=2)
    a
    
         [,1] [,2]
    [1,]    1    3
    [2,]    2    4
    
    b <- matrix(c(6,7,8,9), nrow=2)
    b
    
         [,1] [,2]
    [1,]    6    8
    [2,]    7    9
    
  • Matrix addition
    a + b
    
         [,1] [,2]
    [1,]    7   11
    [2,]    9   13
    
  • Matrices math, dimensions
    c <- matrix(c(3,4,5,6), nrow=1)
    c
    
         [,1] [,2] [,3] [,4]
    [1,]    3    4    5    6
    
    a + c
    
    # result:
    Error in a + c : non-conformable arrays
    
  • Matrix math, con.
    a * b
    
         [,1] [,2]
    [1,]    6   24
    [2,]   14   36
    
    a^3
    
         [,1] [,2]
    [1,]    1   27
    [2,]    8   64
    
  • Matrix multiplication
    a <- matrix(c(1,2,3),ncol=3)
    b <- matrix(c(4,5,6),ncol=1)
    a
    b
    
     
        [,1] [,2] [,3]
    [1,]    1    2    3
     
        [,1]
    [1,]    4
    [2,]    5
    [3,]    6
    

    Calculate dot product (inner product)? Or generally, matrix product?

  • Matrix multiplication
    # a*b # ERROR
    a %*% b
    
         [,1]
    [1,]   32
    
    b %*% a
    
         [,1] [,2] [,3]
    [1,]    4    8   12
    [2,]    5   10   15
    [3,]    6   12   18
    
  • rbind and cbind
    rbind(a, a)
    # rbind(a,b) # ERROR!
    cbind(b, b)
    
         [,1] [,2] [,3]
    [1,]    1    2    3
    [2,]    1    2    3
     
        [,1] [,2]
    [1,]    4    4
    [2,]    5    5
    [3,]    6    6
    
  • Row and column means

    See:

    • rowMeans()
    • colMeans()
    • rowSums()
    • colSums()
  • Subsetting matrices
    m <- matrix(c(1,2,3,4,5,6),nrow=3)
    m[1:2,1:2]
    
    .Rprofile: Setting TX repository
         [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    
  • Dimension reduction
    z <- matrix(1:8, nrow=4)
    z
    
         [,1] [,2]
    [1,]    1    5
    [2,]    2    6
    [3,]    3    7
    [4,]    4    8
    
    r <- z[2,]
    r
    
    [1] 2 6
    
  • Dimension reduction result
    attributes(z)
    attributes(r)
    
    $dim
    [1] 4 2
    NULL
    
  • Avoiding dimension reduction
    r <- z[2,, drop=FALSE]
    r
    
         [,1] [,2]
    [1,]    2    6
    
    dim(r)
    
    [1] 1 2
    

3.6 Lists

  • Lists

    An R list is a container whose contents can be items of different data types, unlike a vector.

    x <- list(u=2, v="abc")
    x
    x$u
    
    .Rprofile: Setting TX repository
    $u
    [1] 2
    
    $v
    [1] "abc"
    
    [1] 2
    
  • Subsetting lists
    [
    returns a list)
    [[
    returns an element in the list
    (no term)
    $ returns an element by name
  • Lists continued

    A common usage of lists is to package the return values of complex statistical functions. For example, the histogram function hist():

    hist(Nile)
    hn <- hist(Nile)
    print(hn)
    
    .Rprofile: Setting TX repository
    $breaks
     [1]  400  500  600  700  800  900 1000 1100 1200 1300 1400
    
    $counts
     [1]  1  0  5 20 25 19 12 11  6  1
    
    $density
     [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001
    
    $mids
     [1]  450  550  650  750  850  950 1050 1150 1250 1350
    
    $xname
    [1] "Nile"
    
    $equidist
    [1] TRUE
    
    attr(,"class")
    [1] "histogram"
    

3.7 Data Frames

  • Data Frames

    Like a matrix, but for mixed types. This is what you will use most often!

    d <- data.frame(kids=c("Jack","Jill","Igor"), ages=c(12,10,11))
    d
    
    .Rprofile: Setting TX repository
      kids ages
    1 Jack   12
    2 Jill   10
    3 Igor   11
    
  • Reading data

    For now, we will assume you save your data as a comma separated values file (say, in excel), and then read that file into a dataframe:

    fish <- read.csv("./data/fish-data.csv", stringsAsFactors=FALSE)
    head(fish,3)
    
    .Rprofile: Setting TX repository
      aquarium nutrient color.score
    1        a      low         5.3
    2        a      low         4.7
    3        a      low         5.0
    

    Always use stringsAsFactors = FALSE (or used Hadley Wickham's readr package).

4 Introduction to functions and more lists

4.1 Functions

  • Why write functions?

    Why not just copy and paste code?

    • You can give a function a name that makes your code easier to understand.
    • If you need to change the code, you only need to change it in one place.
    • You eliminate the chance of making mistakes when you copy and paste.
  • When write a function

    Whenever you find yourself copying and pasting more than twice.

  • Functions are objects

    So we create them and assign them to a name.

    addone <- function(x) {
       return(x+1)
    }
    addone(1)
    
    [1] 2
    
  • Function example
    num2string <- function(thename, num) {
      if (num > 1) {
         thename <- paste(thename, "s", sep="")
      }
    
      return(paste(as.character(num), thename))
    }
    
    num2string("computer", 44)
    num2string("instructor", 1)
    
    [1] "44 computers"
    [1] "1 instructor"
    

4.2 In class project

  • Function randomIntegers()
    • RandomIntegers(n, min_val, max_val)

      Function returns n random uniform integers between minval and maxval, inclusive

      • Look up:
        ?runif
        ?round
        
  • Problem: convert random real numbers to integers

    rand_to_int.png

  • Solution
    # RandomIntegers: produce vector of uniformly 
    # distributed random integers.
    #
    #   Arguments:
    #     n = number of random integers to be generated.
    #     min_val, max_val = min, max possible values, 
    #                      inclusive.
    # Returns:
    #     vector of random integers.
    
    RandomIntegers <- function(n, min_val, max_val) {
      # Extend upper boundary by 1 and use floor()
      result = runif(n, min_val, max_val+1)
      result = floor(result)
      return(result)
      }
    
  • Other solutions, using round():
    RandomIntegers2 <- function(n, min_val, max_val) {
      # Extend each boundary by 0.5 and use round()
      r = runif(n, min_val-0.5, max_val+0.5)
      return(round(r))
    }
    

    Or … extend lower bound and use ceiling()

  • Other solutions, using sample()
    RandomIntegers3 <- function(n, min_val, max_val) {
      sample(min_val:max_val, n, replace=TRUE)
    }
    

4.3 Some notes on functions

  • Steps to writing a function
    1. Decide what information is needed (arguments) and what to return
    2. Develop the algorithm
    3. Implement in code
  • Potential problems
    • Syntax errors (error when you define the function)
    • Errors during run time (eg undefined variable names)
    • Runs, no errors but output wrong!
  • Debugging

    More on this later …

    • use print()
    • use stopifnot()

4.4 Lists

  • Lists vs vectors

    An R list is a container whose contents can be items of diverse data types, as opposed to them being, say, all numbers, as in a vector.

    • Lists are ordered lookup tables:

      in comparison to other computer languages, R lists are a cross between a hash map and a list. R lists have a heavy-duty role: they act as hash maps (python Dict or perl hash). But they are ordered like a linked list or a vector.

  • List indexing
    j <- list(name="Joe", salary=39000, permanent=TRUE)
    j$salary
    j$sal
    j[["salary"]]
    j[[2]]
    j[2]
    
    [1] 39000
    [1] 39000
    [1] 39000
    [1] 39000
    $salary
    [1] 39000
    
  • [] vs [[]]
    j[[1]]
    Extracts a single item
    j[1]
    Extracts a sublist
  • Adding elements
    j$rnumber <- "R00001111"
    j
    
    $name
    [1] "Joe"
    
    $salary
    [1] 39000
    
    $permanent
    [1] TRUE
    
    $rnumber
    [1] "R00001111"
    
  • Adding more elements
    length(j)
    j[[5]] <- "Extra"
    j[6:7] <- c(F,F)
    j[3:7]
    
    [1] 4
    $permanent
    [1] TRUE
    
    $rnumber
    [1] "R00001111"
    
    [[3]]
    [1] "Extra"
    
    [[4]]
    [1] FALSE
    
    [[5]]
    [1] FALSE
    
  • Deleting elements
    length(j)
    j$rnumber <- NULL
    length(j)
    names(j)
    
    [1] 7
    [1] 6
    [1] "name"      "salary"    "permanent" ""          ""          ""
    
  • Inserting elements

    Either break apart and re-concatenate with c(), or use append():

    newj <- append(j, list(birthday="12/10/1970"), after=3)
    newj[1:5]
    
    $name
    [1] "Joe"
    
    $salary
    [1] 39000
    
    $permanent
    [1] TRUE
    
    $birthday
    [1] "12/10/1970"
    
    [[5]]
    [1] "Extra"
    
  • Lists to vectors: unlist()
    l <- list(a=1,b=10,c=100)
    v <- unlist(l)
    v
    mode(l)
    mode(v)
    unname(v)
    
    .Rprofile: Setting TX repository
      a   b   c 
      1  10 100 
    [1] "list"
    [1] "numeric"
    [1]   1  10 100
    

4.5 Apply functions

  • Apply family of functions
    apply
    apply a function to the rows, columns or values in a matrix/array
    lapply
    apply a function to every item in a list and return a list
    sapply
    lapply but simplifies the result and returns a vector
    mapply
    You have multiple data structures and want to hand elements of each as separate arguments to a function

    There are more in this family that we use less commonly. See ??apply and scroll down.

  • Applying functions to lists
    • lapply
      lapply(list(1:3,25:29),median)
      
      .Rprofile: Setting TX repository
      [[1]]
      [1] 2
      
      [[2]]
      [1] 27
      
    • sapply
      sapply(list(1:3,25:29),median)
      
      .Rprofile: Setting TX repository
      [1]  2 27
      
  • lapply example

    Consider sexing 10,000 shrimp. We'll use a subset. We'd like to know the indices for all females, males and juveniles:

    shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F")
    shrimp=="F"
    which(shrimp=="F")
    
    [1] FALSE  TRUE FALSE FALSE  TRUE  TRUE FALSE  TRUE  TRUE
    [1] 2 5 6 8 9
    
  • Example, continued
    codes <- c("F","M","J")
    lapply(codes, function(sex) which(shrimp==sex))
    
    [[1]]
    [1] 2 5 6 8 9
    
    [[2]]
    [1] 1 7
    
    [[3]]
    [1] 3 4
    
  • Whoa, where was the function?

    It was unnamed! Alternative with named function:

    # function to return indices of x in v
    indices <- function(x, v) {
      return(which(x==v))
    }
    # and use sapply for USE.NAMES=TRUE:
    sapply(codes, indices, v=shrimp, simplify=FALSE)
    
    $F
    [1] 2 5 6 8 9
    
    $M
    [1] 1 7
    
    $J
    [1] 3 4
    

5 Control Structures

5.1 Logistics and homework solutions

  • Logistics
  • unique()
    x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2)
    unique(x)
    
    [1] 1 2 3 4
    
  • Counting TRUES
    x < 3
    length(which(x < 3))
    sum(x < 3)
    
    [1]  TRUE  TRUE FALSE FALSE FALSE FALSE FALSE  TRUE  TRUE
    [1] 4
    [1] 4
    
  • Solutions Q. 1-3
    oak_wp <-
    read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv",
               stringsAsFactors = FALSE)
    
    n_species <- length(unique(oak_wp$spcode)) # 1
    any(oak_wp$md.psi > oak_wp$pd.psi) # This answers #2
    which_fail <- which(oak_wp$md.psi > oak_wp$pd.psi) #2 and
    n_fail_check <- length(which_fail) # 2
    min_index <- which.min(oak_wp$md.psi) #3
    oak_wp[min_index, c("md.psi", "date", "spcode")] #3
    
  • Solution Q. 4
    years <- unique(oak_wp$year)
    mean_for_year <- function(year){
      return(mean(oak_wp$md.psi[oak_wp$year==year], na.rm=TRUE))
    }
    wp_means <- sapply(years, mean_for_year)
    years[which.min(wp_means)]
    
  • R vocab first two weeks of class:

5.2 Apply functions, cont.

  • Lists of lists
    a <- list(name="Dylan", dept="Biology")
    b <- list(name="Gondah", dep="Plant and Soil")
    c <- list(name="Kathryn", role="NRM")
    course <- list(Schwilk=a, Zulue=b, Brautigam=c)
    course
    
    $Schwilk
    $Schwilk$name
    [1] "Dylan"
    
    $Schwilk$dept
    [1] "Biology"
    
    
    $Zulue
    $Zulue$name
    [1] "Gondah"
    
    $Zulue$dep
    [1] "Plant and Soil"
    
    
    $Brautigam
    $Brautigam$name
    [1] "Kathryn"
    
    $Brautigam$role
    [1] "NRM"
    
  • sapply

    How can we extract a vector of just the first names?

    sapply(course,"[[","name")
    
    Schwilk     Zulue Brautigam 
    "Dylan"  "Gondah" "Kathryn"
    
  • Apply functions to matrices: apply()

    y <- apply(X, MARGIN, FUN, ...)

    X
    an array
    MARGIN
    subscripts which the function gets applied over
    FUN
    the function to be applied
    ...
    additional arguments to function

    Returns a vector or array

  • Applying a function to rows or columns

    the apply() function. MARGIN argument = 1 if the function applies to rows, 2 for columns, etc.

    z <- matrix(c(1,2,3,4,5,6), ncol=2)
    z
    
         [,1] [,2]
    [1,]    1    4
    [2,]    2    5
    [3,]    3    6
    
    apply(z, 2, mean)
    
    [1] 2 5
    
  • on rows, using our own function:
    f <- function(x) x/c(2, 8)
    y <- apply(z, 1, f)
    y
    
     
        [,1]  [,2] [,3]
    [1,]  0.5 1.000 1.50
    [2,]  0.5 0.625 0.75
    
  • What are "margins"?
    • d-dimensional arrays
      • can be split \(2^d - 1\) ways
      • example: 2-d matrix can be split 3 ways: rows, columns or cells
  • Examples: matrix
    • MARGIN = 1: slice up into rows
    • MARGIN = 2: slice up into columns
    • MARGIN = c(1,2): slice up into individual cells

    hadley-matrix1.jpg

    — Wickham (2011): http://www.jstatsoft.org/v40/i01/

  • 3-d arrays

    can be split \(2^3 - 1 = 7\) ways

    hadley-array1.jpg

    — Wickham (2011): http://www.jstatsoft.org/v40/i01/

  • Example: grid of genotypes

    Let's store genotypes as 1st dimension on a 2d landscape of organisms. So we will make a 3d array:

    makeGenotype <- function() sample(1:100, 25, replace=TRUE)
    # simple 10 x 10 landscape
    landscape <-  array(replicate(100, makeGenotype()),
                        dim =c(25,10,10 ))
    # The first genotype in landscape postion 1,1:
    landscape[, 1, 1]
    
    [1] 55 22  4 80 57  8 14 30 51  9 89  7 31 33 47 77 62 91 58 34 49 82 20 41  2
    
  • Example: apply
    ## get matrix of fitnesses
    testFitness <-  function(g, e) {
      return( 1 / ( sum((e-g)^2) + 1))
    }
    opt <- makeGenotype() # "best" genotype
    fitnesses <- apply(landscape, c(2,3), testFitness, e = opt)
    # lets see the relative fitnesses as percentages
    round(fitnesses / max(fitnesses)*100)
    
  • Results:
          [,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10]
     [1,]   40   45   27   38   40   36   38   44   36    33
     [2,]   41   34   45   37   51   46   37   49   34    40
     [3,]   35   40   35   35   43   35   41   31   45    51
     [4,]   31   58   24   46   44   38   33   38   40    37
     [5,]   34   36   56   44   45   58   37   39   46    68
     [6,]   39   40   47   51   33   36   54   47   44    42
     [7,]   42   56   44   44   38   35   36   38   40    70
     [8,]   43   53   67   50   36   39   59   38   40    55
     [9,]   34   42   61  100   29   42   37   27   50    35
    [10,]   37   58   39   64   53   54   82   42   40    31
    
  • Where is the most fit genotype?
    max(fitnesses)
    indices <-  which(fitnesses == max(fitnesses), arr.ind=TRUE)
    indices
    landscape[, indices[1], indices[2]]
    opt
    
    [1] 4.968944e-05
     
        row col
    [1,]   9   4
     
    [1] 25 89 23 48 69 84 40 30 19 52 87 35 15 54 30 30 48 85 47 19  5  6 62 20 21
     
    [1]  9 54 89  2 67 94  8  5 19 96 95 51 15 27  4 10 47 82 91 49  7 38 63 69 12
    
  • Heatmap of landscape:
    image(fitnesses)
    

    fig17_1.png

5.3 if-else

  • if else decisions
    tf <- function(r) {
      if (r == 13) {
        print("Lucky!")
      } else {
        print("No luck")
      }
    }
    
    tf(5)
    tf(13)
    
    [1] "No luck"
    [1] "Lucky!"
    
  • if, else if, else
    tf <- function(r) {
      if (r == 13) {
        print("Lucky 13!")
      } else if (r < 0) {
        print("Negative!")
      } else {
        print("Just a number")
      }
    }
    
    tf(13)
    tf(-100)
    tf(10)
    
    [1] "Lucky 13!"
    [1] "Negative!"
    [1] "Just a number"
    
  • if needs a single logical value
    if (c(TRUE, FALSE)) {}
    
    NULL
    Warning message:
    In if (c(TRUE, FALSE)) { :
      the condition has length > 1 and only the first element will be used
    
  • | vs || and & vs &&

    Use || and && in if statements. They are "short-circuiting"

    x <- 1
    if (x >0 &&  c(TRUE, FALSE)) {
      print("Yes")
    }
    
    [1] "Yes"
    

6 Loops and functions II:

6.1 Functions: review

  • The return() statement

    If you do not include one, R uses the last statement evaluated

    • This allows very concise functions:
      isEven <- function(x) x%%2==0
      

      OK for very short functions such as lambda (unnamed function) above

  • But explicit return is clearer

    And can avoid some weird R gotchas

    funnyFunc1 <- function(x) {
      if(x == 10) "TEN" 
      else "NOT TEN"
    }
    funnyFunc2 <- function(x) {
      if(x == 10) r <- "TEN"
      else r <- "NOT TEN"
    } # invisible return from assignment
    
    funnyFunc1(10)
    funnyFunc2(10)
    print("huh?")
    print(funnyFunc2(10))
    
    [1] "TEN"
    [1] "huh?"
    [1] "TEN"
    
  • Side effects

    "Pure" functions have no side effects, just returned values. But you have used functions entirely for their side effect (eg library(), write.csv())

  • Luckily R protects you from most side effects

    Function has read access to variables at higher scope, but no write access.

    x <- y <- 10
    func <- function(x) {
       x <- 0
       y <- y*2
       return(y)
    }
    func()
    x
    y
    
    [1] 20
    [1] 10
    [1] 10
    

6.2 In-class project 1

  • Name the functions

    Read the code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

    f1 <- function(x) {
        if (length(x) <= 1) return(x)
        return(x[length(x):1])
    }
    f2 <- function(string, prefix) {
      substr(string, 1, nchar(prefix)) == prefix
    }
    f3 <- function(x) {
        if(! is.vector(x) || ! is.numeric(x)) return(FALSE)
        return(length(x) == 1)
    }
    

6.3 Loops

  • for and while loops
    x <- c(1, 2, 3)
    for (i in x) { print(i^2) }
    
    .Rprofile: Setting TX repository
    [1] 1
    [1] 4
    [1] 9
    
    i <- 1
    while (i <= 10) {
      i <- i+4
      print(i)
    }
    
    .Rprofile: Setting TX repository
    [1] 5
    [1] 9
    [1] 13
    
  • Breaking out of loops
    i <- 1
    while(TRUE) {
    i <- i+4
      if (i>10) break
    }
    i
    
    .Rprofile: Setting TX repository
    [1] 13
    

    Or use repeat:

    i <- 1
    repeat {
      i <- i+4
      if (i>10) break
    }
    i
    
    .Rprofile: Setting TX repository
    [1] 13
    
  • Loop hierarchy

    specific to general: for, while, repeat

    So when do we use each type?

    • Use for loops when you need to iterate over a vector. There are a known number of iterations.
    • Use while loops when the loop must continue until some condition is met. If the condition is met when the loop is first encountered, the loop is skipped completely.
    • Use repeat as for while, but when the loop must execute at least once or when the best exit point(s) is/are somewhere in the middle of the loop block.
  • Two idioms for for loops
    • Idiom 1: Do something to each element
      v <- c(2,4,5,8)
      for(n in v) {
        print(n)
      }
      
      [1] 2
      [1] 4
      [1] 5
      [1] 8
      
  • Idiom 2: Iterate over an indexing vector

    in order to do something to elements over multiple vectors

    v2 <- c(1,1,2,10)
    for(i in seq_along(v)) {
      r <- v[i] * v2[i]
      print(sprintf("Loop %d, Result: %d", i, r))
    }
    
    [1] "Loop 1, Result: 2"
    [1] "Loop 2, Result: 4"
    [1] "Loop 3, Result: 10"
    [1] "Loop 4, Result: 80"
    

7 Exercises with functions

7.1 Homework Solutions

  • euclideanDistance()

    Straightforward solution:

    euclideanDist <- function (p1, p2) {
      dist = sqrt((p1[1] - p2[1])^2 + (p1[2] - p2[2])^2)
      return(dist) 
    }
    
    .Rprofile: Setting TX repository
    

    Idiomatic R:

    euclideanDist <- function (p1, p2) {
      dist <- sqrt(sum((p2 - p1)^2)) 
      return(dist) 
    }
    
    .Rprofile: Setting TX repository
    
  • decomposeTime() with modulo arithmetic
    ## decomposeTime: Given a value in seconds, convert to days, hours, minutes, and
    #       seconds. If input is negative, all returned components will be negative
    #       as well.
    #
    #    args:
    #        nSeconds = scalar number of seconds.
    #         ----------------------------------------
    #    returns:
    #         a list containing:
    #            days = integer number of days.
    #            hours = integer number of hours.
    #            minutes = integer number of minutes.
    #            seconds = residual seconds.
    
  • decomposeTime() with modulo arithmetic
    decomposeTime <- function (nSeconds) {
      sInMinute <- 60
      sInHour <- 60 * 60
      sInDay <- sInHour * 24
    
      # deal with negative input by recording and getting absolute value
      sign <- 1
      if(nSeconds < 0) {
        sign <- -1
        nSeconds <- -1 * nSeconds
      }
    
      result <- list(days = nSeconds %/% sInDay, 
                   hours = (nSeconds %% sInDay) %/% sInHour,  
                   minutes  = (nSeconds %% sInHour) %/% sInMinute,  
                   seconds = nSeconds %% sInMinute)
      return(lapply(result, '*', sign))
    }
    
  • decomposeTime() without using %% or %/%
    decomposeTime <- function(inputSeconds) {
      # Some constants:
      secondsPerMinute = 60
      secondsPerHour = secondsPerMinute * 60
      secondsPerDay = secondsPerHour * 24
    
      # calculate each set:
      days = floor(inputSeconds / secondsPerDay)
      residualSeconds = inputSeconds - (days * secondsPerDay)
    
      hours = floor(residualSeconds / secondsPerHour)
      residualSeconds = residualSeconds - (hours * secondsPerHour)
    
      minutes = floor(residualSeconds / secondsPerMinute)
      seconds = residualSeconds - (minutes * secondsPerMinute)
    
      return(list(days=days,hours=hours,minutes=minutes,seconds=seconds))
    }
    

7.2 In class projects

  • Project 1: calculating a probability

    Suppose we have n independent events each with a separate probability of occurring. What is the probability of exactly one of those events occurring?

    • If there are three events (A, B and C)

      P(exactly one event occurring) =

      • P(A and not B and not C) +
      • P(not A and B and not C) +
      • P(not A and not B and C)
  • The algorithm

    if the probability of the \(i^{th}\) event is \(p_i\), then P(A and not B and not C) would be be \(p_A(1 - p_B)(1 - p_c)\). For general n:

    \[ \sum\limits_{i=1}^{n} p_i (1 - p_1) \ldots (1 - p_{i-1}) (1 - p_{i+1}) \ldots{} (1 - p_n) \]

  • Your turn: translate this to code

    Function takes vector of probabilities and returns a scalar probability.

    exactlyone <- function(p) {
    ## ?
    }
    

    You will want to loop through the vector p while also looping through "not p" (1-p)

  • The code
    exactlyone <- function(p) {
      notp <- 1 - p
      totalProb <- 0.0
      for (i in 1:length(p)) {
        totalProb <- totalProb + p[i] * prod(notp[-i])
      }
      return(totalProb)
    }
    
  • Project 2: Fibonacci sequence

    0,1,1,2,3,5,8,13 …

    • Definition:

      \(F_1 = 0, F_2 = 1\) and \(F_n = F_{n-1} + F_{n-2}\)

    • fib()

      Write a function called fib(n) that returns the nth value in the Fibonacci sequence.

  • Solution
    fib <- function(n) {
        ## Two termination cases
        if (n==1) {
            return(0)
        } else if (n==2) {
            return(1)
        }
        # Otherwise use the definition of the sequence:
        return(fib(n-1) + fib(n-2))
    }
    
    # example: first 15 values in sequence:
    sapply(1:15, fib)
    
    [1]   0   1   1   2   3   5   8  13  21  34  55  89 144 233 377
    
  • The recursion stack

    recursion-stack.png

7.3 More recursion

  • "Exploding" dice

    In some table-top role playing games, dice are used to provide stochasticity in the simulation of events. In order to produce a more "exponential-like" distribution, some games use "acing" or "exploding dice" in which the die is re-rolled if the first roll results in the maximum value and the new value is added to the first.

    • Example:

      Some potential results from roll when "acing" is allowed on a 6-sided die (a 'd6'):

      • 2
      • 5
      • 6 + 5 = 11
      • 6 + 6 + 1 = 13
      • 4
  • Simulated such a die roll
    • rollDie()

      Write a function called rollDie(x) that simulates rolling a dice with x sides and allows "acing" or "exploding dice". Under this rule, a maximum roll (eg a "6" on a six-sided die) results in a second roll whose value is added to the first. This is open-ended and the second roll can "ace" as well: in other words, the resulting probability distribution is unbounded on the upper end.

  • A rollDie() Solution
    rollDie <-  function(die) {
       rolls <- sample(1:die, 1, replace=TRUE)
       if (rolls == die) {
         rolls <- rolls + rollDie(die)
       }
       return(rolls)
    }
    rollDie(6)
    
    [1] 4
    
  • rollDice()
    rollDice <-  function(die, n, allow.ace = TRUE) {
      stopifnot(die > 1) # no such thing as 1-sided die
      rolls <- sample(1:die, n, replace = TRUE)
      if(!allow.ace) return(rolls)
    
      aces <- rolls == die
      if ( any(aces)) {
        rolls[aces] <- rolls[aces] + rollDie(die, sum(aces), allow.ace)
      }
      return(rolls)
    }
    

8 Data Frames and Factors

8.1 Homework

  • Walk through algorithm:

    Example parents ["A", "B", "C", "D"] and ["a", "b", "c", "d"] .

    1. Choose one parent at random. Flip a coin
    2. Read one value from chosen parent and copy to offspring
    3. Test for recombination, if it occurs, start reading from other parent
    4. Go to step 2
    5. Done when length(child) is length(parent)
  • Some ideas for testing?

    What must be true?

    1. Allele at each position must be same as either mother or father's at that same position. You could write a test for that.
    2. What should be distribution of parental runs lengths?

8.2 Creation and merging

  • What is a data frame?

    Has characteristics of lists and matrices.

    kids <- c("Jack","Jill")
    ages <- c(10,12)
    d <- data.frame(kids, ages, stringsAsFactors=FALSE)
    d
    d[[1]]
    d$kids
    d[,1]
    
      kids ages
    1 Jack   10
    2 Jill   12
    [1] "Jack" "Jill"
    [1] "Jack" "Jill"
    [1] "Jack" "Jill"
    
  • Add to data frame

    rbind and cbind work as with matrices

    d <- rbind(d, list("Laura", 11))
    d
    
       kids ages
    1  Jack   10
    2  Jill   12
    3 Laura   11
    

    See also bind_rows and bind_cols from the dplyr package. rbind_rows will concatenate a list of data frames filling missing columns with NA. I use this most of the time.

  • Merging
    kids <- c("Jack","Joan","Laura")
    states <- c("CA","TX","TX")
    d2 <- data.frame(kids,states)
    merge(d, d2)
    
       kids ages states
    1  Jack   10     CA
    2 Laura   11     TX
    
  • Merging (all, all.x, all.y)
    d3 <- merge(d, d2, all = TRUE)
    names(d3) <- c("name","age","state")
    d3
    
       name age state
    1  Jack  10    CA
    2  Jill  12  <NA>
    3  Joan  NA    TX
    4 Laura  11    TX
    
  • Merging (all.x)
    d4 <- merge(d, d2, all.x = TRUE)
    d4
    
       kids ages states
    1  Jack   10     CA
    2  Jill   12   <NA>
    3 Laura   11     TX
    

8.3 Subsetting

  • Data frames: subsetting

    More concise than using [. Names assumed to be columns.

    subset(d3, state=="TX")
    #
    subset(d3, age > 10)
    
       name age state
    3  Joan  NA    TX
    4 Laura  11    TX
       name age state
    2  Jill  12  <NA>
    4 Laura  11    TX
    
  • Data frames: add new columns
    d3$birth.year <- 2012 - d3$age
    d3
    
       name age state birth.year
    1  Jack  10    CA       2002
    2  Jill  12  <NA>       2000
    3  Joan  NA    TX         NA
    4 Laura  11    TX       2001
    
  • More on data frames
    • Read data frame from file

      read.table() and read.csv()

    • Many matrix operations work on dataframes

      Eg rowMeans(), colMeans()

8.4 Factors

  • Grouping variables
    # head(iris)
    class(iris$Species)
    levels(iris$Species)
    
    [1] "factor"
    [1] "setosa"     "versicolor" "virginica"
    
  • Factors vs strings
    • We often store factors as strings in a csv file
    • I suggest always leaving them as strings until factor is needed (use stringsAsFactors=FALSE in read.csv())
    • Factors stored as integers internally
  • Factors vs strings
    v <- c("3", "2", "10")
    factor(v)
    vf <- as.factor(v)
    vf
    as.numeric(vf[3])  ## How did "10" become [1] ?
    
    [1] 3  2  10
    Levels: 10 2 3
    [1] 3  2  10
    Levels: 10 2 3
    [1] 1
    
  • Factors: reassign levels
    v <- c("Med", "High", "Low", "Low", "Med")
    factor(v)
    vf <- factor(v, levels=c("Low", "Med", "High"))
    vf
    
    [1] Med  High Low  Low  Med 
    Levels: High Low Med
    [1] Med  High Low  Low  Med 
    Levels: Low Med High
    
  • Factors: rename levels
    # vf
    levels(vf) <- c("low", "medium", "high")
    vf
    
    [1] medium high   low    low    medium
    Levels: low medium high
    
  • split()
    split(iris$Sepal.Length,iris$Species)
    
    .Rprofile: Setting TX repository
    $setosa
     [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7
    [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9
    [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0
    
    $versicolor
     [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2
    [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3
    [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7
    
    $virginica
     [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7
    [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4
    [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
    
  • tapply()
    tapply(iris$Sepal.Length, iris$Species, sd)
    
    .Rprofile: Setting TX repository
        setosa versicolor  virginica 
     0.3524897  0.5161711  0.6358796 
    
  • Subsetting a dataframe
    x <- subset(iris, Species == "virginica")
    head(x)
    
        Sepal.Length Sepal.Width Petal.Length Petal.Width   Species
    101          6.3         3.3          6.0         2.5 virginica
    102          5.8         2.7          5.1         1.9 virginica
    103          7.1         3.0          5.9         2.1 virginica
    104          6.3         2.9          5.6         1.8 virginica
    105          6.5         3.0          5.8         2.2 virginica
    106          7.6         3.0          6.6         2.1 virginica
    
  • by()
    colMeans(x[ , 1:4])
    
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           6.588        2.974        5.552        2.026
    
    by(iris[ , 1:4],iris$Species,colMeans)
    
    iris$Species: setosa
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           5.006        3.428        1.462        0.246 
    ------------------------------------------------------------ 
    iris$Species: versicolor
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           5.936        2.770        4.260        1.326 
    ------------------------------------------------------------ 
    iris$Species: virginica
    Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
           6.588        2.974        5.552        2.026
    
  • Shrimp again: split()
    shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F")
    split(1:length(shrimp), shrimp)
    
    $F
    [1] 2 5 6 8 9
    
    $J
    [1] 3 4
    
    $M
    [1] 1 7
    

    Frequencies:

    sapply(split(1:length(shrimp), shrimp), length)
    # or use table()
    print("using table():")
    table(shrimp)
    
    F J M 
    5 2 2
    [1] "using table():"
    shrimp
    F J M 
    5 2 2
    

8.5 Merging example

  • Data often stored in separate files

    Consider

    oak_wp <-
    read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv",
               stringsAsFactors = FALSE)
    species <- read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-species.csv",
               stringsAsFactors=FALSE)
    head(oak_wp, 3)
    head(species, 3)
    
       site tag spcode       date pd.psi md.psi year
    1 GATEN 110   QUEM 10/24/2010     NA     NA   10
    2 GATEN 111   QUEM 10/27/2010     NA     NA   10
    3 GATEN 113   QUEM 10/24/2010     NA     NA   10
        genus specific.epithet   family spcode display.name CM DM GM
    1 Quercus           emoryi Fagaceae   QUEM    Q. emoryi  1  1 NA
    2 Quercus         gambelii Fagaceae   QUGA  Q. gambelii  1  1  1
    3 Quercus         gravesii Fagaceae  QUGR2  Q. gravesii  1  1 NA
    
  • We want to plot data but use real names
    oak_wp <- merge(oak_wp, species)
    head(oak_wp, 2)
    
      spcode  site tag       date pd.psi md.psi year   genus specific.epithet
    1   QUEM GATEN 110 10/24/2010     NA     NA   10 Quercus           emoryi
    2   QUEM GATEN 111 10/27/2010     NA     NA   10 Quercus           emoryi
        family display.name CM DM GM
    1 Fagaceae    Q. emoryi  1  1 NA
    2 Fagaceae    Q. emoryi  1  1 NA
    
  • "Merging" = "joining"

    See left_join, inner_join, etc in dplyr

    We'll come back to this.

8.6 Data frames and the "tidyverse"

  • R has a history
    • And that means some aspects have been proven effective, and some not (eg bad defaults like drop=TRUE)
    • Hadley Wickham and others have worked on a set of packages that provide a consistent interface. These, too, have a history (eg plyr, reshape2), but the current best choices are readr, dplyr, tidyr, tibble, and ggplot2.
    • You can even load all of these at once with the meta-package: library(tidyverse)
  • dplyr and tibbles
    • You can use base R (subset, merge, etc), but dplyr porvides a clean interface for working with data frames
    • In fact, dplyr will convert your data frames to efficient "tibbles"
    • tibbles are just data frames with nicer consistent behavior
  • Some dplyr examples to get you started
    library(readr)
    library(dplyr)
    oak_wp <-
    read_csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv")
    
    head(filter(oak_wp, spcode=="QUEM" & year==13), 2)
    
    Attaching package: ‘dplyr’
    
    The following objects are masked from ‘package:stats’:
    
        filter, lag
    
    The following objects are masked from ‘package:base’:
    
        intersect, setdiff, setequal, union
    Source: local data frame [2 x 7]
    
       site   tag spcode       date pd.psi md.psi  year
      <chr> <chr>  <chr>      <chr>  <dbl>  <dbl> <int>
    1   MDS   238   QUEM 05/23/2013  -0.85  -1.95    13
    2   MDS   237   QUEM 05/23/2013  -0.76  -1.16    13
    
  • mutate()
    nd <- mutate(oak_wp, proj_year = year - 9)
    head(nd, 2)
    
    Source: local data frame [2 x 8]
    
       site   tag spcode       date pd.psi md.psi  year proj_year
      <chr> <chr>  <chr>      <chr>  <dbl>  <dbl> <int>     <dbl>
    1 GATEN   110   QUEM 10/24/2010     NA     NA    10         1
    2 GATEN   111   QUEM 10/27/2010     NA     NA    10         1
    
  • select()
    nd <- select(oak_wp, spcode, date, md.psi)
    tail(nd, 3)
    
    Source: local data frame [3 x 3]
    
      spcode       date md.psi
       <chr>      <chr>  <dbl>
    1   QUHY 06/25/2013  -1.22
    2   QUHY 06/25/2013  -1.87
    3   QUHY 06/25/2013  -1.33
    

9 Math I

9.1 Homework

  • Homework
    • Solutions are included with this week's assignment
    • Remember to cite any sources or collaborators and provide some detail – avoid plagiarism.
    • Please read the style guide. Adhere to it.
    • Avoid literal values – rarely needed. Example on board.
  • Testing example
    ## test if distribution of run lengths looks like a bernoulli process
    ## (geometric distribution). As length -> infinity this should be true.
    
    # Mean should be 1 / p
    # Variance should be (1 - p) / p^2
    mother <- rep(0,10000)
    father <- rep(1,10000)
    child <- mateVectors(mother, father, 0.05)
    
    runs <- rle(child)
    runlengths <- runs$lengths
    mean(runlengths) # 1 / 0.05 = 20
    var(runlengths)  # (1 - p) / p^2  = 380 when p=0.05
    

9.2 Sorting and sets

  • sort and order
    • sort()
      x <- c(13,5,12,5)
      sort(x)
      
      .Rprofile: Setting TX repository
      [1]  5  5 12 13
      
    • order()
      x <- c(13,5,12,5)
      order(x)
      x[order(x)]
      
      .Rprofile: Setting TX repository
      [1] 2 4 3 1
      [1]  5  5 12 13
      
  • Use order() to sort dataframes
    d1 <- data.frame(sites = c("MDS", "BGN", "MDN", "LCN"),
                     elevation = c(1900, 2300, 1950, 1750))
    d1
    
      sites elevation
    1   MDS      1900
    2   BGN      2300
    3   MDN      1950
    4   LCN      1750
    
  • Use order() to sort dataframes
    y <- order(d1$elevation)
    d1[y, ]
    
      sites elevation
    4   LCN      1750
    1   MDS      1900
    3   MDN      1950
    2   BGN      2300
    
  • Sets in R

    A set is a collection of distinct (unique) objects. In R we simply use a vector.

    numbers <- sample(1:10, 20, replace=TRUE) 
    numbers
    myset <- unique(numbers)
    myset
    
     [1]  6  3  1  8  6  1  2  3  6  1  9  1  4  4  5  8  7 10  6  4
     
    [1]  6  3  1  8  2  9  4  5  7 10
    
  • Set operations
    union(x,y)
    Union of sets x and y
    intersect(x,y)
    Intersection of sets x and y
    setdiff(x,y)
    All element in x that are not in y
    setequal(x,y)
    Test for set equality
    c %in% y
    Test for membership in set, returns boolean vector
    choose(n,k)
    Number of possible subsets of size k chosen from set of size n
    combn(x,n)
    Find all combinations of size m in x
  • Let's write some functions
    1. Write a function to calculate the symmetric difference of two sets
    2. Define a subset operator

9.3 Number systems

  • A note on integers vs real numbers

    R has integers and real numbers ("numeric" class and "double" type). I have sometimes referred to "integers" when we have created vectors such as c(1,2,3), but usually one ends up with the numeric class which stores double-precision floating points. Examples:

    typeof(2)
    typeof(as.integer(2))
    typeof(c(1,2,3))
    typeof(1:3)
    
    [1] "double"
    [1] "integer"
    [1] "double"
    [1] "integer"
    
  • Positional number systems
    • Simply ways to count things. Ours is the base-10.
    • There is no symbol "10" in base-10 and no symbol "2" in base-2, etc
    • Each column stands for a power of the base (eg powers of 10)
    • 1,357,896:
      1 3 5 7 8 9 6
      106 105 104 103 102 101 100
  • Binary: base-2
    • A single binary number is called a B inary dig IT, or bit.
    • Computers perform operations on binary number groups called "words". Today, most computers use 32-, 64-, or 128-bit words.
    • Words are subdivided into 8-bit groups called bytes.
    • 25510 = 111111112

      \( 1 \times 2^7 + 1 \times 2^6 + 1 \times 2^5 + 1 \times 2^4 + 1 \times 2^3 + 1 \times 2^2 + 1 \times 2^1 + 1 \times 2^0 \)

      \(= 2^8 - 1 \)

  • Hexadecimal: base-16
    Dec Hex Binary Dec Hex Binary
    0 0 0000 8 8 1000
    1 1 0001 9 9 1001
    2 2 0010 10 a 1010
    3 3 0011 11 b 1011
    4 4 0100 12 c 1100
    5 5 0101 13 d 1101
    6 6 0110 14 e 1110
    7 7 0111 15 f 1111
  • Hexadecimal example

    hex-example.png

  • Negative numbers

    Negative numbers: signed integers

    • Takes an extra bit of information because it doubles the size we must be able to store.
    • Multiple ways to represent (Use first bit for sign, top half of range, or 2's complement)
    • Last way is what is used in computers — but you don't need to worry about representation.
  • Integer Overflow

    If we have a 1-byte integer in which the first bit stores the sign:

    01011101 : 93
    + 01101011 : 107
    \= 11001000 : -56
    • Produces warning in R
      sum(rep(as.integer(100000), 100000))
      
      [1] NA
      Warning message:
      In sum(rep(as.integer(1e+05), 1e+05)) :
        integer overflow - use sum(as.numeric(.))
      

      R uses 32 bit integers for now. See ?integer. There are packages to handle big integers.

9.4 Floating point

  • Real numbers on computers: floating point
    • Problem:

      How to represent real numbers with integers? Binary point?

      101.011:

      1 0 1 . 0 1 1
      22 21 20 . 2-1 2-2 2-3
    • Solution:

      Fix the location of the binary point. Think of scientific notation with fixed precision: \(6.020123 \times 10^{23}\)

      In binary: \(\pm\) X.YYYYYY \(\times 2^{\pm N}\)

  • IEEE single precision: 32 bits, one word

    Simply "float" type in most languages: \(\pm\) 1.YYYYY \(\times 2^{N-127}\)

    • assumes X is always 1 (no leading zeros, trick to get one more bit of precision)
    • 1 sign bit
    • 8 bit biased exponent (n-127)
    • 23 bit mantissa (YYYYY)
  • "single precision floating point": 32 bits

    IEEE_754_Single_Negative_Zero.jpg

  • Different floating point types

    IEEE Standard for Floating-Point Arithmetic (IEEE 754) (1980s):

    • Single precision, called "float" in the C language family, has a 23 bit mantissa (so 24 bits of precision with assumed 1).
    • Double precision, called "double" in the C language family, occupies 64 bits (8 bytes) and its significand has a precision of 53 with the assumed 1. Sign is 1 bit, exponent is 11 bits and mantissa is 52 bits.

    In R there are only 2 numeric types: integer and numeric (double precision). "Single precision" not used much anywhere nowadays.

  • Some issues
    • Doubles types ("numeric" class) cannot represent all 64-bit integers! But since R uses 32 bit integers, double can store more integers than can the integer type. See =?integer
    • Rounding error: The only numbers that can be represented exactly in R's numeric type are integers and fractions whose denominator is a power of 2. Other numbers have to be rounded to (typically) 53 binary digits accuracy.
  • Rounding error

    64 bits are not enough to store every possible value on number line!

    1.0 - 0.9 - 0.1
    
    .Rprofile: Setting TX repository
    [1] -2.775558e-17
    
  • Floating point error

    As a result, two floating point numbers will not reliably be equal:

    a <- sqrt(2)
    a * a == 2
    a * a - 2
    
    [1] FALSE
    [1] 4.440892e-16
    

    The function all.equal() allows you to compare doubles. It basically tests abs(a - b) < epsilon value.

  • Testing equality of floating point numbers

    Compare all.equal, identical and ==

    all.equal(a*a, 2)  # see documentation for using in if statement
    identical(a*a, 2.0)
    all.equal(as.double(8), as.integer(8))
    identical(as.double(8), as.integer(8))
    
    [1] TRUE
    [1] FALSE
    [1] TRUE
    [1] FALSE
    

10 Math II

10.1 Math functions

  • Built-in functions
    exp()
    Exponential function, base e
    log()
    Natural logarithm
    log10()
    Logarithm base 10
    sqrt()
    Square root
    abs()
    Absolute value
    sin()
    Trig functions: sin(), cos(), tan(), atan(), atan2()
    min(), max()
    Min and max within a vector
    pmin(), pmax
    Element-wise maxima and minima of several vectors
    sum(), prod()
    Sum and product of elements of vector
    cumsum, cumprod()
    Cumulative sum or product of all elements in a vector
    round()
    Rounding to integers. See floor(), ceiling()
    factorial()
    Factorial function
  • Minimization and Maximization

    nlm() and optim() both find minima

    f <- function(x) {
      return(x^2 - sin(x) )
    }
    
    d <- data.frame(x = seq(-10, 10, 0.01))
    d$y <- f(d$x)
    
    library(ggplot2)
    qplot(x, y, data = d)
    
  • Plot of x,y

    fig11_1.png

  • Using nlm() (Newton algorithm only)
    nlm(f, 5)
    
    $minimum
    [1] -0.2324656
    
    $estimate
    [1] 0.4501833
    
    $gradient
    [1] 4.501954e-07
    
    $code
    [1] 1
    
    $iterations
    [1] 6
    
  • Using optim()
    optim(5, f, method = "Brent", lower = -100, upper = 100)
    
    $par
    [1] 0.4501836
    
    $value
    [1] -0.2324656
    
    $counts
    function gradient 
          NA       NA 
    
    $convergence
    [1] 0
    
    $message
    NULL
    
  • Symbolic math: differentiation

    There are packages to do symbolic math (and other open source alternatives eg sage) but base R has D()

    D(expression( x^2 + 2*x ), "x")
    
    .Rprofile: Setting TX repository
    2 * x + 2
    
    D(expression( exp(x^2) ), "x")
    
    .Rprofile: Setting TX repository
    exp(x^2) * (2 * x)
    
    D(expression(log(2*x) + x^2), "x")
    
    .Rprofile: Setting TX repository
    2/(2 * x) + 2 * x
    
  • D() continued
    curve(  x^2 + 2*x ,0,10)
    

    diff-plot1.png

  • D() continued II
    curve(eval(D(expression( x^2 + 2*x ), "x") ), 0, 10)
    

    diff-plot2.png

  • Simple calculus: numeric integration

    integrate()

    f <- function(x) x^2
    integrate(f,0,1)
    
    .Rprofile: Setting TX repository
    0.3333333 with absolute error < 3.7e-15
    

10.2 Statistical distributions

  • Distribution functions
    Distribution pdf cdf quantiles rand. numbers
    Uniform dunif() punif() qunif() runif()
    Normal dnorm() pnorm() qnorm() rnorm()
    Chi square dchisq() pchisq() qchisq() rchisq()
    Binomial dbinom() pbinom() qbinom() rbinom()
    Poisson dpois() ppois() qpois() rpois()
    Student's T dt() pt() qt() rt()
    F dist. df() pf() qf() rf()

    and more!

    For continuous distributions, the most useful functions are the "p" (cumulative density) and "q" (quantile = inverse cdf) functions.

  • Using distributions

    \(95^{th}\) percentile of the Chi square distribution with 2 degrees of freedom:

    qchisq(0.95,df=2)
    
    .Rprofile: Setting TX repository
    [1] 5.991465
    

    \(5^{th}\) and \(95^{th}\) percentiles of the normal distribution with mean zero and sd of one:

    qnorm( c(0.05, 0.95) )
    
    .Rprofile: Setting TX repository
    [1] -1.644854  1.644854
    
  • Example: Binomial Distribution

    Probability of exactly 2 heads out of 12 coin flips:

    dbinom(2, size=12, prob=0.5)
    
    .Rprofile: Setting TX repository
    [1] 0.01611328
    

    Should be same as probability of 10 heads:

    dbinom(10, size=12, prob=0.5)
    
    .Rprofile: Setting TX repository
    [1] 0.01611328
    

    Probability of 1-10 heads:

    pbinom(10, size=12, prob=0.5)
    
    .Rprofile: Setting TX repository
    [1] 0.9968262
    

10.3 Randomization and simulation

  • Random variates

    Built-in random number generators for many statistical distributions.

    • Find probability of four or five heads out of five coin tosses:
      x <- rbinom(500000,5,0.5)
      mean(x >= 4)
      
      # The exact way:
      1 - pbinom(3, 5, 0.5)
      dbinom(4, 5, 0.5) + dbinom(5, 5, 0.5)
      
      [1] 0.187228
      [1] 0.1875
      [1] 0.1875
      
  • Expectation

    find \( E[max(X,Y)] \) when X,Y are normal random variables mean=0 sd=1

    sum <- 0
    nreps <- 100000
    for (i in 1:nreps) {
      xy <- rnorm(2)
      sum <- sum + max(xy)
    }
    print(sum/nreps)
    
    [1] 0.5663903
    
    # or store in vector:
    v <- replicate(nreps, max(rnorm(2)))
    hist(v)
    
  • Plot the results

    biv-norm-exp.png

  • Setting the random number seed

    Important for debugging!

    set.seed(100)
    mean(rnorm(10000))
    mean(rnorm(10000))
    set.seed(100)
    
    mean(rnorm(10000))
    
    .Rprofile: Setting TX repository
    [1] 0.00387737
    [1] -0.01183919
    [1] 0.00387737
    

10.4 Linear Algebra

  • Linear Algebra

    Matrix multiplication:

    a <- matrix(c(1,3,2,4), ncol=2)
    b <- matrix(c(1,0,-1,1), ncol=2)
    a %*% b
    
    .Rprofile: Setting TX repository
         [,1] [,2]
    [1,]    1    1
    [2,]    3    1
    
  • Linear algebra

    Solve system of equations:

    \begin{array}{lcl} x + y & = & 4 \\ x - y & = & 2 \end{array}
    a <- matrix(c(1,1,1,-1), ncol=2)
    b <- c(4, 2)
    solve(a, b)
    
    .Rprofile: Setting TX repository
    [1] 3 1
    
  • Some linear algebra functions
    crossprod()
    Inner product (dot product) of two vectors. Not the real crossproduct.
    t()
    Matrix transpose
    qr()
    QR decomposition
    chol()
    Cholesky decomposition
    det()
    Determinant
    eigen()
    Eigenvalues / eigenvectors
    diag()
    Extracts the diagonal of a square matrix (useful for obtaining variances from a covariances matrix and for constructing a diagonal matrix)
    sweep()
    Numerical analysis sweep operations (a more capable and complicated "apply" type function)

11 Environment, Scope and Style

11.1 Scope

  • Where do names apply?
    • Environment

      A function consists not only of its arguments and body, but also its environment: the collection of objects present when the function is created.

      Within a function, variables defined at a higher scope are visible, but not writeable with "<-"

  • Nested functions
     w <- 12
     f <- function(y) {
       d <- 8
       h <- function() {
         return(d * (w+y) )
        }
       print(environment(h))
       return(h())
     }
    
    environment(f)
    ls()
    
    <environment: R_GlobalEnv>
    [1] "d"     "f"     "i"     "nreps" "sum"   "v"     "w"     "x"     "xy"
    

    Note: confusingly, we refer to the top environment as .GlobalEnv in code.

  • Scope hierarchy
    • w is global to f() and h(). d is local to f() but global to h().
      f(2)
      h()
      
      <environment: 0x5a36020>
      [1] 112
      Error: could not find function "h"
      
  • Scope hierarchy con.
    f <- function(y, ftn) {
      d <- 8
      print(environment(ftn))
      return(ftn(d, y))
    }
    
    h <- function(dd, yy) {
       return(dd * (w + yy))
    }
    
    w <- 12
    f(3, h)
    
    <environment: R_GlobalEnv>
    [1] 120
    

11.2 Superassignment

  • Scope and assignment

    In general, R functions do not change variables at a higher scope

    x <- 1
    increment.x <- function() {
      print(x)
      x <- x+1
      print(x)
    }
    increment.x()
    x
    
    [1] 1
    [1] 2
    [1] 1
    
  • Super assignment
    increment.x <- function() {
       x <<- x+1
    }
    x
    increment.x()
    x
    
    [1] 1
    [1] 2
    
  • assign()

    Write to upper level variables using character strings for names

    filelist <- list.files(pattern="*.org$")
    #filelist
    for(s in filelist) {
      assign(s, readLines(file(s)) )
    }
    ls()
    
    .Rprofile: Setting TX repository
    [1] "filelist"             "r-course-outline.org" "s"                   
    

11.3 Programming style

  • Programming style
    • Good style aids maintenance
      • Use comments to tell the readers what they need to know
      • Use indentation to mark the levels of program control
      • Use meaningful names
      • Develop a convention for names (global variables vs other variables vs function names)
      • Use standard idioms where appropriate
      • Avoid unnecessary complexity. Avoid code repetition!
  • Coding style
  • Function Documentation

    For example, Google's style:

11.4 Craft and organization

  • Beyond "style": crafting a program
    • Procedural programming

      is the concept of developing programs based on sub tasks (procedures = functions. The sub-tasks are called and connected by a single parent task that coordinates all of the procedures into a working solution.

  • Modularization

    Modularization is the process programmers use to break up complex programs into smaller more manageable procedures.

    • Divide and Conquer!
      • Review overall goal and identify subtasks
      • Make each function perform one small task and test it separately.
      • Keep functions small enough to understand
      • If you find yourself repeating code, reevaluate!
  • Constants
    • What is a "constant?"

      An identifier whose associated value cannot (or should not) be altered by the program during its execution

  • Global variables and constants

    It can be useful to put some infrequently changing constants at the top of a script file:

    GRID_SIZE <- 10
    NUM_ITERATIONS <- 10000
    

    Then use these global variables:

    makeGrid <- function(gridSize = GRID_SIZE) {
      g <- matrix(1:(gridSize^2),nrow=gridSize)
      return(g)
    }
    

12 Strings

12.1 "Text" on computers

  • How to represent characters?

    Simple idea: numeric code for each character. Originated at Bell labs in 1960.

    • ASCII: American Standard Code for Information Interchange
      • 7 bits (128 symbols)
      • On modern computers, stored in one byte with zero as first bit
      • For example: "A" = 01000001
  • ASCII table

    0–31 used for "unprintable" characters (BEEP! page feed, etc)

    800px-ASCII_Code_Chart.png

  • All was good

    Assuming you were an English speaker.

    • Extended ASCII

      We have 8 bits! What about codes 128-255?

      • IBM-PC: The "OEM" character set: European accented characters and drawing characters.
      • But everyone had a different idea of what to do with the upper codes.
  • ANSI standard
    • Same as ASCII for 0-127
    • But different "code pages" for upper set
    • So everything was crazy but ok

      As long as you spoke a western language.

      In Asia, there was a whole other mess using a mixture of one and two-byte characters.

  • Unicode

    Basic idea: two bytes. Then expanded to more (up to 3 bytes) Enough to store every character in every language.

    • Originally: \(2^{14} = 16,384\) (2 bits reserved)
    • Currently: 110,182 code points
    • Every letter maps to a "code point" — a platonic ideal of a letter.
    • "A" is different from "a", but "A" in Times New Roman is the same as "A" in Arial
    • "Hello" = U+0048 U+0065 U+006C U+006C U+006F
    • Still the issue of exactly how to store the numbers
  • Encodings
    • There were multiple standards for STORING the numbers. Original, UCS-2, uses 2 bytes.
    • Americans were lazy, so:
    • UTF-8

      Idea: use one byte for code points 0-127, and only use two for points above that. UTF-8 is the de facto standard encoding for unicode in US.

      You can read in utf-8 encoded text in R. In fact utf-8 may be the current default ("locale"):

      Sys.getlocale()
      
      .Rprofile: Setting TX repository
      [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
      
  • Unicode strings
    a <- "\U00B5"  # note that "\" is escape char in R strings
    a
    b <- c("\U03C0", "π")
    b
    
    .Rprofile: Setting TX repository
    [1] "µ"
    [1] "π" "π"
    
  • You can use unicode (utf-8) in R scripts

    But may be awkward to enter depending on keyboard and text editor

    π <- 3.14
    r <-2
    2*r*π
    
    .Rprofile: Setting TX repository
    [1] 12.56
    

12.2 Strings

  • Characters in R
    • until 2004, R only used ASCII!
    • R now supports UTF-8 (make sure "locale" is set correctly on your computer) — works a bit differently on Windows vs Linux
    • So you can use Chinese, Russian, etc.
    • Sys.setlocale("LC_CTYPE","chinese")
  • Character objects (strings)

    A character object is used to represent string values in R. We convert objects into character values with the as.character() function:

    x = as.character(pi)
    x
    typeof(x)
    
    .Rprofile: Setting TX repository
    [1] "3.14159265358979"
    [1] "character"
    
  • Formatting strings

    using paste():

    first <- "Dylan"
    last <- "Schwilk"
    n <- paste(first, last)
    n
    paste(last, first, sep=", ")
    
    [1] "Dylan Schwilk"
    [1] "Schwilk, Dylan"
    
  • Formatting strings

    using sprintf():

    s <- sprintf("%s has %d dollars", n, 50)
    s
    sprintf("%s has $%2.2f", n, 50)
    
    [1] "Dylan Schwilk has 50 dollars"
    [1] "Dylan Schwilk has $50.00"
    
  • Substrings

    substr() and substr<-()

    substr(s, 7, 13)
    
    [1] "Schwilk"
    

    Assignment:

    substr(s, 7, 13) <- "SCHWILK"
    s
    substr(s, 7, 13) <- "???"
    s
    
    [1] "Dylan SCHWILK has 50 dollars"
    [1] "Dylan ???WILK has 50 dollars"
    
  • Substitution
    sub("???WILK", "S.", s, fixed=TRUE)
    
    [1] "Dylan S. has 50 dollars"
    

12.3 String operations

  • Converting strings
    tolower(s)
    toupper(s)
    
    [1] "dylan ???wilk has 50 dollars"
    [1] "DYLAN ???WILK HAS 50 DOLLARS"
    
  • Built-in string functions
    grep()
    Searches for a substring, like the Linux command of the same name
    nchar()
    Finds the length of a string
    paste()
    Assembles a string from parts
    sprintf()
    Also assembles a string from parts
    substr()
    Extracts a substring
    strsplit()
    Splits a string into substrings
  • Example: testing for a file suffix
    # tests whether the file name fnname has the suffix suffix,
    # e.g. "abc" in "x.abc"
    hasSuffix <- function(fname, suffix) {
      ncf <- nchar(fname)               # nchar() gives the string length
      dotpos <- ncf - nchar(suffix) + 1 # dot would start here if there is one
      # now check that stuff is at the end of the name
      return(substr(fname, dotpos, ncf) == suffix)
    }
    
    hasSuffix("file.csv", "csv")
    hasSuffix("file.csv", "txt")
    
    [1] TRUE
    [1] FALSE
    
  • grep() and grepl()

    Takes its name from the linux/unix command.

    v <- c("beach", "beech", "back", "buster", "abea")
    grep("bust", v)
    
    v[grep("bust", v)]
    
    [1] 4
    [1] "buster"
    
    grepl("bust", v)
    
    [1] FALSE FALSE FALSE  TRUE FALSE
    

12.4 Regular expressions

  • Regular Expressions

    Sometimes we need to look for patterns, not specific strings

    • "regex"
      • Come from regular set theory (1950s)
      • Formal way of specifying a set string matches
      • Really are a whole mini computer language
      • Used in many command line tools (eg grep), and implemented as at least a library in most computer languages
      • Most text editors and word processors allow regex search and replace
  • regex basics
    • Single character wildcard: "."
    • Single match from multiple options: [a-z] or [acdf-h]. Matches one of the characters in brackets or implied by range
    • [a-zA-Z] should match any ASCII alphabetic letter
  • Positional matching
    • ^ matches beginning of line
    • $ matches end of line
    • \b matches word boundary (space or punctuation, etc)
  • Positional matching example
    x <- unlist(strsplit("Out in the west Texas town of El Paso", split=" "))
    # capitalized words:
    grep("^[A-Z]", x)
    
    [1] 1 5 8 9
    
  • Matching multiples

    Modify preceding character or expression

    • "*" Matches any number of preceding expression (zero or more!)
    • "?" matches one or none
    • "+" matches one or more
    • {n} matches n times
    • ={n,} matches n or more times
    • ={n,m} matches n to m times
  • Grouping
    • Use parentheses to group
      v <- c("aba", "abba", "aabba", "accc",  "ababa")
      grep("(ab)+a", v)
      
      .Rprofile: Setting TX repository
      [1] 1 5
      
    • Boolean OR
      v <- c("beach", "beech", "back", "buster", "abea")
      grep("be(e|a)ch", v)
      
      .Rprofile: Setting TX repository
      [1] 1 2
      
  • Regular expressions in R
    • R string functions (and functions in the stringr package) can handle regular expressions. Use fixed=TRUE to avoid.
    • R includes some predefined character classes (eg [:digit:], [:punct:]). See ?regex
    • In regexes, one must escape special characters with a backslash. But keep in mind that R will first interpret the string before it hands it to the regex engine. So to match a literal "?" use \\?
  • Escaping example
    v <- c("The first", "firstofall", "first time", "Who's on first?", "bfirstb")
    grep("first", v)
    grep("first\\b", v)
    grep("^first\\b", v)
    
    .Rprofile: Setting TX repository
    [1] 1 2 3 4 5
    [1] 1 3 4
    [1] 3
    

12.5 stringr Package

  • Why the need?
    • R string functions are inconsistent

      Remedy: Hadley Wickham's stringr library

      library(stringr)
      s <- c("(806) 777-7777", "(401) 555-5555", "(806) 444-4444")
      number_pattern <- "\\(([0-9]{3})\\) ([0-9]{3}-[0-9]{4})"
      str_match(s, number_pattern )
      
           [,1]             [,2]  [,3]      
      [1,] "(806) 777-7777" "806" "777-7777"
      [2,] "(401) 555-5555" "401" "555-5555"
      [3,] "(806) 444-4444" "806" "444-4444"
      
  • Basic string operations with stringr
    str_c()
    like paste, but it uses the empty string as the default separator and silently removes zero length arguments
    str_length()
    is equivalent to nchar, but it preserves NAs and converts factors to characters (not integers)
    str_sub()
    is equivalent to substr() but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the end of the string. The end position defaults to -1, which corresponds to the last character.
    str_sub<-
    is equivalent to substr<-, but like str_sub it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing
  • str_c()
    str_c("Hello", " ", "world", "!")
    v <- c("A", "B", "C")
    str_c(v,v)
    
    [1] "Hello world!"
    [1] "AA" "BB" "CC"
    
  • str_length()
    v <- factor(c("Aa", "Bb", "C", NA))
    nchar(v)
    str_length(v)
    
    Error in nchar(v) : 'nchar()' requires a character vector
    [1]  2  2  1 NA
    
  • str_sub()
    s <- "What do you do with a drunken sailor"
    str_sub(s, -nchar("sailor"), -1 ) <-  "student"
    s
    
    [1] "What do you do with a drunken student"
    

13 Debugging

13.1 Debugging approaches

  • Debugging

    "Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

    — Brian W. Kernighan.

  • What is a bug?
    • Code fails (produces error)
    • Code runs but produces the wrong result
  • Debugging

    Differential diagnosis

    • Characterize the error
    • Localize the error
    • Fix the error (and don't introduce a new bug!)
  • Characterizing the bug
    • Make the error reproducible
      • Can we always get the error when re-running the same code and values?
      • If we start the same code in a clean workspace, does the same thing happen?
    • Bound the error
      • How much can we change the inputs and get the same error? A different error?
      • For what inputs (if any) does the bug go away?
      • How big is the error?
    • Get more information

      Add extra output (e.g., number of optimization steps, did the loop converge, final value of optimized function)

  • Localizing the bug

    Tools:

    • traceback (where did an error message come from?); print, warning, stopifnot (messages from the code as it goes)
    • Trying controlled inputs
    • Interactive debugging

13.2 traceback()

  • traceback
    • When an R function fails, an error is printed to the screen. Immediately after the error, you can call traceback() to see in which function the error occurred.
    • The traceback() function prints the list of functions that were called before the error occurred.
    • The functions are printed in reverse order.
  • traceback example
    f <- function(x) {
      r <- x - g(x)
      r
    }
    
    g <- function(y) {
      r <- y * h(y)
      r
    }
    
    h <- function(z) {
      r <- log(z)
      if(r < 10) r^2
      else r^3
    }
    f(-1)
    traceback()
    
  • traceback results
    Error in if (r < 10) r^2 else r^3 (from #3) : missing value where TRUE/FALSE needed
    In addition: Warning message:
    In log(z) : NaNs produced
    3: h(y) at #2
    2: g(x) at #2
    1: f(-1)
    

13.3 Interactive Debugging

  • Interactive debugging

    traceback() does not always tell you where the error originated. In order to know which line causes the error, you may want to step through the function using debug().

  • debug() and browser()
    • debug(foo) flags the function foo() for debugging.
    • undebug(foo) unflags the function.
    • When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and you can interact with the environment.
    • After you obtained necessary information from the environment, you can let the function to execute the next statement.
    • You can call browser() from any point in your code to enter the debugger. This is useful if you know about where the bug is and you don't want to start at the beginning of the function.
  • debug example
    ## buggy version:
    mateVectors <- function(x, y, r) {
      child <- y
      usingX <- runif(1) <= 0.5
      for (i in x) {
        if(usingX) {
          child[i] <- x[i]
        }
        if(runif(1) <= r) {
          !usingX
        }
      }
      return(child)
    }
    
  • Trying the buggy verson
    set.seed(1001)
    mateVectors(1:10, 11:20, 0.4)
    mateVectors(c("A","B","C"), c("a","b","c"),0.3)
    
     [1] 11 12 13 14 15 16 17 18 19 20
     
                 A   B   C 
    "a" "b" "c"  NA  NA  NA
    
  • Let's debug this function
    • Useful debug commands
      • 'n' next line
      • 'c' Continue to end of context (finish loop, or finish function)
      • 'where' print a stack trace of all active function calls
      • 'Q' quit the debugger
  • Debugging using RStudio

    RStudio includes a visual debugger. Have you noticed the "environment" window? It shows values for variables in the current environment. When debug is active, this shows the function being debugged.

    • RStudio’s error inspector and traceback()
    • RStudio’s "Rerun with Debug" tool and options(error = browser) which open an interactive session where the error occurred.
    • RStudio’s breakpoints and browser() which open an interactive session at an arbitrary location in the code.

14 The Grammar of Graphics and ggplot I

14.1 Grammar of graphics

  • Happy Ada Lovelace Day
  • Grammer of Graphics

    The Grammar of Graphics. Leland Wilkinson, 1999, 2005. Springer.

    • An abstraction

      To describe, think about, and implement graphical displays of data

  • What is a "plot"?

    A plot is

    • A data set and a set of mappings from variables to aesthetics
    • One or more layers
    • One scale for each aesthetic mapping
    • A coordinate system (we won't worry about this and just use cartesian coordinates!)
    • The faceting specification (how many subplots)
  • What is a scatterplot?

    A simple dataset. Plot A vs C with symbol shape determined by categorical variable, D

    A B C D
    2 3 4 a
    1 2 1 a
    4 5 15 b
    9 10 80 b
  • Map variables to aesthetic space
    • Turn values in coordinates, eg

      \(floor\left( \frac{x - min(x)}{range(x)} \times screen.width \right)\)

    • Transformed:
      x y Shape
      25 11 circle
      0 0 circle
      75 53 square
      200 300 square
  • Next steps
    • Statistical transformation

      Boring this time: the identity function

    • Combine the parts
      • Data (represented by the point geom)
      • Scales and coordinate system (axes and legends)
      • Annotations (such as background and plot title)
  • The parts

    scatterplot-parts.png

  • The plot

    scatterplot-final.png

  • Grammar

    We can produce grammatically correct plots that are nonsensical, just as in English we can produce grammatically correct nonsense

  • Nonsense plot
    ggplot(mpg, aes(displ, hwy)) + geom_line()
    

    ggplot_nonsense.png

  • Learning ggplot2

14.2 Components of a plot

  • Components of a plot
    • A layer:
      • data and aesthetic mappings
      • geometric objects
      • statistical transformations
    • The rest of the plot:
      • scales
      • facet specification
      • the coordinate system
  • Scales, facets, coordinate system
    • We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
      • A scale is a function and its inverse, along with some parameters
      • Example: a color gradient scale
      • The inverse gives us the axes (for position scales) or the legend (for other scales such as colors or shapes)
    • You've seen facets and we will stick with cartesian coordinates for now.
  • Layers

    composed of five parts:

    • data
    • aesthetic mapping (what is x, y, color, shape, etc). aes() command
    • statistical transformation (stat)
    • geometric object (geom) (geoms have default stats and vice versa)
    • a position adjustment

14.3 Building a plot by layers

  • ggplot() function

    Use the diamonds data set

    p <- ggplot(diamonds, aes(carat, price, color =cut))
    #p
    

    Blank plot

  • Adding layers
    p <- p + layer(geom = "point", stat= "identity", position = "identity")
    p
    

    ggplot-add-layer.png

  • More complicated plot
    p <- ggplot(diamonds, aes(x=carat))
    p <- p + layer(geom = "bar",
                   stat = "bin",
                   position = "identity",
                   params = list(fill = "steelblue", binwidth=1),
             )
    p
    
  • Result:

    ggplot-more-comp.png

  • Shortcuts!

    Every geom is associated with a default statistic and position, and every statistic with a default geom:

    p <- ggplot(diamonds, aes(x=carat))
    p <- p + geom_histogram(binwidth = 1, fill = "steelblue")
    p
    
  • Result:

    ggplot-short-cut.png

  • ggplot book

    http://www.springerlink.com/content/978-0-387-98140-6/

    Tables 4.2 and 4.3 have default stats and geoms

  • Layer shortcuts
    • Shortcut functions
      • geom_XXX(mapping, data, ..., geom, position)
      • stat_XXX(mapping, data, ..., stat, position)
    • Descriptions:
      mapping
      A set of aesthetic mappings using aes()
      data
      A dataset to override the default one
      ...
      Parameters for the geom or the stat
      geom or stat
      You can override defaults
      position
      Method for adjusting overlapping objects
  • Layers are objects
    bestfit <- geom_smooth(method = "lm",
                           color = "steelblue",
                           size = 2,
                           se = FALSE)
    
    qplot(cty, hwy, data = mpg) + bestfit
    
  • Result:

    ggplot-layer-obj.png

14.4 Aesthetics

  • Aesthetic mappings

    Refer to variables in the dataset only! So, back to our mpg data:

    a <- aes(x = displ, y = hwy, color = class)
    p <- ggplot(mpg, a) + geom_point()
    p
    
  • Result:

    ggplot-aes.png

  • Mapping vs setting

    Because R can handle arguments by position or name this can be confusing:

    p <- ggplot(mpg, aes(cty, displ))
    # set point color to dark blue
    p + geom_point(color = "darkblue")
    # map new unamed variable with value "darkblue" to color
    p + geom_point(aes(color="darkblue"))
    # more useful:
    p + geom_point(aes(color=cyl))
    
  • Individual vs collective geoms
    • Individual: Graphical object for each row (eg points)
    • Collective: statistical summary, polygon geom
    • Lines and smoothed paths somewhere in between
  • Grouping

    We'll use the Oxboys data set in nlme

    library(nlme)
    # ?Oxboys
    # head(Oxboys)
    # class(Oxboys)
    p <- ggplot(Oxboys, aes(age, height, group = Subject)) +
         geom_line()
    p
    
  • Result:

    ggplot-oxboys.png

  • Different groups for different layers
    # try adding lm line for all data
    p + geom_smooth(method="lm", se=FALSE)
    # same as:
    p + geom_smooth(aes(group = Subject),
                    method="lm", se=FALSE)
    # what we want:
    p + geom_smooth(aes(group = NULL),
                    method="lm", se=FALSE)
    
  • Default groups on categorical variables
    p <- ggplot(Oxboys, aes(Occasion, height)) +
           geom_boxplot()
    p + geom_line(aes(group = Subject), color = "#3366FF")
    
  • Result:

    ggplot-groups.png

  • Violin plot:
    ggplot(mpg, aes(class,hwy)) + geom_violin()
    

    ggplot-violin.png

  • Tables of geoms and stats (layer functions)
  • Stats

    Default stat for histogram geom is bin

    ggplot(diamonds, aes(carat)) +
      geom_histogram(aes(y= ..density..), binwidth=0.1)
    ## in qplot:
    qplot(carat, ..density.., data= diamonds,
          geom = "histogram", binwidth = 0.1)
    ## with counts
    ggplot(diamonds, aes(carat)) +
      geom_histogram(aes(y= ..count..), binwidth=0.1)
    
  • Position adjustment
    Adjustment Description
    dodge Adjust position by dodging overlaps on sides
    fill Stack overlapping objects and standardize to equal height
    identity Don't adjust position
    jitter Jitter points to avoid overlap
    stack Stack overlapping objects on top of one
  • Position examples
    # stack
    p <- ggplot(diamonds, aes(carat))
    p + stat_bin(aes(ymax = ..count.., fill = cut),
                  binwidth=0.2, position = "stack")
    # fill
    p +  stat_bin(aes(ymax = ..count.., fill = cut),
               binwidth=0.2, position = "fill")
    # dodge
    p + stat_bin(aes(ymax = ..count.., fill = cut),
                  binwidth=0.2, position = "dodge")
    # Identity (use line not area)
    p + stat_bin(aes(ymax = ..count.., color = cut),
                 geom = "line",  binwidth=0.2,
                 position = "identity")
    
  • Writing your own layer functions
    stat_sum_single <- function(fun, ...) {
       stat_summary(fun.y = fun, colour="gray15",
                    geom="point", size = 4, ...)
    }
    se_bar <- function(){
      stat_summary(fun.data = mean_se,
                   geom = "errorbar",
                   width = 0.15, size = 1)
    }
    
    p <- ggplot(iris, aes(Species, Petal.Length)) +
          se_bar() + stat_sum_single(mean)
    p
    p + stat_sum_single(median,  size = 3, shape = 5)
    
  • Result:

    ggplot-own-func.png

15 The Grammar of Graphics and ggplot II

15.1 Scales

  • Scales
    • Control how data is mapped to perceptual properties, and produce guides (axes and legends) which allow us to read the plot.
    • Important parameters: name, breaks & labels, limits.
    • Naming scheme: scale_aesthetic_name. All default scales have name continuous or discrete.
  • Scales

    Position scales, color scales, manual scales, identity scale

    name
    text string, mathematical expressions (see ?plotmath). Use labs(), ylab(), xlab()
    limits
    Domain of the scale, Continuous scales take a vector of length 2, eg c(4,10).
    breaks
    sets the values that appear on the axis or the legend (tick marks).
  • Scales examples
    p <- qplot(cty, hwy, data= mpg, color = displ)
    p
    p + scale_x_continuous("City mpg")
    p + xlab("City mpg")
    p + ylab("Highway mpg")
    p + labs(x="City mpg",
             y = expression(paste("Highway ",frac(miles,gal))),
             color= "Displacement (l)" )
    
  • Result

    ggplot-scales.png

  • Scale transformations

    "transformers":

    • Manage the transformation and the inverse for legends, values
    • Common ones: log, log10, logit, sqrt, exp, reverse
    • shorthand for common ones: scale_x_log10() is equivalent to scale_x_continuous(trans = "log10")
  • More scales examples
    library(scales) # for math_format()
    library(stringr) # for str_c
    p <- qplot(carat, price, data=diamonds)
    p + scale_x_log10() + scale_y_log10()
    
    log10breaks <- seq(-0.8,0.8,.2)
    p + scale_x_log10(breaks = 10^(log10breaks),
                      labels = parse(
                      text=(str_c("10^",log10breaks)))) +
        scale_y_log10()
    
    # natural log
    p + scale_x_continuous(trans = "log",
                           breaks = trans_breaks("log", "exp"),
                           labels = trans_format("log", math_format(e^.x)))
    
  • Result

    ggplot-scales2.png

15.2 Themes

  • ggplot Themes
    p + theme_bw()
    
    theme_nogrid <- theme_bw() + theme(
                    axis.title = element_text(size = 16),
                    axis.text = element_text(size = 13),
                    panel.grid.major = element_blank(),
                    panel.grid.minor = element_blank(),
                    panel.background = element_blank()
      )
    p + theme_nogrid + xlab("Carat") + scale_y_continuous("Price ($)", labels = comma)
    
    # ?theme
    
  • Result:

    ggplot-theme.png

  • Some more resources for ggplot tweaking

15.3 Maps

  • Maps in ggplot

    ggmap by David Kahle and Hadley Wickham: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

    library(maps)
    library(ggmap)
    library(RColorBrewer)
    
    county_df <- map_data('county')
    tx.counties <- subset(county_df, region == "texas")
    names(tx.counties)[6] <- "county"
    
  • Maps in ggplot: TX population
    tx.pop <- read.csv("http://demographics.texas.gov/Resources/TPEPP/Estimates/2014/2014_txpopest_county.csv", stringsAsFactors=FALSE)
    tx.pop$county <- tolower(tx.pop$county)
    tx.pop <- merge(tx.counties, tx.pop)
    head(tx.pop,3)
    
        county      long      lat group order region FIPS census_2010_count
    1 anderson -95.75271 31.53560  2492 74520  texas    1             58458
    2 anderson -95.76989 31.55852  2492 74521  texas    1             58458
    3 anderson -95.76416 31.58143  2492 74522  texas    1             58458
      july1_2014_pop_est jan1_2015_pop_est num_chg_10_14 num_chg_10_15
    1              58501             58243            43          -215
    2              58501             58243            43          -215
    3              58501             58243            43          -215
      pct_chg_10_14 pct_chg_10_15
    1           0.1          -0.4
    2           0.1          -0.4
    3           0.1          -0.4
    
  • TX population map
    tx.map <- ggplot(tx.pop, aes(x=long, y=lat, group=county))
    tx.map <- tx.map + geom_path(color="gray")
    tx.map <- tx.map + geom_polygon(aes(fill=log10(jan1_2015_pop_est)))
    tx.map
    
  • Result:

    ggplot_map1.png

  • With google maps
    goog.tx <- get_map(location = "Texas", zoom=6)
    tx.map2 <- ggmap(goog.tx)
    tx.map2 <- tx.map2  +  geom_path(aes(x=long, y=lat, group = county),
                                     color="gray", data=tx.pop)
    tx.map2 <- tx.map2 + geom_polygon(aes(x=long, y=lat, group = county,
                                      fill =log10(jan1_2015_pop_est)),
                                      alpha = 0.80, data=tx.pop)
    llabel <- expression(paste("Population (", log[10], ")"))
    tx.map2  + scale_fill_gradientn(llabel, colours=brewer.pal(5,"OrRd"))
    
  • Result:

    ggplot_map2.png

16 Shaping data and dplyr

16.1 Shaping data:

  • dplyr (grammar of data manipulation)
    • summarizing by group
    • transforming by row or by group
    • filtering out specific records
  • tidyr

    For obtaining tidy data (needed before steps above)

16.2 The dplyr package

  • dplyr

    dplyr includes a family of individually simple functions that all have the same structure and all operate on data frames. They employ a bit of magic for convenience that allow non-standard evaluation.

    Five important verbs (functions)

    • select certain columns of data
    • filter your data to select specific rows
    • arrange the rows of your data into an order
    • mutate your data frame to contain new columns
    • summarize chunks of you data
  • data frames vs "tibbles"

    Packages by Hadley Wickham in the "tidyverse" (or "hadleyverse") use "tibbles" instead of R's traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make life a littler easier.

    library(tibble)
    

    but you rarely need to do this because dplyr and other packages will load this for you

  • Structure
  • Some sample data

    The Batting data set (in the Lahman package) contains the batting records for all professional US players. For more information use the command ?Batting

    library(dplyr)
    library(Lahman) # baseball dataset
    names(Batting)
    
     [1] "playerID" "yearID"   "stint"    "teamID"   "lgID"     "G"       
     [7] "AB"       "R"        "H"        "X2B"      "X3B"      "HR"      
    [13] "RBI"      "SB"       "CS"       "BB"       "SO"       "IBB"     
    [19] "HBP"      "SH"       "SF"       "GIDP"
    
  • ID variables

    Factors by which we can group the data. For example, let's get mean stats (games, at bats, runs and hits, columns 6-9) for Babe Ruth across all years he played.

    colMeans( filter(Batting, playerID == "ruthba01")[6:9] )
    
            G        AB         R         H 
    113.77273 381.72727  98.81818 130.59091
    
  • Another ID variable: year

    Let's get mean stats (games, at bats, runs and hits, columns 6-9) across all players for 1871.

    colMeans( filter(Batting, yearID == 1871)[6:9] )
    
           G       AB        R        H 
    19.96522 94.10435 23.12174 26.96522
    
  • Obtain summaries for groups

    Finding summaries for one group is easy, but what if we want to find the same summaries for every group?

    • split, apply, combine.

      Summarize mean number of games, at bats, runs and hits (columns 6-9) for each year across all players.

  • use dplyr
    allyears <- select(Batting, yearID, G, AB, R, H)
    years <- group_by(allyears, yearID)
    yearstats <- summarize_each(years, funs(mean))
    head(yearstats, 3)
    
    Source: local data frame [3 x 5]
    
      yearID        G        AB        R        H
       <int>    <dbl>     <dbl>    <dbl>    <dbl>
    1   1871 19.96522  94.10435 23.12174 26.96522
    2   1872 21.19872 100.50641 21.73077 28.76282
    3   1873 28.82400 135.79200 28.64000 39.38400
    
  • Split-apply-recombine

    When do we need to split-apply-recombine?

    • During data preparation, when performing group-wise ranking, normalization, etc
    • When creating summaries for display or analysis (marginal means, conditioning tables)
    • During modelling, when fitting separate models to subsets of the data
    • How to in other systems/languages?

      Excel pivot tables, "by" argument to SAS procedures, or, in R, the "plyr" package

16.3 Using dplyr functions

  • filter

    How many players in 1871?

    y1871 <- filter(Batting, yearID==1871)
    length(unique(y1871$playerID))
    
    [1] 115
    
  • mutate and summarise functions
    • mutate() allows us to modify an existing data frame; a row of output per row of input
    • summarise() A row of output per group
    • Arguments to mutate() and summarize()

      Function form is mutate(.data, ...) where the ... arguments are named parameters defining new columns

  • Strategy for grouped calculations
    1. Extract a single group
    2. Solve problem for that one group — perhaps write a wrapper function that takes the group and returns the summary or transformation
    3. Use dplyr functions and groupby to solve it for all groups
  • 1-2. Calculate career year for Babe Ruth

    Number of years since player began.

    baberuth <- filter(Batting, playerID == "ruthba01")
    baberuth <- mutate(baberuth, 
                       cyear = yearID - min(yearID) + 1)
    head(baberuth, 5)[,c(2, 6:9, 23)]
    
      yearID  G  AB  R  H cyear
    1   1914  5  10  1  2     1
    2   1915 42  92 16 29     2
    3   1916 67 136 18 37     3
    4   1917 52 123 14 40     4
    5   1918 95 317 50 95     5
    
  • 3. Calculate career year for all players
    byplayer <- group_by(Batting, playerID)
    Batting <- mutate(byplayer, cyear = yearID - min(yearID) + 1)
    Batting <- arrange(Batting, playerID)
    head(Batting)[, c(1, 2, 6:9, 23)]
    
    Source: local data frame [6 x 7]
    Groups: playerID [1]
    
       playerID yearID     G    AB     R     H cyear
          <chr>  <int> <int> <int> <int> <int> <dbl>
    1 aardsda01   2004    11     0     0     0     1
    2 aardsda01   2006    45     2     0     0     3
    3 aardsda01   2007    25     0     0     0     4
    4 aardsda01   2008    47     1     0     0     5
    5 aardsda01   2009    73     0     0     0     6
    6 aardsda01   2010    53     0     0     0     7
    
  • More summaries: Teams per player

    Total number of teams in the data frame

    length(unique(Batting$teamID))
    
    [1] 149
    

    Number of teams for which each player played:

    nteams <- summarise(byplayer,
                    nteams = length(unique(teamID)))
    head(nteams, 5)
    
    Source: local data frame [5 x 2]
    
       playerID nteams
          <chr>  <int>
    1 aardsda01      8
    2 aaronha01      3
    3 aaronto01      2
    4  aasedo01      5
    5  abadan01      3
    
  • Do players improve?
    • Code   BMCOL
      library(ggplot2)
      brplot <- ggplot(baberuth, 
            aes(cyear, RBI/AB)) + 
            geom_point()
      brplot + 
         geom_smooth(method="lm")
      

      baseball-baberuth1.png

    • Result   BMCOL

      baseball-baberuth1.png

  • Do players improve? Modeling for one player

    Fitting a regression model relating runs per bat (rbi/ab) to career year

    mod.br <- lm(RBI/AB ~ cyear, data=baberuth)
    mod.br
    #summary(mod.br)
    
     X11cairo 
           2
    
    Call:
    lm(formula = RBI/AB ~ cyear, data = baberuth)
    
    Coefficients:
    (Intercept)        cyear  
       0.203200     0.003413
    
  • Do players improve? Modeling for all

    How to model this for all players?

    lmslope <- function(x,y){
      linmod <- lm(y~x)
      coefs <- coef(linmod)
      return(coefs[2])
     }
    
    bb <- subset(Batting, AB >= 25)
    bb <- mutate(group_by(bb, playerID), 
                 include = length(yearID) > 10)
    bb <- filter(bb, include)
    player_traj <- summarize(bb, slope=lmslope(cyear, RBI/AB))
    
    qplot(slope, data=player_traj, binwidth=0.005)
    
  • Results:

    baseball-allplayers.png

  • Maybe players get better than worse again
    modq.br <- lm(RBI/AB ~ cyear + I(cyear^2), data=baberuth)
    #mod.br
    summary(modq.br)
    
    Call:
    lm(formula = RBI/AB ~ cyear + I(cyear^2), data = baberuth)
    
    Residuals:
         Min       1Q   Median       3Q      Max 
    -0.10600 -0.01994  0.01035  0.04025  0.06291 
    
    Coefficients:
                  Estimate Std. Error t value Pr(>|t|)   
    (Intercept)  0.1267872  0.0372004   3.408  0.00295 **
    cyear        0.0225163  0.0074510   3.022  0.00701 **
    I(cyear^2)  -0.0008306  0.0003146  -2.640  0.01613 * 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 0.05295 on 19 degrees of freedom
    Multiple R-squared:  0.3592,	Adjusted R-squared:  0.2917 
    F-statistic: 5.325 on 2 and 19 DF,  p-value: 0.01459
    
  • Quadratic plot
    brplot + stat_smooth(method="lm", formula = y~x+I(x^2))
    

    baseball-baberuth2.png

  • All players, quadratic term
    qtermmodel <- function(df){
      mod <- lm(RBI/AB ~ cyear + I(cyear^2), data = df)
      coefs <- data.frame(t(coef(mod)))
      names(coefs) <- c("int", "slope", "qterm")
      return(coefs)
     }
    
    bcoefs <- do(bb, qtermmodel(.))
    qplot(qterm, data=bcoefs, binwidth=0.0005) + xlim(c(-0.015, 0.015))
    
  • Results: all players, quadratic term

    baseball-allplayers-quad.png

16.4 US baby names data

  • US baby names data

    baby-names-ssa.png

  • Download the data
  • Read data
    bnames <- read.csv("http://r-research-tool.schwilk.org/assignments/top-1000-baby-names.csv",
                       stringsAsFactors = FALSE)
    bnames$sex <-  factor(bnames$sex)
    levels(bnames$sex) <- c("girl", "boy")
    head(bnames, 3)
    
      name  sex year rank  percent
    1 Mary girl 1881    1 7.524496
    2 Anna girl 1881    2 2.934108
    3 Emma girl 1881    3 2.212000
    

16.5 Explore data

  • Name diversity

    Investigate the percentage of all names that the top 1000 names occupy

    bn <- group_by(bnames, year, sex)
    ofall <- summarize(bn, tprop = sum(percent))
    qplot(year, tprop, data = ofall, colour = sex, geom = "line", ylab="Percentage of all names")
    
  • Result:

    baby-names-ofall.png

  • Look at name trend
    a <- aes(year, percent, color = sex)
    p <- ggplot(subset(bnames, name=="Dylan"), a) + geom_line()
    p
    

    baby-names-dylan.png

  • Some helper functions
    # Return letter at position pos
    library(stringr)
    letter <- function(string, pos) {
      return(str_sub(string, pos, pos))
    }
    
    # and some consistent axes
    axes <- list(
      scale_x_continuous(breaks = c(1900, 1920, 1940, 1960, 1980, 2000)),
      scale_y_continuous("percent")
      )
    
  • Trend in first letter
    bn <- group_by(bnames, year,sex, first = letter(name, 1))
    fl <- summarize(bn, percent=sum(percent))
    ggplot(fl, a) + geom_line() + facet_wrap(~ first) + axes
    
  • Result

    baby-names-first-letter.png

  • Trend in last letter
    bn <- group_by(bnames, year,sex, last = letter(name, -1))
    ll <- summarize(bn, percent=sum(percent))
    ggplot(ll, a) + geom_line() + facet_wrap(~ last) + axes
    
  • Result

    baby-names-last-letter.png

  • Wow, rise in names that end in n
    lastn <- group_by(filter(bnames, letter(name, -1)=="n"), year)
    lastn <- summarize(lastn, percent=sum(percent))
    ggplot(lastn, aes(year, percent)) + geom_line() + axes
    

    baby-names-last-letter-n.png

  • Result

16.6 soundex

  • What about names that sound alike?

    see http://en.wikipedia.org/wiki/Soundex

    soundex <- function(name) {
      name <- toupper(name)
      name <- unlist(strsplit(name,"")) # split into individual chars
      sdx <- name[1]
      dict <- c("BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", "AEIOUHWY")
      codes <- c("1","2","3","4","5","6",".")
        for (i in 2:length(name)) {
          code = codes[grep(name[i], dict, fixed=TRUE)]
           if (code != str_sub(sdx,-1,-1)) {
             sdx <- str_c(sdx, code)
          }
         }
      sdx <- gsub(".", "", sdx, fixed=TRUE)
      sdx <- str_pad(sdx, 3, pad = "0")
      return(sdx)
    }
    
  • Get soundex for each name in data
    unames <- unique(bnames$name)
    sdxs <- sapply(unames, soundex)
    
    bnames$soundex <- sdxs[ match(bnames$name, unames )] 
    bnames$soundend <- str_sub(bnames$soundex,2)
    
  • So what is the deal with names that end in n?
    sdylan <- subset(bnames, soundex==soundex("Dylan"))
    unique(sdylan$name)
    
    [1] "Delma"   "Dillon"  "Delano"  "Delwin"  "Dylan"   "Delaney" "Dillan" 
    [8] "Dillion" "Dylon"   "Dyllan"  "Dallin"  "Dilan"   "Daylen"
    
  • Plot sound alikes
    sdy <- group_by(sdylan, sex, year)
    sdylan.sum <- summarize(sdy, percent = sum(percent))
    ggplot(sdylan.sum, a)  + geom_line() + axes
    
  • Result

    baby-names-soundex-dylan.png

  • Maybe all soundexs that end in "45" + "n"?
    sy <- group_by(bnames, year, sex)
    s45 <- summarize(subset(sy, soundend == "45" & letter(name,-1)=="n"), percent=sum(percent))
    ggplot(s45, a)  + geom_line() + axes
    
  • Result

    baby-names-soundex-45.png

17 Tidy Data

17.1 Tidy data

  • First an introduction to the pipe

    in magrittr package (required by dplyr and re-exported). It is a bit like the | pipe in the unix command line. It simply lets the left hand object replace the first argument of the right hand function.

    library(dplyr)
    
    mtcars %>% filter(mpg > 30)
    
       mpg cyl disp  hp drat    wt  qsec vs am gear carb
    1 32.4   4 78.7  66 4.08 2.200 19.47  1  1    4    1
    2 30.4   4 75.7  52 4.93 1.615 18.52  1  1    4    2
    3 33.9   4 71.1  65 4.22 1.835 19.90  1  1    4    1
    4 30.4   4 95.1 113 3.77 1.513 16.90  1  1    5    2
    
  • Causes of messiness
    • What is tidy data?
      • A step along the road to clean data
      • Data that is easy to model, visualise and aggregate (i.e. works well with lm, ggplot, and ddply)
    • Tidy:
      • Each variable must have its own column.
      • Each observation must have its own row.
      • Each value must have its own cell.
  • Example of messy data
    • There are three variables in this data set
        Pregnant Not Pregnant
      male 0 5
      female 1 4
    • Tidy:
      Sex Pregnant N
      male Yes 0
      male No 5
      female Yes 1
      female No 4
  • Data storage in "long format"
    Storage Meaning
    Table / File Data set
    Rows Observation
    Columns Variables
  • Common causes of messiness
    • column headers are values, not variable names
    • multiple variables are stored in one column
    • variables are stored in both rows and columns
    • multiple types of experimental unit stored in the same table
    • one type of experimental unit stored in multiple tables
  • Some tools
    library(tidyr)
    ?gather
    ?separate
    ?spread
    library(stringr)
    ?str_replace
    ?str_sub
    ?str_split_fixed
    library(dplyr)
    ?arrange
    

17.2 Long vs wide data

  • You've seen long vs wide data formats
    date species psi.pd psi.md RH.pd RH.md
    110810 JUDE2 -2.0 -4.5 88 50
    110810 QUGR -1.0 -2.2 90 40
    110810 PICE -0.4 -1.7 91 33

    Useful for regressing md.psi on pd.psi, but not useful if we want to facet on time of day. So is this tidy? Depends if we think of pd.psi and md.psi as separate variables or not.

  • Long format
    date species time psi RH
    110810 JUDE2 pd -2.0 88
    110810 JUDE2 md -4.5 90
    110810 QUGR pd -1.0 91
    110810 QUGR md -2.2 50
    110810 PICE pd -0.4 40
    110810 PICE md -1.7 33
  • Conceptual framework
    • Identifer (id) variables identify the unit that measurements take place on. ID variables are usually discrete, and are typically fixed by design.
    • Measured variables represent what is measured on that unit.
  • Further step

    In truly "long" format there are only id variables and a value column!

    date species time variable value
    110810 JUDE2 pd psi -2.0
    110810 JUDE2 md psi -4.5
    110810 QUGR pd psi -1.0
    110810 QUGR md psi -2.2
    110810 PICE pd psi -0.4
    110810 PICE md psi -1.7
    110810 JUDE2 pd RH 88
    110810 JUDE2 md RH 90
    110810 QUGR pd RH 91
    110810 QUGR md RH 50
    110810 PICE pd RH 40
    110810 PICE md RH 33
  • Problem: Column headers are values, not variable names
    raw <- read.delim("http://r-research-tool.schwilk.org/lectures/data/pew.txt", check.names = FALSE, stringsAsFactors = FALSE)
    head(raw)
    
                religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k
    1           Agnostic    27      34      60      81      76     137      122
    2            Atheist    12      27      37      52      35      70       73
    3           Buddhist    27      21      30      34      33      58       62
    4           Catholic   418     617     732     670     638    1116      949
    5 Don’t know/refused    15      14      15      11      10      35       21
    6   Evangelical Prot   575     869    1064     982     881    1486      949
      $100-150k >150k Don't know/refused
    1       109    84                 96
    2        59    74                 76
    3        39    53                 54
    4       792   633               1489
    5        17    18                116
    6       723   414               1529
    
  • Problem: Multiple variables in one column

    The WHO tuberculosis data:

    library(stringr)
    raw <- read.csv("http://r-research-tool.schwilk.org/lectures/data/tb-simple-2011.csv", stringsAsFactors = FALSE)
    raw %>% select(1:7) %>% tail()
    
         iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524
    6801   ZW 2006  12718         NA          NA         215          736
    6802   ZW 2007  10583          6         132         138          500
    6803   ZW 2008   9830         NA          NA         127          614
    6804   ZW 2009  10195         NA          NA         125          578
    6805   ZW 2010  11654         20         130         150          710
    6806   ZW 2011  12596         30         122         152          784
    
    ## get rid of the "overall count" for now
    raw$new_sp <- NULL
    
    ## rename rest of columns for clarity
    names(raw) <- str_replace(names(raw), "new_sp_", "")
    

17.3 Reshaping with tidyr

  • Variables in rows and columns

    Example data, photosynthesis and respiration in two species

    library(tidyr)
    tab1 <- read.csv("http://r-research-tool.schwilk.org/lectures/data/carex.csv")
    head(tab1)
    
      Species Treatment   A.1    A.2    A.3    A.4    R.1    R.2    R.3    R.4
    1      Pa      0.15  8.36  7.632  5.522  6.300 0.7358 0.5870 0.6236 0.4354
    2      Pa      3.00 11.06 11.540  7.914  7.914     NA     NA     NA     NA
    3      Pa     15.00 22.32 18.640 17.200 18.400 1.1680 0.9600 1.5460     NA
    4      Cs      0.15 15.60 10.240 13.100 10.480 1.4840 0.7682 0.6236 1.3060
    5      Cs      3.00 13.74 15.240 13.920 10.500     NA     NA     NA     NA
    6      Cs     15.00  9.37 18.760 15.600 10.660 1.0360 1.0640 0.5276 0.8990
    

    But A, A.1 and A.2 all represent measurements of A (photosynthesis)

  • Gather data (wide to long format)
    tab.tidier <- tab1 %>% gather(measurement, value, -Species, -Treatment)
    head(tab.tidier)
    
      Species Treatment measurement value
    1      Pa      0.15         A.1  8.36
    2      Pa      3.00         A.1 11.06
    3      Pa     15.00         A.1 22.32
    4      Cs      0.15         A.1 15.60
    5      Cs      3.00         A.1 13.74
    6      Cs     15.00         A.1  9.37
    
  • Separate replicate id from measurement type
    tab.tidy <- tab.tidier %>% separate(measurement, into = c("measurement", "rep"), 
                                                     sep = "\\.") 
    head(tab.tidy)
    
     
     Species Treatment measurement rep value
    1      Pa      0.15           A   1  8.36
    2      Pa      3.00           A   1 11.06
    3      Pa     15.00           A   1 22.32
    4      Cs      0.15           A   1 15.60
    5      Cs      3.00           A   1 13.74
    6      Cs     15.00           A   1  9.37
    

    In more complicated cases you may need to use stringr library, regular expressions to pull apart column data. See also argument convert.

  • Spreading data (long to wide)

    Why?

    tab.spread <- tab.tidy %>% spread(measurement, value)
    head(tab.spread)
    
      Species Treatment rep     A      R
    1      Cs      0.15   1 15.60 1.4840
    2      Cs      0.15   2 10.24 0.7682
    3      Cs      0.15   3 13.10 0.6236
    4      Cs      0.15   4 10.48 1.3060
    5      Cs      3.00   1 13.74     NA
    6      Cs      3.00   2 15.24     NA
    
  • What to do with NAs?
    • In long data, some NAs can be implicit (missing rows), but when spread to a wider form we need to make them explicit since there is a cell to fill.
    • In other cases, an NA in one form can imply a zero in another aggregate form (eg oak cover on my Chisos transects?!).

18 Dates and Times

18.1 How to represent time

  • Time is tricky to represent

    Time is a deceptively complicated system to work with. Do we want it to behave like a number line, or according to complicated conventions?

    • Possible time objects
      • instants (specific moment)
      • intervals (specific timespans with a start and stop)
      • durations (length of timespan between two instants)
      • periods (time span in units larger than a second that can be of variable length)
  • Instants
    • An instant refers to a specific moment of time.
    • Instants are combination of years, months, days, hours, minutes, seconds, and timezone. e.g., 2011-03-28 08:44:32 CST
    • There are two main time data types in R:
  • POSIXct

    The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.

  • POSIXlt

    The POSIXlt data type is a vector, and the entries in the vector have the following meanings:

    • seconds
    • minutes
    • hours
    • day of month (1-31)
    • month of the year (0-11)
    • years since 1900
    • day of the week (0-6 where 0 represents Sunday)
    • day of the year (0-365)
    • Daylight savings indicator (positive if it is daylight savings)

18.2 lubridate

  • lubridate package

    More info at http://www.jstatsoft.org/v40/i03/

    library(lubridate)
    
    #  lubridate turns strings into instants with
    #  functions that have y, m, and d in their
    #  names. Use the function whose name matches
    #  the order of the elements in the date
    ymd("2011-03-28")
    mdy("03/28/2011")
    dmy("28032011")
    ymd_hms("2011:03:2808:50:30", tz="CST")
    
    #order matters
    dmy("05-03-2011")
    mdy("05-03-2011")
    
  • What type of objects do these return?
    # what is the object type (class)?
    class(mdy("05-03-2011"))
    
    [1] "Date"
    
  • Example use
    # example:
    soil <- read.csv("http://r-research-tool.schwilk.org/lectures/data/soil-sensors.csv",
                      stringsAsFactors = FALSE,
                      na.strings = c("NA", "#N/A", ""))
    names(soil) <- c("time", "p1_psi", "p2_VWC",
                     "p3_VWC", "p4_psi")
    head(soil)
    class(soil$time)
    
                      time p1_psi p2_VWC p3_VWC p4_psi
    1 24 May 2012 12:00 PM    -36  0.230  0.338    -23
    2 24 May 2012 12:30 PM    -36  0.229  0.338    -23
    3 24 May 2012 01:00 PM    -36  0.226  0.337    -23
    4 24 May 2012 01:30 PM    -37  0.224  0.337    -23
    5 24 May 2012 02:00 PM    -37  0.222  0.336    -23
    6 24 May 2012 02:30 PM    -38  0.220  0.335    -24
    [1] "character"
    
  • Your turn

    Convert time column into date-time objects

  • Manipulating instants
    now() > ymd("2014-11-10")
    today() > ymd("2011-03-29")  # date object
    # oops
    floor_date(now(), "day") > ymd("2011-03-29",             
                                   tz="CST")
    
    [1] TRUE
    [1] TRUE
    [1] TRUE
    
  • Rounding times
    # round
    floor_date(now(), "month")
    ceiling_date(now(), "month")
    round_date(now(), "month")
    
    # extract
    year(now())
    
    [1] "2016-12-01 CST"
    [1] "2017-01-01 CST"
    [1] "2016-12-01 CST"
    [1] 2016
    
  • Accessing components
    Date component Accessor
    Year year()
    Month month()
    Week week()
    Day of year yday()
    Day of month mday()
    Day of week wday()
    Hour hour()
    Minute minute()
    Second second()
    Time Zone tz()
  • Manipulating instants
    now()
    year(now())
    hour(now())
    day(now())
    yday(now())
    
    [1] "2016-12-01 12:04:47 CST"
    [1] 2016
    [1] 12
    [1] 1
    [1] 336
    
  • Instants, continued
    wday(now())
    wday(now(), label = TRUE)
    wday(now(), label = TRUE, abbr = FALSE)
    month(now(), label = TRUE, abbr = FALSE)
    
    [1] 5
    [1] Thurs
    Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat
    [1] Thursday
    7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday
    [1] December
    12 Levels: January < February < March < April < May < June < ... < December
    
  • Upon which day of the week were you born?
  • Changing components
    # Accessor functions can also be used to change the
    # elements of a date-time
    ti <- now()
    year(ti) <- 1999
    hour(ti) <- 23
    day(ti) <- 45
    tz(ti) <- "UTC"
    ti
    
    [1] "2000-01-14 23:04:47 UTC"
    
  • Aggregate
    ## using ddply:
    library(dplyr)
    library(ggplot2)
    soil$date <- floor_date(dmy_hm(soil$time), "day")
    psi.day <- soil %>% group_by(date) %>% 
                summarize(avepsi = mean(p1_psi, na.rm=TRUE))
    ggplot(psi.day, aes(date, avepsi)) + geom_line()
    
  • Result

    date-aggregate.png

18.3 Time Spans

  • Time spans
    #?dseconds
    dseconds(140)
    dhours(32)
    ddays(2)
    class(ddays(1))
    
    [1] "140s (~2.33 minutes)"
    [1] "115200s (~1.33 days)"
    [1] "172800s (~2 days)"
    [1] "Duration"
    attr(,"package")
    [1] "lubridate"
    
  • Let's use R to set our alarm clock
    ymd_hms("2014-01-01 08:30:00", tz = "") +
            ddays(0:30)
    
    # cool
    
     [1] "2014-01-01 08:30:00 CST" "2014-01-02 08:30:00 CST"
     [3] "2014-01-03 08:30:00 CST" "2014-01-04 08:30:00 CST"
     [5] "2014-01-05 08:30:00 CST" "2014-01-06 08:30:00 CST"
     [7] "2014-01-07 08:30:00 CST" "2014-01-08 08:30:00 CST"
     [9] "2014-01-09 08:30:00 CST" "2014-01-10 08:30:00 CST"
    [11] "2014-01-11 08:30:00 CST" "2014-01-12 08:30:00 CST"
    [13] "2014-01-13 08:30:00 CST" "2014-01-14 08:30:00 CST"
    [15] "2014-01-15 08:30:00 CST" "2014-01-16 08:30:00 CST"
    [17] "2014-01-17 08:30:00 CST" "2014-01-18 08:30:00 CST"
    [19] "2014-01-19 08:30:00 CST" "2014-01-20 08:30:00 CST"
    [21] "2014-01-21 08:30:00 CST" "2014-01-22 08:30:00 CST"
    [23] "2014-01-23 08:30:00 CST" "2014-01-24 08:30:00 CST"
    [25] "2014-01-25 08:30:00 CST" "2014-01-26 08:30:00 CST"
    [27] "2014-01-27 08:30:00 CST" "2014-01-28 08:30:00 CST"
    [29] "2014-01-29 08:30:00 CST" "2014-01-30 08:30:00 CST"
    [31] "2014-01-31 08:30:00 CST"
    
  • A problem:
    ## what about:
    ymd_hms("2014-03-01 08:30:00", tz = "") +
            ddays(0:30)
    
     [1] "2014-03-01 08:30:00 CST" "2014-03-02 08:30:00 CST"
     [3] "2014-03-03 08:30:00 CST" "2014-03-04 08:30:00 CST"
     [5] "2014-03-05 08:30:00 CST" "2014-03-06 08:30:00 CST"
     [7] "2014-03-07 08:30:00 CST" "2014-03-08 08:30:00 CST"
     [9] "2014-03-09 09:30:00 CDT" "2014-03-10 09:30:00 CDT"
    [11] "2014-03-11 09:30:00 CDT" "2014-03-12 09:30:00 CDT"
    [13] "2014-03-13 09:30:00 CDT" "2014-03-14 09:30:00 CDT"
    [15] "2014-03-15 09:30:00 CDT" "2014-03-16 09:30:00 CDT"
    [17] "2014-03-17 09:30:00 CDT" "2014-03-18 09:30:00 CDT"
    [19] "2014-03-19 09:30:00 CDT" "2014-03-20 09:30:00 CDT"
    [21] "2014-03-21 09:30:00 CDT" "2014-03-22 09:30:00 CDT"
    [23] "2014-03-23 09:30:00 CDT" "2014-03-24 09:30:00 CDT"
    [25] "2014-03-25 09:30:00 CDT" "2014-03-26 09:30:00 CDT"
    [27] "2014-03-27 09:30:00 CDT" "2014-03-28 09:30:00 CDT"
    [29] "2014-03-29 09:30:00 CDT" "2014-03-30 09:30:00 CDT"
    [31] "2014-03-31 09:30:00 CDT"
    
  • What went wrong?
  • Clock time and the number line

    clock-time.png

  • 3 types of time spans

    We can still do math on the time line, as long as we’re specific about how we want to measure time.

    • durations (exact times)
    • periods (clock times)
    • intervals (specific times)
  • Durations

    Durations measure the exact amount of seconds that pass between two time points.

    duration.png

  • Durations
    # Use d + the plural of of a unit of time to create
    # a duration.
    dminutes(5)
    dhours(278)
    #dmonths(4) # why doesn't this work?
    
    [1] "300s (~5 minutes)"
    [1] "1000800s (~1.65 weeks)"
    
  • Periods

    Periods measure time spans in units larger than seconds. Periods pay no attention to how many sub-units occur during the unit of measurement.

    period.png

  • Periods
    # Use the plural of a time unit to create a period.
    minutes(5)
    hours(278)
    months(4) # months are not a problem
    years(1:10)
    
    [1] "5M 0S"
    [1] "278H 0M 0S"
    [1] "4m 0d 0H 0M 0S"
     [1] "1y 0m 0d 0H 0M 0S"  "2y 0m 0d 0H 0M 0S"  "3y 0m 0d 0H 0M 0S" 
     [4] "4y 0m 0d 0H 0M 0S"  "5y 0m 0d 0H 0M 0S"  "6y 0m 0d 0H 0M 0S" 
     [7] "7y 0m 0d 0H 0M 0S"  "8y 0m 0d 0H 0M 0S"  "9y 0m 0d 0H 0M 0S" 
    [10] "10y 0m 0d 0H 0M 0S"
    
  • Why use periods?

    No surprises.

    2010-11-01 00:00:00 + months(1) will always equal 2010-12-01 00:00:00 no matter how many leap seconds, leap days or changes in DST occur in between

  • Intervals

    Intervals measure a time span by recording its endpoints. Since we know when the time span occurs, we can calculate the lengths of all the units involved.

    interval.png

  • Intervals examples
    # To create an interval, use the special %--%
    # operator:
    int <- ymd("2010-01-01") %--% ymd("2009-01-01")
    # Access and set the endpoints with int_start() and
    # int_end().
    int_start(int)
    int_end(int) <- ymd("2010-03-14")
    # Intervals can be accurately converted to either
    # periods or durations with as.period() and
    # as.duration()
    
    [1] "2010-01-01 UTC"
    

18.4 Math with date-times

  • Math with date-times
    birthday <- ymd("1973-12-10")
    life <- birthday %--% now()
    life / ddays(1)
    life / days(1)
    life %/% days(1)
    
    [1] 15697.75
    [1] 15697.75
    Note: method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’,
     target signature ‘Interval#Period’.
     "Interval#ANY", "ANY#Period" would also be valid
    [1] 15697
    
  • Your turn

    Calculate your age in minutes. In hours. In days. How old is someone who was born on March 6th, 1942?

19 Statistical Models

19.1 Statistical models

  • Definition
    Data
    A collection of variables, each variable being a vector of measurements of some specific trait/characteristic
    Problem
    How does variable \(Y\) depend upon variables \(X_1, X_2 \dots X_n\)
    Model
    The model represents the mathematical relationship between X and Y. It aims to represent the true relationship
  • First steps in modeling
    • Identify the response variable
    • Identify the explanatory variables
    • Are the explanatory variables continuous, categorical or a mixture of both?
    • What is the response variable? Continuous, a count or proportion, category, or time-at-death?
  • Explanatory variables
    Continuous
    Regression
    Categorical
    Analysis of Variance (ANOVA)
    Mixture
    Analysis of covariance (ANCOVA)
  • Response variables
    Continuous
    Normal regression, anova, ancova
    Proportion
    Logistic regression (and variants for counts, binary data)
    Time-at-death
    Survival analysis

19.2 Model formulas in R

  • Model formulas
    • A model formula is a sequence of R variables and symbols that specifies the variables in the model and their relationships
    • The formula is a necessary argument to many model-fitting functions in R
    • Although it resembles a mathematical formula, the symbols in the "equation" have a different meaning
  • Model formula notation

    A model formula in R can look like this:

    y ~ A + B + A:B
    

    the ~ means: "is modeled as a function of"

  • Example: Linear regression

    Setup:

    library(ggplot2)
    x <- runif(1000, 0, 10)
    y <- 1.5*x + rnorm(1000, 0, 2) + 0.01
    d <- data.frame(x=x, y=y)
    qplot(x, y, data=d)
    

    figmodels1.png

  • Example: Linear regression
    fit <- lm(y ~ x)
    fit
    
    Call:
    lm(formula = y ~ x)
    
    Coefficients:
    (Intercept)            x  
        0.01578      1.50354
    
  • Extracting information from a fitted object
    typeof(fit)
    class(fit)
    names(fit)
    coef(fit)
    
    [1] "list"
    [1] "lm"
     
    [1] "coefficients"  "residuals"     "effects"       "rank"         
     [5] "fitted.values" "assign"        "qr"            "df.residual"  
     [9] "xlevels"       "call"          "terms"         "model"
    (Intercept)           x 
     0.01577804  1.50353900
    
  • Residuals
    head(residuals(fit))
    
              1           2           3           4           5           6 
    -2.95736864  0.02948475 -0.95593011 -1.22266288 -1.48669100 -0.85117376
    
  • Intercepts

    But wait! R estimated two coefficients and we only had one explanatory variable.

    • R assumes that you want to include an intercept term (our call was the same as y ~ 1 +x)
    • You can exclude it by adding a 0 to the formula (or -1). But you never want to do that (almost never).
  • Example of no intercept
    fitnoi <- lm(y ~ 0 + x)
    summary(fitnoi)
    
    Call:
    lm(formula = y ~ 0 + x)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -5.222 -1.393 -0.084  1.317  6.927 
    
    Coefficients:
      Estimate Std. Error t value Pr(>|t|)    
    x  1.50590    0.01069   140.8   <2e-16 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 1.946 on 999 degrees of freedom
    Multiple R-squared:  0.952,	Adjusted R-squared:  0.952 
    F-statistic: 1.983e+04 on 1 and 999 DF,  p-value: < 2.2e-16
    
  • Let's build some models
    # names(mtcars)
    ggplot(mtcars, aes(disp, mpg)) + geom_point()
    ggplot(mtcars, aes(wt, mpg)) + geom_point()
    ggplot(mtcars, aes(disp, wt)) + geom_point()
    fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars)
    summary(fuel.mod)
    
  • Model results
    Call:
    lm(formula = mpg ~ disp + wt + wt:disp, data = mtcars)
    
    Residuals:
       Min     1Q Median     3Q    Max 
    -3.267 -1.677 -0.836  1.351  5.017 
    
    Coefficients:
                 Estimate Std. Error t value Pr(>|t|)    
    (Intercept) 44.081998   3.123063  14.115 2.96e-14 ***
    disp        -0.056358   0.013239  -4.257  0.00021 ***
    wt          -6.495680   1.313383  -4.946 3.22e-05 ***
    disp:wt      0.011705   0.003255   3.596  0.00123 ** 
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
    Residual standard error: 2.455 on 28 degrees of freedom
    Multiple R-squared:  0.8501,	Adjusted R-squared:  0.8341 
    F-statistic: 52.95 on 3 and 28 DF,  p-value: 1.158e-11
    
  • anova() Test the significance of terms
    anova(fuel.mod)
    
    Analysis of Variance Table
    
    Response: mpg
              Df Sum Sq Mean Sq F value   Pr(>F)    
    disp       1 808.89  808.89 134.217  3.4e-12 ***
    wt         1  70.48   70.48  11.694 0.001942 ** 
    disp:wt    1  77.93   77.93  12.931 0.001227 ** 
    Residuals 28 168.75    6.03                     
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Linear models: summary() vs anova()
    • Ben Bolker:

      The p-values you get for summary() of each individual model are the p-values for the effects of each of the parameters in each model, conditional on all the other parameters in that model. If your data are perfectly balanced (which is unlikely in a regression design), you should get the same answers from summary() and anova(), but otherwise the results from anova are generally preferable.

  • anova() to compare models
    fuel.mod2 <- lm(mpg ~ disp + wt, data=mtcars)
    fuel.mod3 <- lm(mpg ~ disp, data=mtcars)
    anova(fuel.mod3, fuel.mod2, fuel.mod)
    
    Analysis of Variance Table
    
    Model 1: mpg ~ disp
    Model 2: mpg ~ disp + wt
    Model 3: mpg ~ disp + wt + wt:disp
      Res.Df    RSS Df Sum of Sq      F   Pr(>F)   
    1     30 317.16                                
    2     29 246.68  1    70.476 11.694 0.001942 **
    3     28 168.75  1    77.934 12.931 0.001227 **
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Symbols in formulas
    + separate effects in a formula
    : interaction (A:B is interaction of A and B)
    * main effects plus interactions
    ^ crossed
    %in% nested within
    / nested within
    | conditional on (rarer)

    A*B is equivalent to A + B + A:B

  • lm() works for categorical variables too
    mtcars$cylnum <- factor(mtcars$cyl)
    levels(mtcars$cylnum)
    fuel.aov <- lm(mpg ~ cylnum, data=mtcars)
    anova(fuel.aov)
    
    [1] "4" "6" "8"
    Analysis of Variance Table
    
    Response: mpg
              Df Sum Sq Mean Sq F value    Pr(>F)    
    cylnum     2 824.78  412.39  39.697 4.979e-09 ***
    Residuals 29 301.26   10.39                      
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    

19.3 Beyond the linear model

  • Beyond the linear model
    GLM
    Generalized linear modeling allow non-normal errors in response variable through a linear link function
    Mixed effects models
    relax assumption of independence. Model nested observations, random effects
    GLS
    Generalized least squares relax assumption of homogeneity, model the heterogeneity
    GLMM and GAMM
    combine first and second, above.
  • Generalized Linear Models (glm)

    Consist of:

    • Error distributions

      The distribution of the response around its mean. for ANOVA and regression, this is the normal; for logistic regression it is the binomial

    • Link function

      The function, g that shows how the linear function of the explanatory variables is related to the expected value of the response:

      \(g(\mu) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)

    • Variance function

      This captures how the variance of the response depends upon the mean

  • Example logistic regression
    trees <- data.frame(fire.sev = seq(0.01,1,0.005))
    fire.mort <- runif(length(trees$fire.sev)) < trees$fire.sev
    trees$mortality <- ifelse(fire.mort,1,0)
    qplot(fire.sev, mortality, data=trees)
    
  • Not a linear relationship!

    fig_log_reg.png

  • Logistic regression
    trees.glm <- glm(mortality ~ fire.sev,data=trees, family = binomial())
    trees.glm
    # summary(trees.glm)
    
    Call:  glm(formula = mortality ~ fire.sev, family = binomial(), data = trees)
    
    Coefficients:
    (Intercept)     fire.sev  
         -2.323        4.627  
    
    Degrees of Freedom: 198 Total (i.e. Null);  197 Residual
    Null Deviance:	    275.9 
    Residual Deviance: 212.7 	AIC: 216.7
    
  • Prediction
    trees$pred.mort <- predict(trees.glm, type="response")
    qplot(fire.sev, mortality, data=trees) + geom_line(aes(y =pred.mort) )
    

    fig_log_pred.png

19.4 Mixed models

  • Example: Mixed model
    fish <- read.csv("http://r-research-tool.schwilk.org/lectures/data/fish-data.csv")
    head(fish)
    
      aquarium nutrient color.score
    1        a      low         5.3
    2        a      low         4.7
    3        a      low         5.0
    4        a      low         4.3
    5        a      low         4.8
    6        b      low         4.8
    
  • Look at these data

    Is each fish by treatment row independent?

  • Mixed model using nlme
    library(nlme)
    fit1.lme <- lme(color.score ~ 1, random = ~ 1 | aquarium,
                    method="ML", data=fish)
    fit2.lme <- lme(color.score ~ nutrient, random = ~ 1 | aquarium,
                     method="ML", data=fish)
    
    anova(fit1.lme, fit2.lme)
    
     
            Model df      AIC      BIC    logLik   Test  L.Ratio p-value
    fit1.lme     1  3 139.7463 144.8129 -66.87315                        
    fit2.lme     2  4 122.6974 129.4529 -57.34872 1 vs 2 19.04886  <.0001
    
  • BEWARE! Those p values are wrong.

    Beware of p-values in multi-level models. Consider model selection (AIC or BIC, but careful there as well). Consider Markov chain Monte Carlo methods (MCMC) for confidence intervals.

    See https://seriousstats.wordpress.com/tag/mcmc-methods/

    See package "pbkrtest" for Kenward-Roger approximation

  • Using lme4
    library(lme4)
    library(pbkrtest)
    fit1.lmer <- lmer(color.score ~ 1 + (1 | aquarium),
                      data=fish)
    fit2.lmer <- lmer(color.score ~ nutrient + (1 | aquarium),
                     data=fish)
    KRmodcomp(fit1.lmer, fit2.lmer)
    
    Loading required package: Matrix
    
    Attaching package: ‘Matrix’
    
    The following object is masked from ‘package:tidyr’:
    
        expand
    
    
    Attaching package: ‘lme4’
    
    The following object is masked from ‘package:nlme’:
    
        lmList
    F-test with Kenward-Roger approximation; computing time: 0.10 sec.
    large : color.score ~ nutrient + (1 | aquarium)
    small : color.score ~ 1 + (1 | aquarium)
            stat    ndf    ddf F.scaling   p.value    
    Ftest 54.912  1.000  6.000         1 0.0003104 ***
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • Mixed models in R
    • Pinheiro, J. C. & Bates, D. Mixed-Effects Models in S and S-Plus Springer-Verlag, New York, 2000.
    • Zuur, A.F., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. Mixed Effects Models and Extensions in Ecology with R. Springer. 2009.

20 Perception and Color

20.1 Graphics and color use

  • Homeworks

    I apologize for my mistakes on the webpage. In light of that, I have been very lenient in grading HW09. But I have some comments:

    1. When I ask for you to "tell me something interesting and show your work, that means explore the data. DO SOME WORK. I am giving you flexibility and choices for a reason.
    2. My questions are brief so that you can use them a jumping-off points. This is grad school, not undergrad.
    3. Review naming convention. I sort and rename your submissions. Some of you are sloppy with case, with given vs surname order. Minor issue, but please be exact.
  • Exploratory vs communication graphics
    • Exploratory graphics

      Are for you (not others). Need to be able to create rapidly because your first attempt will never be the most revealing. Iteration is crucial for developing the best display of your data.

    • Communication graphics

      When you communicate your findings, you need to spend a lot of time polishing your graphics to eliminate distractions and focus on the story. Iteration is crucial to ensure all the the small stuff works well: labels, color choices, tick marks, etc

20.2 Perception

  • Which represents the larger value?

    perc1.jpg

  • Which represents the larger value?

    perc2.jpg

  • Which represents the larger value?

    perc3.jpg

  • Which represents the larger value?

    perc4.jpg

  • Which represents the larger value?

    perc5.jpg

  • Which represents the larger value?

    perc6.jpg

  • Which represents the larger value?

    perc7.jpg

  • Which represents the larger value?

    perc8.jpg

20.3 Color

  • What are the components of color?

    RGB:

    RGB_color_solid_cube.png

  • RGB triplets
    Notation RGB triplet
    Arithmetic (1.0, 0.0, 0.0)
    8-bit per channel (255, 0, 0)
    8- bit as hex #FF0000
  • RGB

    rgb-example.png

  • Other color models?
    • Easy linear transformations of RGB:
      • RGB
      • HSV: Hue, Saturation, Value
      • HSL: Hue, Saturation, Lightness
    • Computationally more intensive:
      • LAB color spaces, eg CIE L*a*b* (CIELAB)
      • HCL : Hue, Chroma, Luminance
  • Luminance / Lightness

    cspace-compare.png

  • HCL

    HCL-dots.png

  • Benefit
    • Perceptually uniform
    • Hue is unordered. Use evenly spaced hues with equal chroma and luminance to make aesthetically pleasing discrete palettes
    • Chroma and luminance are ordered. Easy to make perceptually uniform gradients by varying either (or both).
  • Color blindness
    • 7-10% of men are red-green color "blind."
    • Solutions: avoid red-green contrasts; use redundant mappings; test.
    • Color Oracle: http://colororacle.org/

20.4 Colors in R

  • Colors in R

    Default in R is an hcl color scale, one can use scales in other packages (eg RColorBRewer) as well.

    • Continuous scales

      The default continuous scale in ggplot2 is a blue gradient, from low = #132B43 to high = #56B1F7 which can be reproduced as:

      library(ggplot2)
      scale_color_gradient(low = "#132B43", high = "#56B1F7", space = "Lab")
      
    • Discrete scales

      The default discrete scale in ggplot2 is a range of hues from hcl,

      scale_color_hue(h = c(0, 360) + 15, c = 100,
                      l = 65, h.start = 0, direction = 1)
      
  • Discrete color scales
    p = qplot(Petal.Length, Sepal.Length, data=iris, color = Species)
    p
    p + scale_color_hue(l=40,c=30) # change luminance and chroma
    p + scale_color_hue(h = c(90,200), l = 50,c = 50) # change luminance and chroma
    p + scale_color_hue(h = c(90,200), l = 50,c = 50, direction = -1)
    
  • Results:

    color-hue.png

  • Continuous scales
    p <- ggplot(mpg, aes(hwy, cty, color = displ)) +
         geom_point(size = 3.5, alpha = 0.6, position = "jitter")
    p
    p +  scale_colour_gradient(low = "tan", high = "red")
    
  • Results:

    color-gradient.png

  • Other color palettes

    Cynthia Brewer's palettes for discrete scales http://colorbrewer2.org/

    These are optimized for maps

    library(RColorBrewer)
    display.brewer.all()
    
    p <- qplot(carat, price, color = cut, data=diamonds)
    p + scale_color_brewer(palette = "Blues")
    
    
    p <- ggplot(mpg, aes(reorder(class, -cty), cty, color = cut(displ, breaks = 4))) + geom_boxplot(position = "dodge")
    p + scale_color_brewer(palette="Dark2")
    
  • Result:

    color-brewer.png

20.5 Perception and comparisons

  • Some comparisons are easier than others

    cleveland1.png

  • Some comparisons are easier than others

    cleveland2.png

  • Some comparisons are easier than others

    cleveland3.png

  • Some comparisons are easier than others

    cleveland4.png

  • Some comparisons are easier than others

    cleveland5.png

  • Some comparisons are easier than others

    cleveland6.png

  • We perceive relative difference
    • slide 1   B_onlyenv

      reldiff1.png

    • slide 2   B_onlyenv

      reldiff2.png

  • It is easier to make nearby comparisons
    • slide 1   B_onlyenv

      diamond-price1.png

    • slide 2   B_onlyenv

      diamond-price2.png

  • Beware of visual illusions

    vis-illusion.png

  • Dynamite plots
  • My recommendations

    For comparing univariate responses across discrete groups. According to data set size

    • small to moderate: dotplots, jittered dotplots (a special challenge are small, discrete data sets, where there are lots of repeated values)
    • moderate: boxplots
    • moderate to large: violin plots, density strips, beanplots, box-percentile plots etc. etc.

    See http://www.jstatsoft.org/v28/c01/paper

21 Markup languages

21.1 Why plain text?

  • Our main goal as scientists

    To make our research as reproducible and visible as possible

    • This entails:
      • Sharing of code
      • Sharing of data (if possible and not proprietary nor privacy sensitive)
      • Sharing of output (presentation, article, website)
  • The power of plain text
    • Ubiquitous
    • Usually small in size
    • Portable across platforms (and versions)
    • It is scriptable (both as input as output).
      • Code is always in text format.
      • Data is often in text format.
      • Underlying format for output (presentation, website, tables, articles, books) can be text as well.
  • Manipulation of text

    Most output is based on simple text file. There are two main final formats: HTML and PDF. One is pageless and one is centered on the idea of the printed page.

    • This formatting requires markup-languages
      • HTML and CSS
      • LaTeX
      • Markdown

      And others (ReST, org-mode, etc)

21.2 Markup

  • What is markup?
    • Even with unicode, plain text is, well, plain
    • How do we produce pretty formatted output?
    • With markup languages

      HTML (+CSS), LaTeX, Markdown, org-mode, RMarkdown, ReStructuredText.

21.3 LaTeX

  • LaTeX
    • A set of macros around Tex, a markup language invented by Donald Knuth
    • A document preparation system and document markup language.
    • Defacto standard in academic publishing. Formulae used in HTML pages (e.g., Wikipedia)
    • But …

      Notation is cumbersome. It is a "heavyweight" markup language. It is geared towards printed (pdf) output where pages matter (eg printed journal articles)

  • A LaTeX example
    \documentclass[12pt]{article}
    \begin{document}
    \section*{My Paper}
    I just discovered that:
    \begin{equation} e=mc^2 \end{equation}
    
    \begin{table}[h] \centering
    \caption{My table}
    \begin{tabular}{lrs}
    \hline
    Left & right & center \\
    \midrule
    data & 0.01 & 2.3 \\
    \bottomrule
    \end{tabular}
    \end{table}
    \end{document}
    
  • Result
  • It isn't as bad as it seems

    latex-example-emacs.png

  • Ready for submission

    latex-paper-output.png

21.4 Markdown

  • Still, LaTeX is cumbersome

    Alternatives?

    • Markdown
    • Purposes
      • LaTeX is for paper — PDF
      • Markdown is for HTML (blogs, wikipedia) — but sneakily uses some Latex when needed
      • But with pandoc (http://johnmacfarlane.net/pandoc/), Markdown is becoming language of choice among scientists
      • You can even export from markdown to MS Word to keep a collaborator happy.
  • Markdown example
    ## My Paper ##
    
    I just discovered that:
    
    $$e=mc^2$$
    
    ### My Table ###
    
    | Left | right | center |
    |------|-------|--------|
    | data | 0.01  | 2.3    |
    
  • Which renders as

    markdown-simple-example.png

  • Text markup
    *italic*, **bold**
    
    unordered list:
    - item 1
    - item 2
    - item 3
    
    Ordered list:
    1. item 1
    2. item 2
    3. item 3
    
  • Headings
    Heading1
    ========
    
    # Heading1 #
    
    Heading2
    --------
    
    ## Heading 2 ##
    
    ### Heading 3 ###
    
  • Markdown links
    Links:
    
    http://daringfireball.net/projects/markdown/syntax
    [Markdown syntax](http://daringfireball.net/projects/markdown/syntax)
    [Markdown syntax][mkdn]
    
    [mkdn]: http://daringfireball.net/projects/markdown/syntax
    
  • Markdown math

    Just use LateX math!

    $$\frac{n!}{k!(n-k)!} = \binom{n}{k}$$
    
    • Which renders as:
  • Markdown math

    Inline equations just need a single $

    In ecology, "true" species diversity is:
    $ D_q = \frac{1}{\sqrt[q-1]{\sum_{i=1}^R{p_{i} p_{i}^{q-1}}}}
    
    • Which renders as

21.5 Managing references

  • BibTeX

    A reference database format

    • Very versatile and very powerful (most other markup languages work with bibtex as well)
    • Free managers, such as bibdesk or mendeley, are now ubiquitous. See also jabbref. You do not need or want Endnote!
  • Bibtex entry
    @article{Schwilk+Ackerly-2001,
      title={Flammability and serotiny as strategies:
             correlated evolution in pines},
      author={Schwilk, Dylan W and Ackerly, David D},
      journal={Oikos},
      volume={94},
      number={2},
      pages={326--336},
      year={2001},
      publisher={Wiley Online Library}
    }
    
  • In Jabref
  • LaTeX and BibTeX example
    \section{Introduction}
    Fire has been a powerful disturbance on the global
    landscape for hundreds of thousands of years,
    promoting traits in plants which confer an
    advantage in the presence of fire
    \cite{Keeley+Pausas+etal-2011,He+Pausas+etal-2012}.
    Understanding variation in fire response
    strategies both across and within fire regimes is
    a major goal of plant fire ecology
    \cite{Bond+vanWilgen-1996, Bellingham-2000,
    Schwilk+Ackerly-2001, Keeley+Pausas+etal-2011}.
    
  • Markdown and BibTeX example

    Requires pandoc markdown extensions.

    # Introduction #
    Fire has been a powerful disturbance on the global
    landscape for hundreds of thousands of years,
    promoting traits in plants which confer an
    advantage in the presence of fire
    [@Keeley+Pausas+etal-2011; @He+Pausas+etal-2012].
    Understanding variation in fire response
    strategies both across and within fire regimes is
    a major goal of plant fire ecology
    [@Bond+vanWilgen-1996; @Bellingham-2000;
    @Schwilk+Ackerly-2001; @Keeley+Pausas+etal-2011].
    

21.6 Which markup to use?

  • Which do I use and when:
    Scientific papers
    LaTeX + BibTex. Outlining in org-mode first. I'm starting to use Markdown for some papers (easy conversion to MS word for collaborators).
    Presentations
    LaTeX + Beamer when I need to control details of format, spacing, placement. Org-mode with export via LaTeX + Beamer otherwise. The source for this lecture is plain org-mode (similar to Markdown)
    Websites
    Markdown with html and css templates (jekyll, eg http://www.schwilk.org/) or org-mode export (eg http://r-research-tool.schwilk.org/)
    Other
    org-mode or markdown (documentation, notes, protocols, literate programming)
  • Pandoc

    So how do we glue everything together and produce wonderful htmls and pdfs out of thin air?

    With pandoc: http://johnmacfarlane.net/pandoc/

    • pandoc
      Pandoc can convert from
      markdown, LaTeX, HTML, DocBook, Org-mode, and … Microsoft Word .docx
      To
      HTML formats (including HTML5 slides), via Latex to pdf, MS Word (but support somewhat limited) and OpenOffice formats, various markup formats, and much more

21.7 Scientific workflow

22 Version Control

22.1 Setup

  • The Bash Shell

    Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.

    • Windows

      Install Git for Windows by download and running the installer this will provide you with both Git and Bash in the Git Bash program.

    • Mac OS X

      The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in /Applications/Utilities). You may want to keep Terminal in your dock for this workshop.

    • Linux

      The default shell is usually bash, but if your machine is set up differently you can run it by opening a terminal and typing bash. There is no need to install anything.

  • Git
    • Windows

      Git should be installed on your computer as part of your Bash install (described above).

    • Mac OS X

      For OS X 10.9 and higher, install Git for Mac by downloading and running the installer. After installing Git, there will not be anything in your /Applications, as Git is a command line program.

      For older versions of OS X (10.6-10.8) use the most recent available installer for your OS available here. Use the Leopard installer for 10.5 and the Snow Leopard installer for 10.6-10.8.

  • Additional information

    See my reproducible research workshop:

    http://rr-workshop.schwilk.org/

22.2 The shell

  • The shell
    • What is the shell?

      The shell a command line interface which allows you to control your computer.

      Tasks are accomplished by entering commands with a keyboard (or by running a plain text script) instead of controlling a graphical user interface.

      The most popular shell is Bash ("Bourne Again SHell")

    • What is a terminal?

      A terminal is a program you run that gives you access to the shell. There are many different terminal programs that vary across operating systems.

  • Using the shell

    Open a terminal (bash)

    ls
    echo "hello"
    echo "hello" > newfile.txt
    cat newfile.txt
    
  • Full vs relative paths
    pwd
    cd ~
    pwd
    
  • Shell command options and arguments
    ls -l
    ls -al
    man ls
    

22.3 What is version control?

  • Why use version control?
    • You already use some sort of version control
      • File naming schemes (eg my-file-July18-2013.doc) or by copying folders around
      • Simple but error-prone
      • Does not help with branching, collaboration
    • A version control system (VCS) allows you to:
      • revert files back to a previous state
      • revert the entire project back to a previous state
      • review changes made over time
      • see who last modified something that might be causing a problem, who introduced an issue and when
  • Local version control systems (VCS)
  • Centralized VCS

    central-vc.png

  • Distributed VCS

    distributed-vc.png

22.4 git

  • Configuring
    git config --global user.name "Dylan Schwilk"
    git config --global user.email "dylan@schwilk.org"
    
  • Getting a "repo" (repository)
    cd ~
    mkdir newrepo
    cd newrepo
    git init
    touch README.md
    ls .git -al
    
  • What happened?

    Hidden directory called .git

    .
    ├── .git
    │   ├── branches
    │   ├── config
    │   ├── description
    │   ├── HEAD
    │   ├── hooks
    │   │   ├── applypatch-msg.sample
    │   │   ├── commit-msg.sample
    │   │   ├── post-update.sample
    │   │   ├── pre-applypatch.sample
    │   │   ├── pre-commit.sample
    │   │   ├── prepare-commit-msg.sample
    │   │   ├── pre-push.sample
    │   │   ├── pre-rebase.sample
    │   │   └── update.sample
    │   ├── info
    │   │   └── exclude
    │   ├── objects
    │   │   ├── info
    │   │   └── pack
    │   └── refs
    │       ├── heads
    │       └── tags
    └── README.md
    
  • Checking status
    git status
    
    On branch master
    
    Initial commit
    
    Untracked files:
      (use "git add <file>..." to include in what will be committed)
    
            README.md
    
    nothing added to commit but untracked files present (use "git add" to track)
    
  • git commands
    • git states   BMCOL

      git-states.png

    • git commands   BMCOL
      • git init
      • git add
      • git status
      • git log
      • git commit
  • staging files
    git add .
    git status
    
    On branch master
    
    Initial commit
    
    Changes to be committed:
      (use "git rm --cached <file>..." to unstage)
    
            new file:   README.md
    
  • Committing changes
    git commit -m "initial commit"
    git status
    
    On branch master
    nothing to commit, working directory clean
    
  • git commit

    A commit is a snapshot taken from the index (staging area) not from the working directory!

  • What exists now?

    git-filesystem2.png

  • Clone repo
    git clone https://github.com/schwilklab/protocols.git
    cd protocols
    ls
    
    ./  ../   html/  lab-org/  methods/  README.md  safety/
    
  • A basic workflow
    • Edit files
    • Stage the changes (git add)
    • Review the changes (git diff)
    • Commit the changes (git commit)
  • The git repository

    git-repo-diagram.png

  • git log
    git log --pretty=oneline --abbrev-commit -10
    
    d2ec17b Reduce markup lecture
    e4d405f Add folder for tangled session files
    bb2d29e Add shell (Bash) examples and some data
    261af83 Clean up birdd files
    e289d7e Minor edit
    f2820d3 Scatterplot fig in birdd example
    75a4be6 Add slides on Bash shell
    66c9a7f Add etherpad link for workshop
    e23b559 Suggest getting github.com account
    7e6a6df Add time and date for workshop
    
  • git diff
    git diff
    difference between working directory and index (staging area)
    git diff --staged
    difference between index and last commit

    With right options, diff can show changes between any two commits, any two files, etc.

  • git diff example

    Try:

    git diff HEAD~1
    
  • Diffs on GitHub

    github-diff-example.png

  • Practice
    • edit file
    • stage
    • commit

22.5 Working with branches

  • Branches

    A branch represents an independent line of development.

  • Show and create branches
    • Show
      git branch
      
      * master
        new-feature
      
    • Create
      git branch <newbranch>
      
    • Delete
      git branch -D <newbranch>
      
  • Using branches

    git-branch-diag.png

  • Switch branches
    git checkout <existing-branch>
    
  • Merge

    Merging is Git's way of putting a branched history back together again.

    • Note:

      This merges into the current branch

      git merge <branch>
      
  • Fast-forward merge

    git-ff-merge.png

  • Three-way merge

    git-merge-3w.png

  • Example fast-foward merge
    # Start a new feature
    git checkout -b new-feature master
    
    # Edit some files
    git add <file>
    git commit -m "Start a feature"
    
    # Edit some files
    git add <file>
    git commit -m "Finish a feature"
    
    # Merge in the new-feature branch
    git checkout master
    git merge new-feature
    git branch -d new-feature
    
  • Merge vs rebase

22.6 Working with remotes

  • What is a "remote"
    • a version of the repository that hosted somewhere on the internet.
    • Remotes have a url (https or ssh), but the can also have names.
    • It is common to have only one remote at it is named "origin"
    • Why remotes

      Allow collaboration!

  • GitHub

    We use GitHub for hosting, so most of of your local repos will be associated with a remote named "origin" and a url something like git@github.com:schwilklab/CSC-sky-island-forest.git

  • Working with remotes

    When you clone from a remote, git remembers and calls that remote "origin". You can look at .git/config or, better use git commands to view:

    git remote -v
    
    origin  git@github.com:schwilklab/R-research-tool.git (fetch)
    origin  git@github.com:schwilklab/R-research-tool.git (push)
    
  • Adding a remote

    If you need to add a remote to an existing repository:

    git remote add rem2 git@github.com:dschwilk/software-tools-course.git
    
  • Fetching and pulling
    • fetch
      git fetch origin
      git fetch
      

      This pulls the data to your local repository but doe not merge it. It just means you have a local copy of the changes in the remote so that you can merge them.

    • pull
      git pull
      

      Fetches data and then merges with your local repo.

  • git push
    git push <remote> <branch name>
    eg: git push origin master
    or just: git push
    

    This works if none else has pushed changes to that remote since you last pulled. If there is an error, you will need to pull (fetch and merge) first.

  • git and GitHub

    idea of "forking"

22.7 Advanced uses

  • stage only portions of change
    git add -p
    

    Interactively add changed "hunks".

23 Packages for Statistics

23.1 RStudio Project

  • RStudio project
    • Fork and clone or just clone: https://github.com/schwilklab/BBNP-transects
    • eg clone: git clone https://github.com/schwilklab/BBNP-transects.git
    • Enable version control in RStudio
    • Make RStudio project
    • Start rmarkdown file as a report
  • Questions
    • Were oaks or pines harder hit by 2011 drought? Short vs long term?
    • Did the 2011 drought influence species diversity?
    • Did the drought effects differ by elevation?

23.2 Statistical Modeling

  • Stats models
    • GLM (Generalized linear models). For modeling distributions with link and variance functions (eg binomial = logistic regression)
    • GAM (Generalized Additive Models). gam or mgcv packages.
    • GLS (Generalized least squares models). These allow for heterogeneity. nlme library (Pinheiro and Bates)
    • Mixed (nested) models (lme() and glm()) in nlme or in lme4. Use lme4 (see function lmer() ) when you can.
    • GLMM (Generalized Linear Mixed Models). Multiple packages, still cutting edge. glmmPQL() from MASS package, lmer from lme4 package, glmmML from the glmmML package
  • Bayesian models
    • rstan (R interface to the Stan language)
    • brms (Use lmer-style formulae that are automatically translated to Stan via rstan)
  • Multi model inference (AIC) and model averaging

    MuMIn

  • Bioinformatics and genomics
  • smatr package

    Standardized major axis regression. Useful for comparing allometries.

    http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00153.x/abstract

23.3 Packages useful in Ecology

23.4 Phylogenies, comparative methods

23.5 Spatial statistics

23.6 Open science

24 Extra: Plain text and Emacs

24.1 Working with plain text

  • Why you should love plain text
    • Portability. Linux, Windows, Mac, Android, etc. Standard.
    • Easy to use. Don't let a word processor get in you way when you just need to write
    • Keep content separate from formatting
    • Computers love plain text. Programs are written in plain text. Use one powerful text editing tool for everything you do.
    • Version control: VC systems work best on text files (eg git)
  • What you want in a text editor
    • Syntax highlighting
    • Braces, parenthesis matching, automatic indenting
    • Ability to work with interpreters, shells
    • Spell checking

    And some other things you did not know you wanted: TODOs and outlining modes, exporting to HTML or PDF, web browsing, email …

  • Editors: free as in speech and beer
    • OS agnostic:
      • Emacs My favorite. Runs on everything, super powerful.
      • Vim The "other editor". Also powerful and customizable
      • atom. By the github folks: https://atom.io/. New kid on the block
      • RStudio For R code but also reasonable for markdown editing
    • Windows:
      • notepad++. Lightweight editor. Can syntax highlight R code
    • Mac
      • textmate (textmate2 is open source)
  • Some non-free editors

24.2 Emacs

  • Emacs

    One editor to rule them all.

  • Why use Emacs? Availability
    • All platforms, operating systems
    • Will Always Exist (cue music)
    • Open source and free
  • Why use Emacs? Best tools
    • Emacs has "major modes" for most computer languages and types of files
    • Powerful text management tools:
      • Compare (diff) documents or buffers
      • Regular expressions
      • Can deal with rectangular text (columns, tables)
    • Can handle very large files

24.3 Basics

  • Start Emacs

    emacs-simple-start.png

    1: Menus, 2: Toolbar, 3: Buffer, 4: Mode line (status bar), 5: Mini buffer

  • What I see:

    emacs-schwilk-start.png

  • What you'll see if you use my student starter

    emacs-student-starter-start.png

  • A quick note on terminology

    Emacs was first written before modern window managers existed (before MS Windows 3.1 even!). So it uses its own terminology for a few things.

  • Frames, windows and buffers
    Frame
    What we call a "window" on a modern operating system. One instance can control multiple frames. Try File -> New Frame
    Window
    What we would call a "panel." A subdivided part of an Emacs "frame". A way to view multiple buffers at once (eg R code and R console). Try File -> Split Window
    Buffer
    A name associated with the contents of a window. Can be the contents of a file, a process, an interpreter. Try Buffers -> ...
  • Emacs "Major Modes"
    • Emacs can customize itself based on the language you are working in
    • R code will look different (font colors, comments etc) from Python from Markdown.
    • Emacs will guess the mode from the file extension (eg .R), or you can specify mode: Type Alt-x, release, then type valid mode name
  • Some examples:
    • R (using emacs package ess)
    • Bash shell
    • Tramp (open files over ssh). Or ssh then open emacs.
    • git via package 'magit'
    • markdown and Pandoc
    • Python and C, C++, etc
    • Movement, replace-regexp
  • Emacs keyboard commands
    • Emacs was created in the days before
      • Mice
      • Windowing desktop environments
      • Microsoft windows or Macs existed.
    • Everything can be done from keyboard
      C
      Control key
      S
      Shift key
      M
      Meta key (usually "Alt")
  • Keyboard nomenclature
    "C-x g"
    hold Control while typing "x", release, then type "g"
    "M-x"
    Hold down Alt then hit "x". This brings you to mini-buffer to type longer commands
    "C-c C-c"
    Hold down Control while typing "c" then do that again.
  • That seems complicated

    So you don't have to use those keys if you don't want to!

    • The keyboard commands are featured in most Emacs tutorials. And that may scare folks away.
    • I have a "cheat sheet" of commands taped on the wall next to my computer
    • But you can always use the pull down menus and still do more than with any other editor.
  • Just use the menus and keys you are used to
    • Select all:

      Edit -> Select All vs "C-x h"

    • End of buffer:

      "page down" key a few times vs "M->" (Alt, Shift, greater than)

    • Moving by character:

      arrow keys, "home", "end" vs "C-f", "C-b", "C-n", "C-p", "C-a", "C-e"

  • One problem:
    • You are probably used to MS or Mac cut-copy-paste
      keystroke result
      C-x cut
      C-c copy
      C-v paste
      C-z undo

      \pause

    • But Emacs defaults:
      keystroke result
      C-w cut ("wipe")
      M-w copy
      C-y paste ("yank")
      C-/ undo
  • Easily fixed

    Turn on "CUA mode", multiple ways:

    • Options -> CUA
    • "M-x cua-mode"
    • Add following to emacs startup: (cua-mode t)

    Watch out: This overrides some basic Emacs keyboard functionality. For example, you must turn this off for the built-in Emacs tutorial!

  • Some shortcuts you should use
    M-x
    execute command in minibufer — this gives you access to every lisp function
    C-g
    Quit. Exits a function, exits mini-buffer. Useful if you get stuck
    C-s
    Forward search. I use this very frequently
    M-%
    Query replace
    C-M-%
    Regular expression query replace
    M-q
    Rewrap paragraph (does right thing for code comments)
    M-;
    Comment or uncomment selected region; knows which comment character to use
  • More on M-x
    • "M-x" gets the mini-buffer ready for a command
    • Tab completeion works, so just type first few letters

    \pause

    • Example: indent-region
      • Highlight text ("select region")
      • Type "M-x indent-region"
      • And emacs even reminds you of shortcut.

24.4 Installing

  • Installing Emacs
  • Install Emacs on Windows
    • go to http://ftp.gnu.org/gnu/windows/emacs/
    • Download emacs by right-clicking on either emacs-xx.x-bin-i386.tar.gz or emacs-xx.x-fullbin-i386.tar.gz.
    • Extract to zip file to C:\Program Files\emacs
    • Add shortcuts to the executable, set the $HOME environment variable(My computer … Properties … Advanced
    • You will need to learn about environment variables so that emacs can find R, etc.
  • Customizing Emacs
  • Installing my emacs student starter
  • Set up on linux
    cd ~
    git clone https://github.com/schwilklab/emacs-starter.git .emacs.d
    
  • Some features and departures from defaults

    Note that some things in my configuration are different from the documentation you will find online:

    • Uses windows cut-copy-paste keys (cua-mode).
    • A dark color theme ("schwilk" theme under options…customize emacs…custom themes)
    • C-x g for quick access to git interface "magit"
  • Emacs escape hatch

    \huge C-g

  • A poster for learning emacs

25 Example packages

25.1 PCA and ordination

  • J.A.'s scorpion morphology data
    library(dplyr)
    
    scorpion <-
     read.csv("http://r-research-tool.schwilk.org/lectures/data/Azzinari_raw.csv",
               stringsAsFactors=FALSE) %>%
            select(-V.W) %>% mutate_each(funs(log), -Sample)
    head(scorpion, 2)
    
      Sample      CaL       I.L      I.W      V.L     FemL      FemW     Ch.L
    1      A 2.066863 0.6931472 1.131402 1.856298 1.609438 0.9555114 2.564949
    2      A 2.008214 0.7178398 1.098612 1.840550 1.589235 0.7884574 2.459589
           ChW      ChD      FFL      MFL    Ves.L    Ves.W    Ves.D    Tel.L
    1 1.458615 1.625311 1.667707 1.960095 1.526056 1.098612 1.098612 1.887070
    2 1.386294 1.458615 1.609438 1.916923 1.568616 1.064711 1.081805 1.902108
      Leg.Tib.L Leg.Tib.W
    1  1.629241 0.6830968
    2  1.568616 0.7419373
    
  • Look at correlations
    scorpion %>% select(-Sample) %>%
        pairs()
    

    pca_splom.png

  • PCA
    # PCA cannot handle NA's:
    scorpion <- scorpion %>% filter(complete.cases(.))
    sc.complete <- scorpion %>% select(-Sample)
    pca <- prcomp(sc.complete)
    plot(pca)
    
  • Plot of PC axis variances

    pca_var.png

  • Look at the loadings
    pca$rotation[,1:2]
    
                    PC1         PC2
    CaL       0.2210499 -0.01444772
    I.L       0.2422791 -0.18468988
    I.W       0.2443911 -0.14540620
    V.L       0.2305940 -0.12903198
    FemL      0.2190032  0.11698766
    FemW      0.2322420 -0.22215989
    Ch.L      0.2271538  0.20442404
    ChW       0.2737845 -0.16207092
    ChD       0.2808830 -0.04576147
    FFL       0.2227986  0.67095346
    MFL       0.2473843  0.43824096
    Ves.L     0.2828978 -0.20786541
    Ves.W     0.2459926 -0.21128623
    Ves.D     0.2566618 -0.18374224
    Tel.L     0.2440554 -0.01627861
    Leg.Tib.L 0.2300402  0.12883335
    Leg.Tib.W 0.2060285  0.11827706
    
  • Size differences across samples?
    library(ggplot2)
    scores <- data.frame(pca$x[, 1:10])
    scores$sample <- scorpion$Sample
    ggplot(scores, aes(sample, PC1)) + geom_point(size=3)
    
  • Results

    pca_size.png

  • Plot PC2 and PC3
    ggplot(scores, aes(PC2, PC3, color=sample)) + geom_point(size=4)
    
  • Result

    pca_scores.png

  • What is PC2? Largely FFL and MFL.
    ggplot(scores, aes(sample, PC2)) + geom_point(size=4)
    

    pca_pc2.png

  • Look at FFL residuals
    scorpion$size <- scores$PC1
    fflmod <- lm(FFL ~ size, data=scorpion)
    scorpion$FFL.sc <- fflmod$residuals
    ggplot(scorpion, aes(Sample, FFL.sc)) + geom_point(size=4)
    
  • Results

    pca_ffl.png

  • Better yet, create model with size (PC1) as covariate
    fflmod <- lm(FFL ~ Sample + size + Sample:size, data=scorpion)
    anova(fflmod)
    
     X11cairo 
           2
    Analysis of Variance Table
    
    Response: FFL
                Df  Sum Sq Mean Sq  F value Pr(>F)    
    Sample       8 0.89796 0.11224  40.1027 <2e-16 ***
    size         1 0.55940 0.55940 199.8638 <2e-16 ***
    Sample:size  8 0.03268 0.00409   1.4596 0.1944    
    Residuals   53 0.14834 0.00280                    
    ---
    Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
    
  • And plot these allometries
    ggplot(scorpion, aes(size, FFL, color=Sample)) + geom_point(size=4) +
         geom_smooth(method="lm")
    
  • Result

    pca_ffl_allom.png

  • Discriminant Function Analysis
    dfamod <- MASS::lda(sample ~ PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, 
                  data=scores, CV=TRUE)
    ct <- table(scores$sample, dfamod$class)
    diag(prop.table(ct, 1))
    # total proportion correct
    
    X11cairo 
           2
            A         B         C         D         E         F         G         H 
    0.3000000 0.2500000 0.2857143 0.2500000 0.5555556 0.4444444 0.3333333 0.3333333 
            I 
    0.4000000
    
  • Predict
    lda.values <- predict(MASS::lda(sample ~ PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, 
                  data=scores))
    ldascores <- data.frame(lda.values$x)
    ldascores$sample <- scores$sample
    ggplot(ldascores, aes(LD1, LD2, color=sample)) + geom_point(size=5)
    # total proportion correct
    
  • Results

    dfa.png

  • Notes / see also
    • pcaMethods for imputed values in PCA (deal with NAs)
    • prcomp and princomp have different terminology

25.2 M.F.'s orchid-fungi data

  • Mycorrhizal fungi and orchids
    myco <- read.csv("http://r-research-tool.schwilk.org/lectures/data/Fernandez_orchid_fungi.csv",
              stringsAsFactors=FALSE)
    head(myco, 2)
    
             genus  species sp_code sample_nr site status      long     lat
    1 Masdevallia  nidifica    mani         1    a common -83.78525 9.75116
    2 Masdevallia  nidifica    mani         2    a common -83.78525 9.75116
                  life_zone elevation rainfall temp host_tree light_.    fungal.ID
    1 Premontane Rainforest      1250     4100 23.2  saur_mon      27 Sebacinaceae
    2 Premontane Rainforest      1250     4100 23.2  saur_mon      36 Sebacinaceae
    
  • Explore by orchid species
    # most common fungi by orchid species
    myco %>% group_by(sp_code, fungal.ID) %>% 
         summarize(count = length(fungal.ID)) %>%
         top_n(n=1, wt=count)
    
     Source: local data frame [4 x 3]
    Groups: sp_code [4]
    
      sp_code      fungal.ID count
        <chr>          <chr> <int>
    1    difr Cantharellales    18
    2    mach Cantharellales    15
    3    mani   Sebacinaceae    10
    4    onkl   Sebacinaceae    12
    
  • By site:
    # most common fungi by orchid species
    myco %>% group_by(site, fungal.ID) %>% 
         summarize(count = length(fungal.ID)) %>%
         top_n(n=3, wt=count)
    
     Source: local data frame [11 x 3]
    Groups: site [4]
    
        site      fungal.ID count
       <chr>          <chr> <int>
    1      a Cantharellales     9
    2      a   Sebacinaceae    10
    3      a Tulasnellaceae     1
    4      b Cantharellales    10
    5      b   Sebacinaceae    10
    6      c Atractiellales     4
    7      c Cantharellales    11
    8      c Tulasnellaceae     3
    9      d Atractiellales     2
    10     d Cantharellales    12
    11     d Tulasnellaceae     4
    
Back to top | E-mail Schwilk