# R as a Research Tool: Lectures

## Table of Contents

- 1. Introduction:
- 2. Plots with ggplot
- 3. Data Structures
- 4. Introduction to functions and more lists
- 5. Lists and apply functions
- 6. Control structures
- 7. Loops and functions
- 8. Data Frames and Factors
- 9. Math I
- 10. Math II and scope
- 11. Style and Debugging
- 12. Strings
- 13. The Grammar of Graphics and ggplot I
- 14. The Grammar of Graphics and ggplot II
- 15. Shaping data and
`dplyr`

- 16. Tidy Data
- 17. Dates and Times
- 18. Statistical Models
- 19. Perception and Color
- 20. Markup languages
- 21. Version Control

## 1 Introduction:

### 1.1 Why R?

- R popularity

- R: growth mostly in academia and healthcare

- Example: Shiny (interactive web apps in R)

- Example: R Markdown reports

- Example: Publish dynamic report

- Why use a programming language at all?

- Powerful
- Reproducible
- Interoperability

- Why R?

- A language
*designed*around data analysis and and well suited to interactive exploration of data. - By far the most popular software for statistics and data analyses.
- A massive set of packages.
- Strong and friendly community. R is the de facto standard language in statistics and many areas of biology.
:**Open source**- Available to all, not just those in rich countries.
- Contributes to reproducible research.

- A language
- Shortcomings of R compared to other languages

- Less "general purpose" than Python, etc.
- There are nutty aspects of the language and many ways to do the same thing. Many R functions use tricks to reduce the amount of typing (interaction!) at the cost of making code that is hard to understand.
- R "evolved" it was not designed. Many APIs are inconsistent (Even in core R!)
- Most R programmers are not computer scientists. There is a lot of poorly written and poorly documented R code out there.
- R is not very fast. It uses memory inefficiently. But it is often fast enough.

- Computer Languages

- Interpreted vs compiled languages

- Compiled

higher level language code is converted into machine code and then 'directly' executed by the host CPU

- Faster execution, but needs compilation time
- No dynamic typing or scoping

- Interpreted (eg R)

Programs are indirectly executed ("interpreted") by an interpreter program

- Platform independence
- Reflection (ability to examine and modify the structure and behavior of an object at run time)
- Dynamic typing

- Compiled
- R

- R is a computer language for statistical computing
- R is an
**interpreted**language - FORTRAN, C, C++ are all
**compiled**languages - other interpreted languages used in science:
**Python**,**Julia**, Perl, Matlab/Octave.

### 1.2 Logistics

- Website and homework

- Website
- http://r-research-tool.schwilk.org
- Homework
- Email to me each Tuesday. See webpage for format instructions. dylan.schwilk@ttu.edu
- Syllabus
- http://r-research-tool.schwilk.org/handouts/R-as-a-Research-Tool-Syllabus.html

- Lecture code

All code from my lecture slides is in session files:

http://r-research-tool.schwilk.org

Use these files as your notes templates.

- Me

- Plant ecologist: fire ecology, evolutionary ecology
- On the intertubes:
- http://www.schwilk.org
- twitter: DylanSchwilk
- github: https://github.com/dschwilk and https://github.com/dschwilk

### 1.3 Getting started

- Installing R

See syllabus

- Running R from the command line

Start R and you'll get a greeting, and then the R prompt, the > sign. The screen will look something like:

R

R version 3.4.2 (2017-09-28) -- "Short Summer" Copyright (C) 2017 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) [...MORE...] Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

- RStudio

- RStudio

Many people use RStudio. It is not open source, but the basic version is free. It is an IDE (integrated development environment) that allows one to edit scripts, send code directly to the R process in a console window, debug, edit RMarkdown, etc all in one application.

- Windows: console, workspace, files, current open file (script).
- Projects: Use them! saves a hidden .RProj file in your project directory.

- RStudio: keyboard shortcuts

### 1.4 Fundamentals of R

- R as a calculator:

4.0 * 3.5 log(10) # natural log log10(10) (3 + 1) * 5 3^-1 1/0

[1] 14 [1] 2.302585 [1] 1 [1] 20 [1] 0.3333333 [1] Inf

- Statements

- one statement per line
- but R will keep reading if the statement is "not finished" (open parentheses, haning operator, etc)
- semicolon can end a statement (rarely, if ever, needed)

1 + 2 # second statement ends here 3 + 4 + ( 7 - 7)

[1] 3 [1] 7

- Assigning values to variables

- Variables that contain many values

- There are no scalars in R

But you can have vectors of length 1. Can you figure out what [1] output is for?

y <- rep("a",50) y

[1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" [20] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" [39] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"

- Naming variables (objects)

- 1-63 characters
- first character a letter
- remaining characters: letters
**or**numbers**or**period (+ underscore) - case sensitive

- Parenthesis and braces

`()`

Parenthesis group operators`{}`

braces group statements and return last expression evaluated. This is needed for function definitions, if-then statements and a few other areas.

- Functions

Functions are modules of code that accomplish a specific task. Functions take data as "arguments" and "return" a result. Once a function is written, it can be used over and over and over again. Functions can be "called" from within other functions.

- Functions in R

Arguments to R functions can be matched by position or by name. So the following calls to sd() are all equivalent

mydata <- rnorm(100) sd(mydata) sd(x = mydata) sd(x = mydata, na.rm = FALSE) sd(na.rm = FALSE, x = mydata) sd(na.rm = FALSE, mydata)

- Example: random numbers

runif(5)

[1] 0.12573941 0.09336869 0.96068974 0.90558315 0.30646899

runif(5, 0, 10)

[1] 5.300959 8.737525 3.814049 6.926031 4.954221

round(runif(5,-0.5,10.5))

[1] 5 0 1 4 6

floor(runif(5,0,11))

[1] 7 5 8 10 7

- Inspecting functions

sd

function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm)) <bytecode: 0x5555ab88d040> <environment: namespace:stats>

### 1.5 Getting help

- Help in R

- Help on the internet

- The R Project’s own manuals are available at the R home page and some search engines are listed there, too.
- Quick R website: http://www.statmethods.net/
- RSeek search engine: http://www.rseek.org
- You can post your R questions to r-help, the R listserve. See http://www.r-project.org/mail.html.
- Check out StackOverflow for programming questions tagged "r": http://stackoverflow.com/
- R is difficult to search for when using general search engines such as Google. One trick: use google's filetype criterion. To search for scripts with a .R suffix, that pertain to, say, permutations, do:
`filetype:R permutations -rebol`

- Essential vocabulary

See Hadley Wickham's list: http://adv-r.had.co.nz/Vocabulary.html

## 2 Plots with ggplot

### 2.1 Working in R

- Etherpad

- Use this etherpad for sharing code in class: https://public.etherpad-mozilla.org/p/biol6325-2018

- Vectors: the heart of R

`x <- 1:20 x`

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

- Matrices

`B <- matrix( c(2,4,3,1,5,7), nrow=3, ncol=2) B`

[,1] [,2] [1,] 2 1 [2,] 4 5 [3,] 3 7

- Session data

- .Rprofile
- File loaded on startup which can contain your own custom functions
- .Rhistory
- File that contains the session history of commands
- .RData
- Binary blob of data which contains all the objects created in the session. Watch out for this!

- Working directory

Set your working directory. You can use the GUI, a .Rproj file or in code:

getwd() setwd(path.expand("~/") ) # on linux at least

`.Rdata`

and`.Rhistory`

- When you quit a session you are asked if you want to save the workspace
- or use
`save.image()`

- When R starts it loads all objects from
`./.Rdata`

! Good idea to make sure your code runs from a clean workspace:

rm(list=ls(all=TRUE))

### 2.2 ggplot: introduction by example

- Making pretty figures

Today we will get ahead of ourselves. Don't worry! I'll come back to the basics after showing you some fun with plots.

- Where do functions come from?

Many are part of base R. Others come from packages. Others you will write yourself.

- ggplot2

#install.packages("ggplot2") library(ggplot2) #?mpg head(mpg) #str(mpg) summary(mpg)

`ggplot()`

function

Create a plot object to which one can add layers (geoms)

- First plot

ggplot(mpg, aes(x=displ, y=hwy)) + geom_point()

- geoms

`ggplot()`

simply sets up the axes and the mapping. To plot data you must add a "geometric object" or "geom"- Use
`geom_point()`

for a scatterplot. See other geoms such as`geom_line()`

,`geom_smooth()`

,`geom_boxplot()`

- Additional variables, an example

ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()

- Saving plots

You can store the plot object to a variable name then save with

`ggsave()`

:p <- ggplot(mpg, aes(displ, hwy)) + geom_point() ggsave("plots/plot1.pdf", p)

Save the last plot even if it was not assigned to a variable:

`ggplot(mpg, aes(cyl, hwy)) + geom_point() ggsave("plots/plot2.pdf")`

- You can use the RStudio GUI ("export" button in the plots window)

### 2.3 Aesthetics and faceting

- Aesthetics in addition to x and y

Aesthetic Discrete variable Continuous variables Color Rainbow of colors Gradient red to blue Size Discrete size steps Mapping radius->value Shape Different shape for each Doesn't work - Example aesthetics

ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point()

- Faceting

- Small multiples displaying different subsets of the data.
- Useful for exploring conditional relationships and for large data.

- Examples to try:

`p <- ggplot(mpg, aes(displ, hwy)) + geom_point() p + facet_grid(. ~ cyl) p + facet_grid(drv ~ .) p + facet_grid(drv ~ cyl) p + facet_wrap(~ class)`

- Faceting

`facet_grid()`

- 2d grid,
`rows ~ cols`

,`.`

for no split `facet_wrap()`

- 1d ribbon wrapped into 2d
`scales`

- argument controls whether position scales are fixed or free

- Overplotting: What is wrong here?

`p <- ggplot(mpg, aes(cty, hwy)) p + geom_point()`

- Jitter

p + geom_jitter()

- How can we improve this plot?

ggplot(mpg, aes(class, hwy)) + geom_boxplot()

- Reorder

`p <- ggplot(mpg, aes(reorder(class, hwy), hwy)) p + geom_boxplot()`

- Multiple geoms

p + geom_jitter() + geom_violin()

- Your turn

- Read the help for
`reorder`

. Redraw the previously plots with class ordered by median hwy. - How would you put the jittered points on top of the violin plots?

- Read the help for

## 3 Data Structures

### 3.1 Math

- First week vocabulary

- Doing math in R

- Operators

- Operator precedence

`$`

Subsetting by name `@`

component / slot extraction `[`

`[[`

subsetting `^`

exponentiation `:`

sequence operation `%any%`

special operators, including %% and %/% `* /`

multiply and divide `+ -`

add and subtract < > <= >= `= !`

greater, less than `!`

negation `& &&`

and | || or `~`

"depends on" in formulae `<-`

assignment `<<-`

global assignment `?`

help - Order of operations

`x <- 3 + 4 * 2 x`

[1] 11

x <- (3 + 4) * 2 x x <- (3 + 4) * sqrt(2) x

[1] 14 [1] 9.899495

### 3.2 Vector basics

- R data structures

- vectors
- matrices and arrays
- lists
- data frames

And other specialized structures provided by packages (eg tibbles, data.table, phylogenetic trees …)

- Vectors

`x <- c(5,12,13) x length(x) mode(x)`

[1] 5 12 13 [1] 3 [1] "numeric"

- The ":" operator

Note the ":" operator you used in homework:

5:8 5:1

[1] 5 6 7 8 [1] 5 4 3 2 1

Beware of the operator precedence:

`i <- 3 1:i-1 1:(i-1)`

[1] 0 1 2 [1] 1 2

- seq()

seq(from=12,to=30,by=3) seq(from=1.1,to=2,length=10) seq(from=12,to=30,by=3)

[1] 12 15 18 21 24 27 30 [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 [1] 12 15 18 21 24 27 30

- rep()

rep(8,4) rep(c(5,12,13),3) rep(c(5,12,13),each=2)

- The subsetting operators

R has three subset operators:

`[`

`[[`

`$`

We'll start with one you've already used in your homework:

`[`

- Vector subsetting

`v <- 1:100 v`

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100

`v <- v * 10 v[1:3]`

[1] 10 20 30

- Character vectors (aka vectors of "strings")

y <- "abc" length(y) mode(y) y <- c("abc","1","2") length(y)

[1] 1 [1] "character" [1] 3

- Strings

u <- paste("abc", "de", "f") # concatenate the strings u

[1] "abc de f"

v <- strsplit(u, " ") # split the string according to blanks v

[[1]] [1] "abc" "de" "f"

- Some vector concepts

- Recycling
- Automatic lengthening of vectors in certain settings
- Filtering
- Extracting subsets of vectors
- Vectorization
- Where functions are applied element-wise to vectors

- Recycling

`x <- c(1, 2, 3, 4)`

Recycling:

c(1,2) + x

[1] 2 4 4 6

c(1, 2, 3) + x

[1] 2 4 6 5 Warning message: In c(1, 2, 3) + x : longer object length is not a multiple of shorter object length

- Vectorization

Many functions are vectorized, including operators such as

`>`

u <- c(5,2,8) v <- c(1,3,9) u > v

[1] TRUE FALSE FALSE

sqrt(1:9)

[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 [9] 3.000000

### 3.3 Subsetting

- Vector indexing (subsetting)

There are six things that you can use to subset a vector. First three are used commonly:

- positive integers
- negative integers
- logical vectors
- nothing (
`[]`

returns original vector) - zero (
`[0]`

returns empty vector) - character vectors (only if vector is named)

- Subsetting vectors

`y <- c(1.2,3.9,0.4,0.12) y[c(1,3)] y[1:3]`

[1] 1.2 0.4 [1] 1.2 3.9 0.4

`v <- c(1,4) y[-v]`

[1] 3.9 0.4

- Boolean operators

Operate on TRUE/FALSE values only

Operation Math R ------------ ------------------------ ---------- AND x ∧ y &, && OR x ∨ y || NOT !x ! ------------ ------------------------ ---------- - Boolean operators

operator function -------- ----------------- == Test for equality < Less than > Greater than <=,>= lt or equal, gt or equal && Boolean AND for scalars || Boolean OR for scalar & Boolean AND for vectors | Boolean OR for vectors !x Boolean negation xor() Exclusive OR any(), all() Boolean vector reduction - Logical vectors and filtering

v <- c(TRUE,FALSE, FALSE, TRUE) y[v]

[1] 1.20 0.12

y[ y > 1 ]

[1] 1.2 3.9

y > 1

[1] TRUE TRUE FALSE FALSE

`match`

and`%in%`

names <- c("fortis", "spinosa", "macrocarpa", "leucophylla") samples <- c("spinosa", "leucophylla") samples %in% names

[1] TRUE TRUE

names %in% samples

[1] FALSE TRUE FALSE TRUE

match(samples, names)

[1] 2 4

- any(), all()

Reduce boolean (logical) vectors to a scalar

any(y > 1)

[1] TRUE

all(y > 1)

[1] FALSE

- Assignment to subset

which(y > 1)

[1] 1 2

`y[y > 1] <- 0 y`

[1] 0.00 0.00 0.40 0.12

- Vectorized if-then-else

x <- 1:10 y <- ifelse(x %% 2 == 0, 0, 2*x) y

[1] 2 0 6 0 10 0 14 0 18 0

### 3.4 More vectors

- NA and NULL

- NA
- Missing data
- NULL
- The null object: The value simply does not exist (eg an impossible value)

x <- c(1,NA,3,4,5,6) mean(x)

[1] NA

`mean(x,na.rm=TRUE)`

[1] 3.8

x <- c(1,NULL,3,4,5,6) x # NULL can't be part of a vector mean(x)

[1] 1 3 4 5 6 [1] 3.8

- unique()

`x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2) unique(x)`

[1] 1 2 3 4

- Counting TRUES

x < 3 length(which(x < 3)) sum(x < 3)

[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [1] 4 [1] 4

- Vector element names

`x <- c(1,2,4) names(x)`

NULL

names(x) <- c("one","two","four") x

one two four 1 2 4

- Subsetting on names

x[c("one", "four")]

one four 1 4

- c(): Type coercion:

- Values are converted to the simplest type required to represent all information.
- The ordering is roughly logical < integer < numeric < complex < character < list.

`c(5,2,"abc") c(5,list(a=1,b=4))`

[1] "5" "2" "abc" [[1]] [1] 5 $a [1] 1 $b [1] 4

- c(): Flattening:

c(1,2,c(100,200,300))

[1] 1 2 100 200 300

### 3.5 Matrices

- Matrices

`m <- matrix(data = c(1,3,5,2,4,6), nrow=3, ncol=2) m`

[,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6

`m <- matrix(data = c(1,3,5,2,4,6), nrow=3) m`

[,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6

- Matrices con.

`a <- matrix(c(1,2,3,4), nrow=2) a`

[,1] [,2] [1,] 1 3 [2,] 2 4

`b <- matrix(c(6,7,8,9), nrow=2) b`

[,1] [,2] [1,] 6 8 [2,] 7 9

- Matrix addition

a + b

[,1] [,2] [1,] 7 11 [2,] 9 13

- Matrices math, dimensions

`c <- matrix(c(3,4,5,6), nrow=1) c`

[,1] [,2] [,3] [,4] [1,] 3 4 5 6

a + c # result: Error in a + c : non-conformable arrays

- Matrix math, con.

a * b

[,1] [,2] [1,] 6 24 [2,] 14 36

a^3

[,1] [,2] [1,] 1 27 [2,] 8 64

- Matrix multiplication

a <- matrix(c(1,2,3),ncol=3) b <- matrix(c(4,5,6),ncol=1) a b

[,1] [,2] [,3] [1,] 1 2 3 [,1] [1,] 4 [2,] 5 [3,] 6

Calculate dot product (inner product)? Or generally, matrix product?

- Matrix multiplication

# a*b # ERROR a %*% b

[,1] [1,] 32

b %*% a

[,1] [,2] [,3] [1,] 4 8 12 [2,] 5 10 15 [3,] 6 12 18

- rbind and cbind

rbind(a, a) # rbind(a,b) # ERROR! cbind(b, b)

[,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [,1] [,2] [1,] 4 4 [2,] 5 5 [3,] 6 6

- Row and column means

See:

`rowMeans()`

`colMeans()`

`rowSums()`

`colSums()`

- Subsetting matrices

`m <- matrix(c(1,2,3,4,5,6),nrow=3) m[1:2,1:2]`

[,1] [,2] [1,] 1 4 [2,] 2 5

- Dimension reduction

`z <- matrix(1:8, nrow=4) z`

[,1] [,2] [1,] 1 5 [2,] 2 6 [3,] 3 7 [4,] 4 8

`r <- z[2,] r`

[1] 2 6

- Dimension reduction result

attributes(z) attributes(r)

$dim [1] 4 2 NULL

- Avoiding dimension reduction

r <- z[2,, drop=FALSE] r

[,1] [,2] [1,] 2 6

dim(r)

[1] 1 2

### 3.6 Lists

- Lists

An R list is a container whose contents can be items of different data types, unlike a vector.

x <- list(u=2, v="abc") x x$u

$u [1] 2 $v [1] "abc" [1] 2

- Subsetting lists

`[`

- returns a list)
`[[`

- returns an element in the list
- (no term)
`$`

returns an element by name

- Lists continued

A common usage of lists is to package the return values of complex statistical functions. For example, the histogram function hist():

`hist(Nile) hn <- hist(Nile) print(hn)`

$breaks [1] 400 500 600 700 800 900 1000 1100 1200 1300 1400 $counts [1] 1 0 5 20 25 19 12 11 6 1 $density [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001 $mids [1] 450 550 650 750 850 950 1050 1150 1250 1350 $xname [1] "Nile" $equidist [1] TRUE attr(,"class") [1] "histogram"

### 3.7 Data Frames

- Data Frames

Like a matrix, but for mixed types. This is what you will use most often!

d <- data.frame(kids=c("Jack","Jill","Igor"), ages=c(12,10,11)) d

kids ages 1 Jack 12 2 Jill 10 3 Igor 11

- Reading data

For now, we will assume you save your data as a comma separated values file (say, in excel), and then read that file into a dataframe:

fish <- read.csv("../data/fish-data.csv", stringsAsFactors=FALSE) head(fish,3)

aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0

Always use

`stringsAsFactors = FALSE`

(or used the readr package).

## 4 Introduction to functions and more lists

### 4.1 Functions

- Why write functions?

Why not just copy and paste code?

- You can give a function a name that makes your code easier to understand.
- If you need to change the code, you only need to change it in one place.
- You eliminate the chance of making mistakes when you copy and paste.
- You can pass functions to other functions

- When to write a function

Whenever you find yourself copying and pasting more than twice.

- Functions are objects

So we create them and assign them to a name.

addone <- function(x) { return(x+1) } addone(1)

[1] 2

- Function example

num2string <- function(thename, num) { if (num > 1) { thename <- paste(thename, "s", sep="") } return(paste(as.character(num), thename)) } num2string("computer", 44) num2string("instructor", 1)

[1] "44 computers" [1] "1 instructor"

### 4.2 In class project

- Function randomIntegers()

- Problem: convert random real numbers to integers

- Your turn

# RandomIntegers: produce vector of uniformly # distributed random integers. # # Arguments: # n = number of random integers to be generated. # min_val, max_val = min, max possible values, # inclusive. # Returns: # vector of random integers. RandomIntegers <- function(n, min_val, max_val) { ## ???? return(result) }

- Solution

# RandomIntegers: produce vector of uniformly # distributed random integers. # # Arguments: # n = number of random integers to be generated. # min_val, max_val = min, max possible values, # inclusive. # Returns: # vector of random integers. RandomIntegers <- function(n, min_val, max_val) { # Extend upper boundary by 1 and use floor() result <- runif(n, min_val, max_val+1) result <- floor(result) return(result) }

- Other solutions, using round():

RandomIntegers2 <- function(n, min_val, max_val) { # Extend each boundary by 0.5 and use round() r <- runif(n, min_val-0.5, max_val+0.5) return(round(r)) }

Or … extend lower bound and use

`ceiling()`

- Other solutions, using sample()

RandomIntegers3 <- function(n, min_val, max_val) { sample(min_val:max_val, n, replace=TRUE) }

## 5 Lists and apply functions

- Vocab, homework

- R functions over second week
- Homework: cite sources including fellow students
- Homework: I'm expecting you to be able to sove porblems with what I've taught thus far.

### 5.1 Lists

- Lists vs vectors

An R list is a container whose contents can be items of diverse data types, as opposed to them being, say, all numbers, as in a vector.

- List indexing

j <- list(name=c("Joe", "Doe"), salary=39000, permanent=TRUE) j$salary j$sal j[["salary"]] j[[2]]

[1] 39000 [1] 39000 [1] 39000 [1] 39000

- List indexing

j[2]

$salary [1] 39000

- [] vs [[]]

`j[[1]]`

- Extracts a single item
`j[1]`

- Extracts a sublist

- [] vs [[]]

class(j[["name"]]) class(j["name"])

[1] "character" [1] "list"

- Adding elements

j$rnumber <- "R00001111" j

$name [1] "Joe" "Doe" $salary [1] 39000 $permanent [1] TRUE $rnumber [1] "R00001111"

- Adding more elements

length(j) j[[5]] <- "Extra" j[6:7] <- c(FALSE,FALSE) length(j)

[1] 4 [1] 7

- Adding more elements

j[3:7]

$permanent [1] TRUE $rnumber [1] "R00001111" [[3]] [1] "Extra" [[4]] [1] FALSE [[5]] [1] FALSE

- Deleting elements

length(j) j$rnumber <- NULL length(j) names(j)

[1] 7 [1] 6 [1] "name" "salary" "permanent" "" "" ""

- Inserting elements

Either break apart and re-concatenate with

`c()`

, or use`append()`

:newj <- append(j, list(birthday="12/10/1970"), after=3) newj[3:5]

$permanent [1] TRUE $birthday [1] "12/10/1970" [[3]] [1] "Extra"

- Lists to vectors:
`unlist()`

l <- list(a=1,b=10,c=100) v <- unlist(l) v mode(l) mode(v) unname(v)

a b c 1 10 100 [1] "list" [1] "numeric" [1] 1 10 100

### 5.2 Apply functions

- Apply family of functions

`apply`

- apply a function to the rows, columns or values in a matrix/array
`lapply`

- apply a function to every item in a list and return a list
`sapply`

- lapply but simplifies the result and returns a vector
`mapply`

- You have multiple data structures and want to hand elements of each as separate arguments to a function

There are more in this family that we use less commonly. See ??apply and scroll down.

- Applying functions to lists

- lapply example

Consider sexing 10,000 shrimp. We'll use a subset. We'd like to know the indices for all females, males and juveniles:

shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") shrimp=="F" which(shrimp=="F")

[1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE [1] 2 5 6 8 9

- Example, continued

codes <- c("F","M","J") lapply(codes, function(sex) which(shrimp==sex))

[[1]] [1] 2 5 6 8 9 [[2]] [1] 1 7 [[3]] [1] 3 4

- Whoa, where was the function?

It was unnamed! Alternative with named function:

# function to return indices of x in v indices <- function(x, v) { return(which(x==v)) } # and use sapply for USE.NAMES=TRUE: sapply(codes, indices, v=shrimp, simplify=FALSE)

$F [1] 2 5 6 8 9 $M [1] 1 7 $J [1] 3 4

- Lists of lists

a <- list(name="Dylan", dept="Biology") b <- list(name="Jane", dept="Plant and Soil") c <- list(name="Kathryn", dept="NRM", role="student") course <- list(Schwilk=a, Doe=b, Smith=c) course[[3]]$role

[1] "student"

`sapply`

How can we extract a vector of just the first names?

sapply(course,"[[","name")

Schwilk Doe Smith "Dylan" "Jane" "Kathryn"

- Apply functions to matrices:
`apply()`

`y <- apply(X, MARGIN, FUN, ...)`

`X`

- an array
`MARGIN`

- subscripts which the function gets applied over
`FUN`

- the function to be applied
`...`

- additional arguments to function

Returns a vector or array

- Applying a function to rows or columns

the

`apply()`

function.`MARGIN`

argument = 1 if the function applies to rows, 2 for columns, etc.`z <- matrix(c(1,2,3,4,5,6), ncol=2) z`

[,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6

apply(z, 2, mean)

[1] 2 5

- on rows, using our own function:

f <- function(x) x/c(2, 8) y <- apply(z, 1, f) y

[,1] [,2] [,3] [1,] 0.5 1.000 1.50 [2,] 0.5 0.625 0.75

- What are "margins"?

- Examples: matrix

- MARGIN = 1: slice up into rows
- MARGIN = 2: slice up into columns
- MARGIN = c(1,2): slice up into individual cells

— Wickham (2011): http://www.jstatsoft.org/v40/i01/

- 3-d arrays

- Example: grid of genotypes

Let's store genotypes as 1st dimension on a 2d landscape of organisms. So we will make a 3d array:

makeGenotype <- function() sample(1:100, 25, replace=TRUE) # simple 10 x 10 landscape landscape <- array(replicate(100, makeGenotype()), dim =c(25,10,10 )) # The first genotype in landscape position 1,1: landscape[, 1, 1]

[1] 55 22 4 80 57 8 14 30 51 9 89 7 31 33 47 77 62 91 58 34 49 82 20 41 2

- Example: apply

## get matrix of fitnesses testFitness <- function(g, e) { return( 1 / ( sum((e-g)^2) + 1)) } opt <- makeGenotype() # "best" genotype fitnesses <- apply(landscape, c(2,3), testFitness, e = opt) # lets see the relative fitnesses as percentages round(fitnesses / max(fitnesses)*100)

- Results:

[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 40 45 27 38 40 36 38 44 36 33 [2,] 41 34 45 37 51 46 37 49 34 40 [3,] 35 40 35 35 43 35 41 31 45 51 [4,] 31 58 24 46 44 38 33 38 40 37 [5,] 34 36 56 44 45 58 37 39 46 68 [6,] 39 40 47 51 33 36 54 47 44 42 [7,] 42 56 44 44 38 35 36 38 40 70 [8,] 43 53 67 50 36 39 59 38 40 55 [9,] 34 42 61 100 29 42 37 27 50 35 [10,] 37 58 39 64 53 54 82 42 40 31

- Where is the most fit genotype?

max(fitnesses) indices <- which(fitnesses == max(fitnesses), arr.ind=TRUE) indices landscape[, indices[1], indices[2]] opt

[1] 4.968944e-05 row col [1,] 9 4 [1] 25 89 23 48 69 84 40 30 19 52 87 35 15 54 30 30 48 85 47 19 5 6 62 20 21 [1] 9 54 89 2 67 94 8 5 19 96 95 51 15 27 4 10 47 82 91 49 7 38 63 69 12

- Heatmap of landscape:

image(fitnesses)

:noexport:

y <- "fruit" switch(y, fruit = "banana", vegetable = "broccoli", "Neither") y <- "meat" switch(y, fruit = "banana", vegetable = "broccoli", "Neither")

[1] "banana" [1] "Neither"

## 6 Control structures

### 6.1 Homework solutions

- Solutions Q. 1-3

oak_wp <- read.csv("http://r-research-tool.schwilk.org//data/oak-water-potentials-simple.csv", stringsAsFactors = FALSE) n_species <- length(unique(oak_wp$spcode)) # 1 any(oak_wp$md.psi > oak_wp$pd.psi) # This answers #2 which_fail <- which(oak_wp$md.psi > oak_wp$pd.psi) #2 and n_fail_check <- length(which_fail) # 2 min_index <- which.min(oak_wp$md.psi) #3 oak_wp[min_index, c("md.psi", "date", "spcode")] #3

- Solution Q. 4

years <- unique(oak_wp$year) mean_for_year <- function(year){ return(mean(oak_wp$md.psi[oak_wp$year==year], na.rm=TRUE)) } wp_means <- sapply(years, mean_for_year) years[which.min(wp_means)]

### 6.2 if-else

- if else decisions

tf <- function(a) { if (a == 13) { return("Lucky!") } else { return("No luck") } }

tf(5) tf(13)

[1] "No luck" [1] "Lucky!"

- Be careful inverting boolean logic!

A <- -8 if (A>0 & A<10) { print("YES") } else { print("NO") }

[1] "NO"

- So we invert the condition

if ( !(A>0 & A<10) ) { print("YES") } else { print("NO") }

[1] "YES"

- Careful rewriting

# wrong if (A<=0 & A>=10) { print("YES") } else { print("NO") } #right if (A<=0 | A>=10) { print("YES") } else { print("NO") }

[1] "NO" [1] "YES"

- if, else if, else

tf <- function(r) { if (r == 13) { print("Lucky 13!") } else if (r < 0) { print("Negative!") } else { print("Just a number") } }

tf(13) tf(-100) tf(10)

[1] "Lucky 13!" [1] "Negative!" [1] "Just a number"

`if`

needs a single logical value

if (c(TRUE, FALSE)) {}

NULL Warning message: In if (c(TRUE, FALSE)) { : the condition has length > 1 and only the first element will be used

`|`

vs`||`

and`&`

vs`&&`

Use

`||`

and`&&`

in`if`

statements. They are "short-circuiting"x <- 1 if (x >0 && c(TRUE, FALSE)) { print("Yes") }

[1] "Yes"

### 6.3 Functions: review

- Steps to writing a function

- Decide what information is needed (arguments) and what to return
- Develop the algorithm
- Implement in code

- Potential problems

- Syntax errors (error when you define the function)
- Errors during run time (eg undefined variable names)
- Runs with no errors but the output wrong!

- Debugging

More on this later …

- use print()
- use stopifnot()

- The return() statement

If you do not include one, R uses the last statement evaluated

- But explicit return is clearer

And can avoid some weird R gotchas

funnyFunc1 <- function(x) { if(x == 10) "TEN" else "NOT TEN" } funnyFunc2 <- function(x) { if(x == 10) r <- "TEN" else r <- "NOT TEN" } # invisible return from assignment funnyFunc1(10) funnyFunc2(10) print("huh?") print(funnyFunc2(10))

[1] "TEN" [1] "huh?" [1] "TEN"

- Side effects

"Pure" functions have no side effects, just returned values. But you have used functions entirely for their side effect (eg

`library()`

,`write.csv()`

) - Luckily R protects you from most side effects

Function has read access to variables at higher scope, but no write access.

x <- y <- 10 func <- function(x) { x <- 0 y <- y*2 return(y) } func() x y

[1] 20 [1] 10 [1] 10

### 6.4 In-class project

- Name the functions

Read the code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

f1 <- function(x) { if (length(x) <= 1) return(x) return(x[length(x):1]) } f2 <- function(string, prefix) { substr(string, 1, nchar(prefix)) == prefix } f3 <- function(x) { if(! is.vector(x) || ! is.numeric(x)) return(FALSE) return(length(x) == 1) }

## 7 Loops and functions

### 7.1 Loops

**for**and**while**loops

x <- c(1, 2, 3) for (i in x) { print(i^2) }

[1] 1 [1] 4 [1] 9

i <- 1 while (i <= 10) { i <- i+4 print(i) }

[1] 5 [1] 9 [1] 13

- Breaking out of loops

i <- 1 while(TRUE) { i <- i+4 if (i>10) break } i

[1] 13

Or use

**repeat**:i <- 1 repeat { i <- i+4 if (i>10) break } i

[1] 13

- Loop hierarchy

specific to general:

`for`

,`while`

,`repeat`

So when do we use each type?

- Use
`for`

loops when you need to iterate over a vector. There are a known number of iterations. - Use
`while`

loops when the loop must continue until some condition is met. If the condition is met when the loop is first encountered, the loop is skipped completely. - Use
`repeat`

as for`while`

, but when the loop must execute at least once or when the best exit point(s) is/are somewhere in the middle of the loop block.

- Use
- Two idioms for
loops`for`

- Idiom 2: Iterate over an indexing vector

in order to do something to elements over multiple vectors

v2 <- c(1,1,2,10) for(i in seq_along(v)) { r <- v[i] * v2[i] print(sprintf("Loop %d, Result: %d", i, r)) }

[1] "Loop 1, Result: 2" [1] "Loop 2, Result: 4" [1] "Loop 3, Result: 10" [1] "Loop 4, Result: 80"

### 7.2 In class projects

- Project 1: calculating a probability

Suppose we have

`n`

independent events each with a separate probability of occurring. What is the probability of exactly one of those events occurring? - The algorithm

if the probability of the \(i^{th}\) event is \(p_i\), then P(A and not B and not C) would be be \(p_A(1 - p_B)(1 - p_c)\). For general n:

\[ \sum\limits_{i=1}^{n} p_i (1 - p_1) \ldots (1 - p_{i-1}) (1 - p_{i+1}) \ldots{} (1 - p_n) \]

- Your turn: translate this to code

Function takes vector of probabilities and returns a scalar probability of exactly one event occurring.

exactlyone <- function(p) { ## ? }

You will want to loop through the vector

`p`

while also looping through "not p" (`1-p`

) - The code

exactlyone <- function(p) { notp <- 1 - p totalProb <- 0.0 for (i in 1:length(p)) { totalProb <- totalProb + p[i] * prod(notp[-i]) } return(totalProb) }

- Project 2: Fibonacci sequence

0,1,1,2,3,5,8,13 …

- Solution

fib <- function(n) { ## Two termination cases if (n==1) { return(0) } else if (n==2) { return(1) } # Otherwise use the definition of the sequence: return(fib(n-1) + fib(n-2)) } # example: first 15 values in sequence: sapply(1:15, fib)

[1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377

- The recursion stack

### 7.3 More recursion

- "Exploding" dice

In some table-top role playing games, dice are used to provide stochasticity in the simulation of events. In order to produce a more "exponential-like" distribution, some games use "acing" or "exploding dice" in which the die is re-rolled if the first roll results in the maximum value and the new value is added to the first.

- Simulating such a die roll

`rollDie()`

Write a function called

`rollDie(x)`

that simulates rolling a dice with x sides and allows "acing" or "exploding dice". Under this rule, a maximum roll (eg a "6" on a six-sided die) results in a second roll whose value is added to the first. This is open-ended and the second roll can "ace" as well: in other words, the resulting probability distribution is unbounded on the upper end.

- A
`rollDie()`

Solution

rollDie <- function(die) { rolls <- sample(1:die, 1, replace=TRUE) if (rolls == die) { rolls <- rolls + rollDie(die) } return(rolls) } rollDie(6)

[1] 4

- rollDice()

rollDice <- function(die, n, allow.ace = TRUE) { stopifnot(die > 1) # no such thing as 1-sided die rolls <- sample(1:die, n, replace = TRUE) if(!allow.ace) return(rolls) aces <- rolls == die if ( any(aces)) { rolls[aces] <- rolls[aces] + rollDice(die, sum(aces), allow.ace) } return(rolls) }

## 8 Data Frames and Factors

### 8.1 Homework Solutions

- euclideanDistance()

euclideanDist <- function (p1, p2) { dist = sqrt((p1[1] - p2[1])^2 + (p1[2] - p2[2])^2) return(dist) }

Idiomatic R (works with vector of xs and vector of ys):

euclideanDist <- function (p1, p2) { dist <- sqrt(sum((p2 - p1)^2)) return(dist) }

- decomposeTime() with modulo arithmetic

## decomposeTime: Given a value in seconds, convert to days, hours, minutes, and # seconds. If input is negative, all returned components will be negative # as well. # # args: # nSeconds = scalar number of seconds. # ---------------------------------------- # returns: # a list containing: # days = integer number of days. # hours = integer number of hours. # minutes = integer number of minutes. # seconds = residual seconds.

- decomposeTime() with modulo arithmetic

decomposeTime <- function (nSeconds) { sInMinute <- 60 sInHour <- 60 * 60 sInDay <- sInHour * 24 # deal with negative input by recording sign sign <- 1 if(nSeconds < 0) { sign <- -1 nSeconds <- -1 * nSeconds # back to positive } result <- list(days = nSeconds %/% sInDay, hours = (nSeconds %% sInDay) %/% sInHour, minutes = (nSeconds %% sInHour) %/% sInMinute, seconds = nSeconds %% sInMinute) return(lapply(result, '*', sign)) }

- decomposeTime() without using %% or %/%

decomposeTime <- function(inputSeconds) { # Some constants: secondsPerMinute = 60 secondsPerHour = secondsPerMinute * 60 secondsPerDay = secondsPerHour * 24 # calculate each set: days = floor(inputSeconds / secondsPerDay) residualSeconds = inputSeconds - (days * secondsPerDay) hours = floor(residualSeconds / secondsPerHour) residualSeconds = residualSeconds - (hours * secondsPerHour) minutes = floor(residualSeconds / secondsPerMinute) seconds = residualSeconds - (minutes * secondsPerMinute) return(list(days=days, hours=hours, minutes=minutes, seconds=seconds)) }

### 8.2 Homework 4

- Walk through algorithm:

Example parents

`["A", "B", "C", "D"]`

and`["a", "b", "c", "d"]`

.- Choose one parent at random. Flip a coin
- Read one value from chosen parent and copy to offspring
- Test for recombination, if it occurs, start reading from other parent
- Go to step 2
- Done when length(child) is length(parent)

- Some ideas for testing?

What must be true?

- Allele at each position must be same as either mother or father's at that same position. You could write a test for that.
- What should be distribution of parental runs lengths?

- Testing example

## test if distribution of run lengths looks like a bernoulli process ## (geometric distribution). As length -> infinity this should be true. # Mean should be 1 / p # Variance should be (1 - p) / p^2 mother <- rep(0,10000) father <- rep(1,10000) child <- mateVectors(mother, father, 0.05) runs <- rle(child) runlengths <- runs$lengths mean(runlengths) # 1 / 0.05 = 20 var(runlengths) # (1 - p) / p^2 = 380 when p=0.05

### 8.3 Creation and merging

- What is a data frame?

Has characteristics of lists and matrices.

kids <- c("Jack","Jill") ages <- c(10,12) d <- data.frame(kids, ages, stringsAsFactors=FALSE) d d[[1]] d$kids d[,1]

kids ages 1 Jack 10 2 Jill 12 [1] "Jack" "Jill" [1] "Jack" "Jill" [1] "Jack" "Jill"

- Add to data frame

rbind and cbind work as with matrices

d <- rbind(d, list("Laura", 11)) d

kids ages 1 Jack 10 2 Jill 12 3 Laura 11

See also

`bind_rows`

and`bind_cols`

from the`dplyr`

package.`rbind_rows`

will concatenate a list of data frames filling missing columns with**NA**. I use this most of the time. - Merging

kids <- c("Jack","Joan","Laura") states <- c("CA","TX","TX") d2 <- data.frame(kids,states) merge(d, d2)

kids ages states 1 Jack 10 CA 2 Laura 11 TX

- Merging (all, all.x, all.y)

d3 <- merge(d, d2, all = TRUE) names(d3) <- c("name","age","state") d3

name age state 1 Jack 10 CA 2 Jill 12 <NA> 3 Joan NA TX 4 Laura 11 TX

- Merging (all.x)

d4 <- merge(d, d2, all.x = TRUE) d4

kids ages states 1 Jack 10 CA 2 Jill 12 <NA> 3 Laura 11 TX

### 8.4 Subsetting

- Data frames: subsetting

More concise than using

`[`

. Names assumed to be columns.subset(d3, state=="TX") # subset(d3, age > 10)

name age state 3 Joan NA TX 4 Laura 11 TX name age state 2 Jill 12 <NA> 4 Laura 11 TX

- Data frames: add new columns

`d3$birth.year <- 2012 - d3$age d3`

name age state birth.year 1 Jack 10 CA 2002 2 Jill 12 <NA> 2000 3 Joan NA TX NA 4 Laura 11 TX 2001

- More on data frames

### 8.5 Factors

- Grouping variables

# head(iris) class(iris$Species) levels(iris$Species)

[1] "factor" [1] "setosa" "versicolor" "virginica"

- Factors vs strings

- We often store factors as strings in a csv file
- I suggest always leaving them as strings until factor is needed (use
`stringsAsFactors=FALSE`

in`read.csv()`

) - Factors stored as integers internally

- Factors vs strings

v <- c("3", "2", "10") factor(v) vf <- as.factor(v) vf as.numeric(vf[3]) ## How did "10" become [1] ?

[1] 3 2 10 Levels: 10 2 3 [1] 3 2 10 Levels: 10 2 3 [1] 1

- Factors: reassign levels

v <- c("Med", "High", "Low", "Low", "Med") factor(v) vf <- factor(v, levels=c("Low", "Med", "High")) vf

[1] Med High Low Low Med Levels: High Low Med [1] Med High Low Low Med Levels: Low Med High

- Factors: rename levels

# vf levels(vf) <- c("low", "medium", "high") vf

[1] medium high low low medium Levels: low medium high

- split()

split(iris$Sepal.Length,iris$Species)

$setosa [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 $versicolor [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 $virginica [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

- tapply()

tapply(iris$Sepal.Length, iris$Species, sd)

setosa versicolor virginica 0.3524897 0.5161711 0.6358796

- Subsetting a dataframe

x <- subset(iris, Species == "virginica") head(x)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica 103 7.1 3.0 5.9 2.1 virginica 104 6.3 2.9 5.6 1.8 virginica 105 6.5 3.0 5.8 2.2 virginica 106 7.6 3.0 6.6 2.1 virginica

- by()

colMeans(x[ , 1:4])

Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026

by(iris[ , 1:4],iris$Species,colMeans)

iris$Species: setosa Sepal.Length Sepal.Width Petal.Length Petal.Width 5.006 3.428 1.462 0.246 ------------------------------------------------------------ iris$Species: versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width 5.936 2.770 4.260 1.326 ------------------------------------------------------------ iris$Species: virginica Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026

- Shrimp again: split()

shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") split(1:length(shrimp), shrimp)

$F [1] 2 5 6 8 9 $J [1] 3 4 $M [1] 1 7

Frequencies:

sapply(split(1:length(shrimp), shrimp), length) # or use table() print("using table():") table(shrimp)

F J M 5 2 2 [1] "using table():" shrimp F J M 5 2 2

### 8.6 Merging example

- Data often stored in separate files

Consider

oak_wp <- read.csv("http://r-research-tool.schwilk.org/data/oak-water-potentials-simple.csv", stringsAsFactors = FALSE) species <- read.csv("http://r-research-tool.schwilk.org/data/oak-species.csv", stringsAsFactors=FALSE) head(oak_wp, 3) head(species, 3)

site tag spcode date pd.psi md.psi year 1 GATEN 110 QUEM 10/24/2010 NA NA 10 2 GATEN 111 QUEM 10/27/2010 NA NA 10 3 GATEN 113 QUEM 10/24/2010 NA NA 10 genus specific.epithet family spcode display.name CM DM GM 1 Quercus emoryi Fagaceae QUEM Q. emoryi 1 1 NA 2 Quercus gambelii Fagaceae QUGA Q. gambelii 1 1 1 3 Quercus gravesii Fagaceae QUGR2 Q. gravesii 1 1 NA

- We want to plot data but use real names

`oak_wp <- merge(oak_wp, species) head(oak_wp, 2)`

spcode site tag date pd.psi md.psi year genus specific.epithet 1 QUEM GATEN 110 10/24/2010 NA NA 10 Quercus emoryi 2 QUEM GATEN 111 10/27/2010 NA NA 10 Quercus emoryi family display.name CM DM GM 1 Fagaceae Q. emoryi 1 1 NA 2 Fagaceae Q. emoryi 1 1 NA

- "Merging" = "joining"

See

`left_join`

,`inner_join`

, etc in`dplyr`

We'll come back to this.

### 8.7 Data frames and the "tidyverse"

- R has a history

- And that means some aspects have been proven effective, and some not (eg bad defaults like
`drop=TRUE`

) - Hadley Wickham and others have worked on a set of packages that provide a consistent interface. These, too, have a history (eg
`plyr`

,`reshape2`

), but the current best choices are`readr`

,`dplyr`

,`tidyr`

,`tibble`

, and`ggplot2`

. - You can even load all of these at once with the meta-package:
`library(tidyverse)`

- And that means some aspects have been proven effective, and some not (eg bad defaults like
`dplyr`

and`tibbles`

- You can use base R (subset, merge, etc), but
`dplyr`

provides a clean interface for working with data frames - In fact,
`dplyr`

will convert your data frames to efficient "tibbles" - tibbles are just data frames with nicer consistent behavior

- You can use base R (subset, merge, etc), but
- Some dplyr examples to get you started

suppressMessages(library(readr)) suppressMessages(library(dplyr)) oak_wp <- read_csv("http://r-research-tool.schwilk.org/data/oak-water-potentials-simple.csv")

`head(filter(oak_wp, spcode=="QUEM" & year==13), 2)`

# A tibble: 2 x 7 site tag spcode date pd.psi md.psi year <chr> <chr> <chr> <chr> <dbl> <dbl> <int> 1 MDS 238 QUEM 05/23/2013 -0.85 -1.95 13 2 MDS 237 QUEM 05/23/2013 -0.76 -1.16 13

- mutate()

`nd <- mutate(oak_wp, proj_year = year - 9) head(nd, 2)`

# A tibble: 2 x 8 site tag spcode date pd.psi md.psi year proj_year <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> 1 GATEN 110 QUEM 10/24/2010 NA NA 10 1 2 GATEN 111 QUEM 10/27/2010 NA NA 10 1

- select()

`nd <- select(oak_wp, spcode, date, md.psi) tail(nd, 3)`

# A tibble: 3 x 3 spcode date md.psi <chr> <chr> <dbl> 1 QUHY 06/25/2013 -1.22 2 QUHY 06/25/2013 -1.87 3 QUHY 06/25/2013 -1.33

## 9 Math I

### 9.1 Sorting and sets

`sort`

and`order`

- Use
`order()`

to sort dataframes

d1 <- data.frame(sites = c("MDS", "BGN", "MDN", "LCN"), elevation = c(1900, 2300, 1950, 1750)) d1

sites elevation 1 MDS 1900 2 BGN 2300 3 MDN 1950 4 LCN 1750

- Use
`order()`

to sort dataframes

`y <- order(d1$elevation) d1[y, ]`

sites elevation 4 LCN 1750 1 MDS 1900 3 MDN 1950 2 BGN 2300

- Sets in R

A set is a collection of distinct (unique) objects. In R we simply use a vector.

numbers <- sample(1:10, 20, replace=TRUE) numbers myset <- unique(numbers) myset

[1] 6 3 1 8 6 1 2 3 6 1 9 1 4 4 5 8 7 10 6 4 [1] 6 3 1 8 2 9 4 5 7 10

- Set operations

`union(x,y)`

- Union of sets x and y
`intersect(x,y)`

- Intersection of sets x and y
`setdiff(x,y)`

- All element in x that are not in y
`setequal(x,y)`

- Test for set equality
`c %in% y`

- Test for membership in set, returns boolean vector
`choose(n,k)`

- Number of possible subsets of size k chosen from set of size n
`combn(x,n)`

- Find all combinations of size m in x

- Let's write some functions

- Write a function to calculate the symmetric difference of two sets
- Define a subset operator

### 9.2 Number systems

- A note on integers vs real numbers

R has integers and real numbers ("numeric" class and "double" type). I have sometimes referred to "integers" when we have created vectors such as c(1,2,3), but usually one ends up with the numeric class which stores double-precision floating points. Examples:

typeof(2) typeof(as.integer(2)) typeof(c(1,2,3)) typeof(1:3)

[1] "double" [1] "integer" [1] "double" [1] "integer"

- Positional number systems

- Simply ways to count things. Ours is the base-10.
- There is no symbol "10" in base-10 and no symbol "2" in base-2, etc
- Each column stands for a power of the base (eg powers of 10)

- Binary: base-2

- A single binary number is called a
**B**inary dig**IT**, or**bit**. - Computers perform operations on binary number groups called "words". Today, most computers use 32-, 64-, or 128-bit words.
- Words are subdivided into 8-bit groups called bytes.

- A single binary number is called a
- Hexadecimal: base-16

Dec Hex Binary Dec Hex Binary 0 0 0000 8 8 1000 1 1 0001 9 9 1001 2 2 0010 10 a 1010 3 3 0011 11 b 1011 4 4 0100 12 c 1100 5 5 0101 13 d 1101 6 6 0110 14 e 1110 7 7 0111 15 f 1111 - Hexadecimal example

- Negative numbers

Negative numbers: signed integers

- Takes an extra bit of information because it doubles the size we must be able to store.
- Multiple ways to represent (Use first bit for sign, top half of range, or 2's complement)
- Last way is what is used in computers — but you don't need to worry about representation.

- Integer Overflow

If we have a 1-byte (8 bit) integer in which the first bit stores the sign:

01011101 : 93 + 01101011 : 107 \= 11001000 : -56

### 9.3 Floating point

- Real numbers on computers: floating point

- Problem:

How to represent real numbers with integers? Binary point?

101.011:

1 0 1 . 0 1 1 2 ^{2}2 ^{1}2 ^{0}. 2 ^{-1}2 ^{-2}2 ^{-3} - Solution:

Fix the location of the binary point. Think of scientific notation with fixed precision: \(6.020123 \times 10^{23}\)

In binary: \(\pm\) X.YYYYYY \(\times 2^{\pm N}\)

- Problem:
- IEEE single precision: 32 bits, one word

Simply "float" type in most languages: \(\pm\) 1.YYYYY \(\times 2^{N-127}\)

- assumes X is always 1 (no leading zeros, trick to get one more bit of precision)
- 1 sign bit
- 8 bit biased exponent (n-127)
- 23 bit mantissa (YYYYY)

- "single precision floating point": 32 bits

- Different floating point types

IEEE Standard for Floating-Point Arithmetic (IEEE 754) (1980s):

- Single precision, called "float" in the C language family, has a 23 bit mantissa (so 24 bits of precision with assumed 1).
- Double precision, called "double" in the C language family, occupies 64 bits (8 bytes) and its significand has a precision of 53 with the assumed 1. Sign is 1 bit, exponent is 11 bits and mantissa is 52 bits.

In R there are only 2 numeric types: integer and numeric (double precision). "Single precision" not used much anywhere nowadays.

- Some issues

- Doubles types ("numeric" class) cannot represent all 64-bit integers! But since R uses 32 bit integers, double can store more integers than can the integer type. See =?integer
- Rounding error: The only numbers that can be represented exactly in R's numeric type are integers and fractions whose denominator is a power of 2. Other numbers have to be rounded to (typically) 53 binary digits accuracy.

- Rounding error

64 bits are not enough to store every possible value on number line!

1.0 - 0.9 - 0.1

[1] -2.775558e-17

- Floating point error

As a result, two floating point numbers will not reliably be equal:

`a <- sqrt(2) a * a == 2 a * a - 2`

[1] FALSE [1] 4.440892e-16

The function

`all.equal()`

allows you to compare doubles. It basically tests`abs(a - b) < epsilon value`

. - Testing equality of floating point numbers

Compare

`all.equal`

,`identical`

and`==`

all.equal(a*a, 2) # see documentation for using in if statement identical(a*a, 2.0) all.equal(as.double(8), as.integer(8)) identical(as.double(8), as.integer(8))

[1] TRUE [1] FALSE [1] TRUE [1] FALSE

## 10 Math II and scope

### 10.1 Math functions

- Built-in functions

- exp()
- Exponential function, base e
- log()
- Natural logarithm
- log10()
- Logarithm base 10
- sqrt()
- Square root
- abs()
- Absolute value
- sin()
- Trig functions:
`sin()`

,`cos()`

,`tan()`

,`atan()`

,`atan2()`

- min(), max()
- Min and max within a vector
- pmin(), pmax
- Element-wise maxima and minima of several vectors
- sum(), prod()
- Sum and product of elements of vector
- cumsum, cumprod()
- Cumulative sum or product of all elements in a vector
- round()
- Rounding to integers. See
`floor()`

,`ceiling()`

- factorial()
- Factorial function

- Minimization and Maximization

`nlm()`

and`optim()`

both find minimaf <- function(x) { return(x^2 - sin(x) ) } d <- data.frame(x = seq(-10, 10, 0.01)) d$y <- f(d$x)

`library(ggplot2) ggplot(d, aes(x, y)) + geom_point()`

- Plot of x,y

- Using nlm() (Newton algorithm only)

nlm(f, 5)

$minimum [1] -0.2324656 $estimate [1] 0.4501833 $gradient [1] 4.501954e-07 $code [1] 1 $iterations [1] 6

- Using optim()

`optim(5, f, method = "Brent", lower = -100, upper = 100)`

$par [1] 0.4501836 $value [1] -0.2324656 $counts function gradient NA NA $convergence [1] 0 $message NULL

- Symbolic math: differentiation

There are packages to do symbolic math (and other open source alternatives eg sage) but base R has

`D()`

`D(expression( x^2 + 2*x ), "x")`

2 * x + 2

`D(expression( exp(x^2) ), "x")`

exp(x^2) * (2 * x)

`D(expression(log(2*x) + x^2), "x")`

2/(2 * x) + 2 * x

- D() continued

curve( x^2 + 2*x ,0,10)

- D() continued II

`curve(eval(D(expression( x^2 + 2*x ), "x") ), 0, 10)`

- Simple calculus: numeric integration

`integrate()`

f <- function(x) x^2 integrate(f,0,1)

0.3333333 with absolute error < 3.7e-15

### 10.2 Statistical distributions

- Distribution functions

Distribution pdf cdf quantiles rand. numbers Uniform dunif() punif() qunif() runif() Normal dnorm() pnorm() qnorm() rnorm() Chi square dchisq() pchisq() qchisq() rchisq() Binomial dbinom() pbinom() qbinom() rbinom() Poisson dpois() ppois() qpois() rpois() Student's T dt() pt() qt() rt() F dist. df() pf() qf() rf() and more!

For continuous distributions, the most useful functions are the "p" (cumulative density) and "q" (quantile = inverse cdf) functions.

- Using distributions

\(95^{th}\) percentile of the Chi square distribution with 2 degrees of freedom:

qchisq(0.95,df=2)

[1] 5.991465

\(5^{th}\) and \(95^{th}\) percentiles of the normal distribution with mean zero and sd of one:

qnorm( c(0.05, 0.95) )

[1] -1.644854 1.644854

- Example: Binomial Distribution

Probability of exactly 2 heads out of 12 coin flips:

dbinom(2, size=12, prob=0.5)

[1] 0.01611328

Should be same as probability of 10 heads:

dbinom(10, size=12, prob=0.5)

[1] 0.01611328

Probability of 1-10 heads:

pbinom(10, size=12, prob=0.5)

[1] 0.9968262

### 10.3 Randomization and simulation

- Random variates

Built-in random number generators for many statistical distributions.

- Expectation

find \( E[max(X,Y)] \) when X,Y are normal random variables mean=0 sd=1

sum <- 0 nreps <- 100000 for (i in 1:nreps) { xy <- rnorm(2) sum <- sum + max(xy) } print(sum/nreps)

[1] 0.5663903

# or store in vector: v <- replicate(nreps, max(rnorm(2))) hist(v)

- Plot the results

- Setting the random number seed

Important for debugging!

set.seed(100) mean(rnorm(10000)) mean(rnorm(10000)) set.seed(100) mean(rnorm(10000))

[1] 0.00387737 [1] -0.01183919 [1] 0.00387737

### 10.4 Linear Algebra

- Linear Algebra

Matrix multiplication:

a <- matrix(c(1,3,2,4), ncol=2) b <- matrix(c(1,0,-1,1), ncol=2) a %*% b

[,1] [,2] [1,] 1 1 [2,] 3 1

- Linear algebra

Solve system of equations:

\begin{array}{lcl} x + y & = & 4 \\ x - y & = & 2 \end{array}a <- matrix(c(1,1,1,-1), ncol=2) b <- c(4, 2) solve(a, b)

[1] 3 1

- Some linear algebra functions

`crossprod()`

- Inner product (dot product) of two vectors. Not the real crossproduct.
`t()`

- Matrix transpose
`qr()`

- QR decomposition
`chol()`

- Cholesky decomposition
`det()`

- Determinant
`eigen()`

- Eigenvalues / eigenvectors
`diag()`

- Extracts the diagonal of a square matrix (useful for obtaining variances from a covariances matrix and for constructing a diagonal matrix)
`sweep()`

- Numerical analysis sweep operations (a more capable and complicated "apply" type function)

### 10.5 Scope

- Where do names apply?

- Nested functions

w <- 12 f <- function(y) { d <- 8 h <- function() { return(d * (w+y) ) } print(environment(h)) return(h()) } environment(f) ls()

<environment: R_GlobalEnv> [1] "f" "w"

Note: confusingly, we refer to the top environment as .GlobalEnv in code.

- Scope hierarchy

- Scope hierarchy con.

f <- function(y, ftn) { d <- 8 print(environment(ftn)) return(ftn(d, y)) } h <- function(dd, yy) { return(dd * (w + yy)) } w <- 12 f(3, h)

<environment: R_GlobalEnv> [1] 120

### 10.6 Superassignment

- Scope and assignment

In general, R functions do not change variables at a higher scope

x <- 1 increment.x <- function() { print(x) x <- x+1 print(x) } increment.x() x

[1] 1 [1] 2 [1] 1

- Super assignment

increment.x <- function() { x <<- x+1 } x increment.x() x

[1] 1 [1] 2

- assign()

Write to upper level variables using character strings for names

filelist <- list.files(pattern="*.org$") #filelist for(s in filelist) { assign(s, readLines(file(s)) ) } ls()

[1] "filelist" "r-course-outline.org" "s"

## 11 Style and Debugging

### 11.1 Programming style and organization

- Programming style

- Good style aids maintenance

- Use comments to tell the readers what they need to know
- Use indentation to mark the levels of program control
- Use meaningful names
- Develop a convention for names (global variables vs other variables vs function names)
- Use standard idioms where appropriate
- Avoid unnecessary complexity. Avoid code repetition!

- Good style aids maintenance
- Coding style

- Function Documentation

For example, Google's style:

CalculateSampleCovariance <- function(x, y, verbose = TRUE) { # Computes the sample covariance between two vectors. # # Args: # x: One of two vectors whose sample covariance is to be calculated. # y: The other vector. x and y must have the same length, greater than one, # with no missing values. # verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE. # # Returns: # The sample covariance between x and y. n <- length(x) # Error handling if (n <= 1 || n != length(y)) { stop("Arguments x and y have invalid lengths: ", length(x), " and ", length(y), ".") if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) { stop(" Arguments x and y must not have missing values.") covariance <- var(x, y) if (verbose) cat("Covariance = ", round(covariance, 4), ".\n", sep = "") return(covariance)

- Organizing tasks

- Modularization

Modularization is the process programmers use to break up complex programs into smaller more manageable procedures.

- Constants

- Global variables and constants

It can be useful to put some infrequently changing constants at the top of a script file:

GRID_SIZE <- 10 NUM_ITERATIONS <- 10000

Then use these global variables:

makeGrid <- function(gridSize = GRID_SIZE) { g <- matrix(1:(gridSize^2), nrow=gridSize) return(g) }

### 11.2 Debugging approaches

- Debugging

"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

— Brian W. Kernighan.

- What is a bug?

- Code fails (produces error)
- Code runs but produces the wrong result

- Debugging

Differential diagnosis

- Characterize the error
- Localize the error
- Fix the error (and don't introduce a new bug!)

- Characterizing the bug

- Make the error reproducible

- Can we always get the error when re-running the same code and values?
- If we start the same code in a clean workspace, does the same thing happen?

- Bound the error

- How much can we change the inputs and get the same error? A different error?
- For what inputs (if any) does the bug go away?
- How big is the error?

- Get more information

Add extra output (e.g., number of optimization steps, did the loop converge, final value of optimized function)

- Make the error reproducible
- Localizing the bug

Tools:

- traceback (where did an error message come from?); print, warning, stopifnot (messages from the code as it goes)
- Trying controlled inputs
- Interactive debugging

### 11.3 `traceback()`

- traceback

- When an R function fails, an error is printed to the screen. Immediately after the error, you can call traceback() to see in which function the error occurred.
- The traceback() function prints the list of functions that were called before the error occurred.
- The functions are printed in reverse order.

- traceback example

f <- function(x) { r <- x - g(x) r } g <- function(y) { r <- y * h(y) r } h <- function(z) { r <- log(z) if(r < 10) r^2 else r^3 } f(-1) traceback()

- traceback results

Error in if (r < 10) r^2 else r^3 (from #3) : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced 3: h(y) at #2 2: g(x) at #2 1: f(-1)

### 11.4 Interactive Debugging

- Interactive debugging

`traceback()`

does not always tell you where the error originated. In order to know which line causes the error, you may want to step through the function using debug(). `debug()`

and`browser()`

`debug(foo)`

flags the function`foo()`

for debugging.`undebug(foo)`

unflags the function.- When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and you can interact with the environment.
- After you obtained necessary information from the environment, you can let the function to execute the next statement.
- You can call
`browser()`

from any point in your code to enter the debugger. This is useful if you know about where the bug is and you don't want to start at the beginning of the function.

- debug example

## buggy version: mateVectors <- function(x, y, r) { child <- y usingX <- runif(1) <= 0.5 for (i in x) { if(usingX) { child[i] <- x[i] } if(runif(1) <= r) { !usingX } } return(child) }

- Trying the buggy verson

set.seed(1001) mateVectors(1:10, 11:20, 0.4) mateVectors(c("A","B","C"), c("a","b","c"),0.3)

[1] 11 12 13 14 15 16 17 18 19 20 A B C "a" "b" "c" NA NA NA

- Let's debug this function

- Debugging using RStudio

RStudio includes a visual debugger. Have you noticed the "environment" window? It shows values for variables in the current environment. When debug is active, this shows the function being debugged.

- RStudio’s error inspector and
`traceback()`

- RStudio’s "Rerun with Debug" tool and
`options(error = browser)`

which open an interactive session where the error occurred. - RStudio’s breakpoints and
`browser()`

which open an interactive session at an arbitrary location in the code.

- RStudio’s error inspector and

## 12 Strings

### 12.1 "Text" on computers

- How to represent characters?

Simple idea: numeric code for each character. Originated at Bell labs in 1960.

- ASCII table

0–31 used for "unprintable" characters (BEEP! page feed, etc)

- All was good

Assuming you were an English speaker.

- ANSI standard

- Same as ASCII for 0-127
- But different "code pages" for upper set

- Unicode

Basic idea: two bytes. Then expanded to more (up to 3 bytes) Enough to store every character in every language.

- Originally: \(2^{14} = 16,384\) (2 bits reserved)
- Currently: 110,182 code points
- Every letter maps to a "code point" — a platonic ideal of a letter.
- "A" is different from "a", but "A" in Times New Roman is the same as "A" in Arial
- "Hello" =
`U+0048 U+0065 U+006C U+006C U+006F`

- Still the issue of exactly how to
*store*the numbers

- Encodings

- There were multiple standards for STORING the numbers. Original, UCS-2, uses 2 bytes.
- Americans were lazy, so:

- UTF-8

Idea: use one byte for code points 0-127, and only use two for points above that. UTF-8 is the de facto standard encoding for unicode in US.

You can read in utf-8 encoded text in R. In fact utf-8 may be the current default ("locale"):

Sys.getlocale()

[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

- Unicode strings

a <- "\U00B5" # note that "\" is escape char in R strings a b <- c("\U03C0", "π") b

[1] "µ" [1] "π" "π"

- You can use unicode (utf-8) in R scripts

But may be awkward to enter depending on keyboard and text editor

pi symbol won't show below because of my export system!

π <- 3.14 r <-2 2*r*π

[1] 12.56

### 12.2 Strings

- Characters in R

- until 2004, R only used ASCII!
- R now supports UTF-8 (make sure "locale" is set correctly on your computer) — works a bit differently on Windows vs Linux
- So you can use Chinese, Russian, etc.
`Sys.setlocale("LC_CTYPE","chinese")`

- Character objects (strings)

A character object is used to represent string values in R. We convert objects into character values with the

`as.character()`

function:x = as.character(pi) x typeof(x)

[1] "3.14159265358979" [1] "character"

- Formatting strings

using paste():

first <- "Dylan" last <- "Schwilk" n <- paste(first, last) n paste(last, first, sep=", ")

[1] "Dylan Schwilk" [1] "Schwilk, Dylan"

- Formatting strings

using sprintf():

s <- sprintf("%s has %d dollars", n, 50) s sprintf("%s has $%2.2f", n, 50)

[1] "Dylan Schwilk has 50 dollars" [1] "Dylan Schwilk has $50.00"

- Substrings

`substr()`

and`substr<-()`

substr(s, 7, 13)

[1] "Schwilk"

Assignment:

substr(s, 7, 13) <- "SCHWILK" s substr(s, 7, 13) <- "???" s

[1] "Dylan SCHWILK has 50 dollars" [1] "Dylan ???WILK has 50 dollars"

- Substitution

sub("???WILK", "S.", s, fixed=TRUE)

[1] "Dylan S. has 50 dollars"

### 12.3 String operations

- Converting strings

tolower(s) toupper(s)

[1] "dylan ???wilk has 50 dollars" [1] "DYLAN ???WILK HAS 50 DOLLARS"

- Built-in string functions

`grep()`

- Searches for a substring, like the Linux command of the same name
`nchar()`

- Finds the length of a string
`paste()`

- Assembles a string from parts
`sprintf()`

- Also assembles a string from parts
`substr()`

- Extracts a substring
`strsplit()`

- Splits a string into substrings

- Example: testing for a file suffix

# tests whether the file name fnname has the suffix suffix, # e.g. "abc" in "x.abc" hasSuffix <- function(fname, suffix) { ncf <- nchar(fname) # nchar() gives the string length dotpos <- ncf - nchar(suffix) + 1 # dot would start here if there is one # now check that stuff is at the end of the name return(substr(fname, dotpos, ncf) == suffix) } hasSuffix("file.csv", "csv") hasSuffix("file.csv", "txt")

[1] TRUE [1] FALSE

- grep() and grepl()

Takes its name from the linux/unix command.

v <- c("beach", "beech", "back", "buster", "abea") grep("bust", v) v[grep("bust", v)]

[1] 4 [1] "buster"

`grepl("bust", v)`

[1] FALSE FALSE FALSE TRUE FALSE

### 12.4 Regular expressions

- Regular Expressions

Sometimes we need to look for patterns, not specific strings

- "regex"

- Come from regular set theory (1950s)
- Formal way of specifying a set string matches
- Really are a whole mini computer language
- Used in many command line tools (eg grep), and implemented as at least a library in most computer languages
- Most text editors and word processors allow regex search and replace

- "regex"
- regex basics

- Single character wildcard: "."
- Single match from multiple options:
`[a-z]`

or`[acdf-h]`

. Matches**one**of the characters in brackets or implied by range `[a-zA-Z]`

should match any ASCII alphabetic letter

- Positional matching

`^`

matches beginning of line`$`

matches end of line`\b`

matches word boundary (space or punctuation, etc)

- Positional matching example

x <- unlist(strsplit("Out in the west Texas town of El Paso", split=" ")) # capitalized words: grep("^[A-Z]", x)

[1] 1 5 8 9

- Matching multiples

Modify preceding character or expression

- "*" Matches any number of preceding expression (zero or more!)
- "?" matches one or none
- "+" matches one or more
`{n}`

matches n times- ={n,} matches n or more times
- ={n,m} matches n to m times

- Examples

x <- c("999xyz", "000", "A123", "a123", "1way") grep("[0-9]{3}", x) grep("[0-9]{3}$", x) grep("^[0-5]+", x)

[1] 1 2 3 4 [1] 2 3 4 [1] 2 5

- Grouping

- Use parentheses to group

v <- c("aba", "abba", "aabba", "accc", "ababa") grep("(ab)+a", v)

[1] 1 5

- Boolean OR

v <- c("beach", "beech", "back", "buster", "abea") grep("be(e|a)ch", v)

[1] 1 2

- Using groups

Find cases thrice repeated numerals

x <- c("999xyz", "000", "A123", "a123", "1way", "aaa", "a21113") grep("([0-9])\\1\\1", x)

[1] 1 2 7

- Use parentheses to group
- Regular expressions in R

- R string functions (and functions in the
`stringr`

package) can handle regular expressions. Use fixed=TRUE to avoid. - R includes some predefined character classes (eg
`[:digit:]`

,`[:punct:]`

). See`?regex`

- In regexes, one must escape special characters with a backslash. But keep in mind that R will first interpret the string before it hands it to the regex engine. So to match a literal "?" use
`\\?`

- R string functions (and functions in the
- Escaping example

v <- c("The first", "firstofall", "first time", "Who's on first?", "bfirstb") grep("first", v) grep("first\\b", v) grep("^first\\b", v)

[1] 1 2 3 4 5 [1] 1 3 4 [1] 3

- Text searches

library(gutenbergr) oos <- gutenberg_download(1228) # origin of species subset(oos, grepl("Galapagos (Islands?|Archipelago).+bird", text))

# A tibble: 3 x 2 gutenberg_id text <int> <chr> 1 1228 in the Galapagos Islands nearly every land-bird, but only two o… 2 1228 Galapagos Islands reptiles, and in New Zealand gigantic wingles… 3 1228 genera. In the Galapagos Archipelago, many even of the birds, t…

### 12.5 `stringr`

Package

- Why the need?

- R string functions are inconsistent

Remedy: Hadley Wickham's

`stringr`

librarylibrary(stringr) s <- c("(806) 777-7777", "(401) 555-5555", "(806) 444-4444") number_pattern <- "\\(([0-9]{3})\\) ([0-9]{3}-[0-9]{4})" str_match(s, number_pattern )

[,1] [,2] [,3] [1,] "(806) 777-7777" "806" "777-7777" [2,] "(401) 555-5555" "401" "555-5555" [3,] "(806) 444-4444" "806" "444-4444"

- R string functions are inconsistent
- Basic string operations with stringr

`str_c()`

- like
`paste`

, but it uses the empty string as the default separator and silently removes zero length arguments `str_length()`

- is equivalent to
`nchar`

, but it preserves NAs and converts factors to characters (not integers) `str_sub()`

- is equivalent to
`substr()`

but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the end of the string. The end position defaults to -1, which corresponds to the last character. `str_sub<-`

- is equivalent to
`substr<-`

, but like`str_sub`

it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing `str_trim`

- Trim leading and trailing whitespace

`str_c()`

str_c("Hello", " ", "world", "!") v <- c("A", "B", "C") str_c(v,v)

[1] "Hello world!" [1] "AA" "BB" "CC"

`str_length()`

v <- factor(c("Aa", "Bb", "C", NA)) nchar(v) str_length(v)

Error in nchar(v) : 'nchar()' requires a character vector [1] 2 2 1 NA

`str_sub()`

s <- "What do you do with a drunken sailor?" str_sub(s, -nchar("sailor?"), -1 ) <- "student?" s

[1] "What do you do with a drunken student?"

## 13 The Grammar of Graphics and ggplot I

### 13.1 Grammar of graphics

- Grammer of Graphics

**The Grammar of Graphics.**Leland Wilkinson, 1999, 2005. Springer. - What is a "plot"?

A plot is

- A data set and a set of mappings from variables to aesthetics
- One or more layers
- One scale for each aesthetic mapping
- A coordinate system (we won't worry about this and just use cartesian coordinates!)
- The faceting specification (how many subplots)

- Example scatterplot

A simple dataset. Plot A vs C with symbol shape determined by categorical variable, D

A B C D 2 3 4 a 1 2 1 a 4 5 15 b 9 10 80 b - Map variables to aesthetic space

- Next steps

- The parts

- The plot

- Grammar

We can produce grammatically correct plots that are nonsensical, just as in English we can produce grammatically correct nonsense

- Nonsense plot

ggplot(mpg, aes(displ, hwy)) + geom_line()

- Learning ggplot2

- ggplot2 mailing list, http://groups.google.com/group/ggplot2
- stackoverflow, http://stackoverflow.com/tags/ggplot2
- Cookbook for common graphics, http://wiki.stdout.org/rcookbook/Graphs/

- ggplot2 book

*ggplot2: Elegant Graphics for Data Analysis*by Hadley Wickham is free from Springer on TTU network: http://www.springerlink.com/content/978-0-387-98140-6/

### 13.2 Components of a plot

- Components of a plot

- Scales, facets, coordinate system

- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.

- A scale is a function and its inverse, along with some parameters
- Example: a color gradient scale
- The inverse gives us the axes (for position scales) or the legend (for other scales such as colors or shapes)

- You've seen facets and we will stick with cartesian coordinates for now.

- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
- Layers

composed of five parts:

- data
- aesthetic mapping (what is x, y, color, shape, etc).
`aes()`

command - statistical transformation (
`stat`

) - geometric object (
`geom`

) (geoms have default stats and vice versa) - a position adjustment

### 13.3 Building a plot by layers

- ggplot() function

Use the diamonds data set

library(ggplot2) p <- ggplot(diamonds, aes(carat, price, color =cut)) #p

Blank plot

- Adding layers

p <- p + layer(geom = "point", stat= "identity", position = "identity") p

- More complicated plot

p <- ggplot(diamonds, aes(x=carat)) p <- p + layer(geom = "bar", stat = "bin", position = "identity", params = list(fill = "steelblue", binwidth=1), ) p

- Result:

- Shortcuts!

Every geom is associated with a default statistic and position, and every statistic with a default geom:

p <- ggplot(diamonds, aes(x=carat)) p <- p + geom_histogram(binwidth = 1, fill = "steelblue") p

- Result:

- ggplot book

http://www.springerlink.com/content/978-0-387-98140-6/

Tables 4.2 and 4.3 have default stats and geoms

- Layer shortcuts

- Shortcut functions

`geom_XXX(mapping, data, ..., geom, position)`

`stat_XXX(mapping, data, ..., stat, position)`

- Descriptions:

- mapping
- A set of aesthetic mappings using
`aes()`

- data
- A dataset to override the default one
`...`

- Parameters for the geom or the stat
- geom or stat
- You can override defaults
- position
- Method for adjusting overlapping objects

- Shortcut functions
- Layers are objects

bestfit <- geom_smooth(method = "lm", color = "steelblue", size = 2, se = FALSE) ggplot(mpg, aes(cty, hwy)) + geom_point() + bestfit

- Result:

### 13.4 Aesthetics

- Aesthetic mappings

Refer to variables in the dataset only! So, back to our mpg data:

a <- aes(x = displ, y = hwy, color = class) p <- ggplot(mpg, a) + geom_point() p

- Result:

- Mapping vs setting

Because R can handle arguments by position or name this can be confusing:

p <- ggplot(mpg, aes(cty, displ)) # set point color to dark blue p + geom_point(color = "darkblue") # map new unamed variable with value "darkblue" to color p + geom_point(aes(color="darkblue")) # more useful: p + geom_point(aes(color=cyl))

- Individual vs collective geoms

- Individual: Graphical object for each row (eg points)
- Collective: statistical summary, polygon geom
- Lines and smoothed paths somewhere in between

- Grouping

We'll use the Oxboys data set in nlme

library(nlme) # ?Oxboys # head(Oxboys) # class(Oxboys) p <- ggplot(Oxboys, aes(age, height, group = Subject)) + geom_line() p

- Result:

- Different groups for different layers

# try adding lm line for all data p + geom_smooth(method="lm", se=FALSE) # same as: p + geom_smooth(aes(group = Subject), method="lm", se=FALSE) # what we want: p + geom_smooth(aes(group = NULL), method="lm", se=FALSE)

- Default groups on categorical variables

p <- ggplot(Oxboys, aes(Occasion, height)) + geom_boxplot() p + geom_line(aes(group = Subject), color = "#3366FF")

- Result:

- Violin plot:

ggplot(mpg, aes(class,hwy)) + geom_violin()

- Tables of geoms and stats (layer functions)

- geoms and stats: http://docs.ggplot2.org/current/
- aesthetic specifications: http://docs.ggplot2.org/current/vignettes/ggplot2-specs.html

- Stats

Default stat for histogram geom is bin

ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..density..), binwidth=0.1) ## with counts ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..count..), binwidth=0.1)

- Position adjustment

Adjustment Description dodge Adjust position by dodging overlaps on sides fill Stack overlapping objects and standardize to equal height identity Don't adjust position jitter Jitter points to avoid overlap stack Stack overlapping objects on top of one - Position examples

# stack p <- ggplot(diamonds, aes(carat)) p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "stack") # fill p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "fill") # dodge p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "dodge") # Identity (use line not area) p + stat_bin(aes(ymax = ..count.., color = cut), geom = "line", binwidth=0.2, position = "identity")

- Writing your own layer functions

stat_sum_single <- function(fun, ...) { stat_summary(fun.y = fun, colour="gray15", geom="point", size = 4, ...) } se_bar <- function(){ stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.15, size = 1) } p <- ggplot(iris, aes(Species, Petal.Length)) + se_bar() + stat_sum_single(mean) p p + stat_sum_single(median, size = 3, shape = 5)

- Result:

## 14 The Grammar of Graphics and ggplot II

### 14.1 Scales

- Scales

- Control how data is mapped to perceptual properties, and produce guides (axes and legends) which allow us to read the plot.
- Important parameters: name, breaks & labels, limits.
- Naming scheme:
`scale_aesthetic_name`

. All default scales have name continuous or discrete.

- Scales

Position scales, color scales, manual scales, identity scale

- name
- text string, mathematical expressions (see
`?plotmath`

). Use`labs()`

,`ylab()`

,`xlab()`

- limits
- Domain of the scale, Continuous scales take a vector of length 2, eg
`c(4,10)`

. - breaks
- sets the values that appear on the axis or the legend (tick marks).

- Scales examples

p <- ggplot(mpg, aes(cty, hwy, color = displ)) + geom_point() p p + scale_x_continuous("City mpg") p + xlab("City mpg") p + ylab("Highway mpg") p + labs(x="City mpg", y = expression(paste("Highway ",frac(miles,gal))), color= "Displacement (l)" )

- Result

- Scale transformations

"transformers":

- Manage the transformation and the inverse for legends, values
- Common ones: log, log10, logit, sqrt, exp, reverse
- shorthand for common ones:
`scale_x_log10()`

is equivalent to`scale_x_continuous(trans = "log10")`

- More scales examples

library(scales) # for math_format() library(stringr) # for str_c p <- ggplot(diamonds, aes(carat, price)) + geom_point() p + scale_x_log10() + scale_y_log10() log10breaks <- seq(-0.8,0.8,.2) p + scale_x_log10(breaks = 10^(log10breaks), labels = parse( text=(str_c("10^",log10breaks)))) + scale_y_log10() # natural log p + scale_x_continuous(trans = "log", breaks = trans_breaks("log", "exp"), labels = trans_format("log", math_format(e^.x)))

- Result

### 14.2 Themes

- ggplot Themes

p + theme_bw() theme_nogrid <- theme_bw() + theme( axis.title = element_text(size = 16), axis.text = element_text(size = 13), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank() ) p + theme_nogrid + xlab("Carat") + scale_y_continuous("Price ($)", labels = comma) # ?theme

- Result:

- Some more resources for ggplot tweaking

- ggplot2 themes: http://docs.ggplot2.org/dev/vignettes/themes.html
- Tweaking text on ggplot plots: http://www.r-bloggers.com/textual-healing

### 14.3 Maps

- Maps in ggplot

ggmap by David Kahle and Hadley Wickham: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

library(maps) library(ggmap) library(RColorBrewer) county_df <- map_data('county') tx.counties <- subset(county_df, region == "texas") names(tx.counties)[6] <- "county"

- Maps in ggplot: TX population

tx.pop <- read.csv("http://demographics.texas.gov/Resources/TPEPP/Estimates/2014/2014_txpopest_county.csv", stringsAsFactors=FALSE) tx.pop$county <- tolower(tx.pop$county) tx.pop <- merge(tx.counties, tx.pop) head(tx.pop,3)

county long lat group order region FIPS census_2010_count 1 anderson -95.75271 31.53560 2492 74520 texas 1 58458 2 anderson -95.76989 31.55852 2492 74521 texas 1 58458 3 anderson -95.76416 31.58143 2492 74522 texas 1 58458 july1_2014_pop_est jan1_2015_pop_est num_chg_10_14 num_chg_10_15 1 58501 58243 43 -215 2 58501 58243 43 -215 3 58501 58243 43 -215 pct_chg_10_14 pct_chg_10_15 1 0.1 -0.4 2 0.1 -0.4 3 0.1 -0.4

- TX population map

tx.map <- ggplot(tx.pop, aes(x=long, y=lat, group=county)) tx.map <- tx.map + geom_path(color="gray") tx.map <- tx.map + geom_polygon(aes(fill=log10(jan1_2015_pop_est))) tx.map

- Result:

## 15 Shaping data and `dplyr`

### 15.1 Shaping data:

### 15.2 The `dplyr`

package

- dplyr

`dplyr`

includes a family of individually simple functions that all have the same structure and all operate on data frames. They employ a bit of magic for convenience that allow non-standard evaluation.Five important verbs (functions)

`select`

certain columns of data`filter`

your data to select specific rows`arrange`

the rows of your data into an order`mutate`

your data frame to contain new columns`summarize`

chunks of you data

- data frames vs "tibbles"

Packages by Hadley Wickham in the "tidyverse" (or "hadleyverse") use "tibbles" instead of R's traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make life a littler easier.

`library(tibble)`

but you rarely need to do this because dplyr and other packages will load this for you

- Structure

- All verbs take a data frame as the first argument
- Non standard evaluation: reduces typing. But can be dangerous in functions. There are also standard evaluation forms of each function. See https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html

- Some sample data

The Batting data set (in the

`Lahman`

package) contains the batting records for all professional US players. For more information use the command`?Batting`

library(dplyr) library(Lahman) # baseball dataset names(Batting)

[1] "playerID" "yearID" "stint" "teamID" "lgID" "G" [7] "AB" "R" "H" "X2B" "X3B" "HR" [13] "RBI" "SB" "CS" "BB" "SO" "IBB" [19] "HBP" "SH" "SF" "GIDP"

- ID variables

Factors by which we can group the data. For example, let's get mean stats (games, at bats, runs and hits, columns 6-9) for Babe Ruth across all years he played.

`colMeans( filter(Batting, playerID == "ruthba01")[6:9] )`

G AB R H 113.77273 381.72727 98.81818 130.59091

- Another ID variable: year

Let's get mean stats (games, at bats, runs and hits, columns 6-9) across all players for 1871.

colMeans( filter(Batting, yearID == 1871)[6:9] )

G AB R H 19.96522 94.10435 23.12174 26.96522

- Obtain summaries for groups

Finding summaries for one group is easy, but what if we want to find the same summaries for every group?

- use
`dplyr`

allyears <- select(Batting, yearID, G, AB, R, H) years <- group_by(allyears, yearID) yearstats <- summarize_all(years, funs(mean)) head(yearstats, 3)

# A tibble: 3 x 5 yearID G AB R H <int> <dbl> <dbl> <dbl> <dbl> 1 1871 20.0 94.1 23.1 27.0 2 1872 21.2 101. 21.7 28.8 3 1873 28.8 136. 28.6 39.4

- Split-apply-recombine

When do we need to split-apply-recombine?

- During data preparation, when performing group-wise ranking, normalization, etc
- When creating summaries for display or analysis (marginal means, conditioning tables)
- During modelling, when fitting separate models to subsets of the data

### 15.3 Using dplyr functions

- filter

How many players in 1871?

`y1871 <- filter(Batting, yearID==1871) length(unique(y1871$playerID))`

[1] 115

- mutate and summarise functions

`mutate()`

allows us to modify an existing data frame; a row of output per row of input`summarise()`

A row of output per group

- Strategy for grouped calculations

- Extract a single group
- Solve problem for that one group — perhaps write a wrapper function that takes the group and returns the summary or transformation
- Use dplyr functions and group
_{by}to solve it for all groups

- 1-2. Calculate career year for Babe Ruth

Number of years since player began.

baberuth <- filter(Batting, playerID == "ruthba01") baberuth <- mutate(baberuth, cyear = yearID - min(yearID) + 1) head(baberuth, 5)[,c(2, 6:9, 23)]

# A tibble: 5 x 6 yearID G AB R H cyear <int> <int> <int> <int> <int> <dbl> 1 1914 5 10 1 2 1 2 1915 42 92 16 29 2 3 1916 67 136 18 37 3 4 1917 52 123 14 40 4 5 1918 95 317 50 95 5

- 3. Calculate career year for all players

byplayer <- group_by(Batting, playerID) Batting <- mutate(byplayer, cyear = yearID - min(yearID) + 1) Batting <- arrange(Batting, playerID) head(Batting)[, c(1, 2, 6:9, 23)]

# A tibble: 6 x 7 # Groups: playerID [1] playerID yearID G AB R H cyear <chr> <int> <int> <int> <int> <int> <dbl> 1 aardsda01 2004 11 0 0 0 1 2 aardsda01 2006 45 2 0 0 3 3 aardsda01 2007 25 0 0 0 4 4 aardsda01 2008 47 1 0 0 5 5 aardsda01 2009 73 0 0 0 6 6 aardsda01 2010 53 0 0 0 7

- More summaries: Teams per player

Total number of teams in the data frame

length(unique(Batting$teamID))

[1] 149

Number of teams for which each player played:

`nteams <- summarise(byplayer, nteams = length(unique(teamID))) head(nteams, 5)`

# A tibble: 5 x 2 playerID nteams <chr> <int> 1 aardsda01 8 2 aaronha01 3 3 aaronto01 2 4 aasedo01 5 5 abadan01 3

- Do players improve?

- Do players improve? Modeling for one player

Fitting a regression model relating runs per bat (rbi/ab) to career year n

mod.br <- lm(RBI/AB ~ cyear, data=baberuth) mod.br #summary(mod.br)

Call: lm(formula = RBI/AB ~ cyear, data = baberuth) Coefficients: (Intercept) cyear 0.203200 0.003413

- Do players improve? Modeling for all

How to model this for all players?

lmslope <- function(x,y){ linmod <- lm(y~x) coefs <- coef(linmod) return(coefs[2]) } bb <- subset(Batting, AB >= 25) bb <- mutate(group_by(bb, playerID), include = length(yearID) > 10) bb <- filter(bb, include) player_traj <- summarize(bb, slope=lmslope(cyear, RBI/AB)) ggplot(player_traj, aes(x=slope)) + geom_histogram( binwidth=0.005)

- Results:

- Maybe players get better than worse again

modq.br <- lm(RBI/AB ~ cyear + I(cyear^2), data=baberuth) #mod.br summary(modq.br)

Call: lm(formula = RBI/AB ~ cyear + I(cyear^2), data = baberuth) Residuals: Min 1Q Median 3Q Max -0.10600 -0.01994 0.01035 0.04025 0.06291 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1267872 0.0372004 3.408 0.00295 ** cyear 0.0225163 0.0074510 3.022 0.00701 ** I(cyear^2) -0.0008306 0.0003146 -2.640 0.01613 * --- codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.05295 on 19 degrees of freedom Multiple R-squared: 0.3592, Adjusted R-squared: 0.2917 F-statistic: 5.325 on 2 and 19 DF, p-value: 0.01459

- Quadratic plot

`brplot + stat_smooth(method="lm", formula = y~x+I(x^2))`

- All players, quadratic term

qtermmodel <- function(df){ mod <- lm(RBI/AB ~ cyear + I(cyear^2), data = df) coefs <- data.frame(t(coef(mod))) names(coefs) <- c("int", "slope", "qterm") return(coefs) } bcoefs <- do(bb, qtermmodel(.)) qplot(qterm, data=bcoefs, binwidth=0.0005) + xlim(c(-0.015, 0.015))

- Results: all players, quadratic term

### 15.4 US baby names data

- US baby names data

- Download the data

- All data in zipfile: http://www.ssa.gov/oact/babynames/names.zip
- Zipped file contains one txt file (csv) for each year 1880-2017. Eg: "yob1885.txt"
- My script extracts this data, combines it, and saves the top 1000 names for each sex and year: ../assignments/hw10-clean-bnames.R. You don't need to run it.
- Saved data: ../data/top-1000-baby-names.csv

- Read data

bnames <- read.csv("http://r-research-tool.schwilk.org/data/top-1000-baby-names.csv", stringsAsFactors = FALSE) bnames$sex <- factor(bnames$sex) levels(bnames$sex) <- c("girl", "boy") head(bnames, 3)

name sex year rank percent 1 Mary girl 1880 1 7.764334 2 Anna girl 1880 2 2.861759 3 Emma girl 1880 3 2.201268

### 15.5 Explore data

- Name diversity

Investigate the percentage of all names that the top 1000 names occupy

bn <- group_by(bnames, year, sex) ofall <- summarize(bn, tprop = sum(percent)) qplot(year, tprop, data = ofall, colour = sex, geom = "line", ylab="Percentage of all names")

- Result:

- Look at name trend

a <- aes(year, percent, color = sex) p <- ggplot(subset(bnames, name=="Dylan"), a) + geom_line() p

- Some helper functions

# Return letter at position pos library(stringr) letter <- function(string, pos) { return(str_sub(string, pos, pos)) } # and some consistent axes axes <- list( scale_x_continuous(breaks = c(1900, 1920, 1940, 1960, 1980, 2000)), scale_y_continuous("percent") )

- Trend in first letter

bn <- group_by(bnames, year,sex, first = letter(name, 1)) fl <- summarize(bn, percent=sum(percent)) ggplot(fl, a) + geom_line() + facet_wrap(~ first) + axes

- Result

- Trend in last letter

bn <- group_by(bnames, year,sex, last = letter(name, -1)) ll <- summarize(bn, percent=sum(percent)) ggplot(ll, a) + geom_line() + facet_wrap(~ last) + axes

- Result

- Rise in names that end in n

lastn <- group_by(filter(bnames, letter(name, -1)=="n"), year) lastn <- summarize(lastn, percent=sum(percent)) ggplot(lastn, aes(year, percent)) + geom_line() + axes

- Result

### 15.6 soundex

- What about names that sound alike?

see http://en.wikipedia.org/wiki/Soundex

soundex <- function(name) { name <- toupper(name) name <- unlist(strsplit(name,"")) # split into individual chars sdx <- name[1] dict <- c("BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", "AEIOUHWY") codes <- c("1","2","3","4","5","6",".") for (i in 2:length(name)) { code = codes[grep(name[i], dict, fixed=TRUE)] if (code != str_sub(sdx,-1,-1)) { sdx <- str_c(sdx, code) } } sdx <- gsub(".", "", sdx, fixed=TRUE) sdx <- str_pad(sdx, 3, pad = "0") return(sdx) }

- Get soundex for each name in data

unames <- unique(bnames$name) sdxs <- sapply(unames, soundex) bnames$soundex <- sdxs[ match(bnames$name, unames )] bnames$soundend <- str_sub(bnames$soundex,2)

- So what is the deal with names that end in n?

sdylan <- subset(bnames, soundex==soundex("Dylan")) unique(sdylan$name)

[1] "Delma" "Dillon" "Delano" "Delwin" "Dylan" "Delaney" "Dillan" [8] "Dillion" "Dylon" "Dyllan" "Dallin" "Dilan" "Daylen"

- Plot sound alikes

sdy <- group_by(sdylan, sex, year) sdylan.sum <- summarize(sdy, percent = sum(percent)) ggplot(sdylan.sum, a) + geom_line() + axes

- Result

- Maybe all soundexs that end in "45" + "n"?

sy <- group_by(bnames, year, sex) s45 <- summarize(filter(sy, soundend == "45" & letter(name,-1)=="n"), percent=sum(percent)) ggplot(s45, a) + geom_line() + axes

- Result

## 16 Tidy Data

### 16.1 Tidy data

- First an introduction to the pipe

in magrittr package (required by dplyr and re-exported). It is a bit like the

`|`

pipe in the unix command line. It simply lets the left hand object replace the first argument of the right hand function.`library(dplyr)`

mtcars %>% filter(mpg > 30) # instead of filter(mtcars, mpg > 30)

mpg cyl disp hp drat wt qsec vs am gear carb 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

- Causes of messiness

- Example of messy data

- Data storage in "long format"

Storage Meaning Table / File Data set Rows Observation Columns Variables - Common causes of messiness

- column headers are values, not variable names
- multiple variables are stored in one column
- variables are stored in both rows and columns
- multiple types of experimental unit stored in the same table
- one type of experimental unit stored in multiple tables

- Some tools

library(tidyr) ?gather ?separate ?spread library(stringr) ?str_replace ?str_sub ?str_split_fixed library(dplyr) ?arrange

### 16.2 Long vs wide data

- You've seen long vs wide data formats

date species psi.pd psi.md RH.pd RH.md 110810 JUDE2 -2.0 -4.5 88 50 110810 QUGR -1.0 -2.2 90 40 110810 PICE -0.4 -1.7 91 33 Useful for regressing md.psi on pd.psi, but not useful if we want to facet on time of day. So is this tidy? Depends if we think of pd.psi and md.psi as separate variables or not.

- Long format

date species time psi RH 110810 JUDE2 pd -2.0 88 110810 JUDE2 md -4.5 90 110810 QUGR pd -1.0 91 110810 QUGR md -2.2 50 110810 PICE pd -0.4 40 110810 PICE md -1.7 33 - Conceptual framework

- Identifer (id) variables identify the unit that measurements take place on. ID variables are usually discrete, and are typically fixed by design.
- Measured variables represent what is measured on that unit.

- Further step

In truly "long" format there are only id variables and a value column!

date species time variable value 110810 JUDE2 pd psi -2.0 110810 JUDE2 md psi -4.5 110810 QUGR pd psi -1.0 110810 QUGR md psi -2.2 110810 PICE pd psi -0.4 110810 PICE md psi -1.7 110810 JUDE2 pd RH 88 110810 JUDE2 md RH 90 110810 QUGR pd RH 91 110810 QUGR md RH 50 110810 PICE pd RH 40 110810 PICE md RH 33 - Problem: Column headers are values, not variable names

raw <- read.delim("http://r-research-tool.schwilk.org/data/pew.txt", check.names = FALSE, stringsAsFactors = FALSE) head(raw)

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k 1 Agnostic 27 34 60 81 76 137 122 2 Atheist 12 27 37 52 35 70 73 3 Buddhist 27 21 30 34 33 58 62 4 Catholic 418 617 732 670 638 1116 949 5 Don’t know/refused 15 14 15 11 10 35 21 6 Evangelical Prot 575 869 1064 982 881 1486 949 $100-150k >150k Don't know/refused 1 109 84 96 2 59 74 76 3 39 53 54 4 792 633 1489 5 17 18 116 6 723 414 1529

- Problem: Multiple variables in one column

The WHO tuberculosis data:

library(stringr) raw <- read.csv("http://r-research-tool.schwilk.org/data/tb-simple-2018.csv", stringsAsFactors = FALSE) raw %>% select(1:7) %>% tail()

iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 6801 ZW 2006 12718 NA NA 215 736 6802 ZW 2007 10583 6 132 138 500 6803 ZW 2008 9830 NA NA 127 614 6804 ZW 2009 10195 NA NA 125 578 6805 ZW 2010 11654 20 130 150 710 6806 ZW 2011 12596 30 122 152 784

## get rid of the "overall count" for now raw$new_sp <- NULL ## rename rest of columns for clarity names(raw) <- str_replace(names(raw), "new_sp_", "")

### 16.3 Reshaping with tidyr

- Variables in rows and columns

Example data, photosynthesis and respiration in two species

`library(tidyr)`

tab1 <- read.csv("http://r-research-tool.schwilk.org/data/carex.csv") head(tab1)

Species Treatment A.1 A.2 A.3 A.4 R.1 R.2 R.3 R.4 1 Pa 0.15 8.36 7.632 5.522 6.300 0.7358 0.5870 0.6236 0.4354 2 Pa 3.00 11.06 11.540 7.914 7.914 NA NA NA NA 3 Pa 15.00 22.32 18.640 17.200 18.400 1.1680 0.9600 1.5460 NA 4 Cs 0.15 15.60 10.240 13.100 10.480 1.4840 0.7682 0.6236 1.3060 5 Cs 3.00 13.74 15.240 13.920 10.500 NA NA NA NA 6 Cs 15.00 9.37 18.760 15.600 10.660 1.0360 1.0640 0.5276 0.8990

But A, A.1 and A.2 all represent measurements of A (photosynthesis)

- Gather data (wide to long format)

`tab.tidier <- tab1 %>% gather(measurement, value, -Species, -Treatment) head(tab.tidier)`

Species Treatment measurement value 1 Pa 0.15 A.1 8.36 2 Pa 3.00 A.1 11.06 3 Pa 15.00 A.1 22.32 4 Cs 0.15 A.1 15.60 5 Cs 3.00 A.1 13.74 6 Cs 15.00 A.1 9.37

- Separate replicate id from measurement type

tab.tidy <- tab.tidier %>% separate(measurement, into = c("measurement", "rep"), sep = "\\.") head(tab.tidy)

Species Treatment measurement rep value 1 Pa 0.15 A 1 8.36 2 Pa 3.00 A 1 11.06 3 Pa 15.00 A 1 22.32 4 Cs 0.15 A 1 15.60 5 Cs 3.00 A 1 13.74 6 Cs 15.00 A 1 9.37

In more complicated cases you may need to use stringr library, regular expressions to pull apart column data. See also argument

`convert`

. - Spreading data (long to wide)

Why?

`tab.spread <- tab.tidy %>% spread(measurement, value) head(tab.spread)`

Species Treatment rep A R 1 Cs 0.15 1 15.60 1.4840 2 Cs 0.15 2 10.24 0.7682 3 Cs 0.15 3 13.10 0.6236 4 Cs 0.15 4 10.48 1.3060 5 Cs 3.00 1 13.74 NA 6 Cs 3.00 2 15.24 NA

- What to do with NAs?

- In long data, some NAs can be implicit (missing rows), but when spread to a wider form we need to make them explicit since there is a cell to fill.
- In other cases, an NA in one form can imply a zero in another aggregate form (eg oak cover on my Chisos transects?!).

## 17 Dates and Times

### 17.1 How to represent time

- Time is tricky to represent

Time is a deceptively complicated system to work with. Do we want it to behave like a number line, or according to complicated conventions?

- Instants

- An instant refers to a specific moment of time.
- Instants are combination of years, months, days, hours, minutes, seconds, and timezone. e.g.,
`2011-03-28 08:44:32 CST`

- There are two main time data types in R:

- POSIXct

The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.

- POSIXlt

The POSIXlt data type is a vector, and the entries in the vector have the following meanings:

- seconds
- minutes
- hours
- day of month (1-31)
- month of the year (0-11)
- years since 1900
- day of the week (0-6 where 0 represents Sunday)
- day of the year (0-365)
- Daylight savings indicator (positive if it is daylight savings)

### 17.2 lubridate

- lubridate package

More info at http://www.jstatsoft.org/v40/i03/

library(lubridate) # lubridate turns strings into instants with # functions that have y, m, and d in their # names. Use the function whose name matches # the order of the elements in the date ymd("2011-03-28") mdy("03/28/2011") dmy("28032011") ymd_hms("2011:03:28 08:50:30", tz="America/Chicago") #order matters dmy("05-03-2011") mdy("05-03-2011")

- What type of objects do these return?

# what is the object type (class)? class(mdy("05-03-2011"))

[1] "Date"

- Example use

# example: soil <- read.csv("http://r-research-tool.schwilk.org/data/soil-sensors.csv", stringsAsFactors = FALSE, na.strings = c("NA", "#N/A", "")) names(soil) <- c("time", "p1_psi", "p2_VWC", "p3_VWC", "p4_psi") head(soil) class(soil$time)

time p1_psi p2_VWC p3_VWC p4_psi 1 24 May 2012 12:00 PM -36 0.230 0.338 -23 2 24 May 2012 12:30 PM -36 0.229 0.338 -23 3 24 May 2012 01:00 PM -36 0.226 0.337 -23 4 24 May 2012 01:30 PM -37 0.224 0.337 -23 5 24 May 2012 02:00 PM -37 0.222 0.336 -23 6 24 May 2012 02:30 PM -38 0.220 0.335 -24 [1] "character"

- Your turn

Convert time column into date-time objects

- Manipulating instants

now() > ymd("2018-10-10") today() > ymd("2018-10-29") # date object now() > ymd("2018-10-23", tz="America/Chicago")

[1] TRUE [1] FALSE [1] TRUE

- Timezones

Tricky to specify. Use this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones

d <- ymd_hms("2018-10-10 14:30:00", tz="America/Chicago") d with_tz(d, tz="UTC") with_tz(d, tz="America/Whitehorse") force_tz(d, tz="America/Whitehorse")

[1] "2018-10-10 14:30:00 CDT" [1] "2018-10-10 19:30:00 UTC" [1] "2018-10-10 12:30:00 PDT" [1] "2018-10-10 14:30:00 PDT"

- Rounding times

# round floor_date(now(), "month") ceiling_date(now(), "month") round_date(now(), "month") # extract year(now())

[1] "2018-10-01 CDT" [1] "2018-11-01 CDT" [1] "2018-11-01 CDT" [1] 2018

- Accessing components

Date component Accessor Year `year()`

Month `month()`

Week `week()`

Day of year `yday()`

Day of month `mday()`

Day of week `wday()`

Hour `hour()`

Minute `minute()`

Second `second()`

Time Zone `tz()`

- Manipulating instants

now() year(now()) hour(now()) day(now()) yday(now())

[1] "2018-10-23 09:48:26 CDT" [1] 2018 [1] 9 [1] 23 [1] 296

- Instants, continued

wday(now()) wday(now(), label = TRUE) wday(now(), label = TRUE, abbr = FALSE) month(now(), label = TRUE, abbr = FALSE)

[1] 3 [1] Tue Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat [1] Tuesday 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday [1] October 12 Levels: January < February < March < April < May < June < ... < December

- Upon which day of the week were you born?
- Changing components

# Accessor functions can also be used to change the # elements of a date-time ti <- now() year(ti) <- 1999 hour(ti) <- 23 day(ti) <- 45 tz(ti) <- "UTC" ti

[1] "1999-11-14 23:48:52 UTC"

- Aggregate

## using ddply: library(dplyr) library(ggplot2) soil$date <- floor_date(dmy_hm(soil$time), "day") psi.day <- soil %>% group_by(date) %>% summarize(avepsi = mean(p1_psi, na.rm=TRUE)) ggplot(psi.day, aes(date, avepsi)) + geom_line()

- Result

### 17.3 Time spans

- Time spans

#?dseconds dseconds(140) dhours(32) ddays(2) class(ddays(1))

[1] "140s (~2.33 minutes)" [1] "115200s (~1.33 days)" [1] "172800s (~2 days)" [1] "Duration" attr(,"package") [1] "lubridate"

- Let's use R to set our alarm clock

ymd_hms("2018-01-01 08:30:00", tz = "") + ddays(0:30) # cool

[1] "2018-01-01 08:30:00 CST" "2018-01-02 08:30:00 CST" [3] "2018-01-03 08:30:00 CST" "2018-01-04 08:30:00 CST" [5] "2018-01-05 08:30:00 CST" "2018-01-06 08:30:00 CST" [7] "2018-01-07 08:30:00 CST" "2018-01-08 08:30:00 CST" [9] "2018-01-09 08:30:00 CST" "2018-01-10 08:30:00 CST" [11] "2018-01-11 08:30:00 CST" "2018-01-12 08:30:00 CST" [13] "2018-01-13 08:30:00 CST" "2018-01-14 08:30:00 CST" [15] "2018-01-15 08:30:00 CST" "2018-01-16 08:30:00 CST" [17] "2018-01-17 08:30:00 CST" "2018-01-18 08:30:00 CST" [19] "2018-01-19 08:30:00 CST" "2018-01-20 08:30:00 CST" [21] "2018-01-21 08:30:00 CST" "2018-01-22 08:30:00 CST" [23] "2018-01-23 08:30:00 CST" "2018-01-24 08:30:00 CST" [25] "2018-01-25 08:30:00 CST" "2018-01-26 08:30:00 CST" [27] "2018-01-27 08:30:00 CST" "2018-01-28 08:30:00 CST" [29] "2018-01-29 08:30:00 CST" "2018-01-30 08:30:00 CST" [31] "2018-01-31 08:30:00 CST"

- A problem:

## what about: ymd_hms("2018-03-01 08:30:00", tz = "") + ddays(0:30)

[1] "2018-03-01 08:30:00 CST" "2018-03-02 08:30:00 CST" [3] "2018-03-03 08:30:00 CST" "2018-03-04 08:30:00 CST" [5] "2018-03-05 08:30:00 CST" "2018-03-06 08:30:00 CST" [7] "2018-03-07 08:30:00 CST" "2018-03-08 08:30:00 CST" [9] "2018-03-09 08:30:00 CST" "2018-03-10 08:30:00 CST" [11] "2018-03-11 09:30:00 CDT" "2018-03-12 09:30:00 CDT" [13] "2018-03-13 09:30:00 CDT" "2018-03-14 09:30:00 CDT" [15] "2018-03-15 09:30:00 CDT" "2018-03-16 09:30:00 CDT" [17] "2018-03-17 09:30:00 CDT" "2018-03-18 09:30:00 CDT" [19] "2018-03-19 09:30:00 CDT" "2018-03-20 09:30:00 CDT" [21] "2018-03-21 09:30:00 CDT" "2018-03-22 09:30:00 CDT" [23] "2018-03-23 09:30:00 CDT" "2018-03-24 09:30:00 CDT" [25] "2018-03-25 09:30:00 CDT" "2018-03-26 09:30:00 CDT" [27] "2018-03-27 09:30:00 CDT" "2018-03-28 09:30:00 CDT" [29] "2018-03-29 09:30:00 CDT" "2018-03-30 09:30:00 CDT" [31] "2018-03-31 09:30:00 CDT"

- What went wrong?
- Clock time and the number line

- 3 types of time spans

We can still do math on the time line, as long as we’re specific about how we want to measure time.

- durations (exact times)
- periods (clock times)
- intervals (specific times)

- Durations

Durations measure the exact amount of seconds that pass between two time points.

- Durations

# Use d + the plural of of a unit of time to create # a duration. dminutes(5) dhours(278) #dmonths(4) # why doesn't this work?

[1] "300s (~5 minutes)" [1] "1000800s (~11.58 days)"

- Periods

Periods measure time spans in units larger than seconds. Periods pay no attention to how many sub-units occur during the unit of measurement.

- Periods

# Use the plural of a time unit to create a period. minutes(5) hours(278) months(4) # months are not a problem years(1:10)

[1] "5M 0S" [1] "278H 0M 0S" [1] "4m 0d 0H 0M 0S" [1] "1y 0m 0d 0H 0M 0S" "2y 0m 0d 0H 0M 0S" "3y 0m 0d 0H 0M 0S" [4] "4y 0m 0d 0H 0M 0S" "5y 0m 0d 0H 0M 0S" "6y 0m 0d 0H 0M 0S" [7] "7y 0m 0d 0H 0M 0S" "8y 0m 0d 0H 0M 0S" "9y 0m 0d 0H 0M 0S" [10] "10y 0m 0d 0H 0M 0S"

- Why use periods?

No surprises.

`2010-11-01 00:00:00 + months(1)`

will always equal`2010-12-01 00:00:00`

no matter how many leap seconds, leap days or changes in DST occur in between - Intervals

Intervals measure a time span by recording its endpoints. Since we know when the time span occurs, we can calculate the lengths of all the units involved.

- Intervals examples

# To create an interval, use the special %--% # operator: int <- ymd("2018-01-01") %--% ymd("2019-01-01") # Access and set the endpoints with int_start() and # int_end(). int_start(int) int_end(int) <- ymd("2019-03-14") as.duration(int) as.period(int) # Intervals can be accurately converted to either # periods or durations with as.period() and # as.duration()

[1] "2018-01-01 UTC" [1] "37756800s (~1.2 years)" [1] "1y 2m 13d 0H 0M 0S"

### 17.4 Math with date-times

## 18 Statistical Models

### 18.1 Statistical models

- Definition

- Data
- A collection of variables, each variable being a vector of measurements of some specific trait/characteristic
- Problem
- How does variable \(Y\) depend upon variables \(X_1, X_2 \dots X_n\)
- Model
- The model represents the mathematical relationship between X and Y. It aims to represent the true relationship

- First steps in modeling

- Identify the
**response variable** - Identify the
**explanatory variables** - Are the explanatory variables continuous, categorical or a mixture of both?
- What is the response variable? Continuous, a count or proportion, category, or time-at-death?

- Identify the
- Explanatory variables

- Continuous
- Regression
- Categorical
- Analysis of Variance (ANOVA)
- Mixture
- Analysis of covariance (ANCOVA)

- Response variables

- Continuous
- Normal regression, anova, ancova
- Proportion
- Logistic regression (and variants for counts, binary data)
- Time-at-death
- Survival analysis

### 18.2 Model formulas in R

- Model formulas

- A model formula is a sequence of R variables and symbols that specifies the variables in the model and their relationships
- The formula is a necessary argument to many model-fitting functions in R
- Although it resembles a mathematical formula, the symbols in the "equation" have a different meaning

- Model formula notation

A model formula in R can look like this:

y ~ A + B + A:B

the

`~`

means: "is modeled as a function of" - Example: Linear regression

Setup:

library(ggplot2) x <- runif(1000, 0, 10) y <- 1.5*x + 0.01 + rnorm(1000, 0, 2) d <- data.frame(x=x, y=y) ggplot(d, aes(x, y)) + geom_point()

- Example: Linear regression

`fit <- lm(y ~ x, data = d) fit`

Call: lm(formula = y ~ x, data = d) Coefficients: (Intercept) x 0.01578 1.50354

- Example: Linear regression

summary(fit)

Call: lm(formula = y ~ x) Residuals: Min 1Q Median 3Q Max -5.2281 -1.3923 -0.0896 1.3105 6.9257 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.01578 0.12165 0.13 0.897 x 1.50354 0.02114 71.11 <2e-16 *** --- codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.947 on 998 degrees of freedom Multiple R-squared: 0.8352, Adjusted R-squared: 0.835 F-statistic: 5056 on 1 and 998 DF, p-value: < 2.2e-16

- Extracting information from a fitted object

typeof(fit) class(fit) names(fit) coef(fit)

[1] "list" [1] "lm" [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" (Intercept) x 0.1425765 1.4993422

- Residuals

head(residuals(fit))

1 2 3 4 5 6 -0.76364790 -2.29546231 -0.08862918 -2.11099583 -1.98748818 1.50347650

- Intercepts

But wait! R estimated two coefficients and we only had one explanatory variable.

- R assumes that you want to include an intercept term (our call was the same as
`y ~ 1 + x`

) - You can exclude it by adding a
`0`

to the formula (or`-1`

). But you never want to do that (almost never).

- R assumes that you want to include an intercept term (our call was the same as
- Example of no intercept

`fitnoi <- lm(y ~ 0 + x, data = d) summary(fitnoi)`

Call: lm(formula = y ~ 0 + x, data = d) Residuals: Min 1Q Median 3Q Max -5.222 -1.393 -0.084 1.317 6.927 Coefficients: Estimate Std. Error t value Pr(>|t|) x 1.50590 0.01069 140.8 <2e-16 *** --- codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.946 on 999 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.952 F-statistic: 1.983e+04 on 1 and 999 DF, p-value: < 2.2e-16

- Let's build some models

# names(mtcars) ggplot(mtcars, aes(disp, mpg)) + geom_point() ggplot(mtcars, aes(wt, mpg)) + geom_point() ggplot(mtcars, aes(disp, wt)) + geom_point() fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars) summary(fuel.mod)

- Model results

Call: lm(formula = mpg ~ disp + wt + wt:disp, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.267 -1.677 -0.836 1.351 5.017 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 44.081998 3.123063 14.115 2.96e-14 *** disp -0.056358 0.013239 -4.257 0.00021 *** wt -6.495680 1.313383 -4.946 3.22e-05 *** disp:wt 0.011705 0.003255 3.596 0.00123 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.455 on 28 degrees of freedom Multiple R-squared: 0.8501, Adjusted R-squared: 0.8341 F-statistic: 52.95 on 3 and 28 DF, p-value: 1.158e-11

`anova()`

Test if added variables reduce residual sum-of-squares

anova(fuel.mod)

Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) disp 1 808.89 808.89 134.217 3.4e-12 *** wt 1 70.48 70.48 11.694 0.001942 ** disp:wt 1 77.93 77.93 12.931 0.001227 ** Residuals 28 168.75 6.03 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Linear models:
`summary()`

vs`anova()`

- Short:

`summary()`

gives T-test against null that \(\Beta = 0\).`anova()`

does sequential F-test if each variable improves model (lowers residual sum-of-squares) - Ben Bolker (https://stackoverflow.com/questions/12863178/comparing-two-linear-models-with-anova-in-r):

The p-values you get for

`summary()`

of each individual model are the p-values for the effects of each of the parameters in each model, conditional on all the other parameters in that model. If your data are perfectly balanced (which is unlikely in a regression design), you should get the same answers from`summary()`

and`anova()`

, but otherwise the results from anova are generally preferable.

- Short:
`anova()`

to compare models

fuel.mod2 <- lm(mpg ~ disp + wt, data=mtcars) fuel.mod3 <- lm(mpg ~ disp, data=mtcars) anova(fuel.mod3, fuel.mod2, fuel.mod)

Analysis of Variance Table Model 1: mpg ~ disp Model 2: mpg ~ disp + wt Model 3: mpg ~ disp + wt + wt:disp Res.Df RSS Df Sum of Sq F Pr(>F) 1 30 317.16 2 29 246.68 1 70.476 11.694 0.001942 ** 3 28 168.75 1 77.934 12.931 0.001227 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Symbols in formulas

`+`

separate effects in a formula `:`

interaction (A:B is interaction of A and B) `*`

main effects plus interactions `^`

crossed `%in%`

nested within `/`

nested within | conditional on (rarer) `A*B`

is equivalent to`A + B + A:B`

- lm() works for categorical variables too

mtcars$cylnum <- factor(mtcars$cyl) levels(mtcars$cylnum) fuel.aov <- lm(mpg ~ cylnum, data=mtcars) anova(fuel.aov)

[1] "4" "6" "8" Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) cylnum 2 824.78 412.39 39.697 4.979e-09 *** Residuals 29 301.26 10.39 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Coefficients of categorical variables

summary(fuel.aov)

Call: lm(formula = mpg ~ cylnum, data = mtcars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** cylnum6 -6.9208 1.5583 -4.441 0.000119 *** cylnum8 -11.5636 1.2986 -8.905 8.57e-10 *** --- codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09

### 18.3 Beyond the linear model

- Beyond the linear model

- GLM
- Generalized linear modeling allow non-normal errors in response variable through a linear link function
- Mixed effects models
- relax assumption of independence. Model nested observations and random effects
- GLS
- Generalized least squares relax assumption of homogeneity, model the heterogeneity
- GLMM
- combine first and second, above.
- GAM and GAMM
- Generalized additive models and mixed versions

- Generalized Linear Models (glm)

Consist of:

- Error distributions

The distribution of the response around its mean. for ANOVA and regression, this is the normal; for logistic regression it is the binomial

- Link function

The function,

`g`

that shows how the linear function of the explanatory variables is related to the expected value of the response:\(g(\mu) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)

- Variance function

This captures how the variance of the response depends upon the mean

- Error distributions
- Example logistic regression

trees <- data.frame(fire.sev = seq(0.01,1,0.005)) fire.mort <- runif(length(trees$fire.sev)) < trees$fire.sev trees$mortality <- ifelse(fire.mort,1,0) qplot(fire.sev, mortality, data=trees)

- Not a linear relationship!

- Logistic regression

trees.glm <- glm(mortality ~ fire.sev, data=trees, family = binomial()) trees.glm # summary(trees.glm)

Call: glm(formula = mortality ~ fire.sev, family = binomial(), data = trees) Coefficients: (Intercept) fire.sev -2.730 5.081 Degrees of Freedom: 198 Total (i.e. Null); 197 Residual Null Deviance: 275.3 Residual Deviance: 203.4 AIC: 207.4

- Prediction

trees$pred.mort <- predict(trees.glm, type="response") qplot(fire.sev, mortality, data=trees) + geom_line(aes(y =pred.mort) )

### 18.4 Mixed models

- Example: Mixed model

fish <- read.csv("http://r-research-tool.schwilk.org/data/fish-data.csv") head(fish)

aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0 4 a low 4.3 5 a low 4.8 6 b low 4.8

- Look at these data

Is each fish by treatment row independent?

- Mixed model using nlme

library(nlme) fit1.lme <- lme(color.score ~ 1, random = ~ 1 | aquarium, method="ML", data=fish) fit2.lme <- lme(color.score ~ nutrient, random = ~ 1 | aquarium, method="ML", data=fish) anova(fit1.lme, fit2.lme)

Model df AIC BIC logLik Test L.Ratio p-value fit1.lme 1 3 139.7463 144.8129 -66.87315 fit2.lme 2 4 122.6974 129.4529 -57.34872 1 vs 2 19.04886 <.0001

- BEWARE! Those p values are wrong.

Beware of p-values in multi-level models. Consider model selection (AIC or BIC, but careful there as well). Consider Markov chain Monte Carlo methods (MCMC) for confidence intervals.

See https://seriousstats.wordpress.com/tag/mcmc-methods/

See package "pbkrtest" for Kenward-Roger approximation

- Using lme4

library(lme4) library(pbkrtest) fit1.lmer <- lmer(color.score ~ 1 + (1 | aquarium), data=fish) fit2.lmer <- lmer(color.score ~ nutrient + (1 | aquarium), data=fish) KRmodcomp(fit1.lmer, fit2.lmer)

F-test with Kenward-Roger approximation; computing time: 0.07 sec. large : color.score ~ nutrient + (1 | aquarium) small : color.score ~ 1 + (1 | aquarium) stat ndf ddf F.scaling p.value Ftest 54.912 1.000 6.000 1 0.0003104 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Mixed models in R

- Pinheiro, J. C. & Bates, D. Mixed-Effects Models in S and S-Plus Springer-Verlag, New York, 2000.
- Zuur, A.F., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. Mixed Effects Models and Extensions in Ecology with R. Springer. 2009.

## 19 Perception and Color

### 19.1 Graphics and color use

- Exploratory vs communication graphics

- Exploratory graphics

Are for you (not others). Need to be able to create rapidly because your first attempt will never be the most revealing. Iteration is crucial for developing the best display of your data.

- Communication graphics

When you communicate your findings, you need to spend a lot of time polishing your graphics to eliminate distractions and focus on the story. Iteration is crucial to ensure all the the small stuff works well: labels, color choices, tick marks, etc

- Exploratory graphics

### 19.2 Perception

### 19.3 Color

- What are the components of color?

RGB:

- RGB triplets

Notation RGB triplet Arithmetic (1.0, 0.0, 0.0) 8-bit per channel (255, 0, 0) 8- bit as hex `#FF0000`

- RGB

- Other color models?

- Luminance / Lightness

- HCL

- Benefit

- Perceptually uniform
- Hue is unordered. Use evenly spaced hues with equal chroma and luminance to make aesthetically pleasing discrete palettes
- Chroma and luminance are ordered. Easy to make perceptually uniform gradients by varying either (or both).

- Color blindness

- 7-10% of men are red-green color "blind."
- Solutions: avoid red-green contrasts; use redundant mappings; test.
- Color Oracle: http://colororacle.org

### 19.4 Colors in R

- Colors in R

Default in R is an hcl color scale, one can use scales in other packages (eg RColorBRewer) as well.

- Continuous scales

The default continuous scale in ggplot2 is a blue gradient, from low =

`#132B43`

to high =`#56B1F7`

which can be reproduced as:library(ggplot2) scale_color_gradient(low = "#132B43", high = "#56B1F7", space = "Lab")

- Discrete scales

The default discrete scale in ggplot2 is a range of hues from hcl,

scale_color_hue(h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1)

- Continuous scales
- Discrete color scales

p = ggplot(iris, aes(Petal.Length, Sepal.Length, color = Species)) + geom_point() p p + scale_color_hue(l=40,c=30) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50, direction = -1)

- Results:

- Continuous scales

p <- ggplot(mpg, aes(hwy, cty, color = displ)) + geom_point(size = 3.5, alpha = 0.6, position = "jitter") p p + scale_colour_gradient(low = "tan", high = "red")

- Results:

- Other color palettes

Cynthia Brewer's palettes for discrete scales http://colorbrewer2.org

These are optimized for maps

library(RColorBrewer) display.brewer.all() p <- qplot(carat, price, color = cut, data=diamonds) p + scale_color_brewer(palette = "Blues") p <- ggplot(mpg, aes(reorder(class, -cty), cty, color = cut(displ, breaks = 4))) + geom_boxplot(position = "dodge") p + scale_color_brewer(palette="Dark2")

- Result:

### 19.5 Perception and comparisons

- Some comparisons are easier than others

- Some comparisons are easier than others

- Some comparisons are easier than others

- Some comparisons are easier than others

- Some comparisons are easier than others

- Some comparisons are easier than others

- We perceive relative difference

- It is easier to make nearby comparisons

- Beware of visual illusions

- Boxplots and relatives

- My recommendations

For comparing univariate responses across discrete groups. According to data set size

- small: dotplots or jittered dotplots (a special challenge are small, discrete data sets, where there are lots of repeated values)
- moderate: boxplots
- moderate to large: violin plots, density strips, beanplots, box-percentile plots etc. etc.

## 20 Markup languages

### 20.1 Why plain text?

- For next class:

- Install git on your computer

- Our main goal as scientists

To make our research as reproducible and visible as possible

- The power of plain text

- Ubiquitous
- Usually small in size
- Portable across platforms (and versions)

- Manipulation of text

Most output is based on simple text file. There are two main final formats: HTML and PDF. One is pageless and one is centered on the idea of the printed page.

### 20.2 Markup

### 20.3 LaTeX

- LaTeX

- A set of macros around Tex, a markup language invented by Donald Knuth
- A document preparation system and document markup language.
- Defacto standard for
**math**in academic publishing. Formulae used in HTML pages (e.g., Wikipedia)

- A LaTeX example

%\documentclass[12pt]{article} \begin{document} \pagecolor{white} \section*{My Paper} I just discovered that: \begin{equation} e=mc^2 \end{equation} \begin{table}[h] \centering \caption{My table} \begin{tabular}{lrs} \hline Left & right & center \\ \midrule data & 0.01 & 2.3 \\ \bottomrule \end{tabular} \end{table} \end{document}

- Result

- It isn't as bad as it seems

- Ready for submission

### 20.4 Markdown

- Still, LaTeX is cumbersome

Alternatives?

- Markdown

- Easy to learn http://daringfireball.net/projects/markdown/
- Much more "lightweight" than LaTeX

- Purposes

- LaTeX is for paper — PDF
- Markdown is for HTML (blogs, wikipedia) — but sneakily uses some Latex when needed
- But with pandoc (http://johnmacfarlane.net/pandoc/), Markdown is becoming language of choice among scientists
- You can even export from markdown to MS Word to keep a collaborator happy.

- Markdown
- Markdown example

## My Paper ## I just discovered that: $$e=mc^2$$ ### My Table ### | Left | right | center | |------|-------|--------| | data | 0.01 | 2.3 |

- Which renders as

- Text markup

*italic*, **bold** unordered list: - item 1 - item 2 - item 3 Ordered list: 1. item 1 2. item 2 3. item 3

- Headings

Heading1 ======== # Heading1 # Heading2 -------- ## Heading 2 ## ### Heading 3 ###

- Markdown links

Links: http://daringfireball.net/projects/markdown/syntax [Markdown syntax](http://daringfireball.net/projects/markdown/syntax) [Markdown syntax][mkdn] [mkdn]: http://daringfireball.net/projects/markdown/syntax

- Markdown math

Just use LateX math!

$$\frac{n!}{k!(n-k)!} = \binom{n}{k}$$

- Markdown math

Inline equations just need a single

`$`

In ecology, "true" species diversity is: $ D_q = \frac{1}{\sqrt[q-1]{\sum_{i=1}^R{p_{i} p_{i}^{q-1}}}}

### 20.5 Managing references

- BibTeX

A reference database format

- Very versatile and very powerful (most other markup languages work with bibtex as well)
- Free managers, such as bibdesk or mendeley, are now ubiquitous. See also jabbref. You do not need or want Endnote!

- Bibtex entry

@article{Schwilk+Ackerly-2001, title={Flammability and serotiny as strategies: correlated evolution in pines}, author={Schwilk, Dylan W and Ackerly, David D}, journal={Oikos}, volume={94}, number={2}, pages={326--336}, year={2001}, publisher={Wiley Online Library} }

- In Jabref

- LaTeX and BibTeX example

\section{Introduction} Fire has been a powerful disturbance on the global landscape for hundreds of thousands of years, promoting traits in plants which confer an advantage in the presence of fire \cite{Keeley+Pausas+etal-2011,He+Pausas+etal-2012}. Understanding variation in fire response strategies both across and within fire regimes is a major goal of plant fire ecology \cite{Bond+vanWilgen-1996, Bellingham-2000, Schwilk+Ackerly-2001, Keeley+Pausas+etal-2011}.

- Markdown and BibTeX example

Requires pandoc markdown extensions.

# Introduction # Fire has been a powerful disturbance on the global landscape for hundreds of thousands of years, promoting traits in plants which confer an advantage in the presence of fire [@Keeley+Pausas+etal-2011; @He+Pausas+etal-2012]. Understanding variation in fire response strategies both across and within fire regimes is a major goal of plant fire ecology [@Bond+vanWilgen-1996; @Bellingham-2000; @Schwilk+Ackerly-2001; @Keeley+Pausas+etal-2011].

### 20.6 Which markup to use?

- Which do I use and when:

- Scientific papers
- Depends upon whom I'm writing with. I write in org-mode first. Sometimes LaTeX + BibTeX. I'm starting to use Markdown for some papers (easy conversion to MS word for collaborators).
- Presentations
- LaTeX + Beamer when I need to control details of format, spacing, placement. Org-mode with export via LaTeX + Beamer otherwise. The source for this lecture is plain org-mode (similar to Markdown)
- Websites
- Markdown with html and css templates (via jekyll, eg http://www.schwilk.org) or org-mode export (eg http://r-research-tool.schwilk.org/)
- Other
- org-mode or markdown (documentation, notes, protocols, literate programming)

- Pandoc

So how do we glue everything together and produce wonderful htmls and pdfs out of thin air?

With pandoc: http://johnmacfarlane.net/pandoc/

### 20.7 Scientific workflow

- An RMarkdown workflow

Figure from Kieran Healy: http://kieranhealy.org/resources/

- RMarkdown in RStudio

## 21 Version Control

### 21.1 Setup

- The Bash Shell

Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.

- Windows

Install Git for Windows by download and running the installer this will provide you with both Git and Bash in the Git Bash program.

- Mac OS X

The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in

`/Applications/Utilities`

). You may want to keep Terminal in your dock for this workshop. - Linux

The default shell is usually

`bash`

, Just open a terminal. There is no need to install anything.

- Windows
- Git

- Windows

Git should be installed on your computer as part of your Bash install (described above).

- Mac OS X

For OS X 10.9 and higher, install Git for Mac by downloading and running the installer. After installing Git, there will not be anything in your

`/Applications`

, as Git is a command line program.For older versions of OS X (10.6-10.8) use the most recent available installer for your OS available here. Use the Leopard installer for 10.5 and the Snow Leopard installer for 10.6-10.8.

- Windows
- Additional information

See my reproducible research workshop:

### 21.2 The shell

- The shell

- What is the shell?

The shell a command line interface which allows you to control your computer.

Tasks are accomplished by entering commands with a keyboard (or by running a plain text script) instead of controlling a graphical user interface.

The most popular shell is Bash ("Bourne Again SHell")

- What is a terminal?

A terminal is a program you run that gives you access to the shell. There are many different terminal programs that vary across operating systems.

- What is the shell?
- Using the shell

Open a terminal (bash)

ls echo "hello" echo "hello" > newfile.txt cat newfile.txt

- Full vs relative paths

pwd cd ~ pwd

- Shell command options and arguments

ls -l ls -al man ls

### 21.3 What is version control?

- Why use version control?

- You already use some sort of version control

- File naming schemes (
*eg*`my-file-July18-2013.doc`

) or by copying folders around - Simple but error-prone
- Does not help with branching, collaboration

- File naming schemes (
- A version control system (VCS) allows you to:

- revert files back to a previous state
- revert the entire project back to a previous state
- review changes made over time
- see who last modified something that might be causing a problem, who introduced an issue and when

- You already use some sort of version control
- Local version control systems (VCS)

- Centralized VCS

- Distributed VCS

### 21.4 git

- Configuring

git config --global user.name "Dylan Schwilk" git config --global user.email "dylan@schwilk.org"

- Getting a "repo" (repository)

cd ~ mkdir newrepo cd newrepo git init touch README.md ls .git -al

- What happened?

Hidden directory called

`.git`

. ├── .git │ ├── branches │ ├── config │ ├── description │ ├── HEAD │ ├── hooks │ │ ├── applypatch-msg.sample │ │ ├── commit-msg.sample │ │ ├── post-update.sample │ │ ├── pre-applypatch.sample │ │ ├── pre-commit.sample │ │ ├── prepare-commit-msg.sample │ │ ├── pre-push.sample │ │ ├── pre-rebase.sample │ │ └── update.sample │ ├── info │ │ └── exclude │ ├── objects │ │ ├── info │ │ └── pack │ └── refs │ ├── heads │ └── tags └── README.md

- Checking status

git status

On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) README.md nothing added to commit but untracked files present (use "git add" to track)

- git commands

- staging files

git add . git status

On branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: README.md

- Committing changes

`git commit -m "initial commit" git status`

On branch master nothing to commit, working directory clean

- git commit

A commit is a snapshot taken from the index (staging area)

**not from the working directory**! - What exists now?

- Clone repo

`git clone https://github.com/schwilklab/protocols.git cd protocols ls`

./ ../ html/ lab-org/ methods/ README.md safety/

- A basic workflow

- Edit files
- Stage the changes (git add)
- Review the changes (git diff)
- Commit the changes (git commit)

- The git repository

- git log

git log --pretty=oneline --abbrev-commit -10

d2ec17b Reduce markup lecture e4d405f Add folder for tangled session files bb2d29e Add shell (Bash) examples and some data 261af83 Clean up birdd files e289d7e Minor edit f2820d3 Scatterplot fig in birdd example 75a4be6 Add slides on Bash shell 66c9a7f Add etherpad link for workshop e23b559 Suggest getting github.com account 7e6a6df Add time and date for workshop

- git diff

`git diff`

- difference between working directory and index (staging area)
`git diff --staged`

- difference between index and last commit

With right options, diff can show changes between any two commits, any two files, etc.

- git diff example

Try:

git diff HEAD~1

- Diffs on GitHub

- Practice

- edit file
- stage
- commit

### 21.5 Working with branches

- Branches

A branch represents an independent line of development.

- Show and create branches

- Using branches

- Switch branches

git checkout <existing-branch>

- Merge

Merging is Git's way of putting a branched history back together again.

- Fast-forward merge

- Three-way merge

- Example fast-foward merge

# Start a new feature git checkout -b new-feature master # Edit some files git add <file> git commit -m "Start a feature" # Edit some files git add <file> git commit -m "Finish a feature" # Merge in the new-feature branch git checkout master git merge new-feature git branch -d new-feature

- Merge vs rebase

### 21.6 Working with remotes

- What is a "remote"

- a version of the repository that hosted somewhere on the internet.
- Remotes have a url (https or ssh), but the can also have names.
- It is common to have only one remote at it is named "origin"

- GitHub

We use GitHub for hosting, so most of of your local repos will be associated with a remote named "origin" and a url something like

`git@github.com:schwilklab/CSC-sky-island-forest.git`

- Working with remotes

When you clone from a remote, git remembers and calls that remote "origin". You can look at .git/config or, better use git commands to view:

git remote -v

origin git@github.com:dschwilk/rr-workshop.git (fetch) origin git@github.com:dschwilk/rr-workshop.git (push)

- Adding a remote

If you need to add a remote to an existing repository:

git remote add rem2 git@github.com:dschwilk/software-tools-course.git

- Fetching and pulling

- git push

git push <remote> <branch name> eg: git push origin master or just: git push

This works if none else has pushed changes to that remote since you last pulled. If there is an error, you will need to pull (fetch and merge) first.

- git and GitHub

idea of "forking"