# R as a Research Tool: Lectures

## Table of Contents

- 1. Introduction
- 2. Plots with ggplot
- 3. Data Structures
- 4. Introduction to functions and more lists
- 5. Control Structures
- 6. Loops and functions II:
- 7. Exercises with functions
- 8. Data Frames and Factors
- 9. Math I
- 10. Math II
- 11. Environment, Scope and Style
- 12. Strings
- 13. Debugging
- 14. The Grammar of Graphics and ggplot I
- 15. The Grammar of Graphics and ggplot II
- 16. Shaping data and
`dplyr`

- 17. Tidy Data
- 18. Dates and Times
- 19. Statistical Models
- 20. Perception and Color
- 21. Markup languages
- 22. Version Control
- 23. Packages for Statistics
- 24. Extra: Plain text and Emacs
- 25. Example packages

## 1 Introduction

### 1.1 Introduction to course

- Website and homework
- Website
- http://r-research-tool.schwilk.org/
- Homework
- email me. See webpage for format instructions. dylan.schwilk@ttu.edu
- Syllabus
- See handout, link to syllabus

- No textbook
but you can check out these books/ebooks (all linked on courses website):

*R for Data Science*by Garrett Grolemund and Hadley Wickham: http://r4ds.had.co.nz/. This is a great new book focused on tidying, visualizing and exploring data in R using the great packages of the "hadleyverse" or "tidyverse". I will cover some the concepts in this book but our class is more focused on general purpose programming.- Books in the Springer Use R! series are free from a TTU IP address: http://www.springerlink.com/books/
*Advanced R Programming*by Hadley Wickham http://adv-r.had.co.nz/ Available as a wiki for free.*The Art of R Programming*by Norman Matloff.

- Me
- Plant ecologist: fire ecology, evolutionary ecology
- On the intertubes:
- http://www.schwilk.org/
- twitter: DylanSchwilk
- github: https://github.com/dschwilk and https://github.com/dschwilk

- Computers
If you don't know anything about computers, just remember that they are machines that do exactly what you tell them but often surprise you in the result.

— Richard Dawkins, The Blind Watchmaker (1986)

- Computer Architecture
- Computer Languages
- Interpreted vs compiled languages
- Compiled
higher level language code is converted into machine code and then 'directly' executed by the host CPU

- Faster execution, but needs compilation time
- No dynamic typing or scoping

- Interpreted (eg R)
Programs are indirectly executed ("interpreted") by an interpreter program

- Platform independence
- Reflection (ability to examine and modify the structure and behavior of an object at run time)
- Dynamic typing

- Compiled
- R
- R is a computer language for statistical computing
- R is an
**interpreted**language - FORTRAN, C,C++ are all
**compiled**languages - other interpreted languages used in science: Perl,
**Python**,**Julia**, Matlab (Octave). - R and Python are the most commonly used in science.

### 1.2 R

- Why R?
- Open source, available on every major platform. Ensures that scientists around the world — and not just ones in rich countries — are the co-owners to the software tools needed to carry out research. Contributes to reproducible research!
- A language
*designed*around data analysis and and well suited to interactive exploration of data. - A massive set of packages. If want to do something, chances are that someone has already tried to do it.
- Is the product of 1000s of leading experts in the fields they know best and therefore provides cutting edge tools.
- Strong and friendly community. R is the de facto standard language in statistics and many areas of biology.

- Shortcomings of R compared to other languages
- There are nutty aspects of the language and many ways to do the same thing. Many R functions use tricks to reduce the amount of typing (interaction!) at the cost of making code that is hard to understand.
- R "evolved" it was not designed. Many APIs are inconsistent (Even in core R!)
- Most R programmers are focused on results, using data. They are not computer scientists. There is a lot of poorly written and poorly documented R code out there.
- R is not very fast. It uses memory inefficiently. But it is often fast enough. This is less true than it once was.

- Installing R
CRAN: The Comprehensive R Archive Network

- R lives at http://cran.r-project.org/
- If you run Windows or Mac go to the address above and download precompiled binary distributions.
If you run a debian-based linux (eg ubuntu), simply do:

sudo apt-get update sudo apt-get install r-base

In linux, you can keep up to date with the newest changes by adding a cran repository to your package manager, eg: add

deb http://<cran.mirror>/bin/linux/ubuntu trusty/

to sources.list

- Running R on Windoze
- Running R from command line
Start R and you'll get a greeting, and then the R prompt, the > sign. The screen will look something like:

R

R version 3.3.0 (2016-05-03) -- "Supposedly Educational" Copyright (C) 2016 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) [...MORE...] Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.

- RStudio
- RStudio
- Windows: console, workspace, files, current open file (script).
- Projects: Use them! saves a hidden .RProj file in your project directory.

- RStudio: keyboard shortcuts
- As a script from the command line
- You can write an script and save it as a plain text file (usually with the extension ".R")
Then call your script by typing:

Rscript myscript.R

On linux, you can use the bash "shebang" mechanism to make the script executable: just put

`#!/usr/bin/Rscript`

as the first line of the script. then you can call your script directly:myscript.R

### 1.3 Fundamentals of R

- R as a calculator:
4.0 * 3.5 log(10) # natural log log10(10) (3 + 1) * 5 3^-1 1/0

.Rprofile: Setting TX repository [1] 14 [1] 2.302585 [1] 1 [1] 20 [1] 0.3333333 [1] Inf

- Statements
- one statement per line
- but R will keep reading if the statement is "not finished" (open parentheses, haning operator, etc)
- semicolon can end a statement (rarely, if ever, needed)

1 + 2 # second statement ends here 3 + 4 + ( 7 - 7)

.Rprofile: Setting TX repository [1] 3 [1] 7

- Assigning values to variables
- Variables are assigned using "<-"'
"=" is also now allowed. "==" tests equality.

`x <- 12.6 x x == 13`

.Rprofile: Setting TX repository [1] 12.6 [1] FALSE

- Variables that contains many values
**vectors**create with the concatenate function, c() :`y <- c(3,7,9,11) y`

.Rprofile: Setting TX repository [1] 3 7 9 11

- Variables are assigned using "<-"'
- Naming variables (objects)
- 1-63 characters
- first character a letter
- remaining characters: letters
**or**numbers**or**period (+ underscore) - case sensitive

- Parenthesis and braces
`()`

Parenthesis group operators`{}`

braces group statements and return last expression evaluated. This is needed for function definitions, if-then statements and a few other areas.

- Functions
Functions are modules of code that accomplish a specific task. Functions take data as "arguments" and "return" a result. Once a function is written, it can be used over and over and over again. Functions can be "called" from within other functions.

- Functions in R
Arguments to R functions can be matched by position or by name. So the following calls to sd() are all equivalent

mydata <- rnorm(100) sd(mydata) sd(x = mydata) sd(x = mydata, na.rm = FALSE) sd(na.rm = FALSE, x = mydata) sd(na.rm = FALSE, mydata)

- Example: random numbers
runif(5)

[1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583

runif(5, 0, 10)

[1] 10 0 5 10 9

floor(runif(5,0,11))

[1] 10 0 5 10 9

- Inspecting functions
sd

.Rprofile: Setting TX repository function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm)) <bytecode: 0x1ef8258> <environment: namespace:stats>

- Where do these functions come from?
Many are part of base R. Others come from packages. Others you will write yourself.

- Scatterplots
library(ggplot2) # ?mpg head(mpg) str(mpg) summary(mpg) qplot(displ, hwy, data = mpg)

- The result
- Vectors: the heart of R
`x <- 1:20 x`

.Rprofile: Setting TX repository [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20

- Matrices
`B <- matrix( c(2,4,3,1,5,7), nrow=3, ncol=2) B`

.Rprofile: Setting TX repository [,1] [,2] [1,] 2 1 [2,] 4 5 [3,] 3 7

- Session data
- .Rprofile
- File loaded on startup which can contain your own custom functions
- .Rhistory
- File that contains the session history of commands as
- .RData
- Binary blob of data which contains all the objects created in the session. Watch out for this!

- Working directory
Set your working directory! If you have a $HOME environmental variable set, then path.expand is helpful.

`getwd() setwd(path.expand("~/") )`

`.Rdata`

and`.Rhistory`

- When you quit a session you are asked if you want to save the workspace
- or use
`save.image()`

- When R starts it loads all objects from
`./.Rdata`

! Good idea to start scripts:

rm(list=ls(all=TRUE))

### 1.4 Getting help and essential vocabulary

- Help in R
- Help on the internet
- The R Project’s own manuals are available at the R home page and some search engines are listed there, too.
- Quick R website: http://www.statmethods.net/
- RSeek search engine: http://www.rseek.org/
- You can post your R questions to r-help, the R listserve. See http://www.r-project.org/mail.html.
- Check out StackOverflow for programming questions tagged "r": http://stackoverflow.com/
- R is difficult to search for when using general search engines such as Google. One trick: use google's filetype criterion. To search for scripts with a .R suffix, that pertain to, say, permutations, do:
`filetype:R permutations -rebol`

- Essential vocabulary
See Hadley Wickham's list: http://adv-r.had.co.nz/Vocabulary.html

## 2 Plots with ggplot

### 2.1 Introduction by example

- Making pretty figures
Today we will get ahead of ourselves. Don't worry! I'll come back to the basics after showing you some fun with plots.

- ggplot2
#install.packages("ggplot2") library(ggplot2) #?mpg head(mpg) #str(mpg) summary(mpg)

`ggplot()`

functionCreate a plot object to which one can add layers (geoms)

- First plot
ggplot(mpg, aes(displ, hwy)) + geom_point()

- geoms
`ggplot()`

simply sets up the axes and the mapping. To plot data you must add a "geometric object" or "geom"- Use
`geom_point()`

for a scatterplot. See other geoms such as`geom_line()`

,`geom_smooth()`

,`geom_boxplot()`

- Additional variables, an example
ggplot(mpg, aes(displ, hwy, color=class)) + geom_point()

- Saving plots
- You can store the plot object to a variable name then save with
`ggsave()`

:`p <- ggplot(mpg, aes(displ, hwy)) + geom_point()`

- ggsave("plot1.pdf", p)

- Save the last plot even if it was not assigned to a variable:
`ggplot(mpg, aes(displ, hwy)) + geom_point()`

- ggsave("plot1.pdf")

- You can use the RStudio GUI ("export" button in the plots window)

- You can store the plot object to a variable name then save with

### 2.2 Aesthetics and faceting

- Aesthetics in addition to x and y
Aesthetic Discrete variable Continuous variables Color Rainbow of colors Gradient red to blue Size Discrete size steps Mapping radius->value Shape Different shape for each Doesn't work - Example aesthetics
ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point()

- Faceting
- Small multiples displaying different subsets of the data.
- Useful for exploring conditional relationships and for large data.

- Examples to try:
`p <- ggplot(mpg, aes(displ, hwy)) + geom_point() p + facet_grid(. ~ cyl) p + facet_grid(drv ~ .) p + facet_grid(drv ~ cyl) p + facet_wrap(~ class)`

- Faceting
`facet_grid()`

- 2d grid,
`rows ~ cols`

,`.`

for no split `facet_wrap()`

- 1d ribbon wrapped into 2d
`scales`

- argument controls whether position scales are fixed or free

- Overplotting: What is wrong here?
`p <- ggplot(mpg, aes(cty, hwy)) p + geom_point()`

- Jitter
p + geom_jitter()

- How can we improve this plot?
ggplot(mpg, aes(class, hwy)) + geom_boxplot()

- Reorder
`p <- ggplot(mpg, aes(reorder(class, hwy), hwy)) p + geom_boxplot()`

- Multiple geoms
p + geom_jitter() + geom_violin()

- Your turn
- Read the help for
`reorder`

. Redraw the previously plots with class ordered by median hwy. - How would you put the jittered points on top of the violin plots?

- Read the help for

## 3 Data Structures

### 3.1 Math in R

- Doing math in R
exp(), log(), log10(), sqrt(), abs(), min(), which.min(), max(), which.max(), pmin(), pmax(), sum(), prod() (for products of multiple factors), round(), floor(), ceiling(), sort(), factorial() … and more.

- Operator precedence
`$`

Subsetting by name `@`

component / slot extraction `[`

`[[`

subsetting `^`

exponentiation `:`

sequence operation `%any%`

special operators, including %% and %/% `* /`

multiply and divide `+ -`

add and subtract < > <= >= `= !`

greater, less than `!`

negation `& &&`

and | || or `~`

"depends on" in formulae `<-`

assignment `<<-`

global assignment `?`

help - Order of operations
`x <- 3 + 4 * 2 x`

.Rprofile: Setting TX repository [1] 11

x <- (3 + 4) * 2 x x <- (3 + 4) * sqrt(2) x

.Rprofile: Setting TX repository [1] 14 [1] 9.899495

### 3.2 Vector basics

- R data structures
- vectors
- matrices and arrays
- lists
- data frames

And other specialized structures provided by packages (eg tibbles, data.table, phylogenetic trees …)

- Vectors
`x <- c(5,12,13) x length(x) mode(x)`

.Rprofile: Setting TX repository [1] 5 12 13 [1] 3 [1] "numeric"

- The ":" operator
Note the ":" operator you used in homework:

5:8 5:1

.Rprofile: Setting TX repository [1] 5 6 7 8 [1] 5 4 3 2 1

Beware of the operator precedence:

`i <- 3 1:i-1 1:(i-1)`

.Rprofile: Setting TX repository [1] 0 1 2 [1] 1 2

- seq()
seq(from=12,to=30,by=3) seq(from=1.1,to=2,length=10) seq(from=12,to=30,by=3)

[1] 12 15 18 21 24 27 30 [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 [1] 12 15 18 21 24 27 30

- rep()
rep(8,4) rep(c(5,12,13),3) rep(c(5,12,13),each=2)

- The subsetting operators
R has three subset operators:

`[`

`[[`

`$`

We'll start with one you've already used in your homework:

`[`

- Vector subsetting
`v <- 1:100 v`

[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 [19] 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 [37] 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 [55] 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 [73] 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 [91] 91 92 93 94 95 96 97 98 99 100

`v <- v * 10 v[1:3]`

[1] 10 20 30

- Character vectors (strings)
y <- "abc" length(y) mode(y) y <- c("abc","1","2") length(y)

.Rprofile: Setting TX repository [1] 1 [1] "character" [1] 3

- Strings
u <- paste("abc", "de", "f") # concatenate the strings u

[1] "abc de f"

v <- strsplit(u, " ") # split the string according to blanks v

[[1]] [1] "abc" "de" "f"

- Some vector concepts
- Recycling
- Automatic lengthening of vectors in certain settings
- Filtering
- Extracting subsets of vectors
- Vectorization
- Where functions are applied element-wise to vectors

- Recycling
`x <- c(1, 2, 3, 4)`

Recycling:

c(1,2) + x

[1] 2 4 4 6

c(1, 2, 3) + x

[1] 2 4 6 5 Warning message: In c(1, 2, 3) + x : longer object length is not a multiple of shorter object length

- Vectorization
Many functions are vectorized, including operators such as

`>`

u <- c(5,2,8) v <- c(1,3,9) u > v

.Rprofile: Setting TX repository [1] TRUE FALSE FALSE

sqrt(1:9)

.Rprofile: Setting TX repository [1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 [9] 3.000000

- Vectorized if-then-else
x <- 1:10 y <- ifelse(x %% 2 == 0, 0, 2*x) y

.Rprofile: Setting TX repository [1] 2 0 6 0 10 0 14 0 18 0

### 3.3 Subsetting

- Vector indexing (subsetting)
There are six things that you can use to subset a vector. First three are used commonly:

- positive integers
- negative integers
- logical vectors
- nothing (
`[]`

returns original vector) - zero (
`[0]`

returns empty vector) - character vectors (only if vector is named)

- Subsetting vectors
`y <- c(1.2,3.9,0.4,0.12) y[c(1,3)] y[1:3]`

[1] 1.2 0.4 [1] 1.2 3.9 0.4

`v <- c(1,4) y[-v]`

[1] 3.9 0.4

- Logical vectors and filtering
v <- c(TRUE,FALSE, FALSE, TRUE) y[v]

[1] 1.20 0.12

y[ y > 1 ]

[1] 1.2 3.9

y > 1

[1] TRUE TRUE FALSE FALSE

`match`

and`%in%`

names <- c("fortis", "spinosa", "macrocarpa", "leucophylla") samples <- c("spinosa", "leucophylla") samples %in% names

[1] TRUE TRUE

`names %in% samples`

[1] FALSE TRUE FALSE TRUE

match(samples, names)

[1] 2 4

- any(), all()
Useful for boolean (logical) vectors

any(y > 1)

[1] TRUE

all(y > 1)

[1] FALSE

- Assignment to subset
which(y > 1)

[1] 1 2

`y[y > 1] <- 0 y`

[1] 0.00 0.00 0.40 0.12

- Boolean operators
operator function -------- ----------------- == Test for equality < Less than > Greater than <=,>= lt or equal, gt or equal && Boolean AND for scalars || Boolean OR for scalar & Boolean AND for vectors | Boolean OR for vectors !x Boolean negation xor Exclusive OR any(), all() Boolean vector reduction

### 3.4 Final words on vectors

- NA and NULL
- NA
- Missing data
- NULL
- The null object: The value simply does not exist (eg an impossible value)

x <- c(1,NA,3,4,5,6) mean(x)

[1] NA

`mean(x,na.rm=TRUE)`

[1] 3.8

x <- c(1,NULL,3,4,5,6) x # NULL can't be part of a vector mean(x)

[1] 1 3 4 5 6 [1] 3.8

- unique()
`x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2) unique(x)`

[1] 1 2 3 4

- Counting TRUES
x < 3 length(which(x < 3)) sum(x < 3)

[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [1] 4 [1] 4

- Vector element names
`x <- c(1,2,4) names(x)`

NULL

names(x) <- c("one","two","four") x

one two four 1 2 4

- Subsetting on names
x[c("one", "four")]

one four 1 4

- c(): Type coercion:
- Values are converted to the simplest type required to represent all information.
- The ordering is roughly logical < integer < numeric < complex < character < list.

`c(5,2,"abc") c(5,list(a=1,b=4))`

[1] "5" "2" "abc" [[1]] [1] 5 $a [1] 1 $b [1] 4

- c(): Flattening:
c(1,2,c(100,200,300))

[1] 1 2 100 200 300

### 3.5 Matrices and arrays

- Matrices
`m <- matrix(data = c(1,3,5,2,4,6), nrow=3, ncol=2) m`

.Rprofile: Setting TX repository [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6

`m <- matrix(data = c(1,3,5,2,4,6), nrow=3) m`

.Rprofile: Setting TX repository [,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6

- Matrices con.
`a <- matrix(c(1,2,3,4), nrow=2) a`

[,1] [,2] [1,] 1 3 [2,] 2 4

`b <- matrix(c(6,7,8,9), nrow=2) b`

[,1] [,2] [1,] 6 8 [2,] 7 9

- Matrix addition
a + b

[,1] [,2] [1,] 7 11 [2,] 9 13

- Matrices math, dimensions
`c <- matrix(c(3,4,5,6), nrow=1) c`

[,1] [,2] [,3] [,4] [1,] 3 4 5 6

a + c # result: Error in a + c : non-conformable arrays

- Matrix math, con.
a * b

[,1] [,2] [1,] 6 24 [2,] 14 36

a^3

[,1] [,2] [1,] 1 27 [2,] 8 64

- Matrix multiplication
a <- matrix(c(1,2,3),ncol=3) b <- matrix(c(4,5,6),ncol=1) a b

[,1] [,2] [,3] [1,] 1 2 3 [,1] [1,] 4 [2,] 5 [3,] 6

Calculate dot product (inner product)? Or generally, matrix product?

- Matrix multiplication
# a*b # ERROR a %*% b

[,1] [1,] 32

`b %*% a`

[,1] [,2] [,3] [1,] 4 8 12 [2,] 5 10 15 [3,] 6 12 18

- rbind and cbind
rbind(a, a) # rbind(a,b) # ERROR! cbind(b, b)

[,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [,1] [,2] [1,] 4 4 [2,] 5 5 [3,] 6 6

- Row and column means
See:

`rowMeans()`

`colMeans()`

`rowSums()`

`colSums()`

- Subsetting matrices
`m <- matrix(c(1,2,3,4,5,6),nrow=3) m[1:2,1:2]`

.Rprofile: Setting TX repository [,1] [,2] [1,] 1 4 [2,] 2 5

- Dimension reduction
`z <- matrix(1:8, nrow=4) z`

[,1] [,2] [1,] 1 5 [2,] 2 6 [3,] 3 7 [4,] 4 8

`r <- z[2,] r`

[1] 2 6

- Dimension reduction result
attributes(z) attributes(r)

$dim [1] 4 2 NULL

- Avoiding dimension reduction
r <- z[2,, drop=FALSE] r

[,1] [,2] [1,] 2 6

dim(r)

[1] 1 2

### 3.6 Lists

- Lists
An R list is a container whose contents can be items of different data types, unlike a vector.

x <- list(u=2, v="abc") x x$u

.Rprofile: Setting TX repository $u [1] 2 $v [1] "abc" [1] 2

- Subsetting lists
`[`

- returns a list)
`[[`

- returns an element in the list
- (no term)
`$`

returns an element by name

- Lists continued
A common usage of lists is to package the return values of complex statistical functions. For example, the histogram function hist():

`hist(Nile) hn <- hist(Nile) print(hn)`

.Rprofile: Setting TX repository $breaks [1] 400 500 600 700 800 900 1000 1100 1200 1300 1400 $counts [1] 1 0 5 20 25 19 12 11 6 1 $density [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001 $mids [1] 450 550 650 750 850 950 1050 1150 1250 1350 $xname [1] "Nile" $equidist [1] TRUE attr(,"class") [1] "histogram"

### 3.7 Data Frames

- Data Frames
Like a matrix, but for mixed types. This is what you will use most often!

d <- data.frame(kids=c("Jack","Jill","Igor"), ages=c(12,10,11)) d

.Rprofile: Setting TX repository kids ages 1 Jack 12 2 Jill 10 3 Igor 11

- Reading data
For now, we will assume you save your data as a comma separated values file (say, in excel), and then read that file into a dataframe:

fish <- read.csv("./data/fish-data.csv", stringsAsFactors=FALSE) head(fish,3)

.Rprofile: Setting TX repository aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0

Always use

`stringsAsFactors = FALSE`

(or used Hadley Wickham's readr package).

## 4 Introduction to functions and more lists

### 4.1 Functions

- Why write functions?
Why not just copy and paste code?

- You can give a function a name that makes your code easier to understand.
- If you need to change the code, you only need to change it in one place.
- You eliminate the chance of making mistakes when you copy and paste.

- When write a function
Whenever you find yourself copying and pasting more than twice.

- Functions are objects
So we create them and assign them to a name.

addone <- function(x) { return(x+1) } addone(1)

[1] 2

- Function example
num2string <- function(thename, num) { if (num > 1) { thename <- paste(thename, "s", sep="") } return(paste(as.character(num), thename)) } num2string("computer", 44) num2string("instructor", 1)

[1] "44 computers" [1] "1 instructor"

### 4.2 In class project

- Function randomIntegers()
- Problem: convert random real numbers to integers
- Solution
# RandomIntegers: produce vector of uniformly # distributed random integers. # # Arguments: # n = number of random integers to be generated. # min_val, max_val = min, max possible values, # inclusive. # Returns: # vector of random integers. RandomIntegers <- function(n, min_val, max_val) { # Extend upper boundary by 1 and use floor() result = runif(n, min_val, max_val+1) result = floor(result) return(result) }

- Other solutions, using round():
RandomIntegers2 <- function(n, min_val, max_val) { # Extend each boundary by 0.5 and use round() r = runif(n, min_val-0.5, max_val+0.5) return(round(r)) }

Or … extend lower bound and use

`ceiling()`

- Other solutions, using sample()
RandomIntegers3 <- function(n, min_val, max_val) { sample(min_val:max_val, n, replace=TRUE) }

### 4.3 Some notes on functions

- Steps to writing a function
- Decide what information is needed (arguments) and what to return
- Develop the algorithm
- Implement in code

- Potential problems
- Syntax errors (error when you define the function)
- Errors during run time (eg undefined variable names)
- Runs, no errors but output wrong!

- Debugging
More on this later …

- use print()
- use stopifnot()

### 4.4 Lists

- Lists vs vectors
An R list is a container whose contents can be items of diverse data types, as opposed to them being, say, all numbers, as in a vector.

- List indexing
j <- list(name="Joe", salary=39000, permanent=TRUE) j$salary j$sal j[["salary"]] j[[2]] j[2]

[1] 39000 [1] 39000 [1] 39000 [1] 39000 $salary [1] 39000

- [] vs [[]]
`j[[1]]`

- Extracts a single item
`j[1]`

- Extracts a sublist

- Adding elements
j$rnumber <- "R00001111" j

$name [1] "Joe" $salary [1] 39000 $permanent [1] TRUE $rnumber [1] "R00001111"

- Adding more elements
length(j) j[[5]] <- "Extra" j[6:7] <- c(F,F) j[3:7]

[1] 4 $permanent [1] TRUE $rnumber [1] "R00001111" [[3]] [1] "Extra" [[4]] [1] FALSE [[5]] [1] FALSE

- Deleting elements
length(j) j$rnumber <- NULL length(j) names(j)

[1] 7 [1] 6 [1] "name" "salary" "permanent" "" "" ""

- Inserting elements
Either break apart and re-concatenate with

`c()`

, or use`append()`

:newj <- append(j, list(birthday="12/10/1970"), after=3) newj[1:5]

$name [1] "Joe" $salary [1] 39000 $permanent [1] TRUE $birthday [1] "12/10/1970" [[5]] [1] "Extra"

- Lists to vectors:
`unlist()`

l <- list(a=1,b=10,c=100) v <- unlist(l) v mode(l) mode(v) unname(v)

.Rprofile: Setting TX repository a b c 1 10 100 [1] "list" [1] "numeric" [1] 1 10 100

### 4.5 Apply functions

- Apply family of functions
`apply`

- apply a function to the rows, columns or values in a matrix/array
`lapply`

- apply a function to every item in a list and return a list
`sapply`

- lapply but simplifies the result and returns a vector
`mapply`

- You have multiple data structures and want to hand elements of each as separate arguments to a function

There are more in this family that we use less commonly. See ??apply and scroll down.

- Applying functions to lists
- lapply example
Consider sexing 10,000 shrimp. We'll use a subset. We'd like to know the indices for all females, males and juveniles:

shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") shrimp=="F" which(shrimp=="F")

[1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE [1] 2 5 6 8 9

- Example, continued
codes <- c("F","M","J") lapply(codes, function(sex) which(shrimp==sex))

[[1]] [1] 2 5 6 8 9 [[2]] [1] 1 7 [[3]] [1] 3 4

- Whoa, where was the function?
It was unnamed! Alternative with named function:

# function to return indices of x in v indices <- function(x, v) { return(which(x==v)) } # and use sapply for USE.NAMES=TRUE: sapply(codes, indices, v=shrimp, simplify=FALSE)

$F [1] 2 5 6 8 9 $M [1] 1 7 $J [1] 3 4

## 5 Control Structures

### 5.1 Logistics and homework solutions

- Logistics
- Homework dates all on syllabus now. Note last date is not a Monday
- I will drop lowest homework grade.
- Use this etherpad for sharing code in class: https://public.etherpad-mozilla.org/p/biol6325-2016

`unique()`

`x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2) unique(x)`

[1] 1 2 3 4

- Counting TRUES
x < 3 length(which(x < 3)) sum(x < 3)

[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [1] 4 [1] 4

- Solutions Q. 1-3
oak_wp <- read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv", stringsAsFactors = FALSE) n_species <- length(unique(oak_wp$spcode)) # 1 any(oak_wp$md.psi > oak_wp$pd.psi) # This answers #2 which_fail <- which(oak_wp$md.psi > oak_wp$pd.psi) #2 and n_fail_check <- length(which_fail) # 2 min_index <- which.min(oak_wp$md.psi) #3 oak_wp[min_index, c("md.psi", "date", "spcode")] #3

- Solution Q. 4
years <- unique(oak_wp$year) mean_for_year <- function(year){ return(mean(oak_wp$md.psi[oak_wp$year==year], na.rm=TRUE)) } wp_means <- sapply(years, mean_for_year) years[which.min(wp_means)]

- R vocab first two weeks of class:

### 5.2 Apply functions, cont.

- Lists of lists
a <- list(name="Dylan", dept="Biology") b <- list(name="Gondah", dep="Plant and Soil") c <- list(name="Kathryn", role="NRM") course <- list(Schwilk=a, Zulue=b, Brautigam=c) course

$Schwilk $Schwilk$name [1] "Dylan" $Schwilk$dept [1] "Biology" $Zulue $Zulue$name [1] "Gondah" $Zulue$dep [1] "Plant and Soil" $Brautigam $Brautigam$name [1] "Kathryn" $Brautigam$role [1] "NRM"

`sapply`

How can we extract a vector of just the first names?

sapply(course,"[[","name")

Schwilk Zulue Brautigam "Dylan" "Gondah" "Kathryn"

- Apply functions to matrices:
`apply()`

`y <- apply(X, MARGIN, FUN, ...)`

`X`

- an array
`MARGIN`

- subscripts which the function gets applied over
`FUN`

- the function to be applied
`...`

- additional arguments to function

Returns a vector or array

- Applying a function to rows or columns
the

`apply()`

function.`MARGIN`

argument = 1 if the function applies to rows, 2 for columns, etc.`z <- matrix(c(1,2,3,4,5,6), ncol=2) z`

[,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6

apply(z, 2, mean)

[1] 2 5

- on rows, using our own function:
f <- function(x) x/c(2, 8) y <- apply(z, 1, f) y

[,1] [,2] [,3] [1,] 0.5 1.000 1.50 [2,] 0.5 0.625 0.75

- What are "margins"?
- Examples: matrix
- MARGIN = 1: slice up into rows
- MARGIN = 2: slice up into columns
- MARGIN = c(1,2): slice up into individual cells

— Wickham (2011): http://www.jstatsoft.org/v40/i01/

- 3-d arrays
- Example: grid of genotypes
Let's store genotypes as 1st dimension on a 2d landscape of organisms. So we will make a 3d array:

makeGenotype <- function() sample(1:100, 25, replace=TRUE) # simple 10 x 10 landscape landscape <- array(replicate(100, makeGenotype()), dim =c(25,10,10 )) # The first genotype in landscape postion 1,1: landscape[, 1, 1]

[1] 55 22 4 80 57 8 14 30 51 9 89 7 31 33 47 77 62 91 58 34 49 82 20 41 2

- Example: apply
## get matrix of fitnesses testFitness <- function(g, e) { return( 1 / ( sum((e-g)^2) + 1)) } opt <- makeGenotype() # "best" genotype fitnesses <- apply(landscape, c(2,3), testFitness, e = opt) # lets see the relative fitnesses as percentages round(fitnesses / max(fitnesses)*100)

- Results:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 40 45 27 38 40 36 38 44 36 33 [2,] 41 34 45 37 51 46 37 49 34 40 [3,] 35 40 35 35 43 35 41 31 45 51 [4,] 31 58 24 46 44 38 33 38 40 37 [5,] 34 36 56 44 45 58 37 39 46 68 [6,] 39 40 47 51 33 36 54 47 44 42 [7,] 42 56 44 44 38 35 36 38 40 70 [8,] 43 53 67 50 36 39 59 38 40 55 [9,] 34 42 61 100 29 42 37 27 50 35 [10,] 37 58 39 64 53 54 82 42 40 31

- Where is the most fit genotype?
max(fitnesses) indices <- which(fitnesses == max(fitnesses), arr.ind=TRUE) indices landscape[, indices[1], indices[2]] opt

[1] 4.968944e-05 row col [1,] 9 4 [1] 25 89 23 48 69 84 40 30 19 52 87 35 15 54 30 30 48 85 47 19 5 6 62 20 21 [1] 9 54 89 2 67 94 8 5 19 96 95 51 15 27 4 10 47 82 91 49 7 38 63 69 12

- Heatmap of landscape:
image(fitnesses)

### 5.3 if-else

- if else decisions
tf <- function(r) { if (r == 13) { print("Lucky!") } else { print("No luck") } } tf(5) tf(13)

[1] "No luck" [1] "Lucky!"

- if, else if, else
tf <- function(r) { if (r == 13) { print("Lucky 13!") } else if (r < 0) { print("Negative!") } else { print("Just a number") } } tf(13) tf(-100) tf(10)

[1] "Lucky 13!" [1] "Negative!" [1] "Just a number"

`if`

needs a single logical valueif (c(TRUE, FALSE)) {}

NULL Warning message: In if (c(TRUE, FALSE)) { : the condition has length > 1 and only the first element will be used

`|`

vs`||`

and`&`

vs`&&`

Use

`||`

and`&&`

in`if`

statements. They are "short-circuiting"x <- 1 if (x >0 && c(TRUE, FALSE)) { print("Yes") }

[1] "Yes"

## 6 Loops and functions II:

### 6.1 Functions: review

- The return() statement
If you do not include one, R uses the last statement evaluated

- But explicit return is clearer
And can avoid some weird R gotchas

funnyFunc1 <- function(x) { if(x == 10) "TEN" else "NOT TEN" } funnyFunc2 <- function(x) { if(x == 10) r <- "TEN" else r <- "NOT TEN" } # invisible return from assignment funnyFunc1(10) funnyFunc2(10) print("huh?") print(funnyFunc2(10))

[1] "TEN" [1] "huh?" [1] "TEN"

- Side effects
"Pure" functions have no side effects, just returned values. But you have used functions entirely for their side effect (eg

`library()`

,`write.csv()`

) - Luckily R protects you from most side effects
Function has read access to variables at higher scope, but no write access.

x <- y <- 10 func <- function(x) { x <- 0 y <- y*2 return(y) } func() x y

[1] 20 [1] 10 [1] 10

### 6.2 In-class project 1

- Name the functions
Read the code for each of the following three functions, puzzle out what they do, and then brainstorm better names.

f1 <- function(x) { if (length(x) <= 1) return(x) return(x[length(x):1]) } f2 <- function(string, prefix) { substr(string, 1, nchar(prefix)) == prefix } f3 <- function(x) { if(! is.vector(x) || ! is.numeric(x)) return(FALSE) return(length(x) == 1) }

### 6.3 Loops

**for**and**while**loopsx <- c(1, 2, 3) for (i in x) { print(i^2) }

.Rprofile: Setting TX repository [1] 1 [1] 4 [1] 9

i <- 1 while (i <= 10) { i <- i+4 print(i) }

.Rprofile: Setting TX repository [1] 5 [1] 9 [1] 13

- Breaking out of loops
i <- 1 while(TRUE) { i <- i+4 if (i>10) break } i

.Rprofile: Setting TX repository [1] 13

Or use

**repeat**:i <- 1 repeat { i <- i+4 if (i>10) break } i

.Rprofile: Setting TX repository [1] 13

- Loop hierarchy
specific to general:

`for`

,`while`

,`repeat`

So when do we use each type?

- Use
`for`

loops when you need to iterate over a vector. There are a known number of iterations. - Use
`while`

loops when the loop must continue until some condition is met. If the condition is met when the loop is first encountered, the loop is skipped completely. - Use
`repeat`

as for`while`

, but when the loop must execute at least once or when the best exit point(s) is/are somewhere in the middle of the loop block.

- Use
- Two idioms for
loops`for`

- Idiom 2: Iterate over an indexing vector
in order to do something to elements over multiple vectors

v2 <- c(1,1,2,10) for(i in seq_along(v)) { r <- v[i] * v2[i] print(sprintf("Loop %d, Result: %d", i, r)) }

[1] "Loop 1, Result: 2" [1] "Loop 2, Result: 4" [1] "Loop 3, Result: 10" [1] "Loop 4, Result: 80"

## 7 Exercises with functions

### 7.1 Homework Solutions

- euclideanDistance()
Straightforward solution:

euclideanDist <- function (p1, p2) { dist = sqrt((p1[1] - p2[1])^2 + (p1[2] - p2[2])^2) return(dist) }

.Rprofile: Setting TX repository

Idiomatic R:

euclideanDist <- function (p1, p2) { dist <- sqrt(sum((p2 - p1)^2)) return(dist) }

.Rprofile: Setting TX repository

- decomposeTime() with modulo arithmetic
## decomposeTime: Given a value in seconds, convert to days, hours, minutes, and # seconds. If input is negative, all returned components will be negative # as well. # # args: # nSeconds = scalar number of seconds. # ---------------------------------------- # returns: # a list containing: # days = integer number of days. # hours = integer number of hours. # minutes = integer number of minutes. # seconds = residual seconds.

- decomposeTime() with modulo arithmetic
decomposeTime <- function (nSeconds) { sInMinute <- 60 sInHour <- 60 * 60 sInDay <- sInHour * 24 # deal with negative input by recording and getting absolute value sign <- 1 if(nSeconds < 0) { sign <- -1 nSeconds <- -1 * nSeconds } result <- list(days = nSeconds %/% sInDay, hours = (nSeconds %% sInDay) %/% sInHour, minutes = (nSeconds %% sInHour) %/% sInMinute, seconds = nSeconds %% sInMinute) return(lapply(result, '*', sign)) }

- decomposeTime() without using %% or %/%
decomposeTime <- function(inputSeconds) { # Some constants: secondsPerMinute = 60 secondsPerHour = secondsPerMinute * 60 secondsPerDay = secondsPerHour * 24 # calculate each set: days = floor(inputSeconds / secondsPerDay) residualSeconds = inputSeconds - (days * secondsPerDay) hours = floor(residualSeconds / secondsPerHour) residualSeconds = residualSeconds - (hours * secondsPerHour) minutes = floor(residualSeconds / secondsPerMinute) seconds = residualSeconds - (minutes * secondsPerMinute) return(list(days=days,hours=hours,minutes=minutes,seconds=seconds)) }

### 7.2 In class projects

- Project 1: calculating a probability
Suppose we have

`n`

independent events each with a separate probability of occurring. What is the probability of exactly one of those events occurring? - The algorithm
if the probability of the \(i^{th}\) event is \(p_i\), then P(A and not B and not C) would be be \(p_A(1 - p_B)(1 - p_c)\). For general n:

\[ \sum\limits_{i=1}^{n} p_i (1 - p_1) \ldots (1 - p_{i-1}) (1 - p_{i+1}) \ldots{} (1 - p_n) \]

- Your turn: translate this to code
Function takes vector of probabilities and returns a scalar probability.

exactlyone <- function(p) { ## ? }

You will want to loop through the vector

`p`

while also looping through "not p" (`1-p`

) - The code
exactlyone <- function(p) { notp <- 1 - p totalProb <- 0.0 for (i in 1:length(p)) { totalProb <- totalProb + p[i] * prod(notp[-i]) } return(totalProb) }

- Project 2: Fibonacci sequence
0,1,1,2,3,5,8,13 …

- Solution
fib <- function(n) { ## Two termination cases if (n==1) { return(0) } else if (n==2) { return(1) } # Otherwise use the definition of the sequence: return(fib(n-1) + fib(n-2)) } # example: first 15 values in sequence: sapply(1:15, fib)

[1] 0 1 1 2 3 5 8 13 21 34 55 89 144 233 377

- The recursion stack

### 7.3 More recursion

- "Exploding" dice
In some table-top role playing games, dice are used to provide stochasticity in the simulation of events. In order to produce a more "exponential-like" distribution, some games use "acing" or "exploding dice" in which the die is re-rolled if the first roll results in the maximum value and the new value is added to the first.

- Simulated such a die roll
`rollDie()`

Write a function called

`rollDie(x)`

that simulates rolling a dice with x sides and allows "acing" or "exploding dice". Under this rule, a maximum roll (eg a "6" on a six-sided die) results in a second roll whose value is added to the first. This is open-ended and the second roll can "ace" as well: in other words, the resulting probability distribution is unbounded on the upper end.

- A
`rollDie()`

SolutionrollDie <- function(die) { rolls <- sample(1:die, 1, replace=TRUE) if (rolls == die) { rolls <- rolls + rollDie(die) } return(rolls) } rollDie(6)

[1] 4

- rollDice()
rollDice <- function(die, n, allow.ace = TRUE) { stopifnot(die > 1) # no such thing as 1-sided die rolls <- sample(1:die, n, replace = TRUE) if(!allow.ace) return(rolls) aces <- rolls == die if ( any(aces)) { rolls[aces] <- rolls[aces] + rollDie(die, sum(aces), allow.ace) } return(rolls) }

## 8 Data Frames and Factors

### 8.1 Homework

- Walk through algorithm:
Example parents

`["A", "B", "C", "D"]`

and`["a", "b", "c", "d"]`

.- Choose one parent at random. Flip a coin
- Read one value from chosen parent and copy to offspring
- Test for recombination, if it occurs, start reading from other parent
- Go to step 2
- Done when length(child) is length(parent)

- Some ideas for testing?
What must be true?

- Allele at each position must be same as either mother or father's at that same position. You could write a test for that.
- What should be distribution of parental runs lengths?

### 8.2 Creation and merging

- What is a data frame?
Has characteristics of lists and matrices.

kids <- c("Jack","Jill") ages <- c(10,12) d <- data.frame(kids, ages, stringsAsFactors=FALSE) d d[[1]] d$kids d[,1]

kids ages 1 Jack 10 2 Jill 12 [1] "Jack" "Jill" [1] "Jack" "Jill" [1] "Jack" "Jill"

- Add to data frame
rbind and cbind work as with matrices

d <- rbind(d, list("Laura", 11)) d

kids ages 1 Jack 10 2 Jill 12 3 Laura 11

See also

`bind_rows`

and`bind_cols`

from the`dplyr`

package.`rbind_rows`

will concatenate a list of data frames filling missing columns with**NA**. I use this most of the time. - Merging
kids <- c("Jack","Joan","Laura") states <- c("CA","TX","TX") d2 <- data.frame(kids,states) merge(d, d2)

kids ages states 1 Jack 10 CA 2 Laura 11 TX

- Merging (all, all.x, all.y)
d3 <- merge(d, d2, all = TRUE) names(d3) <- c("name","age","state") d3

name age state 1 Jack 10 CA 2 Jill 12 <NA> 3 Joan NA TX 4 Laura 11 TX

- Merging (all.x)
d4 <- merge(d, d2, all.x = TRUE) d4

kids ages states 1 Jack 10 CA 2 Jill 12 <NA> 3 Laura 11 TX

### 8.3 Subsetting

- Data frames: subsetting
More concise than using

`[`

. Names assumed to be columns.subset(d3, state=="TX") # subset(d3, age > 10)

name age state 3 Joan NA TX 4 Laura 11 TX name age state 2 Jill 12 <NA> 4 Laura 11 TX

- Data frames: add new columns
`d3$birth.year <- 2012 - d3$age d3`

name age state birth.year 1 Jack 10 CA 2002 2 Jill 12 <NA> 2000 3 Joan NA TX NA 4 Laura 11 TX 2001

- More on data frames

### 8.4 Factors

- Grouping variables
# head(iris) class(iris$Species) levels(iris$Species)

[1] "factor" [1] "setosa" "versicolor" "virginica"

- Factors vs strings
- We often store factors as strings in a csv file
- I suggest always leaving them as strings until factor is needed (use
`stringsAsFactors=FALSE`

in`read.csv()`

) - Factors stored as integers internally

- Factors vs strings
v <- c("3", "2", "10") factor(v) vf <- as.factor(v) vf as.numeric(vf[3]) ## How did "10" become [1] ?

[1] 3 2 10 Levels: 10 2 3 [1] 3 2 10 Levels: 10 2 3 [1] 1

- Factors: reassign levels
v <- c("Med", "High", "Low", "Low", "Med") factor(v) vf <- factor(v, levels=c("Low", "Med", "High")) vf

[1] Med High Low Low Med Levels: High Low Med [1] Med High Low Low Med Levels: Low Med High

- Factors: rename levels
# vf levels(vf) <- c("low", "medium", "high") vf

[1] medium high low low medium Levels: low medium high

- split()
split(iris$Sepal.Length,iris$Species)

.Rprofile: Setting TX repository $setosa [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 $versicolor [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 $virginica [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9

- tapply()
tapply(iris$Sepal.Length, iris$Species, sd)

.Rprofile: Setting TX repository setosa versicolor virginica 0.3524897 0.5161711 0.6358796

- Subsetting a dataframe
x <- subset(iris, Species == "virginica") head(x)

Sepal.Length Sepal.Width Petal.Length Petal.Width Species 101 6.3 3.3 6.0 2.5 virginica 102 5.8 2.7 5.1 1.9 virginica 103 7.1 3.0 5.9 2.1 virginica 104 6.3 2.9 5.6 1.8 virginica 105 6.5 3.0 5.8 2.2 virginica 106 7.6 3.0 6.6 2.1 virginica

- by()
colMeans(x[ , 1:4])

Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026

by(iris[ , 1:4],iris$Species,colMeans)

iris$Species: setosa Sepal.Length Sepal.Width Petal.Length Petal.Width 5.006 3.428 1.462 0.246 ------------------------------------------------------------ iris$Species: versicolor Sepal.Length Sepal.Width Petal.Length Petal.Width 5.936 2.770 4.260 1.326 ------------------------------------------------------------ iris$Species: virginica Sepal.Length Sepal.Width Petal.Length Petal.Width 6.588 2.974 5.552 2.026

- Shrimp again: split()
shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") split(1:length(shrimp), shrimp)

$F [1] 2 5 6 8 9 $J [1] 3 4 $M [1] 1 7

Frequencies:

sapply(split(1:length(shrimp), shrimp), length) # or use table() print("using table():") table(shrimp)

F J M 5 2 2 [1] "using table():" shrimp F J M 5 2 2

### 8.5 Merging example

- Data often stored in separate files
Consider

oak_wp <- read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv", stringsAsFactors = FALSE) species <- read.csv("http://r-research-tool.schwilk.org/lectures/data/oak-species.csv", stringsAsFactors=FALSE) head(oak_wp, 3) head(species, 3)

site tag spcode date pd.psi md.psi year 1 GATEN 110 QUEM 10/24/2010 NA NA 10 2 GATEN 111 QUEM 10/27/2010 NA NA 10 3 GATEN 113 QUEM 10/24/2010 NA NA 10 genus specific.epithet family spcode display.name CM DM GM 1 Quercus emoryi Fagaceae QUEM Q. emoryi 1 1 NA 2 Quercus gambelii Fagaceae QUGA Q. gambelii 1 1 1 3 Quercus gravesii Fagaceae QUGR2 Q. gravesii 1 1 NA

- We want to plot data but use real names
`oak_wp <- merge(oak_wp, species) head(oak_wp, 2)`

spcode site tag date pd.psi md.psi year genus specific.epithet 1 QUEM GATEN 110 10/24/2010 NA NA 10 Quercus emoryi 2 QUEM GATEN 111 10/27/2010 NA NA 10 Quercus emoryi family display.name CM DM GM 1 Fagaceae Q. emoryi 1 1 NA 2 Fagaceae Q. emoryi 1 1 NA

- "Merging" = "joining"
See

`left_join`

,`inner_join`

, etc in`dplyr`

We'll come back to this.

### 8.6 Data frames and the "tidyverse"

- R has a history
- And that means some aspects have been proven effective, and some not (eg bad defaults like
`drop=TRUE`

) - Hadley Wickham and others have worked on a set of packages that provide a consistent interface. These, too, have a history (eg
`plyr`

,`reshape2`

), but the current best choices are`readr`

,`dplyr`

,`tidyr`

,`tibble`

, and`ggplot2`

. - You can even load all of these at once with the meta-package:
`library(tidyverse)`

- And that means some aspects have been proven effective, and some not (eg bad defaults like
`dplyr`

and`tibbles`

- You can use base R (subset, merge, etc), but dplyr porvides a clean interface for working with data frames
- In fact,
`dplyr`

will convert your data frames to efficient "tibbles" - tibbles are just data frames with nicer consistent behavior

- Some dplyr examples to get you started
library(readr) library(dplyr) oak_wp <- read_csv("http://r-research-tool.schwilk.org/lectures/data/oak-water-potentials-simple.csv") head(filter(oak_wp, spcode=="QUEM" & year==13), 2)

Attaching package: ‘dplyr’ The following objects are masked from ‘package:stats’: filter, lag The following objects are masked from ‘package:base’: intersect, setdiff, setequal, union Source: local data frame [2 x 7] site tag spcode date pd.psi md.psi year <chr> <chr> <chr> <chr> <dbl> <dbl> <int> 1 MDS 238 QUEM 05/23/2013 -0.85 -1.95 13 2 MDS 237 QUEM 05/23/2013 -0.76 -1.16 13

- mutate()
`nd <- mutate(oak_wp, proj_year = year - 9) head(nd, 2)`

Source: local data frame [2 x 8] site tag spcode date pd.psi md.psi year proj_year <chr> <chr> <chr> <chr> <dbl> <dbl> <int> <dbl> 1 GATEN 110 QUEM 10/24/2010 NA NA 10 1 2 GATEN 111 QUEM 10/27/2010 NA NA 10 1

- select()
`nd <- select(oak_wp, spcode, date, md.psi) tail(nd, 3)`

Source: local data frame [3 x 3] spcode date md.psi <chr> <chr> <dbl> 1 QUHY 06/25/2013 -1.22 2 QUHY 06/25/2013 -1.87 3 QUHY 06/25/2013 -1.33

## 9 Math I

### 9.1 Homework

- Homework
- Solutions are included with this week's assignment
- Remember to cite any sources or collaborators and provide some detail – avoid plagiarism.
- Please read the style guide. Adhere to it.
- Avoid literal values – rarely needed. Example on board.

- Testing example
## test if distribution of run lengths looks like a bernoulli process ## (geometric distribution). As length -> infinity this should be true. # Mean should be 1 / p # Variance should be (1 - p) / p^2 mother <- rep(0,10000) father <- rep(1,10000) child <- mateVectors(mother, father, 0.05) runs <- rle(child) runlengths <- runs$lengths mean(runlengths) # 1 / 0.05 = 20 var(runlengths) # (1 - p) / p^2 = 380 when p=0.05

### 9.2 Sorting and sets

`sort`

and`order`

- Use
`order()`

to sort dataframesd1 <- data.frame(sites = c("MDS", "BGN", "MDN", "LCN"), elevation = c(1900, 2300, 1950, 1750)) d1

sites elevation 1 MDS 1900 2 BGN 2300 3 MDN 1950 4 LCN 1750

- Use
`order()`

to sort dataframes`y <- order(d1$elevation) d1[y, ]`

sites elevation 4 LCN 1750 1 MDS 1900 3 MDN 1950 2 BGN 2300

- Sets in R
A set is a collection of distinct (unique) objects. In R we simply use a vector.

numbers <- sample(1:10, 20, replace=TRUE) numbers myset <- unique(numbers) myset

[1] 6 3 1 8 6 1 2 3 6 1 9 1 4 4 5 8 7 10 6 4 [1] 6 3 1 8 2 9 4 5 7 10

- Set operations
`union(x,y)`

- Union of sets x and y
`intersect(x,y)`

- Intersection of sets x and y
`setdiff(x,y)`

- All element in x that are not in y
`setequal(x,y)`

- Test for set equality
`c %in% y`

- Test for membership in set, returns boolean vector
`choose(n,k)`

- Number of possible subsets of size k chosen from set of size n
`combn(x,n)`

- Find all combinations of size m in x

- Let's write some functions
- Write a function to calculate the symmetric difference of two sets
- Define a subset operator

### 9.3 Number systems

- A note on integers vs real numbers
R has integers and real numbers ("numeric" class and "double" type). I have sometimes referred to "integers" when we have created vectors such as c(1,2,3), but usually one ends up with the numeric class which stores double-precision floating points. Examples:

typeof(2) typeof(as.integer(2)) typeof(c(1,2,3)) typeof(1:3)

[1] "double" [1] "integer" [1] "double" [1] "integer"

- Positional number systems
- Simply ways to count things. Ours is the base-10.
- There is no symbol "10" in base-10 and no symbol "2" in base-2, etc
- Each column stands for a power of the base (eg powers of 10)

- Binary: base-2
- A single binary number is called a
**B**inary dig**IT**, or**bit**. - Computers perform operations on binary number groups called "words". Today, most computers use 32-, 64-, or 128-bit words.
- Words are subdivided into 8-bit groups called bytes.

- A single binary number is called a
- Hexadecimal: base-16
Dec Hex Binary Dec Hex Binary 0 0 0000 8 8 1000 1 1 0001 9 9 1001 2 2 0010 10 a 1010 3 3 0011 11 b 1011 4 4 0100 12 c 1100 5 5 0101 13 d 1101 6 6 0110 14 e 1110 7 7 0111 15 f 1111 - Hexadecimal example
- Negative numbers
Negative numbers: signed integers

- Takes an extra bit of information because it doubles the size we must be able to store.
- Multiple ways to represent (Use first bit for sign, top half of range, or 2's complement)
- Last way is what is used in computers — but you don't need to worry about representation.

- Integer Overflow
If we have a 1-byte integer in which the first bit stores the sign:

01011101 : 93 + 01101011 : 107 \= 11001000 : -56

### 9.4 Floating point

- Real numbers on computers: floating point
- Problem:
How to represent real numbers with integers? Binary point?

101.011:

1 0 1 . 0 1 1 2 ^{2}2 ^{1}2 ^{0}. 2 ^{-1}2 ^{-2}2 ^{-3} - Solution:
Fix the location of the binary point. Think of scientific notation with fixed precision: \(6.020123 \times 10^{23}\)

In binary: \(\pm\) X.YYYYYY \(\times 2^{\pm N}\)

- Problem:
- IEEE single precision: 32 bits, one word
Simply "float" type in most languages: \(\pm\) 1.YYYYY \(\times 2^{N-127}\)

- assumes X is always 1 (no leading zeros, trick to get one more bit of precision)
- 1 sign bit
- 8 bit biased exponent (n-127)
- 23 bit mantissa (YYYYY)

- "single precision floating point": 32 bits
- Different floating point types
IEEE Standard for Floating-Point Arithmetic (IEEE 754) (1980s):

- Single precision, called "float" in the C language family, has a 23 bit mantissa (so 24 bits of precision with assumed 1).
- Double precision, called "double" in the C language family, occupies 64 bits (8 bytes) and its significand has a precision of 53 with the assumed 1. Sign is 1 bit, exponent is 11 bits and mantissa is 52 bits.

In R there are only 2 numeric types: integer and numeric (double precision). "Single precision" not used much anywhere nowadays.

- Some issues
- Doubles types ("numeric" class) cannot represent all 64-bit integers! But since R uses 32 bit integers, double can store more integers than can the integer type. See =?integer
- Rounding error: The only numbers that can be represented exactly in R's numeric type are integers and fractions whose denominator is a power of 2. Other numbers have to be rounded to (typically) 53 binary digits accuracy.

- Rounding error
64 bits are not enough to store every possible value on number line!

1.0 - 0.9 - 0.1

.Rprofile: Setting TX repository [1] -2.775558e-17

- Floating point error
As a result, two floating point numbers will not reliably be equal:

`a <- sqrt(2) a * a == 2 a * a - 2`

[1] FALSE [1] 4.440892e-16

The function

`all.equal()`

allows you to compare doubles. It basically tests`abs(a - b) < epsilon value`

. - Testing equality of floating point numbers
Compare

`all.equal`

,`identical`

and`==`

all.equal(a*a, 2) # see documentation for using in if statement identical(a*a, 2.0) all.equal(as.double(8), as.integer(8)) identical(as.double(8), as.integer(8))

[1] TRUE [1] FALSE [1] TRUE [1] FALSE

## 10 Math II

### 10.1 Math functions

- Built-in functions
- exp()
- Exponential function, base e
- log()
- Natural logarithm
- log10()
- Logarithm base 10
- sqrt()
- Square root
- abs()
- Absolute value
- sin()
- Trig functions:
`sin()`

,`cos()`

,`tan()`

,`atan()`

,`atan2()`

- min(), max()
- Min and max within a vector
- pmin(), pmax
- Element-wise maxima and minima of several vectors
- sum(), prod()
- Sum and product of elements of vector
- cumsum, cumprod()
- Cumulative sum or product of all elements in a vector
- round()
- Rounding to integers. See
`floor()`

,`ceiling()`

- factorial()
- Factorial function

- Minimization and Maximization
`nlm()`

and`optim()`

both find minimaf <- function(x) { return(x^2 - sin(x) ) } d <- data.frame(x = seq(-10, 10, 0.01)) d$y <- f(d$x)

`library(ggplot2) qplot(x, y, data = d)`

- Plot of x,y
- Using nlm() (Newton algorithm only)
nlm(f, 5)

$minimum [1] -0.2324656 $estimate [1] 0.4501833 $gradient [1] 4.501954e-07 $code [1] 1 $iterations [1] 6

- Using optim()
`optim(5, f, method = "Brent", lower = -100, upper = 100)`

$par [1] 0.4501836 $value [1] -0.2324656 $counts function gradient NA NA $convergence [1] 0 $message NULL

- Symbolic math: differentiation
There are packages to do symbolic math (and other open source alternatives eg sage) but base R has

`D()`

`D(expression( x^2 + 2*x ), "x")`

.Rprofile: Setting TX repository 2 * x + 2

`D(expression( exp(x^2) ), "x")`

.Rprofile: Setting TX repository exp(x^2) * (2 * x)

`D(expression(log(2*x) + x^2), "x")`

.Rprofile: Setting TX repository 2/(2 * x) + 2 * x

- D() continued
curve( x^2 + 2*x ,0,10)

- D() continued II
`curve(eval(D(expression( x^2 + 2*x ), "x") ), 0, 10)`

- Simple calculus: numeric integration
`integrate()`

f <- function(x) x^2 integrate(f,0,1)

.Rprofile: Setting TX repository 0.3333333 with absolute error < 3.7e-15

### 10.2 Statistical distributions

- Distribution functions
Distribution pdf cdf quantiles rand. numbers Uniform dunif() punif() qunif() runif() Normal dnorm() pnorm() qnorm() rnorm() Chi square dchisq() pchisq() qchisq() rchisq() Binomial dbinom() pbinom() qbinom() rbinom() Poisson dpois() ppois() qpois() rpois() Student's T dt() pt() qt() rt() F dist. df() pf() qf() rf() and more!

For continuous distributions, the most useful functions are the "p" (cumulative density) and "q" (quantile = inverse cdf) functions.

- Using distributions
\(95^{th}\) percentile of the Chi square distribution with 2 degrees of freedom:

qchisq(0.95,df=2)

.Rprofile: Setting TX repository [1] 5.991465

\(5^{th}\) and \(95^{th}\) percentiles of the normal distribution with mean zero and sd of one:

qnorm( c(0.05, 0.95) )

.Rprofile: Setting TX repository [1] -1.644854 1.644854

- Example: Binomial Distribution
Probability of exactly 2 heads out of 12 coin flips:

dbinom(2, size=12, prob=0.5)

.Rprofile: Setting TX repository [1] 0.01611328

Should be same as probability of 10 heads:

dbinom(10, size=12, prob=0.5)

.Rprofile: Setting TX repository [1] 0.01611328

Probability of 1-10 heads:

pbinom(10, size=12, prob=0.5)

.Rprofile: Setting TX repository [1] 0.9968262

### 10.3 Randomization and simulation

- Random variates
Built-in random number generators for many statistical distributions.

- Expectation
find \( E[max(X,Y)] \) when X,Y are normal random variables mean=0 sd=1

sum <- 0 nreps <- 100000 for (i in 1:nreps) { xy <- rnorm(2) sum <- sum + max(xy) } print(sum/nreps)

[1] 0.5663903

# or store in vector: v <- replicate(nreps, max(rnorm(2))) hist(v)

- Plot the results
- Setting the random number seed
Important for debugging!

set.seed(100) mean(rnorm(10000)) mean(rnorm(10000)) set.seed(100) mean(rnorm(10000))

.Rprofile: Setting TX repository [1] 0.00387737 [1] -0.01183919 [1] 0.00387737

### 10.4 Linear Algebra

- Linear Algebra
Matrix multiplication:

a <- matrix(c(1,3,2,4), ncol=2) b <- matrix(c(1,0,-1,1), ncol=2) a %*% b

.Rprofile: Setting TX repository [,1] [,2] [1,] 1 1 [2,] 3 1

- Linear algebra
Solve system of equations:

\begin{array}{lcl} x + y & = & 4 \\ x - y & = & 2 \end{array}a <- matrix(c(1,1,1,-1), ncol=2) b <- c(4, 2) solve(a, b)

.Rprofile: Setting TX repository [1] 3 1

- Some linear algebra functions
`crossprod()`

- Inner product (dot product) of two vectors. Not the real crossproduct.
`t()`

- Matrix transpose
`qr()`

- QR decomposition
`chol()`

- Cholesky decomposition
`det()`

- Determinant
`eigen()`

- Eigenvalues / eigenvectors
`diag()`

- Extracts the diagonal of a square matrix (useful for obtaining variances from a covariances matrix and for constructing a diagonal matrix)
`sweep()`

- Numerical analysis sweep operations (a more capable and complicated "apply" type function)

## 11 Environment, Scope and Style

### 11.1 Scope

- Where do names apply?
- Nested functions
w <- 12 f <- function(y) { d <- 8 h <- function() { return(d * (w+y) ) } print(environment(h)) return(h()) } environment(f) ls()

<environment: R_GlobalEnv> [1] "d" "f" "i" "nreps" "sum" "v" "w" "x" "xy"

Note: confusingly, we refer to the top environment as .GlobalEnv in code.

- Scope hierarchy
- Scope hierarchy con.
f <- function(y, ftn) { d <- 8 print(environment(ftn)) return(ftn(d, y)) } h <- function(dd, yy) { return(dd * (w + yy)) } w <- 12 f(3, h)

<environment: R_GlobalEnv> [1] 120

### 11.2 Superassignment

- Scope and assignment
In general, R functions do not change variables at a higher scope

x <- 1 increment.x <- function() { print(x) x <- x+1 print(x) } increment.x() x

[1] 1 [1] 2 [1] 1

- Super assignment
increment.x <- function() { x <<- x+1 } x increment.x() x

[1] 1 [1] 2

- assign()
Write to upper level variables using character strings for names

filelist <- list.files(pattern="*.org$") #filelist for(s in filelist) { assign(s, readLines(file(s)) ) } ls()

.Rprofile: Setting TX repository [1] "filelist" "r-course-outline.org" "s"

### 11.3 Programming style

- Programming style
- Good style aids maintenance
- Use comments to tell the readers what they need to know
- Use indentation to mark the levels of program control
- Use meaningful names
- Develop a convention for names (global variables vs other variables vs function names)
- Use standard idioms where appropriate
- Avoid unnecessary complexity. Avoid code repetition!

- Good style aids maintenance
- Coding style
- Function Documentation
For example, Google's style:

### 11.4 Craft and organization

- Beyond "style": crafting a program
- Modularization
Modularization is the process programmers use to break up complex programs into smaller more manageable procedures.

- Constants
- Global variables and constants
It can be useful to put some infrequently changing constants at the top of a script file:

GRID_SIZE <- 10 NUM_ITERATIONS <- 10000

Then use these global variables:

makeGrid <- function(gridSize = GRID_SIZE) { g <- matrix(1:(gridSize^2),nrow=gridSize) return(g) }

## 12 Strings

### 12.1 "Text" on computers

- How to represent characters?
Simple idea: numeric code for each character. Originated at Bell labs in 1960.

- ASCII table
0–31 used for "unprintable" characters (BEEP! page feed, etc)

- All was good
Assuming you were an English speaker.

- ANSI standard
- Same as ASCII for 0-127
- But different "code pages" for upper set

- Unicode
Basic idea: two bytes. Then expanded to more (up to 3 bytes) Enough to store every character in every language.

- Originally: \(2^{14} = 16,384\) (2 bits reserved)
- Currently: 110,182 code points
- Every letter maps to a "code point" — a platonic ideal of a letter.
- "A" is different from "a", but "A" in Times New Roman is the same as "A" in Arial
- "Hello" =
`U+0048 U+0065 U+006C U+006C U+006F`

- Still the issue of exactly how to
*store*the numbers

- Encodings
- There were multiple standards for STORING the numbers. Original, UCS-2, uses 2 bytes.
- Americans were lazy, so:

- UTF-8
Idea: use one byte for code points 0-127, and only use two for points above that. UTF-8 is the de facto standard encoding for unicode in US.

You can read in utf-8 encoded text in R. In fact utf-8 may be the current default ("locale"):

Sys.getlocale()

.Rprofile: Setting TX repository [1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"

- Unicode strings
a <- "\U00B5" # note that "\" is escape char in R strings a b <- c("\U03C0", "π") b

.Rprofile: Setting TX repository [1] "µ" [1] "π" "π"

- You can use unicode (utf-8) in R scripts
But may be awkward to enter depending on keyboard and text editor

π <- 3.14 r <-2 2*r*π

.Rprofile: Setting TX repository [1] 12.56

### 12.2 Strings

- Characters in R
- until 2004, R only used ASCII!
- R now supports UTF-8 (make sure "locale" is set correctly on your computer) — works a bit differently on Windows vs Linux
- So you can use Chinese, Russian, etc.
`Sys.setlocale("LC_CTYPE","chinese")`

- Character objects (strings)
A character object is used to represent string values in R. We convert objects into character values with the

`as.character()`

function:x = as.character(pi) x typeof(x)

.Rprofile: Setting TX repository [1] "3.14159265358979" [1] "character"

- Formatting strings
using paste():

first <- "Dylan" last <- "Schwilk" n <- paste(first, last) n paste(last, first, sep=", ")

[1] "Dylan Schwilk" [1] "Schwilk, Dylan"

- Formatting strings
using sprintf():

s <- sprintf("%s has %d dollars", n, 50) s sprintf("%s has $%2.2f", n, 50)

[1] "Dylan Schwilk has 50 dollars" [1] "Dylan Schwilk has $50.00"

- Substrings
`substr()`

and`substr<-()`

substr(s, 7, 13)

[1] "Schwilk"

Assignment:

substr(s, 7, 13) <- "SCHWILK" s substr(s, 7, 13) <- "???" s

[1] "Dylan SCHWILK has 50 dollars" [1] "Dylan ???WILK has 50 dollars"

- Substitution
sub("???WILK", "S.", s, fixed=TRUE)

[1] "Dylan S. has 50 dollars"

### 12.3 String operations

- Converting strings
tolower(s) toupper(s)

[1] "dylan ???wilk has 50 dollars" [1] "DYLAN ???WILK HAS 50 DOLLARS"

- Built-in string functions
`grep()`

- Searches for a substring, like the Linux command of the same name
`nchar()`

- Finds the length of a string
`paste()`

- Assembles a string from parts
`sprintf()`

- Also assembles a string from parts
`substr()`

- Extracts a substring
`strsplit()`

- Splits a string into substrings

- Example: testing for a file suffix
# tests whether the file name fnname has the suffix suffix, # e.g. "abc" in "x.abc" hasSuffix <- function(fname, suffix) { ncf <- nchar(fname) # nchar() gives the string length dotpos <- ncf - nchar(suffix) + 1 # dot would start here if there is one # now check that stuff is at the end of the name return(substr(fname, dotpos, ncf) == suffix) } hasSuffix("file.csv", "csv") hasSuffix("file.csv", "txt")

[1] TRUE [1] FALSE

- grep() and grepl()
Takes its name from the linux/unix command.

v <- c("beach", "beech", "back", "buster", "abea") grep("bust", v) v[grep("bust", v)]

[1] 4 [1] "buster"

`grepl("bust", v)`

[1] FALSE FALSE FALSE TRUE FALSE

### 12.4 Regular expressions

- Regular Expressions
Sometimes we need to look for patterns, not specific strings

- "regex"
- Come from regular set theory (1950s)
- Formal way of specifying a set string matches
- Really are a whole mini computer language
- Used in many command line tools (eg grep), and implemented as at least a library in most computer languages
- Most text editors and word processors allow regex search and replace

- "regex"
- regex basics
- Single character wildcard: "."
- Single match from multiple options:
`[a-z]`

or`[acdf-h]`

. Matches**one**of the characters in brackets or implied by range `[a-zA-Z]`

should match any ASCII alphabetic letter

- Positional matching
`^`

matches beginning of line`$`

matches end of line`\b`

matches word boundary (space or punctuation, etc)

- Positional matching example
x <- unlist(strsplit("Out in the west Texas town of El Paso", split=" ")) # capitalized words: grep("^[A-Z]", x)

[1] 1 5 8 9

- Matching multiples
Modify preceding character or expression

- "*" Matches any number of preceding expression (zero or more!)
- "?" matches one or none
- "+" matches one or more
`{n}`

matches n times- ={n,} matches n or more times
- ={n,m} matches n to m times

- Grouping
- Regular expressions in R
- R string functions (and functions in the
`stringr`

package) can handle regular expressions. Use fixed=TRUE to avoid. - R includes some predefined character classes (eg
`[:digit:]`

,`[:punct:]`

). See`?regex`

- In regexes, one must escape special characters with a backslash. But keep in mind that R will first interpret the string before it hands it to the regex engine. So to match a literal "?" use
`\\?`

- R string functions (and functions in the
- Escaping example
v <- c("The first", "firstofall", "first time", "Who's on first?", "bfirstb") grep("first", v) grep("first\\b", v) grep("^first\\b", v)

.Rprofile: Setting TX repository [1] 1 2 3 4 5 [1] 1 3 4 [1] 3

### 12.5 `stringr`

Package

- Why the need?
- R string functions are inconsistent
Remedy: Hadley Wickham's

`stringr`

librarylibrary(stringr) s <- c("(806) 777-7777", "(401) 555-5555", "(806) 444-4444") number_pattern <- "\\(([0-9]{3})\\) ([0-9]{3}-[0-9]{4})" str_match(s, number_pattern )

[,1] [,2] [,3] [1,] "(806) 777-7777" "806" "777-7777" [2,] "(401) 555-5555" "401" "555-5555" [3,] "(806) 444-4444" "806" "444-4444"

- R string functions are inconsistent
- Basic string operations with stringr
`str_c()`

- like
`paste`

, but it uses the empty string as the default separator and silently removes zero length arguments `str_length()`

- is equivalent to
`nchar`

, but it preserves NAs and converts factors to characters (not integers) `str_sub()`

- is equivalent to
`substr()`

but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the end of the string. The end position defaults to -1, which corresponds to the last character. `str_sub<-`

- is equivalent to
`substr<-`

, but like`str_sub`

it understands negative indices, and replacement strings not do need to be the same length as the string they are replacing

`str_c()`

str_c("Hello", " ", "world", "!") v <- c("A", "B", "C") str_c(v,v)

[1] "Hello world!" [1] "AA" "BB" "CC"

`str_length()`

v <- factor(c("Aa", "Bb", "C", NA)) nchar(v) str_length(v)

Error in nchar(v) : 'nchar()' requires a character vector [1] 2 2 1 NA

`str_sub()`

s <- "What do you do with a drunken sailor" str_sub(s, -nchar("sailor"), -1 ) <- "student" s

[1] "What do you do with a drunken student"

## 13 Debugging

### 13.1 Debugging approaches

- Debugging
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."

— Brian W. Kernighan.

- What is a bug?
- Code fails (produces error)
- Code runs but produces the wrong result

- Debugging
Differential diagnosis

- Characterize the error
- Localize the error
- Fix the error (and don't introduce a new bug!)

- Characterizing the bug
- Make the error reproducible
- Can we always get the error when re-running the same code and values?
- If we start the same code in a clean workspace, does the same thing happen?

- Bound the error
- How much can we change the inputs and get the same error? A different error?
- For what inputs (if any) does the bug go away?
- How big is the error?

- Get more information
Add extra output (e.g., number of optimization steps, did the loop converge, final value of optimized function)

- Make the error reproducible
- Localizing the bug
Tools:

- traceback (where did an error message come from?); print, warning, stopifnot (messages from the code as it goes)
- Trying controlled inputs
- Interactive debugging

### 13.2 `traceback()`

- traceback
- When an R function fails, an error is printed to the screen. Immediately after the error, you can call traceback() to see in which function the error occurred.
- The traceback() function prints the list of functions that were called before the error occurred.
- The functions are printed in reverse order.

- traceback example
f <- function(x) { r <- x - g(x) r } g <- function(y) { r <- y * h(y) r } h <- function(z) { r <- log(z) if(r < 10) r^2 else r^3 } f(-1) traceback()

- traceback results
Error in if (r < 10) r^2 else r^3 (from #3) : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced 3: h(y) at #2 2: g(x) at #2 1: f(-1)

### 13.3 Interactive Debugging

- Interactive debugging
`traceback()`

does not always tell you where the error originated. In order to know which line causes the error, you may want to step through the function using debug(). `debug()`

and`browser()`

`debug(foo)`

flags the function`foo()`

for debugging.`undebug(foo)`

unflags the function.- When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and you can interact with the environment.
- After you obtained necessary information from the environment, you can let the function to execute the next statement.
- You can call
`browser()`

from any point in your code to enter the debugger. This is useful if you know about where the bug is and you don't want to start at the beginning of the function.

- debug example
## buggy version: mateVectors <- function(x, y, r) { child <- y usingX <- runif(1) <= 0.5 for (i in x) { if(usingX) { child[i] <- x[i] } if(runif(1) <= r) { !usingX } } return(child) }

- Trying the buggy verson
set.seed(1001) mateVectors(1:10, 11:20, 0.4) mateVectors(c("A","B","C"), c("a","b","c"),0.3)

[1] 11 12 13 14 15 16 17 18 19 20 A B C "a" "b" "c" NA NA NA

- Let's debug this function
- Debugging using RStudio
RStudio includes a visual debugger. Have you noticed the "environment" window? It shows values for variables in the current environment. When debug is active, this shows the function being debugged.

- RStudio’s error inspector and
`traceback()`

- RStudio’s "Rerun with Debug" tool and
`options(error = browser)`

which open an interactive session where the error occurred. - RStudio’s breakpoints and
`browser()`

which open an interactive session at an arbitrary location in the code.

- RStudio’s error inspector and

## 14 The Grammar of Graphics and ggplot I

### 14.1 Grammar of graphics

- Happy Ada Lovelace Day
- Ada Lovelace day today
- She wrote what is considered the first computer program in 1843.
- Some fun reading: https://www.brainpickings.org/2015/06/15/the-thrilling-adventures-of-lovelace-and-babbage-sydney-padua/

- Grammer of Graphics
**The Grammar of Graphics.**Leland Wilkinson, 1999, 2005. Springer. - What is a "plot"?
A plot is

- A data set and a set of mappings from variables to aesthetics
- One or more layers
- One scale for each aesthetic mapping
- A coordinate system (we won't worry about this and just use cartesian coordinates!)
- The faceting specification (how many subplots)

- What is a scatterplot?
A simple dataset. Plot A vs C with symbol shape determined by categorical variable, D

A B C D 2 3 4 a 1 2 1 a 4 5 15 b 9 10 80 b - Map variables to aesthetic space
- Next steps
- The parts
- The plot
- Grammar
We can produce grammatically correct plots that are nonsensical, just as in English we can produce grammatically correct nonsense

- Nonsense plot
ggplot(mpg, aes(displ, hwy)) + geom_line()

- Learning ggplot2
- ggplot2 mailing list, http://groups.google.com/group/ggplot2
- stackoverflow, http://stackoverflow.com/tags/ggplot2
- Cookbook for common graphics, http://wiki.stdout.org/rcookbook/Graphs/

- ggplot2 book
*ggplot2: Elegant Graphics for Data Analysis*by Hadley Wickham is free from Springer on TTU network: http://www.springerlink.com/content/978-0-387-98140-6/

### 14.2 Components of a plot

- Components of a plot
- Scales, facets, coordinate system
- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
- A scale is a function and its inverse, along with some parameters
- Example: a color gradient scale
- The inverse gives us the axes (for position scales) or the legend (for other scales such as colors or shapes)

- You've seen facets and we will stick with cartesian coordinates for now.

- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
- Layers
composed of five parts:

- data
- aesthetic mapping (what is x, y, color, shape, etc).
`aes()`

command - statistical transformation (
`stat`

) - geometric object (
`geom`

) (geoms have default stats and vice versa) - a position adjustment

### 14.3 Building a plot by layers

- ggplot() function
Use the diamonds data set

p <- ggplot(diamonds, aes(carat, price, color =cut)) #p

Blank plot

- Adding layers
p <- p + layer(geom = "point", stat= "identity", position = "identity") p

- More complicated plot
p <- ggplot(diamonds, aes(x=carat)) p <- p + layer(geom = "bar", stat = "bin", position = "identity", params = list(fill = "steelblue", binwidth=1), ) p

- Result:
- Shortcuts!
Every geom is associated with a default statistic and position, and every statistic with a default geom:

p <- ggplot(diamonds, aes(x=carat)) p <- p + geom_histogram(binwidth = 1, fill = "steelblue") p

- Result:
- ggplot book
http://www.springerlink.com/content/978-0-387-98140-6/

Tables 4.2 and 4.3 have default stats and geoms

- Layer shortcuts
- Shortcut functions
`geom_XXX(mapping, data, ..., geom, position)`

`stat_XXX(mapping, data, ..., stat, position)`

- Descriptions:
- mapping
- A set of aesthetic mappings using
`aes()`

- data
- A dataset to override the default one
`...`

- Parameters for the geom or the stat
- geom or stat
- You can override defaults
- position
- Method for adjusting overlapping objects

- Shortcut functions
- Layers are objects
bestfit <- geom_smooth(method = "lm", color = "steelblue", size = 2, se = FALSE) qplot(cty, hwy, data = mpg) + bestfit

- Result:

### 14.4 Aesthetics

- Aesthetic mappings
Refer to variables in the dataset only! So, back to our mpg data:

a <- aes(x = displ, y = hwy, color = class) p <- ggplot(mpg, a) + geom_point() p

- Result:
- Mapping vs setting
Because R can handle arguments by position or name this can be confusing:

p <- ggplot(mpg, aes(cty, displ)) # set point color to dark blue p + geom_point(color = "darkblue") # map new unamed variable with value "darkblue" to color p + geom_point(aes(color="darkblue")) # more useful: p + geom_point(aes(color=cyl))

- Individual vs collective geoms
- Individual: Graphical object for each row (eg points)
- Collective: statistical summary, polygon geom
- Lines and smoothed paths somewhere in between

- Grouping
We'll use the Oxboys data set in nlme

library(nlme) # ?Oxboys # head(Oxboys) # class(Oxboys) p <- ggplot(Oxboys, aes(age, height, group = Subject)) + geom_line() p

- Result:
- Different groups for different layers
# try adding lm line for all data p + geom_smooth(method="lm", se=FALSE) # same as: p + geom_smooth(aes(group = Subject), method="lm", se=FALSE) # what we want: p + geom_smooth(aes(group = NULL), method="lm", se=FALSE)

- Default groups on categorical variables
p <- ggplot(Oxboys, aes(Occasion, height)) + geom_boxplot() p + geom_line(aes(group = Subject), color = "#3366FF")

- Result:
- Violin plot:
ggplot(mpg, aes(class,hwy)) + geom_violin()

- Tables of geoms and stats (layer functions)
- geoms and stats: http://docs.ggplot2.org/current/
- aesthetic specifications: http://docs.ggplot2.org/current/vignettes/ggplot2-specs.html

- Stats
Default stat for histogram geom is bin

ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..density..), binwidth=0.1) ## in qplot: qplot(carat, ..density.., data= diamonds, geom = "histogram", binwidth = 0.1) ## with counts ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..count..), binwidth=0.1)

- Position adjustment
Adjustment Description dodge Adjust position by dodging overlaps on sides fill Stack overlapping objects and standardize to equal height identity Don't adjust position jitter Jitter points to avoid overlap stack Stack overlapping objects on top of one - Position examples
# stack p <- ggplot(diamonds, aes(carat)) p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "stack") # fill p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "fill") # dodge p + stat_bin(aes(ymax = ..count.., fill = cut), binwidth=0.2, position = "dodge") # Identity (use line not area) p + stat_bin(aes(ymax = ..count.., color = cut), geom = "line", binwidth=0.2, position = "identity")

- Writing your own layer functions
stat_sum_single <- function(fun, ...) { stat_summary(fun.y = fun, colour="gray15", geom="point", size = 4, ...) } se_bar <- function(){ stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.15, size = 1) } p <- ggplot(iris, aes(Species, Petal.Length)) + se_bar() + stat_sum_single(mean) p p + stat_sum_single(median, size = 3, shape = 5)

- Result:

## 15 The Grammar of Graphics and ggplot II

### 15.1 Scales

- Scales
- Control how data is mapped to perceptual properties, and produce guides (axes and legends) which allow us to read the plot.
- Important parameters: name, breaks & labels, limits.
- Naming scheme:
`scale_aesthetic_name`

. All default scales have name continuous or discrete.

- Scales
Position scales, color scales, manual scales, identity scale

- name
- text string, mathematical expressions (see
`?plotmath`

). Use`labs()`

,`ylab()`

,`xlab()`

- limits
- Domain of the scale, Continuous scales take a vector of length 2, eg
`c(4,10)`

. - breaks
- sets the values that appear on the axis or the legend (tick marks).

- Scales examples
p <- qplot(cty, hwy, data= mpg, color = displ) p p + scale_x_continuous("City mpg") p + xlab("City mpg") p + ylab("Highway mpg") p + labs(x="City mpg", y = expression(paste("Highway ",frac(miles,gal))), color= "Displacement (l)" )

- Result
- Scale transformations
"transformers":

- Manage the transformation and the inverse for legends, values
- Common ones: log, log10, logit, sqrt, exp, reverse
- shorthand for common ones:
`scale_x_log10()`

is equivalent to`scale_x_continuous(trans = "log10")`

- More scales examples
library(scales) # for math_format() library(stringr) # for str_c p <- qplot(carat, price, data=diamonds) p + scale_x_log10() + scale_y_log10() log10breaks <- seq(-0.8,0.8,.2) p + scale_x_log10(breaks = 10^(log10breaks), labels = parse( text=(str_c("10^",log10breaks)))) + scale_y_log10() # natural log p + scale_x_continuous(trans = "log", breaks = trans_breaks("log", "exp"), labels = trans_format("log", math_format(e^.x)))

- Result

### 15.2 Themes

- ggplot Themes
p + theme_bw() theme_nogrid <- theme_bw() + theme( axis.title = element_text(size = 16), axis.text = element_text(size = 13), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank() ) p + theme_nogrid + xlab("Carat") + scale_y_continuous("Price ($)", labels = comma) # ?theme

- Result:
- Some more resources for ggplot tweaking
- ggplot2 themes: http://docs.ggplot2.org/dev/vignettes/themes.html
- Tweaking text on ggplot plots: http://www.r-bloggers.com/textual-healing

### 15.3 Maps

- Maps in ggplot
ggmap by David Kahle and Hadley Wickham: http://journal.r-project.org/archive/2013-1/kahle-wickham.pdf

library(maps) library(ggmap) library(RColorBrewer) county_df <- map_data('county') tx.counties <- subset(county_df, region == "texas") names(tx.counties)[6] <- "county"

- Maps in ggplot: TX population
tx.pop <- read.csv("http://demographics.texas.gov/Resources/TPEPP/Estimates/2014/2014_txpopest_county.csv", stringsAsFactors=FALSE) tx.pop$county <- tolower(tx.pop$county) tx.pop <- merge(tx.counties, tx.pop) head(tx.pop,3)

county long lat group order region FIPS census_2010_count 1 anderson -95.75271 31.53560 2492 74520 texas 1 58458 2 anderson -95.76989 31.55852 2492 74521 texas 1 58458 3 anderson -95.76416 31.58143 2492 74522 texas 1 58458 july1_2014_pop_est jan1_2015_pop_est num_chg_10_14 num_chg_10_15 1 58501 58243 43 -215 2 58501 58243 43 -215 3 58501 58243 43 -215 pct_chg_10_14 pct_chg_10_15 1 0.1 -0.4 2 0.1 -0.4 3 0.1 -0.4

- TX population map
tx.map <- ggplot(tx.pop, aes(x=long, y=lat, group=county)) tx.map <- tx.map + geom_path(color="gray") tx.map <- tx.map + geom_polygon(aes(fill=log10(jan1_2015_pop_est))) tx.map

- Result:
- With google maps
goog.tx <- get_map(location = "Texas", zoom=6) tx.map2 <- ggmap(goog.tx) tx.map2 <- tx.map2 + geom_path(aes(x=long, y=lat, group = county), color="gray", data=tx.pop) tx.map2 <- tx.map2 + geom_polygon(aes(x=long, y=lat, group = county, fill =log10(jan1_2015_pop_est)), alpha = 0.80, data=tx.pop) llabel <- expression(paste("Population (", log[10], ")")) tx.map2 + scale_fill_gradientn(llabel, colours=brewer.pal(5,"OrRd"))

- Result:

## 16 Shaping data and `dplyr`

### 16.1 Shaping data:

### 16.2 The `dplyr`

package

- dplyr
`dplyr`

includes a family of individually simple functions that all have the same structure and all operate on data frames. They employ a bit of magic for convenience that allow non-standard evaluation.Five important verbs (functions)

`select`

certain columns of data`filter`

your data to select specific rows`arrange`

the rows of your data into an order`mutate`

your data frame to contain new columns`summarize`

chunks of you data

- data frames vs "tibbles"
Packages by Hadley Wickham in the "tidyverse" (or "hadleyverse") use "tibbles" instead of R's traditional data.frame. Tibbles are data frames, but they tweak some older behaviors to make life a littler easier.

`library(tibble)`

but you rarely need to do this because dplyr and other packages will load this for you

- Structure
- All verbs take a data frame as the first argument
- Non standard evaluation: reduces typing. But can be dangerous in functions. There are also standard evaluation forms of each function. See https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html

- Some sample data
The Batting data set (in the

`Lahman`

package) contains the batting records for all professional US players. For more information use the command`?Batting`

library(dplyr) library(Lahman) # baseball dataset names(Batting)

[1] "playerID" "yearID" "stint" "teamID" "lgID" "G" [7] "AB" "R" "H" "X2B" "X3B" "HR" [13] "RBI" "SB" "CS" "BB" "SO" "IBB" [19] "HBP" "SH" "SF" "GIDP"

- ID variables
Factors by which we can group the data. For example, let's get mean stats (games, at bats, runs and hits, columns 6-9) for Babe Ruth across all years he played.

`colMeans( filter(Batting, playerID == "ruthba01")[6:9] )`

G AB R H 113.77273 381.72727 98.81818 130.59091

- Another ID variable: year
Let's get mean stats (games, at bats, runs and hits, columns 6-9) across all players for 1871.

colMeans( filter(Batting, yearID == 1871)[6:9] )

G AB R H 19.96522 94.10435 23.12174 26.96522

- Obtain summaries for groups
Finding summaries for one group is easy, but what if we want to find the same summaries for every group?

- use
`dplyr`

allyears <- select(Batting, yearID, G, AB, R, H) years <- group_by(allyears, yearID) yearstats <- summarize_each(years, funs(mean)) head(yearstats, 3)

Source: local data frame [3 x 5] yearID G AB R H <int> <dbl> <dbl> <dbl> <dbl> 1 1871 19.96522 94.10435 23.12174 26.96522 2 1872 21.19872 100.50641 21.73077 28.76282 3 1873 28.82400 135.79200 28.64000 39.38400

- Split-apply-recombine
When do we need to split-apply-recombine?

- During data preparation, when performing group-wise ranking, normalization, etc
- When creating summaries for display or analysis (marginal means, conditioning tables)
- During modelling, when fitting separate models to subsets of the data

### 16.3 Using dplyr functions

- filter
How many players in 1871?

`y1871 <- filter(Batting, yearID==1871) length(unique(y1871$playerID))`

[1] 115

- mutate and summarise functions
`mutate()`

allows us to modify an existing data frame; a row of output per row of input`summarise()`

A row of output per group

- Strategy for grouped calculations
- Extract a single group
- Solve problem for that one group — perhaps write a wrapper function that takes the group and returns the summary or transformation
- Use dplyr functions and group
_{by}to solve it for all groups

- 1-2. Calculate career year for Babe Ruth
Number of years since player began.

baberuth <- filter(Batting, playerID == "ruthba01") baberuth <- mutate(baberuth, cyear = yearID - min(yearID) + 1) head(baberuth, 5)[,c(2, 6:9, 23)]

yearID G AB R H cyear 1 1914 5 10 1 2 1 2 1915 42 92 16 29 2 3 1916 67 136 18 37 3 4 1917 52 123 14 40 4 5 1918 95 317 50 95 5

- 3. Calculate career year for all players
byplayer <- group_by(Batting, playerID) Batting <- mutate(byplayer, cyear = yearID - min(yearID) + 1) Batting <- arrange(Batting, playerID) head(Batting)[, c(1, 2, 6:9, 23)]

Source: local data frame [6 x 7] Groups: playerID [1] playerID yearID G AB R H cyear <chr> <int> <int> <int> <int> <int> <dbl> 1 aardsda01 2004 11 0 0 0 1 2 aardsda01 2006 45 2 0 0 3 3 aardsda01 2007 25 0 0 0 4 4 aardsda01 2008 47 1 0 0 5 5 aardsda01 2009 73 0 0 0 6 6 aardsda01 2010 53 0 0 0 7

- More summaries: Teams per player
Total number of teams in the data frame

length(unique(Batting$teamID))

[1] 149

Number of teams for which each player played:

`nteams <- summarise(byplayer, nteams = length(unique(teamID))) head(nteams, 5)`

Source: local data frame [5 x 2] playerID nteams <chr> <int> 1 aardsda01 8 2 aaronha01 3 3 aaronto01 2 4 aasedo01 5 5 abadan01 3

- Do players improve?
- Do players improve? Modeling for one player
Fitting a regression model relating runs per bat (rbi/ab) to career year

mod.br <- lm(RBI/AB ~ cyear, data=baberuth) mod.br #summary(mod.br)

X11cairo 2 Call: lm(formula = RBI/AB ~ cyear, data = baberuth) Coefficients: (Intercept) cyear 0.203200 0.003413

- Do players improve? Modeling for all
How to model this for all players?

lmslope <- function(x,y){ linmod <- lm(y~x) coefs <- coef(linmod) return(coefs[2]) } bb <- subset(Batting, AB >= 25) bb <- mutate(group_by(bb, playerID), include = length(yearID) > 10) bb <- filter(bb, include) player_traj <- summarize(bb, slope=lmslope(cyear, RBI/AB)) qplot(slope, data=player_traj, binwidth=0.005)

- Results:
- Maybe players get better than worse again
modq.br <- lm(RBI/AB ~ cyear + I(cyear^2), data=baberuth) #mod.br summary(modq.br)

Call: lm(formula = RBI/AB ~ cyear + I(cyear^2), data = baberuth) Residuals: Min 1Q Median 3Q Max -0.10600 -0.01994 0.01035 0.04025 0.06291 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1267872 0.0372004 3.408 0.00295 ** cyear 0.0225163 0.0074510 3.022 0.00701 ** I(cyear^2) -0.0008306 0.0003146 -2.640 0.01613 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.05295 on 19 degrees of freedom Multiple R-squared: 0.3592, Adjusted R-squared: 0.2917 F-statistic: 5.325 on 2 and 19 DF, p-value: 0.01459

- Quadratic plot
`brplot + stat_smooth(method="lm", formula = y~x+I(x^2))`

- All players, quadratic term
qtermmodel <- function(df){ mod <- lm(RBI/AB ~ cyear + I(cyear^2), data = df) coefs <- data.frame(t(coef(mod))) names(coefs) <- c("int", "slope", "qterm") return(coefs) } bcoefs <- do(bb, qtermmodel(.)) qplot(qterm, data=bcoefs, binwidth=0.0005) + xlim(c(-0.015, 0.015))

- Results: all players, quadratic term

### 16.4 US baby names data

- US baby names data
- Download the data
- All data in zipfile: http://www.ssa.gov/oact/babynames/names.zip
- Zipped file contains one txt file (csv) for each year 1880-2013. Eg: "yob1885.txt"
- My script extracts this data, combines it, and saves the top 1000 names for each sex and year: ../assignments/hw10-clean-bnames.R. You don't need to run it.
- Saved data: ../assignments/top-1000-baby-names.csv

- Read data
bnames <- read.csv("http://r-research-tool.schwilk.org/assignments/top-1000-baby-names.csv", stringsAsFactors = FALSE) bnames$sex <- factor(bnames$sex) levels(bnames$sex) <- c("girl", "boy") head(bnames, 3)

name sex year rank percent 1 Mary girl 1881 1 7.524496 2 Anna girl 1881 2 2.934108 3 Emma girl 1881 3 2.212000

### 16.5 Explore data

- Name diversity
Investigate the percentage of all names that the top 1000 names occupy

bn <- group_by(bnames, year, sex) ofall <- summarize(bn, tprop = sum(percent)) qplot(year, tprop, data = ofall, colour = sex, geom = "line", ylab="Percentage of all names")

- Result:
- Look at name trend
a <- aes(year, percent, color = sex) p <- ggplot(subset(bnames, name=="Dylan"), a) + geom_line() p

- Some helper functions
# Return letter at position pos library(stringr) letter <- function(string, pos) { return(str_sub(string, pos, pos)) } # and some consistent axes axes <- list( scale_x_continuous(breaks = c(1900, 1920, 1940, 1960, 1980, 2000)), scale_y_continuous("percent") )

- Trend in first letter
bn <- group_by(bnames, year,sex, first = letter(name, 1)) fl <- summarize(bn, percent=sum(percent)) ggplot(fl, a) + geom_line() + facet_wrap(~ first) + axes

- Result
- Trend in last letter
bn <- group_by(bnames, year,sex, last = letter(name, -1)) ll <- summarize(bn, percent=sum(percent)) ggplot(ll, a) + geom_line() + facet_wrap(~ last) + axes

- Result
- Wow, rise in names that end in n
lastn <- group_by(filter(bnames, letter(name, -1)=="n"), year) lastn <- summarize(lastn, percent=sum(percent)) ggplot(lastn, aes(year, percent)) + geom_line() + axes

- Result

### 16.6 soundex

- What about names that sound alike?
see http://en.wikipedia.org/wiki/Soundex

soundex <- function(name) { name <- toupper(name) name <- unlist(strsplit(name,"")) # split into individual chars sdx <- name[1] dict <- c("BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", "AEIOUHWY") codes <- c("1","2","3","4","5","6",".") for (i in 2:length(name)) { code = codes[grep(name[i], dict, fixed=TRUE)] if (code != str_sub(sdx,-1,-1)) { sdx <- str_c(sdx, code) } } sdx <- gsub(".", "", sdx, fixed=TRUE) sdx <- str_pad(sdx, 3, pad = "0") return(sdx) }

- Get soundex for each name in data
unames <- unique(bnames$name) sdxs <- sapply(unames, soundex) bnames$soundex <- sdxs[ match(bnames$name, unames )] bnames$soundend <- str_sub(bnames$soundex,2)

- So what is the deal with names that end in n?
sdylan <- subset(bnames, soundex==soundex("Dylan")) unique(sdylan$name)

[1] "Delma" "Dillon" "Delano" "Delwin" "Dylan" "Delaney" "Dillan" [8] "Dillion" "Dylon" "Dyllan" "Dallin" "Dilan" "Daylen"

- Plot sound alikes
sdy <- group_by(sdylan, sex, year) sdylan.sum <- summarize(sdy, percent = sum(percent)) ggplot(sdylan.sum, a) + geom_line() + axes

- Result
- Maybe all soundexs that end in "45" + "n"?
sy <- group_by(bnames, year, sex) s45 <- summarize(subset(sy, soundend == "45" & letter(name,-1)=="n"), percent=sum(percent)) ggplot(s45, a) + geom_line() + axes

- Result

## 17 Tidy Data

### 17.1 Tidy data

- First an introduction to the pipe
in magrittr package (required by dplyr and re-exported). It is a bit like the

`|`

pipe in the unix command line. It simply lets the left hand object replace the first argument of the right hand function.library(dplyr) mtcars %>% filter(mpg > 30)

mpg cyl disp hp drat wt qsec vs am gear carb 1 32.4 4 78.7 66 4.08 2.200 19.47 1 1 4 1 2 30.4 4 75.7 52 4.93 1.615 18.52 1 1 4 2 3 33.9 4 71.1 65 4.22 1.835 19.90 1 1 4 1 4 30.4 4 95.1 113 3.77 1.513 16.90 1 1 5 2

- Causes of messiness
- Example of messy data
- Data storage in "long format"
Storage Meaning Table / File Data set Rows Observation Columns Variables - Common causes of messiness
- column headers are values, not variable names
- multiple variables are stored in one column
- variables are stored in both rows and columns
- multiple types of experimental unit stored in the same table
- one type of experimental unit stored in multiple tables

- Some tools
library(tidyr) ?gather ?separate ?spread library(stringr) ?str_replace ?str_sub ?str_split_fixed library(dplyr) ?arrange

### 17.2 Long vs wide data

- You've seen long vs wide data formats
date species psi.pd psi.md RH.pd RH.md 110810 JUDE2 -2.0 -4.5 88 50 110810 QUGR -1.0 -2.2 90 40 110810 PICE -0.4 -1.7 91 33 Useful for regressing md.psi on pd.psi, but not useful if we want to facet on time of day. So is this tidy? Depends if we think of pd.psi and md.psi as separate variables or not.

- Long format
date species time psi RH 110810 JUDE2 pd -2.0 88 110810 JUDE2 md -4.5 90 110810 QUGR pd -1.0 91 110810 QUGR md -2.2 50 110810 PICE pd -0.4 40 110810 PICE md -1.7 33 - Conceptual framework
- Identifer (id) variables identify the unit that measurements take place on. ID variables are usually discrete, and are typically fixed by design.
- Measured variables represent what is measured on that unit.

- Further step
In truly "long" format there are only id variables and a value column!

date species time variable value 110810 JUDE2 pd psi -2.0 110810 JUDE2 md psi -4.5 110810 QUGR pd psi -1.0 110810 QUGR md psi -2.2 110810 PICE pd psi -0.4 110810 PICE md psi -1.7 110810 JUDE2 pd RH 88 110810 JUDE2 md RH 90 110810 QUGR pd RH 91 110810 QUGR md RH 50 110810 PICE pd RH 40 110810 PICE md RH 33 - Problem: Column headers are values, not variable names
raw <- read.delim("http://r-research-tool.schwilk.org/lectures/data/pew.txt", check.names = FALSE, stringsAsFactors = FALSE) head(raw)

religion <$10k $10-20k $20-30k $30-40k $40-50k $50-75k $75-100k 1 Agnostic 27 34 60 81 76 137 122 2 Atheist 12 27 37 52 35 70 73 3 Buddhist 27 21 30 34 33 58 62 4 Catholic 418 617 732 670 638 1116 949 5 Don’t know/refused 15 14 15 11 10 35 21 6 Evangelical Prot 575 869 1064 982 881 1486 949 $100-150k >150k Don't know/refused 1 109 84 96 2 59 74 76 3 39 53 54 4 792 633 1489 5 17 18 116 6 723 414 1529

- Problem: Multiple variables in one column
The WHO tuberculosis data:

library(stringr) raw <- read.csv("http://r-research-tool.schwilk.org/lectures/data/tb-simple-2011.csv", stringsAsFactors = FALSE) raw %>% select(1:7) %>% tail()

iso2 year new_sp new_sp_m04 new_sp_m514 new_sp_m014 new_sp_m1524 6801 ZW 2006 12718 NA NA 215 736 6802 ZW 2007 10583 6 132 138 500 6803 ZW 2008 9830 NA NA 127 614 6804 ZW 2009 10195 NA NA 125 578 6805 ZW 2010 11654 20 130 150 710 6806 ZW 2011 12596 30 122 152 784

## get rid of the "overall count" for now raw$new_sp <- NULL ## rename rest of columns for clarity names(raw) <- str_replace(names(raw), "new_sp_", "")

### 17.3 Reshaping with tidyr

- Variables in rows and columns
Example data, photosynthesis and respiration in two species

library(tidyr) tab1 <- read.csv("http://r-research-tool.schwilk.org/lectures/data/carex.csv") head(tab1)

Species Treatment A.1 A.2 A.3 A.4 R.1 R.2 R.3 R.4 1 Pa 0.15 8.36 7.632 5.522 6.300 0.7358 0.5870 0.6236 0.4354 2 Pa 3.00 11.06 11.540 7.914 7.914 NA NA NA NA 3 Pa 15.00 22.32 18.640 17.200 18.400 1.1680 0.9600 1.5460 NA 4 Cs 0.15 15.60 10.240 13.100 10.480 1.4840 0.7682 0.6236 1.3060 5 Cs 3.00 13.74 15.240 13.920 10.500 NA NA NA NA 6 Cs 15.00 9.37 18.760 15.600 10.660 1.0360 1.0640 0.5276 0.8990

But A, A.1 and A.2 all represent measurements of A (photosynthesis)

- Gather data (wide to long format)
tab.tidier <- tab1 %>% gather(measurement, value, -Species, -Treatment) head(tab.tidier)

Species Treatment measurement value 1 Pa 0.15 A.1 8.36 2 Pa 3.00 A.1 11.06 3 Pa 15.00 A.1 22.32 4 Cs 0.15 A.1 15.60 5 Cs 3.00 A.1 13.74 6 Cs 15.00 A.1 9.37

- Separate replicate id from measurement type
tab.tidy <- tab.tidier %>% separate(measurement, into = c("measurement", "rep"), sep = "\\.") head(tab.tidy)

Species Treatment measurement rep value 1 Pa 0.15 A 1 8.36 2 Pa 3.00 A 1 11.06 3 Pa 15.00 A 1 22.32 4 Cs 0.15 A 1 15.60 5 Cs 3.00 A 1 13.74 6 Cs 15.00 A 1 9.37

In more complicated cases you may need to use stringr library, regular expressions to pull apart column data. See also argument

`convert`

. - Spreading data (long to wide)
Why?

tab.spread <- tab.tidy %>% spread(measurement, value) head(tab.spread)

Species Treatment rep A R 1 Cs 0.15 1 15.60 1.4840 2 Cs 0.15 2 10.24 0.7682 3 Cs 0.15 3 13.10 0.6236 4 Cs 0.15 4 10.48 1.3060 5 Cs 3.00 1 13.74 NA 6 Cs 3.00 2 15.24 NA

- What to do with NAs?
- In long data, some NAs can be implicit (missing rows), but when spread to a wider form we need to make them explicit since there is a cell to fill.
- In other cases, an NA in one form can imply a zero in another aggregate form (eg oak cover on my Chisos transects?!).

## 18 Dates and Times

### 18.1 How to represent time

- Time is tricky to represent
Time is a deceptively complicated system to work with. Do we want it to behave like a number line, or according to complicated conventions?

- Instants
- An instant refers to a specific moment of time.
- Instants are combination of years, months, days, hours, minutes, seconds, and timezone. e.g.,
`2011-03-28 08:44:32 CST`

- There are two main time data types in R:

- POSIXct
The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.

- POSIXlt
The POSIXlt data type is a vector, and the entries in the vector have the following meanings:

- seconds
- minutes
- hours
- day of month (1-31)
- month of the year (0-11)
- years since 1900
- day of the week (0-6 where 0 represents Sunday)
- day of the year (0-365)
- Daylight savings indicator (positive if it is daylight savings)

### 18.2 lubridate

- lubridate package
More info at http://www.jstatsoft.org/v40/i03/

library(lubridate) # lubridate turns strings into instants with # functions that have y, m, and d in their # names. Use the function whose name matches # the order of the elements in the date ymd("2011-03-28") mdy("03/28/2011") dmy("28032011") ymd_hms("2011:03:2808:50:30", tz="CST") #order matters dmy("05-03-2011") mdy("05-03-2011")

- What type of objects do these return?
# what is the object type (class)? class(mdy("05-03-2011"))

[1] "Date"

- Example use
# example: soil <- read.csv("http://r-research-tool.schwilk.org/lectures/data/soil-sensors.csv", stringsAsFactors = FALSE, na.strings = c("NA", "#N/A", "")) names(soil) <- c("time", "p1_psi", "p2_VWC", "p3_VWC", "p4_psi") head(soil) class(soil$time)

time p1_psi p2_VWC p3_VWC p4_psi 1 24 May 2012 12:00 PM -36 0.230 0.338 -23 2 24 May 2012 12:30 PM -36 0.229 0.338 -23 3 24 May 2012 01:00 PM -36 0.226 0.337 -23 4 24 May 2012 01:30 PM -37 0.224 0.337 -23 5 24 May 2012 02:00 PM -37 0.222 0.336 -23 6 24 May 2012 02:30 PM -38 0.220 0.335 -24 [1] "character"

- Your turn
Convert time column into date-time objects

- Manipulating instants
now() > ymd("2014-11-10") today() > ymd("2011-03-29") # date object # oops floor_date(now(), "day") > ymd("2011-03-29", tz="CST")

[1] TRUE [1] TRUE [1] TRUE

- Rounding times
# round floor_date(now(), "month") ceiling_date(now(), "month") round_date(now(), "month") # extract year(now())

[1] "2016-12-01 CST" [1] "2017-01-01 CST" [1] "2016-12-01 CST" [1] 2016

- Accessing components
Date component Accessor Year `year()`

Month `month()`

Week `week()`

Day of year `yday()`

Day of month `mday()`

Day of week `wday()`

Hour `hour()`

Minute `minute()`

Second `second()`

Time Zone `tz()`

- Manipulating instants
now() year(now()) hour(now()) day(now()) yday(now())

[1] "2016-12-01 12:04:47 CST" [1] 2016 [1] 12 [1] 1 [1] 336

- Instants, continued
wday(now()) wday(now(), label = TRUE) wday(now(), label = TRUE, abbr = FALSE) month(now(), label = TRUE, abbr = FALSE)

[1] 5 [1] Thurs Levels: Sun < Mon < Tues < Wed < Thurs < Fri < Sat [1] Thursday 7 Levels: Sunday < Monday < Tuesday < Wednesday < Thursday < ... < Saturday [1] December 12 Levels: January < February < March < April < May < June < ... < December

- Upon which day of the week were you born?
- Changing components
# Accessor functions can also be used to change the # elements of a date-time ti <- now() year(ti) <- 1999 hour(ti) <- 23 day(ti) <- 45 tz(ti) <- "UTC" ti

[1] "2000-01-14 23:04:47 UTC"

- Aggregate
## using ddply: library(dplyr) library(ggplot2) soil$date <- floor_date(dmy_hm(soil$time), "day") psi.day <- soil %>% group_by(date) %>% summarize(avepsi = mean(p1_psi, na.rm=TRUE)) ggplot(psi.day, aes(date, avepsi)) + geom_line()

- Result

### 18.3 Time Spans

- Time spans
#?dseconds dseconds(140) dhours(32) ddays(2) class(ddays(1))

[1] "140s (~2.33 minutes)" [1] "115200s (~1.33 days)" [1] "172800s (~2 days)" [1] "Duration" attr(,"package") [1] "lubridate"

- Let's use R to set our alarm clock
ymd_hms("2014-01-01 08:30:00", tz = "") + ddays(0:30) # cool

[1] "2014-01-01 08:30:00 CST" "2014-01-02 08:30:00 CST" [3] "2014-01-03 08:30:00 CST" "2014-01-04 08:30:00 CST" [5] "2014-01-05 08:30:00 CST" "2014-01-06 08:30:00 CST" [7] "2014-01-07 08:30:00 CST" "2014-01-08 08:30:00 CST" [9] "2014-01-09 08:30:00 CST" "2014-01-10 08:30:00 CST" [11] "2014-01-11 08:30:00 CST" "2014-01-12 08:30:00 CST" [13] "2014-01-13 08:30:00 CST" "2014-01-14 08:30:00 CST" [15] "2014-01-15 08:30:00 CST" "2014-01-16 08:30:00 CST" [17] "2014-01-17 08:30:00 CST" "2014-01-18 08:30:00 CST" [19] "2014-01-19 08:30:00 CST" "2014-01-20 08:30:00 CST" [21] "2014-01-21 08:30:00 CST" "2014-01-22 08:30:00 CST" [23] "2014-01-23 08:30:00 CST" "2014-01-24 08:30:00 CST" [25] "2014-01-25 08:30:00 CST" "2014-01-26 08:30:00 CST" [27] "2014-01-27 08:30:00 CST" "2014-01-28 08:30:00 CST" [29] "2014-01-29 08:30:00 CST" "2014-01-30 08:30:00 CST" [31] "2014-01-31 08:30:00 CST"

- A problem:
## what about: ymd_hms("2014-03-01 08:30:00", tz = "") + ddays(0:30)

[1] "2014-03-01 08:30:00 CST" "2014-03-02 08:30:00 CST" [3] "2014-03-03 08:30:00 CST" "2014-03-04 08:30:00 CST" [5] "2014-03-05 08:30:00 CST" "2014-03-06 08:30:00 CST" [7] "2014-03-07 08:30:00 CST" "2014-03-08 08:30:00 CST" [9] "2014-03-09 09:30:00 CDT" "2014-03-10 09:30:00 CDT" [11] "2014-03-11 09:30:00 CDT" "2014-03-12 09:30:00 CDT" [13] "2014-03-13 09:30:00 CDT" "2014-03-14 09:30:00 CDT" [15] "2014-03-15 09:30:00 CDT" "2014-03-16 09:30:00 CDT" [17] "2014-03-17 09:30:00 CDT" "2014-03-18 09:30:00 CDT" [19] "2014-03-19 09:30:00 CDT" "2014-03-20 09:30:00 CDT" [21] "2014-03-21 09:30:00 CDT" "2014-03-22 09:30:00 CDT" [23] "2014-03-23 09:30:00 CDT" "2014-03-24 09:30:00 CDT" [25] "2014-03-25 09:30:00 CDT" "2014-03-26 09:30:00 CDT" [27] "2014-03-27 09:30:00 CDT" "2014-03-28 09:30:00 CDT" [29] "2014-03-29 09:30:00 CDT" "2014-03-30 09:30:00 CDT" [31] "2014-03-31 09:30:00 CDT"

- What went wrong?
- Clock time and the number line
- 3 types of time spans
We can still do math on the time line, as long as we’re specific about how we want to measure time.

- durations (exact times)
- periods (clock times)
- intervals (specific times)

- Durations
Durations measure the exact amount of seconds that pass between two time points.

- Durations
# Use d + the plural of of a unit of time to create # a duration. dminutes(5) dhours(278) #dmonths(4) # why doesn't this work?

[1] "300s (~5 minutes)" [1] "1000800s (~1.65 weeks)"

- Periods
Periods measure time spans in units larger than seconds. Periods pay no attention to how many sub-units occur during the unit of measurement.

- Periods
# Use the plural of a time unit to create a period. minutes(5) hours(278) months(4) # months are not a problem years(1:10)

[1] "5M 0S" [1] "278H 0M 0S" [1] "4m 0d 0H 0M 0S" [1] "1y 0m 0d 0H 0M 0S" "2y 0m 0d 0H 0M 0S" "3y 0m 0d 0H 0M 0S" [4] "4y 0m 0d 0H 0M 0S" "5y 0m 0d 0H 0M 0S" "6y 0m 0d 0H 0M 0S" [7] "7y 0m 0d 0H 0M 0S" "8y 0m 0d 0H 0M 0S" "9y 0m 0d 0H 0M 0S" [10] "10y 0m 0d 0H 0M 0S"

- Why use periods?
No surprises.

`2010-11-01 00:00:00 + months(1)`

will always equal`2010-12-01 00:00:00`

no matter how many leap seconds, leap days or changes in DST occur in between - Intervals
Intervals measure a time span by recording its endpoints. Since we know when the time span occurs, we can calculate the lengths of all the units involved.

- Intervals examples
# To create an interval, use the special %--% # operator: int <- ymd("2010-01-01") %--% ymd("2009-01-01") # Access and set the endpoints with int_start() and # int_end(). int_start(int) int_end(int) <- ymd("2010-03-14") # Intervals can be accurately converted to either # periods or durations with as.period() and # as.duration()

[1] "2010-01-01 UTC"

### 18.4 Math with date-times

- Math with date-times
birthday <- ymd("1973-12-10") life <- birthday %--% now() life / ddays(1) life / days(1) life %/% days(1)

[1] 15697.75 [1] 15697.75 Note: method with signature ‘Timespan#Timespan’ chosen for function ‘%/%’, target signature ‘Interval#Period’. "Interval#ANY", "ANY#Period" would also be valid [1] 15697

- Your turn
Calculate your age in minutes. In hours. In days. How old is someone who was born on March 6th, 1942?

## 19 Statistical Models

### 19.1 Statistical models

- Definition
- Data
- A collection of variables, each variable being a vector of measurements of some specific trait/characteristic
- Problem
- How does variable \(Y\) depend upon variables \(X_1, X_2 \dots X_n\)
- Model
- The model represents the mathematical relationship between X and Y. It aims to represent the true relationship

- First steps in modeling
- Identify the
**response variable** - Identify the
**explanatory variables** - Are the explanatory variables continuous, categorical or a mixture of both?
- What is the response variable? Continuous, a count or proportion, category, or time-at-death?

- Identify the
- Explanatory variables
- Continuous
- Regression
- Categorical
- Analysis of Variance (ANOVA)
- Mixture
- Analysis of covariance (ANCOVA)

- Response variables
- Continuous
- Normal regression, anova, ancova
- Proportion
- Logistic regression (and variants for counts, binary data)
- Time-at-death
- Survival analysis

### 19.2 Model formulas in R

- Model formulas
- A model formula is a sequence of R variables and symbols that specifies the variables in the model and their relationships
- The formula is a necessary argument to many model-fitting functions in R
- Although it resembles a mathematical formula, the symbols in the "equation" have a different meaning

- Model formula notation
A model formula in R can look like this:

y ~ A + B + A:B

the

`~`

means: "is modeled as a function of" - Example: Linear regression
Setup:

library(ggplot2) x <- runif(1000, 0, 10) y <- 1.5*x + rnorm(1000, 0, 2) + 0.01 d <- data.frame(x=x, y=y) qplot(x, y, data=d)

- Example: Linear regression
`fit <- lm(y ~ x) fit`

Call: lm(formula = y ~ x) Coefficients: (Intercept) x 0.01578 1.50354

- Extracting information from a fitted object
typeof(fit) class(fit) names(fit) coef(fit)

[1] "list" [1] "lm" [1] "coefficients" "residuals" "effects" "rank" [5] "fitted.values" "assign" "qr" "df.residual" [9] "xlevels" "call" "terms" "model" (Intercept) x 0.01577804 1.50353900

- Residuals
head(residuals(fit))

1 2 3 4 5 6 -2.95736864 0.02948475 -0.95593011 -1.22266288 -1.48669100 -0.85117376

- Intercepts
But wait! R estimated two coefficients and we only had one explanatory variable.

- R assumes that you want to include an intercept term (our call was the same as
`y ~ 1 +x`

) - You can exclude it by adding a
`0`

to the formula (or`-1`

). But you never want to do that (almost never).

- R assumes that you want to include an intercept term (our call was the same as
- Example of no intercept
`fitnoi <- lm(y ~ 0 + x) summary(fitnoi)`

Call: lm(formula = y ~ 0 + x) Residuals: Min 1Q Median 3Q Max -5.222 -1.393 -0.084 1.317 6.927 Coefficients: Estimate Std. Error t value Pr(>|t|) x 1.50590 0.01069 140.8 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.946 on 999 degrees of freedom Multiple R-squared: 0.952, Adjusted R-squared: 0.952 F-statistic: 1.983e+04 on 1 and 999 DF, p-value: < 2.2e-16

- Let's build some models
# names(mtcars) ggplot(mtcars, aes(disp, mpg)) + geom_point() ggplot(mtcars, aes(wt, mpg)) + geom_point() ggplot(mtcars, aes(disp, wt)) + geom_point() fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars) summary(fuel.mod)

- Model results
Call: lm(formula = mpg ~ disp + wt + wt:disp, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.267 -1.677 -0.836 1.351 5.017 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 44.081998 3.123063 14.115 2.96e-14 *** disp -0.056358 0.013239 -4.257 0.00021 *** wt -6.495680 1.313383 -4.946 3.22e-05 *** disp:wt 0.011705 0.003255 3.596 0.00123 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.455 on 28 degrees of freedom Multiple R-squared: 0.8501, Adjusted R-squared: 0.8341 F-statistic: 52.95 on 3 and 28 DF, p-value: 1.158e-11

`anova()`

Test the significance of termsanova(fuel.mod)

Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) disp 1 808.89 808.89 134.217 3.4e-12 *** wt 1 70.48 70.48 11.694 0.001942 ** disp:wt 1 77.93 77.93 12.931 0.001227 ** Residuals 28 168.75 6.03 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Linear models:
`summary()`

vs`anova()`

- Ben Bolker:
The p-values you get for

`summary()`

of each individual model are the p-values for the effects of each of the parameters in each model, conditional on all the other parameters in that model. If your data are perfectly balanced (which is unlikely in a regression design), you should get the same answers from`summary()`

and`anova()`

, but otherwise the results from anova are generally preferable.

- Ben Bolker:
`anova()`

to compare modelsfuel.mod2 <- lm(mpg ~ disp + wt, data=mtcars) fuel.mod3 <- lm(mpg ~ disp, data=mtcars) anova(fuel.mod3, fuel.mod2, fuel.mod)

Analysis of Variance Table Model 1: mpg ~ disp Model 2: mpg ~ disp + wt Model 3: mpg ~ disp + wt + wt:disp Res.Df RSS Df Sum of Sq F Pr(>F) 1 30 317.16 2 29 246.68 1 70.476 11.694 0.001942 ** 3 28 168.75 1 77.934 12.931 0.001227 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Symbols in formulas
`+`

separate effects in a formula `:`

interaction (A:B is interaction of A and B) `*`

main effects plus interactions `^`

crossed `%in%`

nested within `/`

nested within | conditional on (rarer) `A*B`

is equivalent to`A + B + A:B`

- lm() works for categorical variables too
mtcars$cylnum <- factor(mtcars$cyl) levels(mtcars$cylnum) fuel.aov <- lm(mpg ~ cylnum, data=mtcars) anova(fuel.aov)

[1] "4" "6" "8" Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) cylnum 2 824.78 412.39 39.697 4.979e-09 *** Residuals 29 301.26 10.39 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

### 19.3 Beyond the linear model

- Beyond the linear model
- GLM
- Generalized linear modeling allow non-normal errors in response variable through a linear link function
- Mixed effects models
- relax assumption of independence. Model nested observations, random effects
- GLS
- Generalized least squares relax assumption of homogeneity, model the heterogeneity
- GLMM and GAMM
- combine first and second, above.

- Generalized Linear Models (glm)
Consist of:

- Error distributions
The distribution of the response around its mean. for ANOVA and regression, this is the normal; for logistic regression it is the binomial

- Link function
The function,

`g`

that shows how the linear function of the explanatory variables is related to the expected value of the response:\(g(\mu) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)

- Variance function
This captures how the variance of the response depends upon the mean

- Error distributions
- Example logistic regression
trees <- data.frame(fire.sev = seq(0.01,1,0.005)) fire.mort <- runif(length(trees$fire.sev)) < trees$fire.sev trees$mortality <- ifelse(fire.mort,1,0) qplot(fire.sev, mortality, data=trees)

- Not a linear relationship!
- Logistic regression
trees.glm <- glm(mortality ~ fire.sev,data=trees, family = binomial()) trees.glm # summary(trees.glm)

Call: glm(formula = mortality ~ fire.sev, family = binomial(), data = trees) Coefficients: (Intercept) fire.sev -2.323 4.627 Degrees of Freedom: 198 Total (i.e. Null); 197 Residual Null Deviance: 275.9 Residual Deviance: 212.7 AIC: 216.7

- Prediction
trees$pred.mort <- predict(trees.glm, type="response") qplot(fire.sev, mortality, data=trees) + geom_line(aes(y =pred.mort) )

### 19.4 Mixed models

- Example: Mixed model
fish <- read.csv("http://r-research-tool.schwilk.org/lectures/data/fish-data.csv") head(fish)

aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0 4 a low 4.3 5 a low 4.8 6 b low 4.8

- Look at these data
Is each fish by treatment row independent?

- Mixed model using nlme
library(nlme) fit1.lme <- lme(color.score ~ 1, random = ~ 1 | aquarium, method="ML", data=fish) fit2.lme <- lme(color.score ~ nutrient, random = ~ 1 | aquarium, method="ML", data=fish) anova(fit1.lme, fit2.lme)

Model df AIC BIC logLik Test L.Ratio p-value fit1.lme 1 3 139.7463 144.8129 -66.87315 fit2.lme 2 4 122.6974 129.4529 -57.34872 1 vs 2 19.04886 <.0001

- BEWARE! Those p values are wrong.
Beware of p-values in multi-level models. Consider model selection (AIC or BIC, but careful there as well). Consider Markov chain Monte Carlo methods (MCMC) for confidence intervals.

See https://seriousstats.wordpress.com/tag/mcmc-methods/

See package "pbkrtest" for Kenward-Roger approximation

- Using lme4
library(lme4) library(pbkrtest) fit1.lmer <- lmer(color.score ~ 1 + (1 | aquarium), data=fish) fit2.lmer <- lmer(color.score ~ nutrient + (1 | aquarium), data=fish) KRmodcomp(fit1.lmer, fit2.lmer)

Loading required package: Matrix Attaching package: ‘Matrix’ The following object is masked from ‘package:tidyr’: expand Attaching package: ‘lme4’ The following object is masked from ‘package:nlme’: lmList F-test with Kenward-Roger approximation; computing time: 0.10 sec. large : color.score ~ nutrient + (1 | aquarium) small : color.score ~ 1 + (1 | aquarium) stat ndf ddf F.scaling p.value Ftest 54.912 1.000 6.000 1 0.0003104 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- Mixed models in R
- Pinheiro, J. C. & Bates, D. Mixed-Effects Models in S and S-Plus Springer-Verlag, New York, 2000.
- Zuur, A.F., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. Mixed Effects Models and Extensions in Ecology with R. Springer. 2009.

## 20 Perception and Color

### 20.1 Graphics and color use

- Homeworks
I apologize for my mistakes on the webpage. In light of that, I have been very lenient in grading HW09. But I have some comments:

- When I ask for you to "tell me something interesting and show your work, that means explore the data. DO SOME WORK. I am giving you flexibility and choices for a reason.
- My questions are brief so that you can use them a jumping-off points. This is grad school, not undergrad.
- Review naming convention. I sort and rename your submissions. Some of you are sloppy with case, with given vs surname order. Minor issue, but please be exact.

- Exploratory vs communication graphics
- Exploratory graphics
Are for you (not others). Need to be able to create rapidly because your first attempt will never be the most revealing. Iteration is crucial for developing the best display of your data.

- Communication graphics
When you communicate your findings, you need to spend a lot of time polishing your graphics to eliminate distractions and focus on the story. Iteration is crucial to ensure all the the small stuff works well: labels, color choices, tick marks, etc

- Exploratory graphics

### 20.2 Perception

### 20.3 Color

- What are the components of color?
RGB:

- RGB triplets
Notation RGB triplet Arithmetic (1.0, 0.0, 0.0) 8-bit per channel (255, 0, 0) 8- bit as hex `#FF0000`

- RGB
- Other color models?
- Luminance / Lightness
- HCL
- Benefit
- Perceptually uniform
- Hue is unordered. Use evenly spaced hues with equal chroma and luminance to make aesthetically pleasing discrete palettes
- Chroma and luminance are ordered. Easy to make perceptually uniform gradients by varying either (or both).

- Color blindness
- 7-10% of men are red-green color "blind."
- Solutions: avoid red-green contrasts; use redundant mappings; test.
- Color Oracle: http://colororacle.org/

### 20.4 Colors in R

- Colors in R
Default in R is an hcl color scale, one can use scales in other packages (eg RColorBRewer) as well.

- Continuous scales
The default continuous scale in ggplot2 is a blue gradient, from low =

`#132B43`

to high =`#56B1F7`

which can be reproduced as:library(ggplot2) scale_color_gradient(low = "#132B43", high = "#56B1F7", space = "Lab")

- Discrete scales
The default discrete scale in ggplot2 is a range of hues from hcl,

scale_color_hue(h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1)

- Continuous scales
- Discrete color scales
p = qplot(Petal.Length, Sepal.Length, data=iris, color = Species) p p + scale_color_hue(l=40,c=30) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50, direction = -1)

- Results:
- Continuous scales
p <- ggplot(mpg, aes(hwy, cty, color = displ)) + geom_point(size = 3.5, alpha = 0.6, position = "jitter") p p + scale_colour_gradient(low = "tan", high = "red")

- Results:
- Other color palettes
Cynthia Brewer's palettes for discrete scales http://colorbrewer2.org/

These are optimized for maps

library(RColorBrewer) display.brewer.all() p <- qplot(carat, price, color = cut, data=diamonds) p + scale_color_brewer(palette = "Blues") p <- ggplot(mpg, aes(reorder(class, -cty), cty, color = cut(displ, breaks = 4))) + geom_boxplot(position = "dodge") p + scale_color_brewer(palette="Dark2")

- Result:

### 20.5 Perception and comparisons

- Some comparisons are easier than others
- Some comparisons are easier than others
- Some comparisons are easier than others
- Some comparisons are easier than others
- Some comparisons are easier than others
- Some comparisons are easier than others
- We perceive relative difference
- It is easier to make nearby comparisons
- Beware of visual illusions
- Dynamite plots
- My recommendations
For comparing univariate responses across discrete groups. According to data set size

- small to moderate: dotplots, jittered dotplots (a special challenge are small, discrete data sets, where there are lots of repeated values)
- moderate: boxplots
- moderate to large: violin plots, density strips, beanplots, box-percentile plots etc. etc.

## 21 Markup languages

### 21.1 Why plain text?

- Our main goal as scientists
To make our research as reproducible and visible as possible

- The power of plain text
- Ubiquitous
- Usually small in size
- Portable across platforms (and versions)

- Manipulation of text
Most output is based on simple text file. There are two main final formats: HTML and PDF. One is pageless and one is centered on the idea of the printed page.

### 21.2 Markup

### 21.3 LaTeX

- LaTeX
- A set of macros around Tex, a markup language invented by Donald Knuth
- A document preparation system and document markup language.
- Defacto standard in academic publishing. Formulae used in HTML pages (e.g., Wikipedia)

- A LaTeX example
\documentclass[12pt]{article} \begin{document} \section*{My Paper} I just discovered that: \begin{equation} e=mc^2 \end{equation} \begin{table}[h] \centering \caption{My table} \begin{tabular}{lrs} \hline Left & right & center \\ \midrule data & 0.01 & 2.3 \\ \bottomrule \end{tabular} \end{table} \end{document}

- Result
- It isn't as bad as it seems
- Ready for submission

### 21.4 Markdown

- Still, LaTeX is cumbersome
Alternatives?

- Markdown
- Easy to learn http://daringfireball.net/projects/markdown/
- Much more "lightweight" than LaTeX

- Purposes
- LaTeX is for paper — PDF
- Markdown is for HTML (blogs, wikipedia) — but sneakily uses some Latex when needed
- But with pandoc (http://johnmacfarlane.net/pandoc/), Markdown is becoming language of choice among scientists
- You can even export from markdown to MS Word to keep a collaborator happy.

- Markdown
- Markdown example
## My Paper ## I just discovered that: $$e=mc^2$$ ### My Table ### | Left | right | center | |------|-------|--------| | data | 0.01 | 2.3 |

- Which renders as
- Text markup
*italic*, **bold** unordered list: - item 1 - item 2 - item 3 Ordered list: 1. item 1 2. item 2 3. item 3

- Headings
Heading1 ======== # Heading1 # Heading2 -------- ## Heading 2 ## ### Heading 3 ###

- Markdown links
Links: http://daringfireball.net/projects/markdown/syntax [Markdown syntax](http://daringfireball.net/projects/markdown/syntax) [Markdown syntax][mkdn] [mkdn]: http://daringfireball.net/projects/markdown/syntax

- Markdown math
Just use LateX math!

$$\frac{n!}{k!(n-k)!} = \binom{n}{k}$$

- Markdown math
Inline equations just need a single

`$`

In ecology, "true" species diversity is: $ D_q = \frac{1}{\sqrt[q-1]{\sum_{i=1}^R{p_{i} p_{i}^{q-1}}}}

### 21.5 Managing references

- BibTeX
A reference database format

- Very versatile and very powerful (most other markup languages work with bibtex as well)
- Free managers, such as bibdesk or mendeley, are now ubiquitous. See also jabbref. You do not need or want Endnote!

- Bibtex entry
@article{Schwilk+Ackerly-2001, title={Flammability and serotiny as strategies: correlated evolution in pines}, author={Schwilk, Dylan W and Ackerly, David D}, journal={Oikos}, volume={94}, number={2}, pages={326--336}, year={2001}, publisher={Wiley Online Library} }

- In Jabref
- LaTeX and BibTeX example
\section{Introduction} Fire has been a powerful disturbance on the global landscape for hundreds of thousands of years, promoting traits in plants which confer an advantage in the presence of fire \cite{Keeley+Pausas+etal-2011,He+Pausas+etal-2012}. Understanding variation in fire response strategies both across and within fire regimes is a major goal of plant fire ecology \cite{Bond+vanWilgen-1996, Bellingham-2000, Schwilk+Ackerly-2001, Keeley+Pausas+etal-2011}.

- Markdown and BibTeX example
Requires pandoc markdown extensions.

# Introduction # Fire has been a powerful disturbance on the global landscape for hundreds of thousands of years, promoting traits in plants which confer an advantage in the presence of fire [@Keeley+Pausas+etal-2011; @He+Pausas+etal-2012]. Understanding variation in fire response strategies both across and within fire regimes is a major goal of plant fire ecology [@Bond+vanWilgen-1996; @Bellingham-2000; @Schwilk+Ackerly-2001; @Keeley+Pausas+etal-2011].

### 21.6 Which markup to use?

- Which do I use and when:
- Scientific papers
- LaTeX + BibTex. Outlining in org-mode first. I'm starting to use Markdown for some papers (easy conversion to MS word for collaborators).
- Presentations
- LaTeX + Beamer when I need to control details of format, spacing, placement. Org-mode with export via LaTeX + Beamer otherwise. The source for this lecture is plain org-mode (similar to Markdown)
- Websites
- Markdown with html and css templates (jekyll, eg http://www.schwilk.org/) or org-mode export (eg http://r-research-tool.schwilk.org/)
- Other
- org-mode or markdown (documentation, notes, protocols, literate programming)

- Pandoc
So how do we glue everything together and produce wonderful htmls and pdfs out of thin air?

With pandoc: http://johnmacfarlane.net/pandoc/

### 21.7 Scientific workflow

- An RMarkdown workflow
Figure from Kieran Healy: http://kieranhealy.org/resources/

- RMarkdown in RStudio

## 22 Version Control

### 22.1 Setup

- The Bash Shell
Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.

- Windows
Install Git for Windows by download and running the installer this will provide you with both Git and Bash in the Git Bash program.

- Mac OS X
The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in

`/Applications/Utilities`

). You may want to keep Terminal in your dock for this workshop. - Linux
The default shell is usually

`bash`

, but if your machine is set up differently you can run it by opening a terminal and typing`bash`

. There is no need to install anything.

- Windows
- Git
- Windows
Git should be installed on your computer as part of your Bash install (described above).

- Mac OS X
For OS X 10.9 and higher, install Git for Mac by downloading and running the installer. After installing Git, there will not be anything in your

`/Applications`

, as Git is a command line program.For older versions of OS X (10.6-10.8) use the most recent available installer for your OS available here. Use the Leopard installer for 10.5 and the Snow Leopard installer for 10.6-10.8.

- Windows
- Additional information
See my reproducible research workshop:

### 22.2 The shell

- The shell
- What is the shell?
The shell a command line interface which allows you to control your computer.

Tasks are accomplished by entering commands with a keyboard (or by running a plain text script) instead of controlling a graphical user interface.

The most popular shell is Bash ("Bourne Again SHell")

- What is a terminal?
A terminal is a program you run that gives you access to the shell. There are many different terminal programs that vary across operating systems.

- What is the shell?
- Using the shell
Open a terminal (bash)

ls echo "hello" echo "hello" > newfile.txt cat newfile.txt

- Full vs relative paths
pwd cd ~ pwd

- Shell command options and arguments
ls -l ls -al man ls

### 22.3 What is version control?

- Why use version control?
- You already use some sort of version control
- File naming schemes (
*eg*`my-file-July18-2013.doc`

) or by copying folders around - Simple but error-prone
- Does not help with branching, collaboration

- File naming schemes (
- A version control system (VCS) allows you to:
- revert files back to a previous state
- revert the entire project back to a previous state
- review changes made over time
- see who last modified something that might be causing a problem, who introduced an issue and when

- You already use some sort of version control
- Local version control systems (VCS)
- Centralized VCS
- Distributed VCS

### 22.4 git

- Configuring
git config --global user.name "Dylan Schwilk" git config --global user.email "dylan@schwilk.org"

- Getting a "repo" (repository)
cd ~ mkdir newrepo cd newrepo git init touch README.md ls .git -al

- What happened?
Hidden directory called

`.git`

. ├── .git │ ├── branches │ ├── config │ ├── description │ ├── HEAD │ ├── hooks │ │ ├── applypatch-msg.sample │ │ ├── commit-msg.sample │ │ ├── post-update.sample │ │ ├── pre-applypatch.sample │ │ ├── pre-commit.sample │ │ ├── prepare-commit-msg.sample │ │ ├── pre-push.sample │ │ ├── pre-rebase.sample │ │ └── update.sample │ ├── info │ │ └── exclude │ ├── objects │ │ ├── info │ │ └── pack │ └── refs │ ├── heads │ └── tags └── README.md

- Checking status
git status

On branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) README.md nothing added to commit but untracked files present (use "git add" to track)

- git commands
- staging files
git add . git status

On branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: README.md

- Committing changes
`git commit -m "initial commit" git status`

On branch master nothing to commit, working directory clean

- git commit
A commit is a snapshot taken from the index (staging area)

**not from the working directory**! - What exists now?
- Clone repo
`git clone https://github.com/schwilklab/protocols.git cd protocols ls`

./ ../ html/ lab-org/ methods/ README.md safety/

- A basic workflow
- Edit files
- Stage the changes (git add)
- Review the changes (git diff)
- Commit the changes (git commit)

- The git repository
- git log
git log --pretty=oneline --abbrev-commit -10

d2ec17b Reduce markup lecture e4d405f Add folder for tangled session files bb2d29e Add shell (Bash) examples and some data 261af83 Clean up birdd files e289d7e Minor edit f2820d3 Scatterplot fig in birdd example 75a4be6 Add slides on Bash shell 66c9a7f Add etherpad link for workshop e23b559 Suggest getting github.com account 7e6a6df Add time and date for workshop

- git diff
`git diff`

- difference between working directory and index (staging area)
`git diff --staged`

- difference between index and last commit

With right options, diff can show changes between any two commits, any two files, etc.

- git diff example
Try:

git diff HEAD~1

- Diffs on GitHub
- Practice
- edit file
- stage
- commit

### 22.5 Working with branches

- Branches
A branch represents an independent line of development.

- Show and create branches
- Using branches
- Switch branches
git checkout <existing-branch>

- Merge
Merging is Git's way of putting a branched history back together again.

- Fast-forward merge
- Three-way merge
- Example fast-foward merge
# Start a new feature git checkout -b new-feature master # Edit some files git add <file> git commit -m "Start a feature" # Edit some files git add <file> git commit -m "Finish a feature" # Merge in the new-feature branch git checkout master git merge new-feature git branch -d new-feature

- Merge vs rebase

### 22.6 Working with remotes

- What is a "remote"
- a version of the repository that hosted somewhere on the internet.
- Remotes have a url (https or ssh), but the can also have names.
- It is common to have only one remote at it is named "origin"

- GitHub
We use GitHub for hosting, so most of of your local repos will be associated with a remote named "origin" and a url something like

`git@github.com:schwilklab/CSC-sky-island-forest.git`

- Working with remotes
When you clone from a remote, git remembers and calls that remote "origin". You can look at .git/config or, better use git commands to view:

git remote -v

origin git@github.com:schwilklab/R-research-tool.git (fetch) origin git@github.com:schwilklab/R-research-tool.git (push)

- Adding a remote
If you need to add a remote to an existing repository:

git remote add rem2 git@github.com:dschwilk/software-tools-course.git

- Fetching and pulling
- git push
git push <remote> <branch name> eg: git push origin master or just: git push

This works if none else has pushed changes to that remote since you last pulled. If there is an error, you will need to pull (fetch and merge) first.

- git and GitHub
idea of "forking"

## 23 Packages for Statistics

### 23.1 RStudio Project

- RStudio project
- Fork and clone or just clone: https://github.com/schwilklab/BBNP-transects
- eg clone:
`git clone https://github.com/schwilklab/BBNP-transects.git`

- Enable version control in RStudio
- Make RStudio project
- Start rmarkdown file as a report

- Questions
- Were oaks or pines harder hit by 2011 drought? Short vs long term?
- Did the 2011 drought influence species diversity?
- Did the drought effects differ by elevation?

### 23.2 Statistical Modeling

- Stats models
- GLM (Generalized linear models). For modeling distributions with link and variance functions (eg binomial = logistic regression)
- GAM (Generalized Additive Models). gam or mgcv packages.
- GLS (Generalized least squares models). These allow for heterogeneity. nlme library (Pinheiro and Bates)
- Mixed (nested) models (lme() and glm()) in nlme or in lme4. Use lme4 (see function
`lmer()`

) when you can. - GLMM (Generalized Linear Mixed Models). Multiple packages, still cutting edge. glmmPQL() from MASS package, lmer from lme4 package, glmmML from the glmmML package

- Bayesian models
- rstan (R interface to the Stan language)
- brms (Use lmer-style formulae that are automatically translated to Stan via rstan)

- Multi model inference (AIC) and model averaging
MuMIn

- Bioinformatics and genomics
BioConductor: http://www.bioconductor.org/

- smatr package
Standardized major axis regression. Useful for comparing allometries.

http://onlinelibrary.wiley.com/doi/10.1111/j.2041-210X.2011.00153.x/abstract

### 23.3 Packages useful in Ecology

- Packages useful to Ecology
- CRAN task view Ecological and Environmental Data: http://cran.r-project.org/web/views/Environmetrics.html
- http://www.r-bloggers.com/recent-r-packages-for-ecology-and-evolution/

- Species diversity, community ordination
- vegan: species richness, diversity indices, community ordination (dca, nms, etc). Expects data in wide format (species ids as columns, rows are plots, values are abundances)
- rich: some richness rarefaction and bootstrapping procedures
- FD: functional diversity indices (some, there are useful ones it leaves out, see Cornwell, Schwilk and Ackerly 2006)

### 23.4 Phylogenies, comparative methods

- Phylogenies, comparative methods
- CRAN task view Phylogenetics: http://cran.r-project.org/web/views/Phylogenetics.html
- ape: phylogenetic comparative methods

### 23.5 Spatial statistics

## 24 Extra: Plain text and Emacs

### 24.1 Working with plain text

- Why you should love plain text
- Portability. Linux, Windows, Mac, Android, etc.
**Standard**. - Easy to use. Don't let a word processor get in you way when you just need to
**write** - Keep content separate from formatting
- Computers love plain text. Programs are written in plain text. Use one powerful text editing tool for everything you do.
- Version control: VC systems work best on text files (eg git)

- Portability. Linux, Windows, Mac, Android, etc.
- What you want in a text editor
- Syntax highlighting
- Braces, parenthesis matching, automatic indenting
- Ability to work with interpreters, shells
- Spell checking

And some other things you did not know you wanted: TODOs and outlining modes, exporting to HTML or PDF, web browsing, email …

- Editors: free as in speech and beer
- OS agnostic:
**Emacs**My favorite. Runs on everything, super powerful.**Vim**The "other editor". Also powerful and customizable**atom**. By the github folks: https://atom.io/. New kid on the block**RStudio**For R code but also reasonable for markdown editing

- Windows:
**notepad++**. Lightweight editor. Can syntax highlight R code

- Mac
**textmate**(textmate2 is open source)

- OS agnostic:
- Some non-free editors
- SublimeText (all platforms): https://www.sublimetext.com/
- TextWrangler and BBEdit (Mac)

### 24.2 Emacs

- Emacs
One editor to rule them all.

- Why use Emacs? Availability
- All platforms, operating systems
- Will Always Exist (cue music)
- Open source and free

- Why use Emacs? Best tools
- Emacs has "major modes" for most computer languages and types of files
- Powerful text management tools:
- Compare (
`diff`

) documents or buffers - Regular expressions
- Can deal with rectangular text (columns, tables)

- Compare (
- Can handle very large files

### 24.3 Basics

- Start Emacs
1: Menus, 2: Toolbar, 3: Buffer, 4: Mode line (status bar), 5: Mini buffer

- What I see:
- What you'll see if you use my student starter
- A quick note on terminology
Emacs was first written before modern window managers existed (before MS Windows 3.1 even!). So it uses its own terminology for a few things.

- Frames, windows and buffers
- Frame
- What we call a "window" on a modern operating system. One instance can control multiple frames. Try
`File -> New Frame`

- Window
- What we would call a "panel." A subdivided part of an Emacs "frame". A way to view multiple buffers at once (eg R code and R console). Try
`File -> Split Window`

- Buffer
- A name associated with the contents of a window. Can be the contents of a file, a process, an interpreter. Try
`Buffers -> ...`

- Emacs "Major Modes"
- Emacs can customize itself based on the language you are working in
- R code will look different (font colors, comments etc) from Python from Markdown.
- Emacs will guess the mode from the file extension (eg
`.R`

), or you can specify mode: Type Alt-x, release, then type valid mode name

- Some examples:
- R (using emacs package ess)
- Bash shell
- Tramp (open files over ssh). Or ssh then open emacs.
- git via package 'magit'
- markdown and Pandoc
- Python and C, C++, etc
- Movement, replace-regexp

- Emacs keyboard commands
- Keyboard nomenclature
- "C-x g"
- hold Control while typing "x", release, then type "g"
- "M-x"
- Hold down Alt then hit "x". This brings you to mini-buffer to type longer commands
- "C-c C-c"
- Hold down Control while typing "c" then do that again.

- That seems complicated
So you don't have to use those keys if you don't want to!

- The keyboard commands are featured in most Emacs tutorials. And that may scare folks away.
- I have a "cheat sheet" of commands taped on the wall next to my computer
- But you can always use the pull down menus and still do more than with any other editor.

- Just use the menus and keys you are used to
- One problem:
- Easily fixed
Turn on "CUA mode", multiple ways:

`Options -> CUA`

- "M-x cua-mode"
- Add following to emacs startup:
`(cua-mode t)`

Watch out: This overrides some basic Emacs keyboard functionality. For example, you must turn this off for the built-in Emacs tutorial!

- Some shortcuts you
**should**use- M-x
- execute command in minibufer — this gives you access to every lisp function
- C-g
- Quit. Exits a function, exits mini-buffer. Useful if you get stuck
- C-s
- Forward search. I use this very frequently
- M-%
- Query replace
- C-M-%
- Regular expression query replace
- M-q
- Rewrap paragraph (does right thing for code comments)
- M-;
- Comment or uncomment selected region; knows which comment character to use

- More on
`M-x`

- "M-x" gets the mini-buffer ready for a command
- Tab completeion works, so just type first few letters

\pause

### 24.4 Installing

- Installing Emacs
- On linux
`> sudo apt-get install emacs24`

- On Mac
- OSX: aquamacs: http://aquamacs.org/

- On linux
- Install Emacs on Windows
- go to http://ftp.gnu.org/gnu/windows/emacs/
- Download emacs by right-clicking on either emacs-xx.x-bin-i386.tar.gz or emacs-xx.x-fullbin-i386.tar.gz.
- Extract to zip file to
`C:\Program Files\emacs`

- Add shortcuts to the executable, set the $HOME environment variable(My computer … Properties … Advanced
- You will need to learn about environment variables so that emacs can find R, etc.

- Customizing Emacs
- Emacs is highly customizable. On startup it reads an initialization file (~/.emacs.d/init.el)
- I have written a set of initialization files for you: https://github.com/schwilklab/emacs-starter

- Installing my emacs student starter
- go to https://github.com/schwilklab/emacs-starter
- read the README.md file
- Download to your .emacs.d directory (
`~/.emacs.d`

, or`$HOME\.emacs.d\`

) - Learn what I mean by "~" and
`$HOME`

!

- Set up on linux
`cd ~ git clone https://github.com/schwilklab/emacs-starter.git .emacs.d`

- Some features and departures from defaults
Note that some things in my configuration are different from the documentation you will find online:

- Uses windows cut-copy-paste keys (
**cua-mode**). - A dark color theme ("schwilk" theme under options…customize emacs…custom themes)
- C-x g for quick access to git interface "magit"

- Uses windows cut-copy-paste keys (
- Emacs escape hatch
\huge

`C-g`

- A poster for learning emacs

## 25 Example packages

### 25.1 PCA and ordination

- J.A.'s scorpion morphology data
`library(dplyr)`

scorpion <- read.csv("http://r-research-tool.schwilk.org/lectures/data/Azzinari_raw.csv", stringsAsFactors=FALSE) %>% select(-V.W) %>% mutate_each(funs(log), -Sample) head(scorpion, 2)

Sample CaL I.L I.W V.L FemL FemW Ch.L 1 A 2.066863 0.6931472 1.131402 1.856298 1.609438 0.9555114 2.564949 2 A 2.008214 0.7178398 1.098612 1.840550 1.589235 0.7884574 2.459589 ChW ChD FFL MFL Ves.L Ves.W Ves.D Tel.L 1 1.458615 1.625311 1.667707 1.960095 1.526056 1.098612 1.098612 1.887070 2 1.386294 1.458615 1.609438 1.916923 1.568616 1.064711 1.081805 1.902108 Leg.Tib.L Leg.Tib.W 1 1.629241 0.6830968 2 1.568616 0.7419373

- Look at correlations
scorpion %>% select(-Sample) %>% pairs()

- PCA
# PCA cannot handle NA's: scorpion <- scorpion %>% filter(complete.cases(.)) sc.complete <- scorpion %>% select(-Sample) pca <- prcomp(sc.complete) plot(pca)

- Plot of PC axis variances
- Look at the loadings
pca$rotation[,1:2]

PC1 PC2 CaL 0.2210499 -0.01444772 I.L 0.2422791 -0.18468988 I.W 0.2443911 -0.14540620 V.L 0.2305940 -0.12903198 FemL 0.2190032 0.11698766 FemW 0.2322420 -0.22215989 Ch.L 0.2271538 0.20442404 ChW 0.2737845 -0.16207092 ChD 0.2808830 -0.04576147 FFL 0.2227986 0.67095346 MFL 0.2473843 0.43824096 Ves.L 0.2828978 -0.20786541 Ves.W 0.2459926 -0.21128623 Ves.D 0.2566618 -0.18374224 Tel.L 0.2440554 -0.01627861 Leg.Tib.L 0.2300402 0.12883335 Leg.Tib.W 0.2060285 0.11827706

- Size differences across samples?
library(ggplot2) scores <- data.frame(pca$x[, 1:10]) scores$sample <- scorpion$Sample ggplot(scores, aes(sample, PC1)) + geom_point(size=3)

- Results
- Plot PC2 and PC3
ggplot(scores, aes(PC2, PC3, color=sample)) + geom_point(size=4)

- Result
- What is PC2? Largely FFL and MFL.
ggplot(scores, aes(sample, PC2)) + geom_point(size=4)

- Look at FFL residuals
scorpion$size <- scores$PC1 fflmod <- lm(FFL ~ size, data=scorpion) scorpion$FFL.sc <- fflmod$residuals ggplot(scorpion, aes(Sample, FFL.sc)) + geom_point(size=4)

- Results
- Better yet, create model with size (PC1) as covariate
`fflmod <- lm(FFL ~ Sample + size + Sample:size, data=scorpion) anova(fflmod)`

X11cairo 2 Analysis of Variance Table Response: FFL Df Sum Sq Mean Sq F value Pr(>F) Sample 8 0.89796 0.11224 40.1027 <2e-16 *** size 1 0.55940 0.55940 199.8638 <2e-16 *** Sample:size 8 0.03268 0.00409 1.4596 0.1944 Residuals 53 0.14834 0.00280 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

- And plot these allometries
`ggplot(scorpion, aes(size, FFL, color=Sample)) + geom_point(size=4) + geom_smooth(method="lm")`

- Result
- Discriminant Function Analysis
dfamod <- MASS::lda(sample ~ PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, data=scores, CV=TRUE) ct <- table(scores$sample, dfamod$class) diag(prop.table(ct, 1)) # total proportion correct

X11cairo 2 A B C D E F G H 0.3000000 0.2500000 0.2857143 0.2500000 0.5555556 0.4444444 0.3333333 0.3333333 I 0.4000000

- Predict
lda.values <- predict(MASS::lda(sample ~ PC2 + PC3 + PC4 + PC5 + PC6 + PC7 + PC8, data=scores)) ldascores <- data.frame(lda.values$x) ldascores$sample <- scores$sample ggplot(ldascores, aes(LD1, LD2, color=sample)) + geom_point(size=5) # total proportion correct

- Results
- Notes / see also
- pcaMethods for imputed values in PCA (deal with NAs)
- prcomp and princomp have different terminology

### 25.2 M.F.'s orchid-fungi data

- Mycorrhizal fungi and orchids
myco <- read.csv("http://r-research-tool.schwilk.org/lectures/data/Fernandez_orchid_fungi.csv", stringsAsFactors=FALSE) head(myco, 2)

genus species sp_code sample_nr site status long lat 1 Masdevallia nidifica mani 1 a common -83.78525 9.75116 2 Masdevallia nidifica mani 2 a common -83.78525 9.75116 life_zone elevation rainfall temp host_tree light_. fungal.ID 1 Premontane Rainforest 1250 4100 23.2 saur_mon 27 Sebacinaceae 2 Premontane Rainforest 1250 4100 23.2 saur_mon 36 Sebacinaceae

- Explore by orchid species
# most common fungi by orchid species myco %>% group_by(sp_code, fungal.ID) %>% summarize(count = length(fungal.ID)) %>% top_n(n=1, wt=count)

Source: local data frame [4 x 3] Groups: sp_code [4] sp_code fungal.ID count <chr> <chr> <int> 1 difr Cantharellales 18 2 mach Cantharellales 15 3 mani Sebacinaceae 10 4 onkl Sebacinaceae 12

- By site:
# most common fungi by orchid species myco %>% group_by(site, fungal.ID) %>% summarize(count = length(fungal.ID)) %>% top_n(n=3, wt=count)

Source: local data frame [11 x 3] Groups: site [4] site fungal.ID count <chr> <chr> <int> 1 a Cantharellales 9 2 a Sebacinaceae 10 3 a Tulasnellaceae 1 4 b Cantharellales 10 5 b Sebacinaceae 10 6 c Atractiellales 4 7 c Cantharellales 11 8 c Tulasnellaceae 3 9 d Atractiellales 2 10 d Cantharellales 12 11 d Tulasnellaceae 4