R as a Research Tool: Lectures
Table of Contents
- 1. Introduction
- 2. Plots with ggplot
- 3. Math and Data structures 1
- 4. Data structures 2
- 5. Data structures 3
- 6. Introduction to Functions
- 7. Control structures
- 8. Lists and apply functions
- 9. Loops and more functions
- 10. Function exercises
- 11. Recursion
- 12. Scope and design
- 13. Debugging
- 14. Data Frames and Factors
- 15. Math and numbers
- 16. Strings
- 17. Math II
- 18. The Grammar of Graphics and ggplot2 I
- 19. The Grammar of Graphics and ggplot2 II
- 20. Function exercises
- 21. The Tidyverse
- 22. Shaping data and
dplyr - 23. More dplyr and data exploration
- 24. Tidy Data
- 25. Dates and Times
- 26. Statistical Models
- 27. Perception and Color
- 28. Version Control
1. Introduction
1.1. Why R?
- Why R?
?
- Why use a programming language at all?
- Why R vs Python, C/C++, Rust, Julia, Java, etc?
- Why use a programming language at all?
- Powerful
- Reproducible
- Interoperability
- Why R vs others?
- A language focused on data analysis and and well suited to interactive exploration of data.
- By far the most popular software for statistics and data analyses – although losing ground to python.
- A massive set of packages.
- Strong and friendly community. R is the de facto standard language in statistics and many areas of biology.
- Open source:
- Available to all, not just the rich.
- Contributes to reproducible research.
- Shortcomings of R compared to other languages
- Less "general purpose" than Python and some others.
- There are nutty aspects of the language (eg counting starts at 1, not zero
- There are MANY ways to accomplish the same thing in R. Many R functions use tricks to reduce the amount of typing at the cost of making code that is hard to understand. It is not really elegant.
- R "evolved." Many APIs are inconsistent (especially in base R!)
- Most R programmers are not computer scientists; there is a lot of poorly written and poorly documented R code out there.
- R is not very fast. It uses memory inefficiently. But it is often fast enough.
- Computer Languages
- Interpreted vs compiled languages
- Compiled
higher level language code is converted into machine code and then 'directly' executed by the host CPU
- Faster execution, but needs compilation time
- No dynamic typing or scoping
- Interpreted (eg R)
Programs are indirectly executed ("interpreted") by an interpreter program
- Platform independence
- Reflection (ability to examine and modify the structure and behavior of an object at run time)
- Dynamic typing
- Compiled
- R
- R is a high level computer language for statistical computing
- R is an interpreted language
- FORTRAN, C, C++ are all high level compiled languages
- other interpreted languages used in science: Python, Perl, Matlab/Octave, Julia.
1.2. Logistics
- Website and homework
- Website
- http://r-research-tool.schwilk.org
- Homework
- Email to me each Monday. See webpage for format instructions. dylan.schwilk@ttu.edu
- Syllabus
- http://r-research-tool.schwilk.org/handouts/R-as-a-Research-Tool-Syllabus.html
- Lecture code
All code from my lecture slides is in session files:
http://r-research-tool.schwilk.org
Use these files as your notes templates during lecture! Take notes as R comments.
- Me
- Plant ecologist: fire ecology, evolutionary ecology
- On the intertubes:
1.3. Getting started
- Installing R
See syllabus. You will install R (the open source language) and RStudio which is an integrated development environment many folks use for writing and running R code.
- Running R from the command line
Start R and you'll get a greeting, and then the R prompt, the > sign. The screen will look something like:
R
R version 4.3.2 (2023-10-31) -- "Eye Holes" Copyright (C) 2023 The R Foundation for Statistical Computing Platform: x86_64-pc-linux-gnu (64-bit) [...MORE...] Type 'demo()' for some demos, 'help()' for on-line help, or 'help.start()' for an HTML browser interface to help. Type 'q()' to quit R.
- RStudio
You will probably use this. The Desktop version is open source (AGPLv3) and free. It is an IDE (integrated development environment) that allows one to edit scripts, send code directly to the R process in a console window, debug, edit RMarkdown, etc all in one application.
- Windows: console, workspace, files, current open file (script).
- Projects: RStudio saves a .RProj file (probably hidden by your file browser) in your project directory.
- RStudio: keyboard shortcuts
1.4. Fundamentals of R
- R as a calculator:
4.0 * 3.5 log(10) # natural log log10(10) (3 + 1) * 5 3^-1 1/0[1] 14 [1] 2.302585 [1] 1 [1] 20 [1] 0.3333333 [1] Inf
- Interactive
You can start an R session on the command line or as a "console window" in RStudio. This is an interactive session.
- Or a stand alone script
Rscript run-stats.ROr make an executable script: On linux/mac, just start the text file with a "shebang" line and make the file executable:
#!/usr/bin/Rscript x <- 10 print(x) - Statements
- one statement per line
- but R will keep reading if the statement is "not finished" (open parentheses, hanging operator, etc)
- semicolon can end a statement (rarely needed)
1 + 2 # second statement ends here 3 + 4 + ( 7 - 7)[1] 3 [1] 7
- Assigning values to variables
- R now allows
=for assignment
- Variables that contain many values
- There are no scalars in R
But you can have vectors of length 1. Can you figure out what the
[1], [20]etc output below means?y <- rep("a",50) y[1] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" [20] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" [39] "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a" "a"
- Naming variables (objects)
- 1-63 characters
- first character a letter
- remaining characters: letters or numbers or period (+ underscore)
- case sensitive
- Parenthesis and braces
()Parenthesis group operators and are used in function calls{}braces group statements and return last expression evaluated. This is needed for function definitions, if-then statements and a few other areas.
- Functions
Functions are modules of code that accomplish a specific task. Functions take data as "arguments" and "return" a result. Once a function is written, it can be used over and over and over again. Functions can be "called" from the console or from within other functions.
- Functions in R
Arguments to R functions can be matched by position or by name. So the following calls to
sd()are all equivalentmydata <- rnorm(100) sd(mydata) sd(x = mydata) sd(x = mydata, na.rm = FALSE) sd(na.rm = FALSE, x = mydata) sd(na.rm = FALSE, mydata) - Example: random numbers
runif(5)[1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583
[1] 0.0705746 0.7757365 0.2578603 0.3275505 0.6414583
runif(5, 0, 10)[1] 2.9466004 8.9681397 2.7195012 6.7328881 0.2388521
round(runif(5,-0.5,10.5))[1] 10 0 5 10 9
floor(runif(5,0,11))[1] 7 5 8 10 7
- Inspecting functions
sdfunction (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm)) <bytecode: 0x5d0f98e12650> <environment: namespace:stats>function (x, na.rm = FALSE) sqrt(var(if (is.vector(x) || is.factor(x)) x else as.double(x), na.rm = na.rm)) <bytecode: 0x5ec3b5ea8de0> <environment: namespace:stats> - Comments
The R interpreter ignores anything after the "#" symbol on a lines:
x <- 2 # On the line above, I assigned the value of 2 to x. This comment is for human # readers x <- 3 # This comment is "inline" (use judiciously)
1.5. Getting help
- Help in R
- Help on the internet
- The R Project’s own manuals are available at the R home page and some search engines are listed there, too.
- RSeek search engine: http://www.rseek.org
- DANGER: You can post your R questions to r-help, the R listserve. See http://www.r-project.org/mail.html.
- DANGER: Check out StackOverflow for programming questions tagged "r": http://stackoverflow.com/
- StackOverflow is dying in part because of chatgpt — which was trained on all its answers.
2. Plots with ggplot
2.1. Setup
- Housekeeping and reminders
- Resources at http://r-research-tool.schwilk.org/ : class session (.R) files, weekly vocabulary lists, lecture notes
- Follow along with code and STOP ME if you get an error.
- Homework due on Mondays via email. These must be R scripts (plain text files). Write any answers or explanations asked for as comments (line starts with
#).
- Some advice on setting up your computer
- Set your file browser to show file extensions. Those are part of the file name despite microsoft's defaults. You might set your browser to show "hidden" fiiles as well but note you are not expected to modify those manually.
- Make a separate folder for each assignment and major analysis project. Use plain text for everything (you can edit in RStudio).
2.2. ggplot: introduction by example
- Making pretty figures
- Today we will get ahead of ourselves. Don't worry! I'll come back to the basics after showing you some fun with plots.
- We will start with the
ggplot()function
- Where do functions come from?
Many are part of base R. Others come from packages. Others you will write yourself.
- R packages are loaded with library()
library(ggplot2)
- To install a new package:
install.packages("ggplot2") - Note  B_note
- In the R interpreter console,
install.packageswill install for the current user. Or use the menu in Rstudio. If you want to install system-wide you can use your operating system's package manager. Example on Debian-based linux:apt install r-cran-ggplot2
- In the R interpreter console,
- R packages are loaded with library()
- ggplot2 and built in data sets in R
#install.packages("ggplot2") library(ggplot2) #?mpg head(mpg) #str(mpg) summary(mpg) ggplot()function
Create a plot object to which one can add layers (geoms)
- First plot
ggplot(mpg, aes(x=displ, y=hwy)) + geom_point() - geoms
ggplot()simply sets up the axes and the mapping. To put data on the plot you must add a "geometric object" or "geom"- Use
geom_point()for a scatterplot. See other geoms such asgeom_line(),geom_smooth(),geom_boxplot()
- Additional variables, an example
ggplot(mpg, aes(displ, hwy, color=class)) + geom_point() - Saving plots
You can store the plot object to a variable name then save with
ggsave():p <- ggplot(mpg, aes(displ, hwy)) + geom_point() ggsave("plots/plot1.pdf", p)Save the last plot even if it was not assigned to a variable:
ggplot(mpg, aes(cyl, hwy)) + geom_point() ggsave("plots/plot2.pdf")- You can use the RStudio GUI ("export" button in the plots window)
2.3. Aesthetics and faceting
- Aesthetics in addition to x and y
Aesthetic Discrete variable Continuous variables Color Rainbow of colors Gradient red to blue Size Discrete size steps Mapping radius->value Shape Different shape for each Doesn't work - Example aesthetics
ggplot(mpg, aes(displ, hwy, size = cyl)) + geom_point() - Faceting
- Small multiples displaying different subsets of the data.
- Useful for exploring conditional relationships and for large data.
- Examples to try:
p <- ggplot(mpg, aes(displ, hwy)) + geom_point() p + facet_grid(. ~ cyl) p + facet_grid(drv ~ .) p + facet_grid(drv ~ cyl) p + facet_wrap(~ class) - Faceting
facet_grid()- 2d grid,
rows ~ cols,.for no split facet_wrap()- 1d ribbon wrapped into 2d
scales- argument controls whether position scales are fixed or free
- Overplotting: What is wrong here?
p <- ggplot(mpg, aes(cty, hwy)) p + geom_point()
- Jitter
p + geom_jitter()
- How can we improve this plot?
ggplot(mpg, aes(class, hwy)) + geom_boxplot()
- Reorder
p <- ggplot(mpg, aes(reorder(class, hwy), hwy)) p + geom_boxplot()
- Multiple geoms
p + geom_jitter() + geom_violin()
- Your turn
- Read the help for
reorder. Redraw the previously plots with class ordered by median hwy. - How would you put the jittered points on top of the violin plots?
- Read the help for
3. Math and Data structures 1
3.1. More on getting started in R
- Session data
- .Rprofile
- File loaded on startup which can contain your own custom functions
- .Rhistory
- File that contains the session history of commands
- .RData
- Binary blob of data which contains all the objects created in the session. Watch out for this!
I suggest you write code which does not use or depend upon these files! For your homework, you must turn in code that runs in a "clean" environment. Check
Tools -> Global options -> Generalin RStudio. - Working directory
Set your working directory. You can use the GUI, an RStudio .Rproj file, or interactively on the command line:
getwd() setwd("C:/Users/Username/Desktop") setwd(path.expand("~/projects/flammability") ) - Never set the working directory in code, only in an interactive session or a project file.
- Your code will not be portable across machines.
- Instead, document in your code or elsewhere the assumptions your code makes about paths and working directories.
- For this class, we will always assume that the working directory is the same as the folder containing the current R script.
- Special "hidden files"
.Rdataand.Rhistory
- When you quit a session you are asked if you want to save the workspace
- or you can manually use
save.image() - When R starts it loads all objects from
./.Rdata(Unset this in RStudio!) - Saving your workspace should usually be avoided. Make sure your code runs
from a clean workspace: try
rm(list=ls(all=TRUE))and then rerun your code.
3.2. Math
- Operators
- Operator precedence
$Subsetting by name @component / slot extraction [[[subsetting ^exponentiation :sequence operation %any%special operators, including %% and %/% * /multiply and divide + -add and subtract < > <= >= = !greater, less than !negation &and&&and | and || or ~"depends on" in formulae <-assignment <<-global assignment ?help - Order of operations
x <- 3 + 4 * 2 x[1] 11
x <- (3 + 4) * 2 x x <- (3 + 4) * sqrt(2) x[1] 14 [1] 9.899495
3.3. Vector basics
- R data structures
- vectors
- matrices and arrays
- lists
- data frames
And other specialized structures provided by packages (eg tibbles, data.table, phylogenetic trees …)
- Vectors: the heart of R
x <- 1:20 x y <- seq(1,20) y z <- seq(1,40,2) z[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [20] 20 [1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 [20] 20 [1] 1 3 5 7 9 11 13 15 17 19 21 23 25 27 29 31 33 35 37 [20] 39
- Vectors: inspecting
x <- c(5,12,13) x length(x) mode(x)[1] 5 12 13 [1] 3 [1] "numeric"
- The ":" operator
5:8 5:1[1] 5 6 7 8 [1] 5 4 3 2 1
- Beware of operator precedence:
i <- 4 1:i-1 1:(i-1)[1] 0 1 2 3 [1] 1 2 3
- seq()
seq(from=12,to=30,by=3) seq(from=1.1,to=2,length=10) seq(from=12,to=30,by=3)[1] 12 15 18 21 24 27 30 [1] 1.1 1.2 1.3 1.4 1.5 1.6 1.7 1.8 1.9 2.0 [1] 12 15 18 21 24 27 30
- rep()
rep(8,4) rep(c(5,12,13),3) rep(c(5,12,13),each=2)[1] 8 8 8 8 [1] 5 12 13 5 12 13 5 12 13 [1] 5 5 12 12 13 13
- The subscripting operators (used to extract a subset)
R has three subscripting operators:
[[[$
We'll start with one you've already used in your homework:
[ - Vector subsetting
v <- 1:100 v[1] 1 2 3 4 5 6 7 8 9 10 11 12 13 14 [15] 15 16 17 18 19 20 21 22 23 24 25 26 27 28 [29] 29 30 31 32 33 34 35 36 37 38 39 40 41 42 [43] 43 44 45 46 47 48 49 50 51 52 53 54 55 56 [57] 57 58 59 60 61 62 63 64 65 66 67 68 69 70 [71] 71 72 73 74 75 76 77 78 79 80 81 82 83 84 [85] 85 86 87 88 89 90 91 92 93 94 95 96 97 98 [99] 99 100
v <- v * 10 v[1:3][1] 10 20 30
- Vector indexing (subsetting)
There are six things that you can use to subset a vector. First three are used commonly:
- positive integers
- negative integers
- logical vectors
- nothing (
[]returns original vector) - zero (
[0]returns empty vector) - character vectors (only if vector is named)
- Subsetting vectors using
[
y <- c(1.2,3.9,0.4,0.12) y[c(1,3)] y[1:3][1] 1.2 0.4 [1] 1.2 3.9 0.4
v <- c(1,4) y[-v] # returns all except 1st and 4th element[1] 3.9 0.4
- Subsetting with logical vectors
y <- c(1.2,3.9,0.4,0.12) l <- y < 2 l y[l][1] TRUE FALSE TRUE TRUE [1] 1.20 0.40 0.12
- Character vectors (aka vectors of "strings")
y <- "abc" length(y) mode(y) y <- c("abc","1","2") length(y)[1] 1 [1] "character" [1] 3
- Strings
u <- paste("abc", "de", "f") # concatenate the strings u[1] "abc de f"
v <- strsplit(u, " ") # split the string according to blanks v[[1]] [1] "abc" "de" "f"
- Note  B_note
Vectors that hold pieces of text are called character vectors in R. Formally, the mode of an object that holds character strings in R is "character". The individual elements of an R "character vector" are character strings, or what most computer languages call "strings".
See the
stringrpackage for useful string functions. I'll introduce this package later in the course.
- Note  B_note
- Some vector concepts
- Recycling
- Automatic lengthening of vectors in certain settings
- Filtering
- Extracting subsets of vectors
- Vectorization
- Where functions are applied element-wise to vectors
- Recycling
x <- c(1, 2, 3, 4)Recycling:
c(1,2) + x[1] 2 4 4 6
c(1, 2, 3) + x[1] 2 4 6 5
- Vectorization
Many functions are vectorized, including operators such as
>. They conduct element-wise operationsu <- c(5,2,8) v <- c(1,3,9) u > v[1] TRUE FALSE FALSE
sqrt(1:9)[1] 1.000000 1.414214 1.732051 2.000000 2.236068 2.449490 2.645751 2.828427 [9] 3.000000
- Learn these functions
exp(), log(), log10(), sqrt(), abs(), min(), which.min(), max(), which.max(), pmin(), pmax(), sum(), prod() (for products of multiple factors), round(), floor(), ceiling(), sort(), factorial()
x <- c(1, 2, 3, 4) y <- c(3, 6, 1, 4) min(y) which.min(y) pmin(x, y)[1] 1 [1] 3 [1] 1 2 1 4
4. Data structures 2
4.1. Logical vectors
- Boolean operators
Operate on TRUE/FALSE values only
Operation Math R ------------ ------------------------ ---------- AND x ∧ y & OR x ∨ y | NOT !x ! ------------ ------------------------ ---------- - Boolean operators
operator function -------- ----------------- == Test for equality < Less than > Greater than <=,>= lt or equal, gt or equal & Boolean AND for vectors | Boolean OR for vectors !x Boolean negation xor() Exclusive OR %in% Set inclusion any(), all() Boolean vector reduction - Logical vectors and filtering (subscripting)
y <- c(1.2, 3.9, 0.4, 0.12) v <- c(TRUE, TRUE, FALSE, FALSE) # unusual to see literal booleans in real code y[v][1] 1.2 3.9
y[ y > 1 ][1] 1.2 3.9
y > 1[1] TRUE TRUE FALSE FALSE
- Practice
# All the values that are evenly divisible by 9 and are greater than 200 v <- 1:500 ch <- as.character(v) ch[v %% 9 == 0 & v > 200][1] "207" "216" "225" "234" "243" "252" "261" "270" "279" [10] "288" "297" "306" "315" "324" "333" "342" "351" "360" [19] "369" "378" "387" "396" "405" "414" "423" "432" "441" [28] "450" "459" "468" "477" "486" "495"
- Practice:
# Get all the elements that evenly divisible by 2 or by 3 v <- 1:100 # ? matchand%in%
names <- c("fortis", "spinosa", "macrocarpa", "leucophylla") samples <- c("spinosa", "leucophylla") samples %in% names[1] TRUE TRUE
names %in% samples[1] FALSE TRUE FALSE TRUE
match(samples, names)[1] 2 4
matchand%in%
matchgives indices of where elements of first argument appear in the second argument%in%returns vector of same length as first argument with TRUE for each element that occurs in second argument.
- Practice
head(LETTERS) match("G", LETTERS) length("G" %in% LETTERS) == 1 length(LETTERS %in% "G") == 26[1] "A" "B" "C" "D" "E" "F" [1] 7 [1] TRUE [1] TRUE
- any(), all()
Reduce boolean (logical) vectors to a scalar.
any(y > 1)[1] TRUE
all(y > 1)[1] FALSE
- Additional logical functions
is.na() is.finite() duplicated() which() ifelse() - Assignment to subset
which(y > 1)[1] 1 2
y[y > 1] <- 0 y[1] 0.00 0.00 0.40 0.12
- Vectorized if-else
x <- 1:10 y <- ifelse(x %% 2 == 0, 0, 2*x) y[1] 2 0 6 0 10 0 14 0 18 0
4.2. More vectors
- NA and NULL
- NA
- Missing data
- NULL
- The null object
x <- c(1,NA,3,4,5,6) mean(x)[1] NA
mean(x, na.rm=TRUE)[1] 3.8
x <- c(1, NULL, 3, 4, 5, 6) x # NULL can't be part of a vector mean(x)[1] 1 3 4 5 6 [1] 3.8
- Testing for NA
x <- c(1, 2, NA) x==NA ## Oops is.na(x) # Correct[1] NA NA NA [1] FALSE FALSE TRUE
- unique()
x <- c(1, 2, 3, 3, 3, 4, 4, 1, 2) unique(x)[1] 1 2 3 4
- Counting TRUES
x < 3 length(which(x < 3)) sum(x < 3)[1] TRUE TRUE FALSE FALSE FALSE FALSE FALSE TRUE TRUE [1] 4 [1] 4
- Vector element names
x <- c(1,2,4) names(x)NULL
names(x) <- c("one","two","four") xone two four 1 2 4
- Subsetting on names
x[c("one", "four")]one four 1 4
- c(): Type coercion:
- When you attempt to combine different types they will be coerced to the most flexible type.
- The ordering from least to most flexible is
logical < integer < numeric < complex < character < list.
c(5,2,"abc") c(5,list(a=1,b=4))[1] "5" "2" "abc" [[1]] [1] 5 $a [1] 1 $b [1] 4
- c(): Flattening:
c()creates vectorsc(1,2,c(100,200,300))[1] 1 2 100 200 300
4.3. Matrices
- Matrices
m <- matrix(data = c(1,3,5,2,4,6), nrow=3, ncol=2) m[,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6m <- matrix(data = c(1,3,5,2,4,6), nrow=3) m[,1] [,2] [1,] 1 2 [2,] 3 4 [3,] 5 6 - Matrices con.
a <- matrix(c(1, 2, 3, 4), nrow=2) a[,1] [,2] [1,] 1 3 [2,] 2 4b <- matrix(c(6, 7, 8, 9), nrow=2) b[,1] [,2] [1,] 6 8 [2,] 7 9 - Matrix addition
a + b[,1] [,2] [1,] 7 11 [2,] 9 13 - Matrices math, dimensions
c <- matrix(c(3,4,5,6), nrow=1) c[,1] [,2] [,3] [,4] [1,] 3 4 5 6a + c # result: Error in a + c : non-conformable arrays
- Matrix math, con.
a * b[,1] [,2] [1,] 6 24 [2,] 14 36a^3[,1] [,2] [1,] 1 27 [2,] 8 64 - Matrix multiplication
a <- matrix(c(1,2,3),ncol=3) b <- matrix(c(4,5,6),ncol=1) a b[,1] [,2] [,3] [1,] 1 2 3 [,1] [1,] 4 [2,] 5 [3,] 6Calculate dot product (inner product)? Or generally, matrix product?
- Matrix multiplication
# a*b # ERROR a %*% b[,1] [1,] 32b %*% a[,1] [,2] [,3] [1,] 4 8 12 [2,] 5 10 15 [3,] 6 12 18 - rbind and cbind
rbind(a, a) # rbind(a,b) # ERROR! cbind(b, b)[,1] [,2] [,3] [1,] 1 2 3 [2,] 1 2 3 [,1] [,2] [1,] 4 4 [2,] 5 5 [3,] 6 6 - Row and column means
See:
rowMeans()colMeans()rowSums()colSums()
- Subsetting matrices
m <- matrix(c(1,2,3,4,5,6),nrow=3) m[1:2,1:2][,1] [,2] [1,] 1 4 [2,] 2 5 - Dimension reduction
z <- matrix(1:8, nrow=4) z[,1] [,2] [1,] 1 5 [2,] 2 6 [3,] 3 7 [4,] 4 8r <- z[2,] r[1] 2 6
- Dimension reduction result
attributes(z) attributes(r)$dim [1] 4 2 NULL
- Avoiding dimension reduction
r <- z[2,, drop=FALSE] r[,1] [,2] [1,] 2 6dim(r)[1] 1 2
5. Data structures 3
5.1. Lists
- Lists
An R list is a container whose contents can be items of different data types, unlike a vector.
x <- list(u=2, v="abc") x x$u$u [1] 2 $v [1] "abc" [1] 2
- Lists vs vectors
An R list is a container whose contents can be items of diverse data types, as opposed to them being, say, all numbers, as in a vector.
- Lists uses
A common usage of lists is to package the return values of complex statistical functions. For example, the histogram function hist():
hist(Nile) hn <- hist(Nile) print(hn)$breaks [1] 400 500 600 700 800 900 1000 1100 1200 1300 1400 $counts [1] 1 0 5 20 25 19 12 11 6 1 $density [1] 0.0001 0.0000 0.0005 0.0020 0.0025 0.0019 0.0012 0.0011 0.0006 0.0001 $mids [1] 450 550 650 750 850 950 1050 1150 1250 1350 $xname [1] "Nile" $equidist [1] TRUE attr(,"class") [1] "histogram"
- Subsetting lists
[- returns a list
[[- returns an element in the list
- (no term)
$returns an element by name
- List subsetting
j <- list(name=c("Joe", "Doe"), salary=39000, permanent=TRUE) j[["salary"]] j$salary # shorthand for above j$sal # "$" does partial matching - dangerous! j[[2]] # extract by position[1] 39000 [1] 39000 [1] 39000 [1] 39000
- List indexing
fieldname <- "salary" j[[fieldname]] ## '[[' most robust way to extract a single element[1] 39000
- List indexing
j[2]$salary [1] 39000
- [] vs [[]]
j[[1]]- Extracts a single item
j[1]- Extracts a sublist
- [] vs [[]]
class(j[["name"]]) class(j["name"])[1] "character" [1] "list"
- Adding elements
j$rnumber <- "R00001111" j$name [1] "Joe" "Doe" $salary [1] 39000 $permanent [1] TRUE $rnumber [1] "R00001111"
- Adding more elements
length(j) j[[5]] <- "Extra" j[6:7] <- c(FALSE,FALSE) length(j)[1] 4 [1] 7
- Adding more elements
j[3:7]$permanent [1] TRUE $rnumber [1] "R00001111" [[3]] [1] "Extra" [[4]] [1] FALSE [[5]] [1] FALSE
- Deleting elements
length(j) j$rnumber <- NULL length(j) names(j)[1] 7 [1] 6 [1] "name" "salary" "permanent" "" [5] "" ""
- Inserting elements
Either break apart and re-concatenate with
c(), or useappend():newj <- append(j, list(birthday="5/17/1990"), after=3) newj[3:5]$permanent [1] TRUE $birthday [1] "5/17/1990" [[3]] [1] "Extra"
- Lists to vectors:
unlist()
l <- list(a=1,b=10,c=100) v <- unlist(l) v mode(l) mode(v) unname(v)a b c 1 10 100 [1] "list" [1] "numeric" [1] 1 10 100
5.2. Data Frames
- Data Frames
Like a matrix, but for mixed types. This is what you will use most often!
d <- data.frame(kids=c("Jack","Jill","Igor"), ages=c(12,10,11)) dkids ages 1 Jack 12 2 Jill 10 3 Igor 11
- Classes
R is an object-oriented language in which objects are instances of classes. Much of R uses "S3" classes.
mode(d) class(d)[1] "list" [1] "data.frame"
- Reading data
For now, we will assume you save your data as a comma separated values file (say, in microsoft excel using "save as"), and then read that file into a dataframe:
fish <- read.csv("../data/fish-data.csv") # or via http: fish <- read.csv("http://r-research-tool.schwilk.org/data/fish-data.csv") head(fish, 3)aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0
- Try
?read.csv
read.csv(file, header = TRUE, sep = ",", quote = "\"", dec = ".", fill = TRUE, comment.char = "", ...) - Storing data as character delimited files
Limitations:
- The delim character cannot appear in a field unless it is quoted
- The quote character should not occur at all in a field
comma separated values (csv) and tab-delimited files (tsv) are both fairly common. You might also run into fixed-width data
5.3. Testing, inspecting, coercing
- Vectors and lists relationships
- Testing
- Vector types (internal storage mode)
is.logical(),is.integer(),is.double(), andis.character()- BUT AVOID
is.vector(),is.atomic(), andis.numeric()because they don't do exactly what you expect. mode(): Mutually exclusive classification according to structure with atomic modes named by type and recursive objects like lists and functions separate.typeof()will distinguish numeric types as integer vs double
class()
- Property assigned to an object to determine how some generic functions like
summary()work.
- Property assigned to an object to determine how some generic functions like
- Lecture notes  B_note
Class is just a property — you can change this.
- Vector types (internal storage mode)
- Type coercion
as.logical(), as.integer(), as.double(), or as.character()
as.logical(0:3) as.logical(c("a", "b", "c")) as.character(0:3) as.logical(as.character(0:3)) as.integer(as.character(0:3))[1] FALSE TRUE TRUE TRUE [1] NA NA NA [1] "0" "1" "2" "3" [1] NA NA NA NA [1] 0 1 2 3
6. Introduction to Functions
6.1. Functions
- Advice following HW02
- Don't copy data around unnecessarily. Use good variable names.
- Write code to obtain values
- Don't hard-code indices.
- Read style guide now.
- Why write functions?
Why not just copy and paste code?
- You can give a function a name that makes your code easier to understand.
- If you need to change the code, you only need to change it in one place.
- You eliminate the chance of making mistakes when you copy and paste.
- You can pass functions to other functions
- When to write a function?
Whenever you find yourself copying and pasting more than once or twice.
- Functions are objects
So we create them and assign them to a name.
addone <- function(x) { return(x+1) } addone(1)[1] 2
- Function example
pluralize_if <- function(thenames, nums) { endings <- ifelse(nums > 1, "s", "") result <- paste(thenames, endings, sep="") return(result) } pluralize_if("computer", 44) pluralize_if("instructor", 1) pluralize_if(c("dog", "cat", "parakeet"), c(1,2,3))[1] "computers" [1] "instructor" [1] "dog" "cats" "parakeets"
- Functions in R
Arguments to functions can be required or optional. Optional arguments have default values provided in the function definition:
pluralize_if <- function(thenames, nums=1) { endings <- ifelse(nums > 1, "s", "") result <- paste(thenames, endings, sep="") return(result) } pluralize_if("computer") pluralize_if(rep("cat", 3)) pluralize_if(rep("dog", 3), c(1,2,3))[1] "computer" [1] "cat" "cat" "cat" [1] "dog" "dogs" "dogs"
6.2. In class project 1
- Function randomIntegers()
- Problem: convert random real numbers to integers
- Your turn
# randomIntegers: produce vector of uniformly # distributed random integers. # # Arguments: # n = number of random integers to be generated. # min_val, max_val = min, max possible values, # inclusive. # Returns: # vector of random integers. randomIntegers <- function(n, min_val, max_val) { ## ???? return(result) }
6.3. Notes on functions
- Steps to writing a function
- Decide what information is needed (arguments) and what to return
- Develop the algorithm
- Implement in code
- Potential problems
- Syntax errors (error when you define the function)
- Errors during run time (eg undefined variable names)
- Runs with no errors but the output is wrong!
- Debugging
More on this later … For now:
- use
print() - use
stopifnot()
- use
- Lecture notes  B_note
Default printing only works in certain scope. You can use the print() function to be explicit and guarantee display to screen.
7. Control structures
7.1. if else
- if-else control
In
Rwe can often treat function arguments and other objects with the same code regardless of values. If we need to make a decision based on element values, theifelse()function is often enough.However, sometimes, we need to conditionally execute chunks of code.
- if-else decisions
tf <- function(a) { if (a == 13) { return("Lucky!") } else { return("No luck") } }tf(5) tf(13)[1] "No luck" [1] "Lucky!"
if ()
The value in parentheses after the
ifshould evaluate to a single boolean (logical) value. Note thatifis not a function, we usually put space betweenifand(.- Be careful inverting boolean logic!
A <- -8 if (A>0 & A<10) { print("YES") } else { print("NO") } # If we want to invert the logic: if (! A>0 & !A<10) { print("YES") } else { print("NO") } # Not what we want if (! (A>0 & A<10) ) { print("YES") } else { print("NO") }[1] "NO" [1] "NO" [1] "YES"
- if, else if, else
tf <- function(r) { if (r == 13) { return("Lucky 13!") } else if (r < 0) { return("Negative!") } else { return("Just a number") } }tf(13) tf(-100) tf(10)[1] "Lucky 13!" [1] "Negative!" [1] "Just a number"
ifneeds a single logical value
if (c(TRUE, FALSE)) {return("DONE")}Error in if (c(TRUE, FALSE)) { : the condition has length > 1|vs||and&vs&&
||and&&can be safer iniftests: Require logical(1) arguments. These are "short circuiting"x <- 2; y <- 1:3 if (x > 0 && x > y) { print("Yes") }Error in x > 0 && x > y 'length = 3' in coercion to 'logical(1)'
- Short circuiting
a TRUE || a TRUE | aError: object 'a' not found [1] TRUE Error: object 'a' not found
7.2. Loops
- for and while loops
x <- c(1, 2, 3) for (i in x) { print(i^2) }[1] 1 [1] 4 [1] 9
i <- 1 while (i <= 10) { i <- i+4 print(i) }[1] 5 [1] 9 [1] 13
- Breaking out of loops
i <- 1 while(TRUE) { i <- i+4 if (i>10) break } i[1] 13
Or use repeat:
i <- 1 repeat { i <- i+4 if (i>10) break } i[1] 13
- Loop hierarchy
specific to general:
for,while,repeat- So when do we use each type?
- Use
forloops when you need to iterate over a vector. There are a known number of iterations. - Use
whileloops when the loop must continue until some condition is met. If the condition is met when the loop is first encountered, the loop is skipped completely. - Use
repeatas forwhile, but when the loop must execute at least once or when the best exit point(s) is/are somewhere in the middle of the loop block.
- Use
- So when do we use each type?
- Two idioms for
forloops
Idiom 1: Do something to each element
v <- c(2,4,5,8) for(n in v) { print(n) }[1] 2 [1] 4 [1] 5 [1] 8
- Idiom 2: Iterate over an indexing vector
in order to do something to elements over multiple vectors
v2 <- c(1,1,2,10) for(i in seq_along(v)) { r <- v[i] * v2[i] print(sprintf("Loop %d, Result: %d", i, r)) }[1] "Loop 1, Result: 2" [1] "Loop 2, Result: 4" [1] "Loop 3, Result: 10" [1] "Loop 4, Result: 80"
seq_alongvs:
v <- c(2,4,5) for(i in 1:length(v) ) { r <- v[i] * v2[i] print(c(i, r)) } v <- NULL # Loop should not cycle even once for(i in 1:length(v)) { r <- v[i] * v2[i] print(c(i, r)) } #print(1:0)[1] 1 2 [1] 2 4 [1] 3 10 [1] 1 [1] 0
8. Lists and apply functions
8.1. Functions review
- The return() statement
If you do not include one, R uses the last statement evaluated
- Side effects
"Pure" functions have no side effects, only returned values. But you have probably already used functions entirely for their side effect (eg
library(),write.csv(),ggplot()) - Luckily, R protects you from most accidental side effects
Function has read access to variables at higher scope, but no write access.
x <- y <- 10 func <- function(x) { x <- 0 y <- y*2 return(y) } func() x y[1] 20 [1] 10 [1] 10
8.2. In-class project
8.3. Apply functions
applyfamily of functions
apply- apply a function to the rows, columns or values in a matrix/array
lapply- apply a function to every item in a list and return a list
sapply- lapply but simplifies the result and returns a vector
vapply- Like `sapply` but safer when you know the element types you expect in the output. Use this when possible.
mapply- You have multiple data structures and want to hand elements of each as separate arguments to a function
There are more in this family that we use less commonly. See ??apply and scroll down.
- Applying functions to lists
- lapply example
Consider sexing 10,000 shrimp. We'll use a subset. We'd like to know the indices for all females, males and juveniles:
shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") shrimp=="F" which(shrimp=="F")[1] FALSE TRUE FALSE FALSE TRUE TRUE FALSE TRUE TRUE [1] 2 5 6 8 9
- Example, continued
codes <- c("F","M","J") lapply(codes, function(sex) which(shrimp==sex))[[1]] [1] 2 5 6 8 9 [[2]] [1] 1 7 [[3]] [1] 3 4
- Whoa, where was the function?
It was unnamed! Alternative with named function:
# function to return indices of x in v indices <- function(x, v) { return(which(x==v)) } # and use sapply for USE.NAMES=TRUE: sapply(codes, indices, v=shrimp, simplify=FALSE)$F [1] 2 5 6 8 9 $M [1] 1 7 $J [1] 3 4
- Let's look at documentation
- Lists of lists
a <- list(name="Dylan", dept="Biology", role="faculty") b <- list(name="Jane", dept="Plant and Soil", role="student") c <- list(name="Kathryn", dept="NRM", role="student") course <- list(Schwilk=a, Doe=b, Smith=c) course[[3]]$role[1] "student"
How can we extract a vector of just the first names?
sapply
sapply(course,"[[","name")Schwilk Doe Smith "Dylan" "Jane" "Kathryn"
vapply
The problem: R is flexible but that can lead to silent bugs
test <- list(a = c(1, 3, 5), b = c(2,4,6), c = c(9,8,7)) sapply(test, max) test$d <- c("a", "b", "c") sapply(test, max)a b c 5 6 9 a b c d "5" "6" "9" "c"
vapply
vapply(test, max, numeric(1)) # error vapply(test[-4], max, numeric(1)) # worksError in vapply(test, max, numeric(1)) : values must be type 'double', but FUN(X[[4]]) result is type 'character' a b c 5 6 9
- Apply functions to matrices:
apply()
y <- apply(X, MARGIN, FUN, ...)X- an array
MARGIN- subscripts which the function gets applied over
FUN- the function to be applied
...- additional arguments to function
Returns a vector or array
- Applying a function to rows or columns
the
apply()function.MARGINargument = 1 if the function applies to rows, 2 for columns, etc.z <- matrix(c(1,2,3,4,5,6), ncol=2) z[,1] [,2] [1,] 1 4 [2,] 2 5 [3,] 3 6apply(z, 2, mean)[1] 2 5
- on rows, using our own function:
f <- function(x) x/c(2, 8) y <- apply(z, 1, f) y[,1] [,2] [,3] [1,] 0.5 1.000 1.50 [2,] 0.5 0.625 0.75 - What are "margins"?
- Examples: matrix
- MARGIN = 1: slice up into rows
- MARGIN = 2: slice up into columns
- MARGIN = c(1,2): slice up into individual cells
— Wickham (2011): http://www.jstatsoft.org/v40/i01/
- 3-d arrays
- Example: grid of genotypes
Let's store genotypes as 1st dimension on a 2d landscape of organisms. So we will make a 3d array:
makeGenotype <- function() sample(1:100, 25, replace=TRUE) # simple 10 x 10 landscape landscape <- array(replicate(100, makeGenotype()), dim =c(25,10,10 )) # The first genotype in landscape position 1,1: landscape[, 1, 1][1] 62 90 35 1 44 76 50 29 96 2 38 98 54 34 [15] 100 2 22 30 92 30 47 8 54 53 34
- Example: apply
## get matrix of fitnesses testFitness <- function(g, e) { return( 1 / ( sum((e-g)^2) + 1)) } opt <- makeGenotype() # "best" genotype fitnesses <- apply(landscape, c(2,3), testFitness, e = opt) # lets see the relative fitnesses as percentages round(fitnesses / max(fitnesses)*100) - Results:
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [1,] 39 48 50 31 47 47 42 49 35 33 [2,] 34 75 30 38 60 48 38 37 65 45 [3,] 57 35 54 38 35 38 42 100 37 30 [4,] 35 39 49 56 47 37 47 47 40 32 [5,] 45 40 68 60 49 52 47 47 39 30 [6,] 52 39 37 44 47 36 36 27 45 42 [7,] 31 54 41 54 41 59 33 32 47 37 [8,] 31 89 51 40 52 36 37 37 26 36 [9,] 25 37 38 34 31 54 51 53 35 65 [10,] 57 50 31 30 45 45 37 45 51 72 - Where is the most fit genotype?
max(fitnesses) indices <- which(fitnesses == max(fitnesses), arr.ind=TRUE) indices landscape[, indices[1], indices[2]] opt[1] 6.573758e-05 row col [1,] 3 8 [1] 78 87 54 73 81 47 44 28 39 56 51 30 36 92 60 7 22 98 51 [20] 93 42 3 54 18 54 [1] 80 93 10 94 64 21 48 51 59 40 20 1 30 92 [15] 100 40 63 57 44 65 56 34 66 29 32 - Heatmap of landscape:
image(fitnesses)
9. Loops and more functions
9.1. In class project
- Project 1: calculating a probability
Suppose we have
nindependent events each with a separate probability of occurring. What is the probability of exactly one of those events occurring? - The algorithm
if the probability of the \(i^{th}\) event is \(p_i\), then P(A and not B and not C) would be be \(p_A(1 - p_B)(1 - p_c)\). For general n:
\[ \sum\limits_{i=1}^{n} p_i (1 - p_1) \ldots (1 - p_{i-1}) (1 - p_{i+1}) \ldots{} (1 - p_n) \]
- Your turn: translate this to code
Function takes vector of probabilities and returns a scalar probability of exactly one event occurring.
exactlyOne <- function(p) { ## ? }You will want to loop through the vector
pwhile also looping through "not p" (1-p)
10. Function exercises
10.1. In class projects
- Converting units
Write a function called
convertInchesthat accepts two arguments. The first is a numeric vector representing measurements in inches, the second is a character vector whose elements have the possible values "cm", "m", "ft". Depending on the second argument, the function converts inches to either centimeters, meters, or feet. Make the second argument an optional argument with a default value.## convertInches: Converts measurements in inches to centimeters, meters, or feet. ## Args: ## x = a numeric vector containing measurements in inches units = a ## character vector with possible values "cm", "m", or "ft" ## Returns: ## a vector representing the measurements in the requested units. convertInches <- function(x, units="cm") { # your code here } - Above approach with lookup table
conversions <- list(cm = 2.54, m = 0.0254, ft = 1/12) getConversionMultiplier <- function(units) { if (!(units %in% names(conversions))) { stop(paste("Unrecognized units. Allowed values are ", names(conversions))) } return(conversions[[units]]) } convertInches3 <- function(x, units="cm") { multipliers <- sapply(units, getConversionMultiplier) return(x*multipliers) } convertInches3(1:3, c("cm", "m", "ft")) - Choose n points within a circle
- Choosing a random point in a circle
Several methods:
- Rejection sampling
- Choose random angle and random distance from origin (need to scale radius to account for area going up as square of distance from the center). Then convert from radians and distance to Cartesian coordinates.
- Archimedes and triangles (not shown, check out StackOverflow)
- Approach #1
- The uniform distribution has the nice property that if we reject a portion of the sampled space, the remaining points are uniform in the remaining space.
- However, if we use rejection sampling, we can't predict how many points we will need ahead of time (we should end up needing about 27% more random points than we use)
- So we need to use a while loop to grab the points, then only keep each point if
sqrt(x^2 + y^2) <= r
- result
11. Recursion
11.1. Recursion
- Project 2: Fibonacci sequence
0,1,1,2,3,5,8,13 …
- The recursion stack
- "Exploding" dice
In some table-top role playing games, dice are used to provide stochasticity in the simulation of events. In order to produce a more "exponential-like" distribution, some games use "acing" or "exploding dice" in which the die is re-rolled if the first roll results in the maximum value and the new value is added to the first.
- Dice
Photo by Clément Bucco-Lechat, CC BY-SA 3.0.
- Simulating such a die roll
rollDie()
Write a function called
rollDie(x)that simulates rolling a dice with x sides and allows "acing" or "exploding dice". Under this rule, a maximum roll (eg a "6" on a six-sided die) results in a second roll whose value is added to the first. This is open-ended and the second roll can "ace" as well: in other words, the resulting probability distribution is unbounded on the upper end.
- rollDice() ?
Similar function but returns result of
nindependent rolls?# RollDice: A dice rolling simulator which returns a vector # of optionally acing die rolls # # args: # n = size of output vector desired # die = An integer representing the die type # allow.ace = a logical value indicating if the rolls # are allowed to "ace" # ----------------------------------------------------- # returns: # a vector of length n containing the simulated # die roll results rollDice <- function(n, die, allow.ace = FALSE) { # ? }- Lecture notes  B_note
Modify the rollDie() function we wrote so that it is similar to the builtin random number functions such as
runif()and returns a vector rather than a scalar.rollDice(n,die)takes an initial argument,n, that gives the number of output values and returns a vector of length n. Remember, this die roll can be "open ended:" a maximum result (eg a 6 on a six-sided die) results in a second roll the result of which is added to the first. This is controlled by the "allow.ace" argument ("acing" is another term for "exploding" dice). So there is no theoretical upper limit to the result.
- Lecture notes  B_note
- Testing
# rollTester: simulates many rolls and returns proportions rollTester <- function(die) { n <- 10000 rolls <- rollDice(n, die, TRUE) # hist(rolls) res <- sapply(1:(die-1), function(x) sum(rolls==x) / n) res[die] <- sum(rolls > die) / n res <- res * die # scale so that all proportions should be 1 not 1/die return(res) } rollTester(die=8) # should all be about 1.0[1] 1.0152 1.0120 0.9792 0.9664 0.9776 1.0328 0.9976 1.0192
11.2. Pros / Cons
- Recursion pros / cons
- Recursion allows a very simple definition of the function
- Can be memory hungry: function is added to stack and values kept until call finishes
- Can result in "stack overflow" if the function is called too many times before the first finally returns
- Can be slower than iterative version (but in some cases, can be faster)
12. Scope and design
12.1. Scope
- Where do names apply?
- Nested functions
w <- 12 f <- function(y) { d <- 8 h <- function() { return(d * (w+y) ) } print(environment(h)) return(h()) } environment(f) ls() f(10)<environment: R_GlobalEnv> [1] "f" "w" <environment: 0x5d0f9a9b8b50> [1] 176
Note: confusingly, we refer to the top environment as .GlobalEnv in code.
- Scope hierarchy
- Scope hierarchy con.
f <- function(y, ftn) { d <- 8 print(environment(ftn)) return(ftn(d, y)) } h <- function(dd, yy) { return(dd * (w + yy)) } w <- 12 f(3, h)<environment: R_GlobalEnv> [1] 120
- Scope and assignment
In general, R functions do not change variables at a higher scope
x <- 1 increment.x <- function() { print(x) x <- x+1 print(x) } increment.x() x[1] 1 [1] 2 [1] 1
- Super assignment
increment.x <- function() { x <<- x+1 } x increment.x() x[1] 1 [1] 2
- assign()
Write to upper level variables using character strings for names
filelist <- list.files(path="../handouts/", pattern="vocab_week_[0-9][0-9].org$") #filelist for(s in filelist) { assign(s, readLines(file(file.path("../handouts/", s))) ) } ls()[1] "f" "filelist" [3] "h" "increment.x" [5] "s" "vocab_week_10.org" [7] "vocab_week_11.org" "vocab_week_12.org" [9] "vocab_week_13.org" "vocab_week_14.org" [11] "w" "x"
12.2. Programming style and organization
- Programming style
"Design" and "style" are related but separate ideas. Design is more important but style matters, too.
- Good style aids maintenance
- Use comments to tell the readers what they need to know
- Use indentation to mark the levels of program control
- Use meaningful names
- Develop a convention for names (global variables vs other variables vs function names)
- Use standard idioms where appropriate
- Avoid unnecessary complexity. Avoid code repetition!
- Good style aids maintenance
- Function Documentation
For example, Google's style:
calculateSampleCovariance <- function(x, y, verbose = TRUE) { # Computes the sample covariance between two vectors. # # Args: # x: One of two vectors whose sample covariance is to be calculated. # y: The other vector. x and y must have the same length, greater than one, # with no missing values. # verbose: If TRUE, prints sample covariance; if not, not. Default is TRUE. # # Returns: # The sample covariance between x and y. n <- length(x) # Error handling if (n <= 1 || n != length(y)) { stop("Arguments x and y have invalid lengths: ", length(x), " and ", length(y), ".") } if (TRUE %in% is.na(x) || TRUE %in% is.na(y)) { stop(" Arguments x and y must not have missing values.") } covariance <- var(x, y) if (verbose) cat("Covariance = ", round(covariance, 4), ".\n", sep = "") return(covariance) - Organizing tasks (design)
- Modularization
Modularization is the process programmers use to break up complex programs into smaller more manageable procedures.
- Constants
- Global variables and constants
It can be useful to put some infrequently changing constants at the top of a script file:
GRID_SIZE <- 10 NUM_ITERATIONS <- 10000Then use these global variables, perhaps as default values to functions:
makeGrid <- function(gridSize = GRID_SIZE) { g <- matrix(1:(gridSize^2), nrow=gridSize) return(g) } - Multiple scripts
In larger projects in can be helpful to break up the code into multiple files. For example:
read_clean_data.Rrun_models.Rms_figs.Rrun_all.R
In this example,
run_all.Rcontains "steering code and might just look likesource("./scripts/read_clean_data.R") source("./run_models.R") # etc
13. Debugging
13.1. Debugging approaches
- Debugging
"Debugging is twice as hard as writing the code in the first place. Therefore, if you write the code as cleverly as possible, you are, by definition, not smart enough to debug it."
— Brian W. Kernighan.
- What is a bug?
- Code fails (produces error)
- Code runs but produces the wrong result
- Debugging
Differential diagnosis
- Characterize the error
- Localize the error
- Fix the error (and don't introduce a new bug!)
- Characterizing the bug
- Make the error reproducible
- Can we always get the error when re-running the same code and values?
- If we start the same code in a clean workspace, does the same thing happen?
- Bound the error
- How much can we change the inputs and get the same error? A different error?
- For what inputs (if any) does the bug go away?
- How big is the error?
- Get more information
Add extra output (e.g., number of optimization steps, did the loop converge, final value of optimized function)
- Make the error reproducible
- Localizing the bug
Tools:
- traceback (where did an error message come from?)
- Our own extra debugging statements: print, warning, stopifnot (messages from the code as it goes)
- Trying controlled inputs
- Interactive debugging
- traceback
- When an R function fails, an error is printed to the screen. Immediately after the error, you can call traceback() to see in which function the error occurred.
- The traceback() function prints the list of functions that were called before the error occurred.
- The functions are printed in reverse order.
- traceback example
f <- function(x) { r <- x - g(x) r } g <- function(y) { r <- y * h(y) r } h <- function(z) { r <- log(z) if(r < 10) r^2 else r^3 } f(-1) traceback() - traceback results
Error in if (r < 10) r^2 else r^3 (from #3) : missing value where TRUE/FALSE needed In addition: Warning message: In log(z) : NaNs produced 3: h(y) at #2 2: g(x) at #2 1: f(-1)
- Interactive debugging
traceback()does not always tell you where the error originated. In order to know which line causes the error, you may want to step through the function using debug(). debug()andbrowser()
debug(foo)flags the functionfoo()for debugging.undebug(foo)unflags the function.- When a function is flagged for debugging, each statement in the function is executed one at a time. After a statement is executed, the function suspends and you can interact with the environment.
- After you obtained necessary information from the environment, you can let the function to execute the next statement.
- You can call
browser()from any point in your code to enter the debugger. This is useful if you know about where the bug is and you don't want to start at the beginning of the function.
- debug example
## buggy version: mateVectors <- function(x, y, r) { child <- y usingX <- runif(1) <= 0.5 for (i in x) { if(usingX) { child[i] <- x[i] } if(runif(1) <= r) { !usingX } } return(child) } - Trying the buggy version
set.seed(1001) mateVectors(1:10, 11:20, 1) mateVectors(c("A","B","C"), c("a","b","c"), 0.3)[1] 11 12 13 14 15 16 17 18 19 20 A B C "a" "b" "c" NA NA NA - Let's debug this function
- Useful debug commands
- 'n' next line
- 'c' Continue to end of context (finish loop, or finish function)
- 'where' print a stack trace of all active function calls
- 'Q' quit the debugger
- Lecture notes  B_note
So: with numeric inputs produces kind of ok result but no recombination. With character input it produces NAs
Two errors: wrong `for` idiom and also does not do actual crossover.
- Useful debug commands
- Buggy
swapHalves()
swapHalves <- function(x) { vlen <- length(x) if(vlen %% 2 == 0) { # even case half <- vlen %/% 2 firstHalf <- 1:half secondHalf <- -1 * 1:half result <- c(x[secondHalf], x[firstHalf]) } else { firstHalf <- 1 : vlen %/% 2 midValue <- vlen %/% 2 + 1 secondHalf <- vlen%/%2 + 2 : vlen result <- c(x[secondHalf], x[midValue], x[firstHalf]) } return(result) } - Buggy
swapHalves()test
print("even:") even <- c("a","b","c","d") print("odd:") odd <- c("a","b","c","d","e") swapHalves(even) swapHalves(odd)[1] "even:" [1] "odd:" [1] "c" "d" "a" "b" [1] "d" "e" NA NA "c" "a" "a" "b" "b"
- Debugging using RStudio
RStudio includes a visual debugger. Have you noticed the "environment" window? It shows values for variables in the current environment. When debug is active, this shows the function being debugged.
- RStudio’s error inspector and
traceback() - RStudio’s "Rerun with Debug" tool and
options(error = browser)which open an interactive session where the error occurred. - RStudio’s breakpoints and
browser()which open an interactive session at an arbitrary location in the code.
- RStudio’s error inspector and
14. Data Frames and Factors
14.1. Creation and operations
- Transition point in course
- Thus far:
- Basics of R syntax
- Manipulating vectors
- Overview main R data types
- Functions, algorithms, debugging
- Next:
- More on data frames and factors
- Math operations and how computers store numbers
- String operations and regular expressions
- Dates and times and computers
- The Grammar of Graphics (more ggplot2)
- Data manipulation tidying and analysis using
tidyversetools.
- Thus far:
- Today: more detail on data frames
Most useful data structure for analyses! Has characteristics of lists and matrices.
kids <- c("Jack","Jill") ages <- c(10,12) d <- data.frame(kids, ages) d d[2,2]kids ages 1 Jack 10 2 Jill 12 [1] 12
- Extract columns 3 ways
d[[1]] d$kids d[,1][1] "Jack" "Jill" [1] "Jack" "Jill" [1] "Jack" "Jill"
- Add to data frame
rbind and cbind work as with matrices
d <- rbind(d, list("Alberita", 11)) dkids ages 1 Jack 10 2 Jill 12 3 Alberita 11 - Merging
kids <- c("Jack","Joan","Alberita") states <- c("CA","TX","TX") d2 <- data.frame(kids,states) merge(d, d2)kids ages states 1 Alberita 11 TX 2 Jack 10 CA - Merging (all, all.x, all.y)
d3 <- merge(d, d2, all = TRUE) names(d3) <- c("name","age","state") d3name age state 1 Alberita 11 TX 2 Jack 10 CA 3 Jill 12 <NA> 4 Joan NA TX - Merging (all.x)
d4 <- merge(d, d2, all.x = TRUE) d4kids ages states 1 Alberita 11 TX 2 Jack 10 CA 3 Jill 12 <NA> - Tool:
str()to examine complex objects
str(d4) #str(hist(cars$speed))'data.frame': 3 obs. of 3 variables: $ kids : chr "Alberita" "Jack" "Jill" $ ages : num 11 10 12 $ states: chr "TX" "CA" NA
- Data frames: subset()
Convenience function more concise than using
[. Names assumed to be columns.subset(d3, state=="TX") subset(d3, age > 10)name age state 1 Alberita 11 TX 4 Joan NA TX name age state 1 Alberita 11 TX 3 Jill 12 <NA> - Data frames: add new columns
d3$birth.year <- 2012 - d3$age d3name age state birth.year 1 Alberita 11 TX 2001 2 Jack 10 CA 2002 3 Jill 12 <NA> 2000 4 Joan NA TX NA - tibbles
Faster, more restrive data frames used by tidyverse functions:
tibble::tibble(d3)# A tibble: 4 × 4 name age state birth.year <chr> <dbl> <chr> <dbl> 1 Alberita 11 TX 2001 2 Jack 10 CA 2002 3 Jill 12 <NA> 2000 4 Joan NA TX NA
- More on data frames
14.2. Merging examples
- Data is often stored in separate files
oak_wp <- read.csv("http://r-research-tool.schwilk.org/data/oak-water-potentials-simple.csv") species <- read.csv("http://r-research-tool.schwilk.org/data/oak-species.csv") head(oak_wp, 3)site tag spcode date pd.psi md.psi year 1 GATEN 110 QUEM 10/24/2010 NA NA 10 2 GATEN 111 QUEM 10/27/2010 NA NA 10 3 GATEN 113 QUEM 10/24/2010 NA NA 10
head(species, 3)genus specific.epithet family spcode display.name CM DM 1 Quercus emoryi Fagaceae QUEM Q. emoryi 1 1 2 Quercus gambelii Fagaceae QUGA Q. gambelii 1 1 3 Quercus gravesii Fagaceae QUGR2 Q. gravesii 1 1 GM 1 NA 2 1 3 NA - Plot data, but use real species names
oak_wp <- merge(oak_wp, species) names(oak_wp)[1] "spcode" "site" "tag" [4] "date" "pd.psi" "md.psi" [7] "year" "genus" "specific.epithet" [10] "family" "display.name" "CM" [13] "DM" "GM"
- Plot data with nice display names
library(ggplot2) ggplot(oak_wp, aes(display.name, md.psi)) + geom_boxplot() + xlab("") + ylab("Mid-day water potental (MPa)")
- Merging is also called "joining"
In base R the function is
merge(). Indplyr(tidyverse), there are several separate functions:left_join(),right_join(),full_join(), etc. We will learn about those later.
14.3. Factors
- Grouping variables
# head(iris) class(iris$Species) levels(iris$Species)[1] "factor" [1] "setosa" "versicolor" "virginica"
- Factors
- stored as integers
- but have a class attribute "factor" and an attribute "levels" which defines the allowed values.
shrimp <- c("M", "J", "J", "M", "M") shrimp.factor <- factor(shrimp, levels = c("J", "F", "M")) table(shrimp.factor)shrimp.factor J F M 2 0 3
- Factors vs strings
- We often store factors as strings in a csv file
- read.csv() will convert them to factors but how can it know the allowed levels?
- Leave factors stored as strings alone until a factor is needed (use
stringsAsFactors=FALSEinread.csv())
- Factors vs strings
Factors are stored as integers with a display name
v <- c("3", "2", "10") factor(v) vf <- as.factor(v) vf as.numeric(vf[3]) ## How did "10" become [1] ?[1] 3 2 10 Levels: 10 2 3 [1] 3 2 10 Levels: 10 2 3 [1] 1
- Factors: reassign levels
v <- c("Med", "High", "Low", "Low", "Med") factor(v) vf <- factor(v, levels=c("Low", "Med", "High")) vf[1] Med High Low Low Med Levels: High Low Med [1] Med High Low Low Med Levels: Low Med High
- Factors: rename levels
# vf levels(vf) <- tolower(levels(vf)) # equivalent to # levels(vf) <- c("low", "medium", "high") vf[1] med high low low med Levels: low med high
- split()
split(iris$Sepal.Length,iris$Species)$setosa [1] 5.1 4.9 4.7 4.6 5.0 5.4 4.6 5.0 4.4 4.9 5.4 4.8 4.8 4.3 5.8 5.7 5.4 5.1 5.7 [20] 5.1 5.4 5.1 4.6 5.1 4.8 5.0 5.0 5.2 5.2 4.7 4.8 5.4 5.2 5.5 4.9 5.0 5.5 4.9 [39] 4.4 5.1 5.0 4.5 4.4 5.0 5.1 4.8 5.1 4.6 5.3 5.0 $versicolor [1] 7.0 6.4 6.9 5.5 6.5 5.7 6.3 4.9 6.6 5.2 5.0 5.9 6.0 6.1 5.6 6.7 5.6 5.8 6.2 [20] 5.6 5.9 6.1 6.3 6.1 6.4 6.6 6.8 6.7 6.0 5.7 5.5 5.5 5.8 6.0 5.4 6.0 6.7 6.3 [39] 5.6 5.5 5.5 6.1 5.8 5.0 5.6 5.7 5.7 6.2 5.1 5.7 $virginica [1] 6.3 5.8 7.1 6.3 6.5 7.6 4.9 7.3 6.7 7.2 6.5 6.4 6.8 5.7 5.8 6.4 6.5 7.7 7.7 [20] 6.0 6.9 5.6 7.7 6.3 6.7 7.2 6.2 6.1 6.4 7.2 7.4 7.9 6.4 6.3 6.1 7.7 6.3 6.4 [39] 6.0 6.9 6.7 6.9 5.8 6.8 6.7 6.7 6.3 6.5 6.2 5.9
- Subsetting a dataframe
Shorthand for
[operator usex <- subset(iris, Species == "virginica") head(x)Sepal.Length Sepal.Width Petal.Length Petal.Width 101 6.3 3.3 6.0 2.5 102 5.8 2.7 5.1 1.9 103 7.1 3.0 5.9 2.1 104 6.3 2.9 5.6 1.8 105 6.5 3.0 5.8 2.2 106 7.6 3.0 6.6 2.1 Species 101 virginica 102 virginica 103 virginica 104 virginica 105 virginica 106 virginica - tapply()
Apply by factor level:
tapply(iris$Sepal.Length, iris$Species, sd)setosa versicolor virginica 0.3524897 0.5161711 0.6358796
- tapply() with multiple factors
oak_wp <- read.csv("http://r-research-tool.schwilk.org//data/oak-water-potentials-simple.csv") tapply(oak_wp$md.psi, list(oak_wp$year, oak_wp$spcode), mean, na.rm=TRUE)QUEM QUGA QUGR2 QUGR3 QUHY 10 NaN -1.818000 -1.900000 NaN -1.632500 11 -2.417674 -2.920455 -4.297826 -3.42000 -2.426383 12 -1.666667 -1.780000 -2.440000 -2.51500 -1.236000 13 -1.587308 -1.400000 -1.686364 -1.94359 -1.215000 - aggregate() will return a dataframe
df <- aggregate(oak_wp$md.psi, list(yr=oak_wp$year, spcode=oak_wp$spcode), mean, na.rm=TRUE) head(df)yr spcode x 1 10 QUEM NaN 2 11 QUEM -2.417674 3 12 QUEM -1.666667 4 13 QUEM -1.587308 5 10 QUGA -1.818000 6 11 QUGA -2.920455
- aggregate() with formula notation
df <- aggregate(md.psi ~ year+spcode, data=oak_wp, FUN=mean, na.rm=TRUE) head(df)year spcode md.psi 1 11 QUEM -2.417674 2 12 QUEM -1.666667 3 13 QUEM -1.587308 4 10 QUGA -1.818000 5 11 QUGA -2.920455 6 12 QUGA -1.780000
- Shrimp again: split() and aggregate()
shrimp <- c("M", "F", "J", "J", "F", "F", "M", "F", "F") split(seq_along(shrimp), shrimp)$F [1] 2 5 6 8 9 $J [1] 3 4 $M [1] 1 7
- Frequencies:
aggregate(shrimp, list(sex=shrimp), length)sex x 1 F 5 2 J 2 3 M 2
- Final thoughts
- You will run into all these base R functions like
tapply,aggregate,split. They are useful - Later in this course we will learn some more consistent (
tidyverse) ways to accomplish these sorts of split-apply-recombine idioms.
- You will run into all these base R functions like
15. Math and numbers
15.1. Number systems
- A note on integers vs "floating point" numbers
typeof(2) typeof(as.integer(2)) typeof(c(1,2,3)) typeof(1:3) typeof(c(1L,2L,3L))[1] "double" [1] "integer" [1] "double" [1] "integer" [1] "integer"
- Note  B_note
R has integers and real numbers. To enter integer numeric literals you must follow numeral with an "L". Otherwise R assumes you want a double.
The information that follows is to help you have some idea of how things work under the hood on the computer. You don't need to memorize or fully understand how numbers are represented in order to use R.
- Note  B_note
- Positional number systems
- Simply ways to count things. Ours is the base-10.
- There is no symbol "10" in base-10 and no symbol "2" in base-2, etc
- Each column stands for a power of the base (eg powers of 10)
- Binary: base-2
- A single binary number is called a B inary dig IT, or bit.
- Computers perform operations on binary number groups called "words". Today, most computers use 32-, 64-, or 128-bit words.
- Words are subdivided into 8-bit groups called bytes.
- Hexadecimal: base-16
Dec Hex Binary Dec Hex Binary 0 0 0000 8 8 1000 1 1 0001 9 9 1001 2 2 0010 10 a 1010 3 3 0011 11 b 1011 4 4 0100 12 c 1100 5 5 0101 13 d 1101 6 6 0110 14 e 1110 7 7 0111 15 f 1111 - Hexadecimal example
- You might have seen HTML colors
written as 6 digit hexidecimal (really three 2-digit values)
#CD5C5C means red=205 (CD), green=92 (5C), blue=92 (5C) - Negative numbers
Negative numbers: signed integers
- Takes an extra bit of information because it doubles the size we must be able to store.
- Multiple ways to represent (Use first bit for sign, top half of range, or 2's complement)
- Last way is what is used in computers — but you don't need to worry about representation.
- Integer Overflow
If we have a 1-byte (8 bit) integer in which the first bit stores the sign:
01011101 : 93 + 01101011 : 107 \= 11001000 : -56
15.2. Floating point
- Real numbers on computers: floating point
- IEEE single precision: 32 bits, one word
Simply "float" type in most languages: \(\pm\) 1.YYYYY \(\times 2^{N-127}\)
- assumes X is always 1 (no leading zeros, trick to get one more bit of precision)
- 1 sign bit
- 8 bit biased exponent (n-127)
- 23 bit mantissa (YYYYY)
- "single precision floating point": 32 bits
- Different floating point types
IEEE Standard for Floating-Point Arithmetic (IEEE 754) (1980s):
- Single precision, 32 bits. Called "float" in the C language family. It has a one sign bit, 23 bit mantissa (so 24 bits of precision with the assumed 1) and 8 bit exponent .
- Double precision, called "double" in the C language family, occupies 64 bits (8 bytes) and its significand has a precision of 53 with the assumed 1. Sign is 1 bit, exponent is 11 bits and mantissa is 52 bits.
In R there are only 2 numeric types: integer and numeric (double precision). "Single precision" aka "float" is not used much anywhere nowadays.
- Note  B_note
- Some issues
- Doubles types ("numeric" class) cannot represent all 64-bit integers! But since R uses 32 (now 53) bit integers, double can store more integers than can the integer type. See =?integer
- Rounding error: The only numbers that can be represented exactly in R's numeric type are integers and fractions whose denominator is a power of 2. Other numbers have to be rounded to (typically) 53 binary digits accuracy.
- Some issues
- In R, numeric literals are treated as floating point
class(c(1,2,3))[1] "numeric"
class(1:3) class(c(2L, 3L))[1] "integer" [1] "integer"
- Rounding error
64 bits are not enough to store every possible value on the number line!
1.0 - 0.9 - 0.1[1] -2.775558e-17
- Floating point error
As a result, two floating point numbers will not reliably be equal:
a <- sqrt(2) a * a == 2 a * a - 2[1] FALSE [1] 4.440892e-16
The function
all.equal()allows you to compare numeric values. It basically testsabs(a - b) < epsilon value. - Testing equality of floating point numbers
Compare
all.equal,identicaland==all.equal(a*a, 2) # see documentation for using in if statement identical(a*a, 2.0) all.equal(as.double(8), as.integer(8)) identical(as.double(8), as.integer(8))[1] TRUE [1] FALSE [1] TRUE [1] FALSE
16. Strings
16.1. Text on computers
- How to represent characters?
Simple idea: numeric code for each character. Originated at Bell labs in 1960.
- ASCII table
0–31 used for "unprintable" characters (BEEP! page feed, etc)
- All was good
Assuming you were an English speaker.
- ANSI standard
MS windows type solution to inconsistency in upper set use: define "code pages"
- Same as ASCII for 0-127
- But different "code pages" for upper set.
- Unicode
Basic idea started in 1980s: two bytes. Then expanded to more (up to 3 bytes) Enough to store every character in every language.
- Originally: \(2^{14} = 16,384\) (2 bits reserved)
- Currently: 154,998 code points (characters)
- Every letter maps to a "code point" — a platonic ideal of a letter.
- "A" is different from "a", but "A" in Times New Roman is the same as "A" in Arial
- Notation is "U+" followed by at least four hexadecimal digits. "Hello" =
U+0048 U+0065 U+006C U+006C U+006F - Since 2010: many emoji code points
- Overseen by Unicode Consortium (part of United Nations).
- Encodings
Still the issue of exactly how to store the numbers:
There were multiple standards. Now:
- UTF-8
Idea: use one byte for code points 0-127, and only use two for points above that. UTF-8 is the de facto standard encoding for unicode in US.
You can read in utf-8 encoded text in R. In fact utf-8 is probably the current default ("locale"):
Sys.getlocale()[1] "LC_CTYPE=en_US.UTF-8;LC_NUMERIC=C;LC_TIME=en_US.UTF-8;LC_COLLATE=en_US.UTF-8;LC_MONETARY=en_US.UTF-8;LC_MESSAGES=en_US.UTF-8;LC_PAPER=en_US.UTF-8;LC_NAME=C;LC_ADDRESS=C;LC_TELEPHONE=C;LC_MEASUREMENT=en_US.UTF-8;LC_IDENTIFICATION=C"
- Lecture notes  B_note
Unicode adoption into email systems has been slow and you might see hiccups since utf8 is not universal.
Most fonts only support a small set of unicode needed for a particular script/language.
- UTF-8
- Unicode strings
a <- c("\U00B5", "µ") # note that "\" is escape char in R strings a[1] "µ" "µ"
- You can use unicode (utf-8) in R scripts
But may be awkward to enter depending on keyboard and text editor
µ <- 0.0001 x <-1000 2*x*µ[1] 0.2
- Note  B_note
pi symbol ("\U03C0") won't show up in my lecture slides because of my export system. So I illustrate with mu
16.2. Strings
- Character objects (strings)
A "character" object is used to represent string values in R. We convert objects into character values with the
as.character()function:x = as.character(pi) x typeof(x)[1] "3.14159265358979" [1] "character"
- Formatting strings
using paste():
first <- "Dylan" last <- "Schwilk" n <- paste(first, last) n paste(last, first, sep=", ")[1] "Dylan Schwilk" [1] "Schwilk, Dylan"
- Formatting strings
using sprintf():
s <- sprintf("%s has %d dollars", n, 50) s sprintf("%s has $%2.2f", n, 50)[1] "Dylan Schwilk has 50 dollars" [1] "Dylan Schwilk has $50.00"
- Substrings
substr()andsubstr<-()substr(s, 7, 13)[1] "Schwilk"
Assignment:
substr(s, 7, 13) <- "SCHWILK" s substr(s, 7, 13) <- "???" s[1] "Dylan SCHWILK has 50 dollars" [1] "Dylan ???WILK has 50 dollars"
- Substitution
sub("???WILK", "S.", s, fixed=TRUE)[1] "Dylan S. has 50 dollars"
16.3. String operations
- Converting strings
tolower(s) toupper(s)[1] "dylan ???wilk has 50 dollars" [1] "DYLAN ???WILK HAS 50 DOLLARS"
- Built-in string functions
grep()- Searches for a substring, like the Linux command of the same name
nchar()- Finds the length of a string
paste()- Assembles a string from parts
sprintf()- Also assembles a string from parts
substr()- Extracts a substring
strsplit()- Splits a string into substrings
- Example: testing for a file suffix
# tests whether the file name `fname` has the suffix `suffix`, # e.g. "abc" in "x.abc" hasSuffix <- function(fname, suffix) { ncf <- nchar(fname) # nchar() gives the string length dotpos <- ncf - nchar(suffix) + 1 # dot would start here if there is one # now check that stuff is at the end of the name return(substr(fname, dotpos, ncf) == suffix) } hasSuffix("file.csv", "csv") hasSuffix("file.csv", "txt") hasSuffix("file.textinfo", "textinfo")[1] TRUE [1] FALSE [1] TRUE
- grep() and grepl()
Takes its name from the linux/unix command.
v <- c("beach", "beech", "back", "buster", "abea") grep("bust", v) v[grep("bust", v)][1] 4 [1] "buster"
grepl("bust", v)[1] FALSE FALSE FALSE TRUE FALSE
16.4. Regular expressions
- Regular Expressions
Sometimes we need to look for patterns, not specific strings
- "regex"
- Come from regular set theory (1950s)
- Formal way of specifying a set string matches
- Really are a whole mini computer language
- Used in many command line tools (eg grep), and implemented as at least a library in most computer languages
- Most text editors and word processors allow regex search and replace
- "regex"
- Regex basics
- Single character wildcard: "."
- Single match from multiple options:
[a-z]or[acdf-h]. Matches one of the characters in brackets or implied by range (hyphen is non-literal here). [a-zA-Z]should match any ASCII alphabetic letter- To match all EXCEPT a set start the set with a
^:[^,]matches any character except a comma.
- Examples
v <- c("A27", "B30", "c56", "123", "33") grep(".[0-9]", v)[1] 1 2 3 4 5
grep(".[0-9][0-9]", v) grep("[A-Z][0-9][0-9]", v) grep("[^0-9][0-9][0-9]", v)[1] 1 2 3 4 [1] 1 2 [1] 1 2 3
- Positional matching
^matches beginning of string (note different meaning than when in brackets)$matches end of string\bmatches word boundary (empty string where non-word char followed by word char)
- Positional matching example
x <- unlist(strsplit("Out in the west Texas town of El Paso", split=" ")) # capitalized words: grep("^[A-Z]", x)[1] 1 5 8 9
- Matching word boundaries
v <- c("dylan Schwilk", "Dylan schwilk", "dylanSchwilk", "dylan.SCHWILK") grep("\\b[A-Z]", v)[1] 1 2 4
- Matching end of string
v <- c("Dylan Schwilk", "dylan schwilk", "Schwilk, Dylan") grep("[Ss]chwilk$", v)[1] 1 2
- Matching multiples
Modify preceding character or expression
- "*" Matches any number of preceding expression (zero or more!)
- "?" matches one or none
- "+" matches one or more
{n}matches n times- ={n,} matches n or more times
- ={n,m} matches n to m times
- Multiples examples
x <- c("999xyz", "000", "A123", "a123", "1way", "WAY1", "a") grep("^[0-9]*[a-z]", x) grep("^[0-9]{3}", x) grep("[0-9]{3}$", x) grep("^[0-5]+", x)[1] 1 4 5 7 [1] 1 2 [1] 2 3 4 [1] 2 5
- Grouping
- Use parentheses to group
v <- c("aba", "abba", "aabba", "accc", "ababa") grep("(ab)+a", v)[1] 1 5
- Boolean OR
v <- c("beach", "beech", "back", "buster", "abea") grep("be(e|a)ch", v, value=TRUE) grep("b(e+|a)c[hk]", v, value=TRUE)[1] "beach" "beech" [1] "beech" "back"
- Referring to groups (back referencing)
Find cases of thrice repeated numerals
x <- c("999xyz", "000", "A123", "a123", "1way", "aaa", "a21113") grep("([0-9])\\1\\1", x)[1] 1 2 7
- Use parentheses to group
- Regular expressions in R
- R string functions (and functions in the
stringrpackage) can handle regular expressions. Use fixed=TRUE to avoid. - R includes some predefined character classes (eg
[:digit:],[:punct:]). See?regex - In regexes, one must escape special characters with a backslash. But keep in mind that R will first interpret the string before it hands it to the regex engine. So to match a literal "?" use
\\?
- R string functions (and functions in the
- Escaping example
v <- c("The first", "firstofall", "first time", "Who's on first?", "bfirstb") grep("first", v) grep("^first\\b", v) grep("\\bfirst\\?", v)[1] 1 2 3 4 5 [1] 3 [1] 4
- R strings
x <- "\\b" ; sprintf("x has %d chars", nchar(x)) x <- "\n" ; sprintf("x has %d chars", nchar(x)) x <- "\U00B5" ; sprintf("x has %d chars", nchar(x)) paste(x)[1] "x has 2 chars" [1] "x has 1 chars" [1] "x has 1 chars" [1] "µ"
- Text searches
library(gutenbergr) oos <- gutenberg_download(1228) # origin of species subset(oos, grepl("Galapagos (Islands?|Archipelago).+bird", text))#+beginexample:
gutenberg_id text <int> <chr> 1228 in the Galapagos Islands nearly every land-bird, but only two ou… 1228 Galapagos Islands reptiles, and in New Zealand gigantic wingless… 1228 genera. In the Galapagos Archipelago, many even of the birds, th…#+endexample
16.5. stringr Package
- Why the need?
- R string functions are inconsistent
Remedy:
stringrpackagelibrary(stringr) s <- c("(806) 777-7777", "(401) 555-5555", "(806) 444-4444") number_pattern <- "\\(([0-9]{3})\\) ([0-9]{3}-[0-9]{4})" str_match(s, number_pattern )[,1] [,2] [,3] [1,] "(806) 777-7777" "806" "777-7777" [2,] "(401) 555-5555" "401" "555-5555" [3,] "(806) 444-4444" "806" "444-4444"
- R string functions are inconsistent
- stringr and stringi
stringris built on top of a fast and efficient set of functions in thestringipackage.stringiprovides tools to do anything we could ever want to do with strings, whereasstringrprovides tools to do the most common 95% of operations.library(stringi) ls("package:stringi") # vs ls("package:stringr") - Basic string operations with stringr
str_c()- like
paste, but it uses the empty string as the default separator and silently removes zero length arguments str_length()- is equivalent to
nchar, but it preserves NAs and converts factors to characters (not integers) str_sub()- is equivalent to
substr()but it returns a zero length vector if any of its inputs are zero length, and otherwise expands each argument to match the longest. It also accepts negative positions, which are calculated from the end of the string. The end position defaults to -1, which corresponds to the last character. str_sub<-- is equivalent to
substr<-, but likestr_subit understands negative indices, and replacement strings not do need to be the same length as the string they are replacing str_trim- Trim leading and trailing whitespace
str_c(),str_length()
str_c("Hello", " ", "world", "!") v <- c("A", "B", "C") str_c(v,v) str_length(c("Hello", " ", "world", "!"))[1] "Hello world!" [1] "AA" "BB" "CC" [1] 5 1 5 1
str_length()also automatically converts factors to strings
# nchar(factor(v)) # produces error str_length(factor(v))[1] 1 1 1
str_sub()
s <- "What do you do with a drunken sailor?" str_sub(s, -str_length("sailor?"), -1 ) <- "student?" s[1] "What do you do with a drunken student?"
str_sort,str_order
Provides option more natural-language friendly sorting than
sort.s <- c("100a10", "100a5", "2b", "2a") str_sort(s, numeric=TRUE)[1] "2a" "2b" "100a5" "100a10"
str_detect(),str_which()
Instead of grepl() and grep()
s <- c("100a10", "100a5", "2b", "2a") str_which(s, "[a-b]$")[1] 3 4
- And more
17. Math II
17.1. Sorting
sortandorder
- Use
order()to sort dataframes
d1 <- data.frame(sites = c("MDS", "BGN", "MDN", "LCN"), elevation = c(1900, 2300, 1950, 1750)) d1sites elevation 1 MDS 1900 2 BGN 2300 3 MDN 1950 4 LCN 1750
- Use
order()to sort dataframes
y <- order(d1$elevation) d1[y, ]sites elevation 4 LCN 1750 1 MDS 1900 3 MDN 1950 2 BGN 2300
- Sets in R
A set is a collection of distinct (unique) objects. In R we simply use a vector.
numbers <- sample(1:10, 20, replace=TRUE) numbers myset <- unique(numbers) myset[1] 10 4 3 1 2 8 2 6 2 7 6 5 2 4 2 6 6 1 8 [20] 6 [1] 10 4 3 1 2 8 6 7 5
- Set operations
union(x,y)- Union of sets x and y
intersect(x,y)- Intersection of sets x and y
setdiff(x,y)- All element in x that are not in y
setequal(x,y)- Test for set equality
c %in% y- Test for membership in set, returns boolean vector
choose(n,k)- Number of possible subsets of size k chosen from set of size n
combn(x,n)- Find all combinations of size m in x
- We can add some functions
- Calculate the symmetric difference of two sets
- Define a subset operator
17.2. Math functions
- Built-in functions
- exp()
- Exponential function, base e
- log()
- Natural logarithm
- log10()
- Logarithm base 10
- sqrt()
- Square root
- abs()
- Absolute value
- sin()
- Trig functions:
sin(),cos(),tan(),atan(),atan2() - min(), max()
- Min and max within a vector
- pmin(), pmax
- Element-wise maxima and minima of several vectors
- sum(), prod()
- Sum and product of elements of vector
- cumsum, cumprod()
- Cumulative sum or product of all elements in a vector
- round()
- Rounding to integers. See
floor(),ceiling() - factorial()
- Factorial function
- Minimization and Maximization
nlm()andoptim()both find minima numericallyf <- function(x) { return(x^2 - sin(x) ) } d <- data.frame(x = seq(-10, 10, 0.01)) d$y <- f(d$x)library(ggplot2) ggplot(d, aes(x, y)) + geom_point() - Plot of x,y
- Using nlm() (Newton algorithm only)
nlm(f, 5)$minimum [1] -0.2324656 $estimate [1] 0.4501833 $gradient [1] 4.501954e-07 $code [1] 1 $iterations [1] 6
- Using optim()
optim(5, f, method = "Brent", lower = -100, upper = 100)$par [1] 0.4501836 $value [1] -0.2324656 $counts function gradient NA NA $convergence [1] 0 $message NULL - Symbolic math: differentiation
There are packages to do symbolic math (and other open source alternatives eg
sage) but base R hasD()D(quote( x^2 + 2*x ), "x")2 * x + 2
D(quote( exp(x^2) ), "x")exp(x^2) * (2 * x)
D(quote(log(2*x) + x^2), "x")2/(2 * x) + 2 * x
- Expressions:
quote()andexpression()
quotereturns its argument as an unevaluated expression,expressionreturns a list of unevaluated expressions.e1 <- quote(sin(x+y)) e2 <- expression(sin(x+y)) str(e1) str(e2) all.equal(e1, e2) all.equal(e1, e2[[1]])language sin(x + y) expression(sin(x + y)) [1] "Modes of target, current: call, expression" [2] "target, current do not match when deparsed" [1] TRUE
- Note  B_note
You can learn more about expressions here: http://adv-r.had.co.nz/Expressions.html
- Note  B_note
- D() continued
# first argument treated as an expression curve(x^2 + 2*x ,0,10)
- D() continued II
curve(eval(D(expression( x^2 + 2*x ), "x") ), 0, 10)
- Simple calculus: numeric integration
integrate()f <- function(x) x^2 integrate(f,0,1)0.3333333 with absolute error < 3.7e-15
17.3. Statistical distributions
- Distribution functions
Distribution pdf cdf quantiles rand. numbers Uniform dunif() punif() qunif() runif() Normal dnorm() pnorm() qnorm() rnorm() Chi square dchisq() pchisq() qchisq() rchisq() Binomial dbinom() pbinom() qbinom() rbinom() Poisson dpois() ppois() qpois() rpois() Student's T dt() pt() qt() rt() F dist. df() pf() qf() rf() and more!
For continuous distributions, the most useful functions are the "p" (cumulative density) and "q" (quantile = inverse cdf) functions.
- Using distributions
\(95^{th}\) percentile of the Chi square distribution with 2 degrees of freedom:
qchisq(0.95,df=2)[1] 5.991465
\(5^{th}\) and \(95^{th}\) percentiles of the normal distribution with mean zero and sd of one:
qnorm( c(0.05, 0.95) )[1] -1.644854 1.644854
- Example: Binomial Distribution
Probability of exactly 2 heads out of 12 coin flips:
dbinom(2, size=12, prob=0.5)[1] 0.01611328
Should be same as probability of 10 heads:
dbinom(10, size=12, prob=0.5)[1] 0.01611328
Probability of 1-10 heads:
pbinom(10, size=12, prob=0.5)[1] 0.9968262
17.4. Randomization and simulation
- Random variates
Built-in random number generators for many statistical distributions.
- Expectation
find \( E[max(X,Y)] \) when X,Y are normal random variables mean=0 sd=1
sum <- 0 nreps <- 100000 for (i in 1:nreps) { xy <- rnorm(2) sum <- sum + max(xy) } print(sum/nreps)[1] 0.5648268
# or store in vector: v <- replicate(nreps, max(rnorm(2))) hist(v) - Plot the results
- Setting the random number seed
Important for debugging!
set.seed(100) mean(rnorm(10000)) mean(rnorm(10000)) set.seed(100) mean(rnorm(10000))[1] 0.00387737 [1] -0.01183919 [1] 0.00387737
17.5. Linear algebra
- Linear Algebra
Matrix multiplication:
a <- matrix(c(1,3,2,4), ncol=2) b <- matrix(c(1,0,-1,1), ncol=2) a %*% b[,1] [,2] [1,] 1 1 [2,] 3 1 - Some linear algebra functions
crossprod()- Inner product (dot product) of two vectors. Not the real crossproduct.
t()- Matrix transpose
qr()- QR decomposition
chol()- Cholesky decomposition
det()- Determinant
eigen()- Eigenvalues / eigenvectors
diag()- Extracts the diagonal of a square matrix (useful for obtaining variances from a covariances matrix and for constructing a diagonal matrix)
sweep()- Numerical analysis sweep operations (a more capable and complicated "apply" type function)
18. The Grammar of Graphics and ggplot2 I
18.1. Grammar of graphics
- Grammer of Graphics
The Grammar of Graphics. Leland Wilkinson, 1999, 2005. Springer.
- What is a "plot"?
A plot is
- A data set and a set of mappings from variables to aesthetics
- One or more layers
- One scale for each aesthetic mapping
- A coordinate system (we won't worry about this and just use cartesian coordinates!)
- The faceting specification (how many subplots)
- Example scatterplot
A simple dataset. Plot A vs C with symbol shape determined by categorical variable, D
A B C D 2 3 4 a 1 2 1 a 4 5 15 b 9 10 80 b - Map variables to aesthetic space
- Next steps
- Grammar
We can produce grammatically correct plots that are nonsensical, just as in English we can produce grammatically correct nonsense
- Nonsense plot
library(ggplot2) ggplot(mpg, aes(displ, hwy)) + geom_line()
- Resources for ggplot2
- Others
- Cookbook for common graphics, http://www.cookbook-r.com/Graphs/
- stackoverflow, http://stackoverflow.com/tags/ggplot2
- ggplot2 mailing list, http://groups.google.com/group/ggplot2
- Others
18.2. Components of a plot
- Components of a plot
- Scales, facets, coordinate system
- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
- A scale is a function and its inverse, along with some parameters
- Example: a color gradient scale
- The inverse gives us the axes (for position scales) or the legend (for other scales such as colors or shapes)
- You've seen facets and we will stick with cartesian coordinates for now.
- We need a scale for each aesthetic in the plot. A scale operates across all data in one plot.
- Layers
composed of five parts:
- data
- aesthetic mapping (what is x, y, color, shape, etc).
aes()command - statistical transformation (
stat) - geometric object (
geom) (geoms have default stats and vice versa) - a position adjustment
18.3. Aesthetics
- Aesthetic mappings
Refer to variables in the dataset only! So, back to our oak water potential data:
oaks <- read.csv("http://r-research-tool.schwilk.org//data/oak-water-potentials-simple.csv") oaks <- oaks[complete.cases(oaks) & oaks$pd.psi < 0,] a <- aes(x = pd.psi, y = md.psi, color = spcode) p <- ggplot(oaks, a) + geom_point() p - Result:
- Mapping vs setting
You can set certain aesthetics to fixed values rather than mapping to variables
p <- ggplot(oaks, aes(pd.psi, md.psi)) # set point color to dark blue p + geom_point(color = "darkblue", size=2.5, alpha=0.8) - Result
- But don't map when you mean to set:
# map new unamed variable with value "darkblue" to color p + geom_point(aes(color="darkblue")) - Result
- A more useful mapping
Don't worry about lubdridate::mdy() for now. That converted the date strings into continuous values on a timeline.
# more useful: p + geom_point(aes(color=lubridate::mdy(date))) - Result
- Individual vs collective geoms
- Individual: Graphical object for each row (eg points)
- Collective: statistical summary: polygon geom
- Lines and smoothed paths somewhere in between
- Grouping
Lets look at each individual tree pre-dawn psi over time:
p <- ggplot(oaks, aes(lubridate::mdy(date), pd.psi, group=tag)) + geom_line() p - Result:
- Grouping
We'll use the Oxboys data set in nlme
library(nlme) # ?Oxboys # head(Oxboys) # class(Oxboys) p <- ggplot(Oxboys, aes(age, height, group = Subject)) + geom_line() p - Result:
- Different groups for different layers
# try adding lm line for all data p + geom_smooth(method="lm", se=FALSE) # same as: p + geom_smooth(aes(group = Subject), method="lm", se=FALSE) # what we want: p + geom_smooth(aes(group = NULL), method="lm", se=FALSE) - Default groups on categorical variables
p <- ggplot(Oxboys, aes(Occasion, height)) + geom_boxplot() p + geom_line(aes(group = Subject), color = "#3366FF") - Result:
- Violin plot:
ggplot(mpg, aes(class,hwy)) + geom_violin()
- Tables of geoms and stats (layer functions)
- ggplot2 cheatsheet: https://raw.githubusercontent.com/rstudio/cheatsheets/main/data-visualization.pdf
- ggplot2 reference: https://ggplot2.tidyverse.org/reference/
- Stats
Default stat for histogram geom is bin
ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..density..), binwidth=0.1) ## with counts ggplot(diamonds, aes(carat)) + geom_histogram(aes(y= ..count..), binwidth=0.1) - Position adjustment
Adjustment Description dodge Adjust position by dodging overlaps on sides fill Stack overlapping objects and standardize to equal height identity Don't adjust position jitter Jitter points to avoid overlap stack Stack overlapping objects on top of one - Position examples
# stack p <- ggplot(diamonds, aes(carat)) p + stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "stack") # fill p + stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "fill") # dodge p + stat_bin(aes(y = ..count.., fill = cut), binwidth=0.2, position = "dodge") # Identity (use line not area) p + stat_bin(aes(y =..count.., color = cut), geom = "line", binwidth=0.2, position = "identity") - Writing your own layer functions
stat_sum_single <- function(fun, ...) { stat_summary(fun = fun, colour="gray15", geom="point", size = 4, ...) } se_bar <- function(){ stat_summary(fun.data = mean_se, geom = "errorbar", width = 0.15) } p <- ggplot(iris, aes(Species, Petal.Length)) + se_bar() + stat_sum_single(mean) p # p + stat_sum_single(median, size = 3, shape = 5) - Result:
19. The Grammar of Graphics and ggplot2 II
19.1. Scales
- Scales
- Control how data is mapped to perceptual properties, and produce guides (axes and legends) which allow us to read the plot.
- Important parameters: name, breaks & labels, limits.
- Naming scheme:
scale_aesthetic_name. All default scales have name continuous or discrete.
- Scales
Position scales, color scales, manual scales, identity scale
- name
- text string, mathematical expressions (see
?plotmath). Uselabs(),ylab(),xlab() - limits
- Domain of the scale, Continuous scales take a vector of length 2, eg
c(4,10). - breaks
- sets the values that appear on the axis or the legend (tick marks).
- Scales examples
p <- ggplot(mpg, aes(cty, hwy, color = displ)) + geom_point() p p + scale_x_continuous("City mpg") p + xlab("City mpg") p + ylab("Highway mpg") p + labs(x="City mpg", y = expression(paste("Highway ",frac(miles,gal))), color= "Displacement (l)" ) - Result
- Scale transformations
"transformers":
- Manage the transformation and the inverse for legends, values
- Common ones: log, log10, logit, sqrt, exp, reverse
- shorthand for common ones:
scale_x_log10()is equivalent toscale_x_continuous(trans = "log10")
- More scales examples
library(scales) # for math_format() library(stringr) # for str_c p <- ggplot(diamonds, aes(carat, price)) + geom_point() p + scale_x_log10() + scale_y_log10() log10breaks <- seq(-0.8,0.8,.2) p + scale_x_log10(breaks = 10^(log10breaks), labels = parse( text=(str_c("10^",log10breaks)))) + scale_y_log10() # natural log p + scale_x_continuous(trans = "log", breaks = trans_breaks("log", "exp"), labels = trans_format("log", math_format(e^.x))) - Result
19.2. Themes
- ggplot Themes
p + theme_bw() theme_nogrid <- theme_bw() + theme( axis.title = element_text(size = 16), axis.text = element_text(size = 13), panel.grid.major = element_blank(), panel.grid.minor = element_blank(), panel.background = element_blank() ) p + theme_nogrid + xlab("Carat") + scale_y_continuous("Price ($)", labels = comma) # ?theme - Result:
- Some more resources for ggplot tweaking
- ggplot2 themes: https://ggplot2.tidyverse.org/reference/theme.html
19.3. Maps
- Maps in ggplot
Let's grab some public map data on Texas county boundaries
library(maps) library(RColorBrewer) county_df <- ggplot2::map_data('county') tx.counties <- subset(county_df, region == "texas") names(tx.counties)[6] <- "county" - The TX county data
head(tx.counties)long lat group order region county 74520 -95.75271 31.53560 2492 74520 texas anderson 74521 -95.76989 31.55852 2492 74521 texas anderson 74522 -95.76416 31.58143 2492 74522 texas anderson 74523 -95.72979 31.58143 2492 74523 texas anderson 74524 -95.74698 31.61008 2492 74524 texas anderson 74525 -95.72405 31.63873 2492 74525 texas anderson - The path geom
tx.map <- ggplot(tx.counties, aes(x=long, y=lat, group=county)) tx.map + geom_path(color="gray")
- Grab population data from texas.gov
Originally at https://demographics.texas.gov/
tx.pop <- read.csv("https://r-research-tool.schwilk.org/data/2023_txpopest_county.csv") tx.pop$county <- tolower(tx.pop$county) tx.pop <- merge(tx.counties, tx.pop) head(tx.pop,3)county long lat group order region FIPS 1 anderson -95.75271 31.53560 2492 74520 texas 1 2 anderson -95.76989 31.55852 2492 74521 texas 1 3 anderson -95.76416 31.58143 2492 74522 texas 1 census_2020_count july1_2023_pop_est jan1_2024_pop_est 1 57922 57501 57465 2 57922 57501 57465 3 57922 57501 57465 num_chg_20_23 num_chg_20_24 pct_chg_20_23 pct_chg_20_24 1 -421 -457 -0.7 -0.8 2 -421 -457 -0.7 -0.8 3 -421 -457 -0.7 -0.8 - TX population map
tx.map <- ggplot(tx.pop, aes(x=long, y=lat, group=county)) tx.map <- tx.map + geom_path(color="gray") tx.map <- tx.map + geom_polygon(aes(fill=log10(jan1_2024_pop_est))) tx.map - Result:
- Add a background map
With Open Street Map data via stadia server (ggmap library). You will need an API key
library(ggmap) bb <- c(min(tx.pop$long), min(tx.pop$lat), max(tx.pop$long), max(tx.pop$lat)) stm.tx <- get_stadiamap(bbox=bb, zoom=6) tx.map2 <- ggmap(stm.tx) tx.map2 <- tx.map2 + geom_path(aes(x=long, y=lat, group = county), color="gray", data=tx.pop) tx.map2 <- tx.map2 + geom_polygon(aes(x=long, y=lat, group = county, fill =jan1_2024_pop_est), alpha = 0.80, data=tx.pop) tx.map2 + scale_fill_gradientn("Population", trans="log10", colours=brewer.pal(5,"OrRd"), labels = scales::number_format(accuracy = 1)) - Result with openstreetmap background
- Other mapping resources
- You can add spatial objects (sf package) directly to plots with the
geom_sffunction. See https://r-spatial.org/r/2018/10/25/ggplot2-sf.html ggspatialadds additional support for map elements toggplot2- See https://sesync-ci.github.io/blog/mapping-with-Mapbox.html for example of using satellite images in custom maps
- You can add spatial objects (sf package) directly to plots with the
20. Function exercises
20.1. Functions
- Exercises 1
Create a function that given two strings (two character vectors each of length one), check if one is an anagram of another. Look at ~sort~() function and think about how to use it in this case (hint: you need to split up the string into characters using `strsplit`).
- Anagram finder
Now what about producing an anagram finder using a corpus of words?
library(qdapDictionaries) wordlist <- GradyAugmented length(wordlist) wordlist[1:5][1] 122806 [1] "cyber" "ceasefire" "it'll" [4] "billionaires" "wheelhouse"
- Approach
- Create a lookup table with sorted words as keys and a list of all anagrams of that sorted (canonical word) as the values.
- Then write a function that takes a word and returns all anagrams of that word by sorting it, then subscripting into the lookup table.
- First: a function to sort a single string
sortstring1 <- function(s) { return(paste(sort(unlist(strsplit(s, ""))), collapse = "")) } # better: sortstring <- function(s) { s <- stri_split_boundaries(s, type = "character") s <- lapply(s, stri_sort) s <- lapply(s, stri_join, collapse = "") unlist(s) } sortstring("goat")[1] "agot"
- Second: let's use a smaller dictionary to save time here
You will see why if you try my first solution on the long list!
shortwordlist <- DICTIONARY[-(1:107),1] # ~20,000 words length(shortwordlist) shortwordlist[1:5][1] 20030 [1] "aback" "abacus" "abaft" "abalone" "abandon"
- Make the lookup table
lookup_table <- list() # list of lists for (word in shortwordlist) { # 1. sort the word and call this the key # 2. Check if this key is already an item name in lookup_table: # (!is.null(lookup_table[sorted_word])) # 3. If key does not exist add make a new list with word as only item and # append it to lookup_table with name 'key' # 4. If key exists, append current word to lookup_table[key] } # Then write function that takes word and returns the anagrams by using that # global lookup table - Test
getAnagrams("goat") getAnagrams("peels") # our wordlist is too small, no plurals[1] "goat" "toga" [1] "sleep"
- Trick for speed
Using a list as a lookup table with tens of thousands of keys will be slow. It would be nice if R had a built in `dictionary` or `lookup table` class. But it does in a way: We can use an R environment as a hash map (lookup table). This is faster for large lookup tables than are lists when order is not important.
lookup <- new.env(hash=TRUE, parent=emptyenv()) lookup[["agot"]] <- list("goat","toga") lookup[["agot"]] - Test
getAnagrams("goat") getAnagrams("estrngi") getAnagrams("peels")[1] "goat" "toga" [1] "resting" "stinger" [1] "peels" "peles" "sleep" "speel"
21. The Tidyverse
21.1. The "tidyverse"
- R has a history
- And that means some aspects have been proven effective, and some not. For example read.csv at one time has bad default
stringsAsFactors=TRUE, matrix operations still havedrop=TRUE) - Hadley Wickham and now a large group of others have worked on a set of packages that provide a consistent interface. This approach is heavily pushed by Posit (the RStudio folks).
- You can even load all of these at once with the meta-package:
library(tidyverse) - However, I recommend only loading what you need. Not all of these packages require committing to the "tidyverse" way of doing things.
- And that means some aspects have been proven effective, and some not. For example read.csv at one time has bad default
- Tidyverse good and bad (well, tricky)
- Good: provides consistent, high-level appraoches to solving common data tidying and shaping tasks
- Tricky: depends heavily on non-standard evaluation and messing with environments. Blurs the line between strings and object names even more than does base R.
readr
Fixes some of the bad defaults in
read.table,read.csv, etc.dplyr
- a grammar of data manipulation
- summarizing by group
- transforming by row or by group
- filtering out specific records
tidyr
- For obtaining tidy data (needed before steps above)
- wide formats to long
- long formats to wide
tibbles
- tibbles are just data frames with nicer, more limited, and consistent behavior.
- You don't need to explicitly load this package, it is loaded when you call library() on any other tidyverse package.
21.2. Reading data using readr
readr
library(readr) fish <- read_csv("http://r-research-tool.schwilk.org/data/fish-data.csv")head(fish)# A tibble: 6 × 3 aquarium nutrient color.score <chr> <chr> <dbl> 1 a low 5.3 2 a low 4.7 3 a low 5 4 a low 4.3 5 a low 4.8 6 b low 4.8
- Column names
readrmakes a tibble and does not automatically convert namesbad_names <- read_csv("http://r-research-tool.schwilk.org/data/bad_names.csv", show_col_types=FALSE)bad_names# A tibble: 4 × 3 `sample id` `1st mass` `2nd mass` <chr> <dbl> <dbl> 1 a 100 200 2 b 100. 155. 3 c 99 106 4 d 98.8 180.
- Referring to non-syntactic column names
bad_names$`2nd mass`[1] 200.0 155.3 106.0 180.3
- Compare with
read.csv
read.csvreplaces spaces with "." and adds "X" before numbers to make syntactically allowed R namesbad_names2 <- read.csv("http://r-research-tool.schwilk.org/data/bad_names.csv") head(bad_names2)sample.id X1st.mass X2nd.mass 1 a 100.0 200.0 2 b 100.5 155.3 3 c 99.0 106.0 4 d 98.8 180.3
- Other formats
- tab delimited:
read_tsv - other delims:
read_delim - fixed width file:
read_fwf
- tab delimited:
- Fixed width files (base R)
readsurvey <- function(filen) { d <- read.fwf(filen, # col_positions = (c(1, 7, 9, 16, 24, 32, 33, 50:100)), widths = c(6, -1, 1, -1, 5, -2, 1, -7, 8, 1, -17, rep(1,50) ), na.strings = c(" ", "*", "")) names(d) <- c("V1", "V2", "V3", "V4", "rnumber", "version", paste("Q", as.character(seq(1:50)), sep="" ) ) d$rnumber <- paste("R", d$rnumber, sep="") return(d) } grades <- readsurvey("http://r-research-tool.schwilk.org/data/fixed_width.txt") grades[1:3, 1:10]V1 V2 V3 V4 rnumber version Q1 Q2 Q3 Q4 1 703001 2 11 1 R11111111 1 3 2 1 3 2 703001 4 21 1 R12345678 2 1 3 3 3 3 703001 6 31 1 R98765432 1 3 2 1 3 - Using
readr:read_fwf()
See documentation. Similar arguments to
read.fwfwith some ability to automatically guess where fields begin and end.
21.3. Tidyverse idioms
- Phylosophy
- High level tools for data science.
- Move away from worrying about details of subsetting, loops, etc when a common pattern is needed such as "split-apply-recombine"
- Less focus on stability than in base R. The Tidyverse changes.
- Everything operates on tibbles (data frames)
- Generally, all functions take a tibble as the first argument and return a tibble as well
- So one can chain together operations
- An introduction to the pipe
Defined in magrittr package (required by dplyr and re-exported). It is a bit like the
|pipe in the unix command line. It simply lets the left hand object replace the first argument of the right hand function.library(dplyr)mtcars %>% filter(mpg > 30) %>% select(-qsec, -gear, -carb) # instead of filter(mtcars, mpg > 30)mpg cyl disp hp drat wt vs am Fiat 128 32.4 4 78.7 66 4.08 2.200 1 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 1 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 1 1 Lotus Europa 30.4 4 95.1 113 3.77 1.513 1 1 - As of R 4.1 there is a native pipe
mtcars |> filter(mpg > 30) |> select(-qsec, -gear, -carb)mpg cyl disp hp drat wt vs am Fiat 128 32.4 4 78.7 66 4.08 2.200 1 1 Honda Civic 30.4 4 75.7 52 4.93 1.615 1 1 Toyota Corolla 33.9 4 71.1 65 4.22 1.835 1 1 Lotus Europa 30.4 4 95.1 113 3.77 1.513 1 1 - Also in 4.1: A lambda syntax
myvec <- 1:10 # instead of sapply(myvec, function(x) x+2) sapply(myvec, \(x) x+2) # useful for reordering arguments for use with the native pipe: paste("Dr. Schwilk", "has $50.") |> {\(x) gsub("50", "2", x)}()[1] 3 4 5 6 7 8 9 10 11 12 [1] 3 4 5 6 7 8 9 10 11 12 [1] "Dr. Schwilk has $2."
- Alternatives to using the pipe
# the pipe: result1 <- mtcars %>% filter(mpg > 30) %>% select(-qsec, -gear, -carb) # nested function calls: result2 <- select(filter(mtcars, mpg > 30), -qsec, -gear, -carb) # intermediate objects: intermediate.result <- filter(mtcars, mpg > 30) result3 <- select(intermediate.result, -qsec, -gear, -carb) identical(result1, result2) identical(result1, result3)[1] TRUE [1] TRUE
- Data science workflow
From R for Data Science by Hadley and Grolemund 2017:
readr->tidyr->dplyr(maybelubridate,stringr, others) ->ggplot2and domain specific modeling packages. - Tidyverse design guide
Good information on programming design decisions
- joins vs merge
dplyr provides join functions to replace
merge
22. Shaping data and dplyr
22.1. The dplyr package
- dplyr
dplyrincludes a family of individually simple functions that all have the same structure and all operate on tibbles (will convert data frames to tibbles). Use non-standard evaluation.Five important functions all with verb names
selectcertain columns of datafilteryour data to select specific rows (similar to base Rsubset)arrangethe rows of your data into an ordermutateyour data frame to contain new columns based on calculationssummarizechunks of you data
- also: joins
left_joinlikemergewith all.x=TRUEright_joinlikemergewith all.y=TRUEfull_joinAll x and yanti_joinTry it!
- Structure
- All function "verbs" take a data frame/tibble as the first argument
- Non standard evaluation: reduces typing. But can be dangerous in functions! There are workarounds but that is an advanced topic. See https://cran.r-project.org/web/packages/dplyr/vignettes/nse.html
- Some sample data
The Batting data set (in the
Lahmanpackage) contains the batting records for all professional US players. For more information use the command?Batting#library(dplyr) library(Lahman) # baseball dataset names(Batting)[1] "playerID" "yearID" "stint" "teamID" "lgID" [6] "G" "AB" "R" "H" "X2B" [11] "X3B" "HR" "RBI" "SB" "CS" [16] "BB" "SO" "IBB" "HBP" "SH" [21] "SF" "GIDP"
- ID variables
Factors by which we can group the data. For example, let's get mean stats (games, at bats, runs and hits, columns 6-9) for Babe Ruth across all years he played.
colMeans( filter(Batting, playerID == "ruthba01")[6:9] )G AB R H 113.77273 381.72727 98.81818 130.59091 - Another ID variable: year
Let's get mean stats (games, at bats, runs and hits, columns 6-9) across all players for 1871.
colMeans( filter(Batting, yearID == 1871)[6:9] )G AB R H 19.96522 94.10435 23.12174 26.96522 - Obtain summaries for groups
Finding summaries for one group is easy, but what if we want to find the same summaries for every group?
- In base R
# Split pieces <- split(Batting[,6:9], Batting$yearID) years <- names(pieces) # Apply results <- lapply(pieces, colMeans) # Combine # do.call turns second argument (list) into individual arguments function result <- do.call(rbind, results) result <- data.frame(result, year=years) - Results:
head(result)G AB R H year 1871 19.96522 94.10435 23.12174 26.96522 1871 1872 21.05732 99.77707 21.59236 28.45223 1872 1873 28.83200 135.67200 28.64000 39.40800 1873 1874 34.13821 155.31707 28.21138 42.47154 1874 1875 28.79263 123.65438 19.51152 31.39171 1875 1876 37.87097 162.26613 24.72581 43.04839 1876 - using
dplyr
allyears <- select(Batting, yearID, G, AB, R, H) years <- group_by(allyears, yearID) yearstats <- summarize_all(years, list(mean=mean)) head(yearstats, 3)# A tibble: 3 × 5 yearID G_mean AB_mean R_mean H_mean <int> <dbl> <dbl> <dbl> <dbl> 1 1871 20.0 94.1 23.1 27.0 2 1872 21.1 99.8 21.6 28.5 3 1873 28.8 136. 28.6 39.4
- Split-apply-recombine
When do we need to split-apply-recombine?
- During data preparation, when performing group-wise ranking, normalization, etc
- When creating summaries for display or analysis (marginal means, conditioning tables)
- During modelling, when fitting separate models to subsets of the data
22.2. dplyr functions
- filter
How many players in 1871?
y1871 <- filter(Batting, yearID==1871) length(unique(y1871$playerID))[1] 115
- mutate and summarise functions
mutate()creates a tibble based on an existing data frame; a row of output per row of inputsummarise()A row of output per group
- Strategy for grouped calculations
- Extract a single group
- Solve problem for that one group — perhaps write a wrapper function that takes the group and returns the summary or transformation
- Use
dplyrfunctions andgroup_byto solve it for all groups
- 1-2. Calculate career year for Babe Ruth
career year = number of years since player began.
baberuth <- filter(Batting, playerID == "ruthba01") baberuth <- mutate(baberuth, cyear = yearID - min(yearID) + 1) head(baberuth, 5)[,c(2, 6:9, 23)]yearID G AB R H cyear 1 1914 5 10 1 2 1 2 1915 42 92 16 29 2 3 1916 67 136 18 37 3 4 1917 52 123 14 40 4 5 1918 95 317 50 95 5
- 3. Calculate career year for all players
byplayer <- group_by(Batting, playerID) Batting <- mutate(byplayer, cyear = yearID - min(yearID) + 1) Batting <- arrange(Batting, playerID) head(Batting)[, c(1, 2, 6:9, 23)]# A tibble: 6 × 7 # Groups: playerID [1] playerID yearID G AB R H cyear <chr> <int> <int> <int> <int> <int> <dbl> 1 aardsda01 2004 11 0 0 0 1 2 aardsda01 2006 45 2 0 0 3 3 aardsda01 2007 25 0 0 0 4 4 aardsda01 2008 47 1 0 0 5 5 aardsda01 2009 73 0 0 0 6 6 aardsda01 2010 53 0 0 0 7
- More summaries: Teams per player
Total number of teams in the data frame
length(unique(Batting$teamID))[1] 149
Number of teams for which each player played:
nteams <- summarise(byplayer, nteams = length(unique(teamID))) head(nteams, 5)# A tibble: 5 × 2 playerID nteams <chr> <int> 1 aardsda01 8 2 aaronha01 3 3 aaronto01 2 4 aasedo01 5 5 abadan01 3
- Do players improve?
- Do players improve? Modeling for one player
Fitting a regression model relating runs per bat (rbi/ab) to career year
mod.br <- lm(RBI/AB ~ cyear, data=baberuth) mod.br #summary(mod.br)Call: lm(formula = RBI/AB ~ cyear, data = baberuth) Coefficients: (Intercept) cyear 0.203200 0.003413
- Do players improve? Modeling for all
How to model this for all players?
lmslope <- function(x,y){ linmod <- lm(y~x) coefs <- coef(linmod) return(coefs[2]) } bb <- subset(Batting, AB >= 25) bb <- mutate(group_by(bb, playerID), include = length(yearID) > 10) bb <- filter(bb, include) player_traj <- summarize(bb, slope=lmslope(cyear, RBI/AB)) ggplot(player_traj, aes(x=slope)) + geom_histogram( binwidth=0.005) - Results:
- Maybe players get better than worse again
modq.br <- lm(RBI/AB ~ cyear + I(cyear^2), data=baberuth) #mod.br summary(modq.br)Call: lm(formula = RBI/AB ~ cyear + I(cyear^2), data = baberuth) Residuals: Min 1Q Median 3Q Max -0.10600 -0.01994 0.01035 0.04025 0.06291 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.1267872 0.0372004 3.408 0.00295 ** cyear 0.0225163 0.0074510 3.022 0.00701 ** I(cyear^2) -0.0008306 0.0003146 -2.640 0.01613 * --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 0.05295 on 19 degrees of freedom Multiple R-squared: 0.3592, Adjusted R-squared: 0.2917 F-statistic: 5.325 on 2 and 19 DF, p-value: 0.01459 - Quadratic plot
brplot + stat_smooth(method="lm", formula = y~x+I(x^2))
- All players, quadratic term
qtermmodel <- function(df){ mod <- lm(RBI/AB ~ cyear + I(cyear^2), data = df) coefs <- data.frame(t(coef(mod))) names(coefs) <- c("int", "slope", "qterm") return(coefs) } bcoefs <- do(bb, qtermmodel(.)) qplot(qterm, data=bcoefs, binwidth=0.0005) + xlim(c(-0.015, 0.015)) - Results: all players, quadratic term
23. More dplyr and data exploration
23.1. Baby names
- US baby names data
- Download the data
- All data in zipfile: http://www.ssa.gov/oact/babynames/names.zip
- Zipped file contains one txt file (csv) for each year 1880-2023. Eg: "yob1885.txt"
- My script extracts this data, combines it, and saves the top 1000 names for each sex and year: file:///home/schwilk/teaching/courses/R-research-tool/course-materials/assignments/hw11-clean-bnames.R. You don't need to run it.
- Saved data: http://r-research-tool.schwilk.org/data/top-1000-baby-names.csv You should just use this saved data!
- Read data
bnames <- read.csv("http://r-research-tool.schwilk.org/data/top-1000-baby-names.csv") # gender here is labeled "sex" and "boy/girl" means male/female sex assigned at birth bnames$sex <- factor(bnames$sex) levels(bnames$sex) <- c("girl", "boy") head(bnames, 3)name sex year rank percent 1 Mary girl 1880 1 7.764248 2 Anna girl 1880 2 2.861727 3 Emma girl 1880 3 2.201244
- Name diversity
Investigate the percentage of all names that the top 1000 names occupy
bn <- group_by(bnames, year, sex) ofall <- summarize(bn, tprop = sum(percent)) ggplot(ofall, aes(year, tprop, colour = sex)) + geom_line() + ylab("Percentage of all names") - Result:
- Look at name trend
a <- aes(year, percent, color = sex) p <- ggplot(filter(bnames, name=="Dylan"), a) + geom_line() p
- Some helper functions
# Return letter at position pos library(stringr) letter <- function(string, pos) { return(str_sub(string, pos, pos)) } # and some consistent axes axes <- list( scale_x_continuous(breaks = c(1900, 1920, 1940, 1960, 1980, 2000)), scale_y_continuous("percent") ) - Trend in first letter
bn <- group_by(bnames, year,sex, first = letter(name, 1)) fl <- summarize(bn, percent=sum(percent)) ggplot(fl, a) + geom_line() + facet_wrap(~ first) + axes - Result
- Trend in last letter
bn <- group_by(bnames, year,sex, last = letter(name, -1)) ll <- summarize(bn, percent=sum(percent)) ggplot(ll, a) + geom_line() + facet_wrap(~ last) + axes - Result
- Rise in names that end in n
lastn <- group_by(filter(bnames, letter(name, -1)=="n"), year) lastn <- summarize(lastn, percent=sum(percent)) ggplot(lastn, aes(year, percent)) + geom_line() + axes - Result
- What about names that sound alike?
see http://en.wikipedia.org/wiki/Soundex
soundex <- function(name) { name <- toupper(name) name <- unlist(strsplit(name,"")) # split into individual chars sdx <- name[1] dict <- c("BFPV", "CGJKQSXZ", "DT", "L", "MN", "R", "AEIOUHWY") codes <- c("1","2","3","4","5","6",".") for (i in 2:length(name)) { code = codes[grep(name[i], dict, fixed=TRUE)] if (code != str_sub(sdx,-1,-1)) { sdx <- str_c(sdx, code) } } sdx <- gsub(".", "", sdx, fixed=TRUE) sdx <- str_pad(sdx, 3, pad = "0") return(sdx) } - Get soundex for each name in data
unames <- unique(bnames$name) sdxs <- sapply(unames, soundex) bnames$soundex <- sdxs[ match(bnames$name, unames )] bnames$soundend <- str_sub(bnames$soundex,2) - So what is the deal with names that end in n?
sdylan <- filter(bnames, soundex==soundex("Dylan")) unique(sdylan$name)[1] "Delma" "Dillon" "Delano" "Delwin" "Dylan" [6] "Delaney" "Dillan" "Dillion" "Dylon" "Dyllan" [11] "Dallin" "Dilan" "Daylen"
- Plot sound alikes
sdy <- group_by(sdylan, sex, year) sdylan.sum <- summarize(sdy, percent = sum(percent)) ggplot(sdylan.sum, a) + geom_line() + axes - Result
- Maybe all soundexs that end in "45" + "n"?
sy <- group_by(bnames, year, sex) s45 <- summarize(filter(sy, soundend == "45" & letter(name,-1)=="n"), percent=sum(percent)) ggplot(s45, a) + geom_line() + axes - Result
24. Tidy Data
24.1. Tidy data
- What is tidy data?
- A step along the road to clean data
- Data that is easy to model, visualise and aggregate (i.e. works well with statistical model functions like
lm, works well withggplot, anddplyr)
- Example of messy data
- Data storage in tidy format
Storage Meaning Table / File Data set Rows Observation Columns Variables - Common causes of messiness
- column headers are values, not variable names
- multiple variables are stored in one column
- variables are stored in both rows and columns
- multiple types of experimental unit stored in the same table
- one type of experimental unit stored in multiple tables
- Some tools
library(tidyr) ?pivot_longer ?pivot_wider ?separate ?extract library(stringr) ?str_replace ?str_sub ?str_split_fixed
24.2. Long vs wide data
- You've seen long vs wide data formats
date species psi.pd psi.md RH.pd RH.md 110810 JUDE2 -2.0 -4.5 88 50 110810 QUGR -1.0 -2.2 90 40 110810 PICE -0.4 -1.7 91 33 Useful for regressing md.psi on pd.psi, but not useful if we want to facet on time of day. So is this tidy? Depends if we think of pd.psi and md.psi as separate variables or not.
- Long format
date species time psi RH 110810 JUDE2 pd -2.0 88 110810 JUDE2 md -4.5 90 110810 QUGR pd -1.0 91 110810 QUGR md -2.2 50 110810 PICE pd -0.4 40 110810 PICE md -1.7 33 - Conceptual framework
- Identifer (id) variables identify the unit that measurements take place on. ID variables are usually discrete, and are typically fixed by design.
- Measured variables represent what is measured on that unit.
- Further step
In truly "long" format there are only id variables and a value column! (variable names all in a "variable" column)
date species time variable value 110810 JUDE2 pd psi -2.0 110810 JUDE2 md psi -4.5 110810 QUGR pd psi -1.0 110810 QUGR md psi -2.2 110810 PICE pd psi -0.4 110810 PICE md psi -1.7 110810 JUDE2 pd RH 88 110810 JUDE2 md RH 90 110810 QUGR pd RH 91 110810 QUGR md RH 50 110810 PICE pd RH 40 110810 PICE md RH 33 - Problem: Column headers are values, not variable names
raw <- read.csv("https://r-research-tool.schwilk.org/data/pew.csv", check.names = FALSE) head(raw[,1:6])religion <$10k $10-20k $20-30k $30-40k $40-50k 1 Agnostic 27 34 60 81 76 2 Atheist 12 27 37 52 35 3 Buddhist 27 21 30 34 33 4 Catholic 418 617 732 670 638 5 Don’t know/refused 15 14 15 11 10 6 Evangelical Prot 575 869 1064 982 881 - Problem: Multiple variables in one column
The WHO tuberculosis data:
library(stringr) course_url <- "http://r-research-tool.schwilk.org" raw <- read.csv(file.path(course_url, "data/tb-simple-2020.csv")) raw %>% select(1:7) %>% tail()country iso2 g_whoregion year new_sp new_sp_m04 8487 Zimbabwe ZW AFR 2014 NA NA 8488 Zimbabwe ZW AFR 2015 NA NA 8489 Zimbabwe ZW AFR 2016 NA NA 8490 Zimbabwe ZW AFR 2017 NA NA 8491 Zimbabwe ZW AFR 2018 NA NA 8492 Zimbabwe ZW AFR 2019 NA NA new_sp_m514 8487 NA 8488 NA 8489 NA 8490 NA 8491 NA 8492 NA - Clean a bit
## get rid of the "overall count" for now raw$new_sp <- NULL ## rename rest of columns for clarity names(raw) <- str_replace(names(raw), "new_sp_", "") raw %>% select(1:8, -iso2) %>% tail()country g_whoregion year m04 m514 m014 m1524 8487 Zimbabwe AFR 2014 NA NA NA NA 8488 Zimbabwe AFR 2015 NA NA NA NA 8489 Zimbabwe AFR 2016 NA NA NA NA 8490 Zimbabwe AFR 2017 NA NA NA NA 8491 Zimbabwe AFR 2018 NA NA NA NA 8492 Zimbabwe AFR 2019 NA NA NA NA
24.3. Reshaping
- Variables in rows and columns
Example data, photosynthesis and respiration in two species
library(tidyr)tab1 <- readr::read_csv("http://r-research-tool.schwilk.org/data/carex.csv") head(tab1)Species Treatment A.1 A.2 A.3 A.4 R.1 R.2 R.3 R.4 Pa 0.15 8.36 7.632 5.522 6.3 0.7358 0.587 0.6236 0.4354 Pa 3 11.06 11.54 7.914 7.914 Pa 15 22.32 18.64 17.2 18.4 1.168 0.96 1.546 Cs 0.15 15.6 10.24 13.1 10.48 1.484 0.7682 0.6236 1.306 Cs 3 13.74 15.24 13.92 10.5 Cs 15 9.37 18.76 15.6 10.66 1.036 1.064 0.5276 0.899 - Make data longer (wide to long format)
tab.tidier <- pivot_longer(tab1, cols=c(-Species, -Treatment), names_to="measurement") head(tab.tidier)# A tibble: 6 × 4 Species Treatment measurement value <chr> <dbl> <chr> <dbl> 1 Pa 0.15 A.1 8.36 2 Pa 0.15 A.2 7.63 3 Pa 0.15 A.3 5.52 4 Pa 0.15 A.4 6.3 5 Pa 0.15 R.1 0.736 6 Pa 0.15 R.2 0.587
- Separate replicate id from measurement type
tab.long <- tab.tidier %>% separate(measurement, into = c("measurement", "rep"), sep = "\\.") head(tab.long)# A tibble: 6 × 5 Species Treatment measurement rep value <chr> <dbl> <chr> <chr> <dbl> 1 Pa 0.15 A 1 8.36 2 Pa 0.15 A 2 7.63 3 Pa 0.15 A 3 5.52 4 Pa 0.15 A 4 6.3 5 Pa 0.15 R 1 0.736 6 Pa 0.15 R 2 0.587
In more complicated cases you may need to use regular expressions to pull apart column data. See
names_patternargument, alsotidyr::extract,stringr::str_match - Make data wider (long to wide)
Why?
tab.tidy <- tab.long %>% pivot_wider(names_from=measurement, values_from=value) head(tab.tidy)# A tibble: 6 × 5 Species Treatment rep A R <chr> <dbl> <chr> <dbl> <dbl> 1 Pa 0.15 1 8.36 0.736 2 Pa 0.15 2 7.63 0.587 3 Pa 0.15 3 5.52 0.624 4 Pa 0.15 4 6.3 0.435 5 Pa 3 1 11.1 NA 6 Pa 3 2 11.5 NA
- pivot longer: TB data again
tempwide <- raw %>% select(1:7, -iso2) tail(tempwide)country g_whoregion year m04 m514 m014 8487 Zimbabwe AFR 2014 NA NA NA 8488 Zimbabwe AFR 2015 NA NA NA 8489 Zimbabwe AFR 2016 NA NA NA 8490 Zimbabwe AFR 2017 NA NA NA 8491 Zimbabwe AFR 2018 NA NA NA 8492 Zimbabwe AFR 2019 NA NA NA - pivot longer
temp_long <- tempwide %>% pivot_longer(cols=starts_with("m"), names_to="sexage", values_to="count") # More correctly, use a regex to capture m|f|sexunk temp_long <- tempwide %>% pivot_longer(cols=matches("^[mfs]"), names_to="sexage", values_to="count") head(temp_long)# A tibble: 6 × 5 country g_whoregion year sexage count <chr> <chr> <int> <chr> <int> 1 Afghanistan EMR 1980 m04 NA 2 Afghanistan EMR 1980 m514 NA 3 Afghanistan EMR 1980 m014 NA 4 Afghanistan EMR 1981 m04 NA 5 Afghanistan EMR 1981 m514 NA 6 Afghanistan EMR 1981 m014 NA
- Several choices now for grabbing sex and age:
- Check possible values. I omitted most cols so this is incomplete.
- Then separate into age and sex columns by using either
names_patterninpivot_longerdirectly,tidyr:extract, orstringr::str_match
- What to do with NAs?
- In long data, some NAs can be implicit (missing rows), but when spread to a wider form we need to make them explicit because there is a cell to fill.
- In some cases, an NA in one form (or a missing row) can imply a zero in another aggregate form (for example: oak cover on my Big Bend transects!).
- Things you can do with tidyr
- pivot wider or longer
- split or combine columns (
separate(),extract(), andunite()) - make explicit missing value explicit (
complete()) or vice-versa (drop_na()), or replace with known value (replace_na()).
- Non standard evaluation
Most dplyr verbs use non-standard evaluation and treat some arguments as expressions. In more detail:
arrange(),count(),filter(),group_by(),mutate(), andsummarise()use data masking so that you can use data variables as if they were variables in the environment (i.e. you writemy_variablenotdf$myvariable).across(),relocate(),rename(),select(), andpull()use tidy selection so you can easily choose variables based on their position, name, or type (e.g.starts_with("x")oris.numeric).Look up these terms and look at examples.
25. Dates and Times
25.1. How to represent time
- Time is tricky to represent
Time is a deceptively complicated system to work with. Do we want it to behave like a number line, or according to complicated conventions?
- Instants
- An instant refers to a specific moment of time.
- Instants are combination of years, months, days, hours, minutes, seconds, and timezone. e.g.,
2011-03-28 08:44:32 CST - There are two main time data types in R:
- POSIXct
The POSIXct data type is the number of seconds since the start of January 1, 1970. Negative numbers represent the number of seconds before this time, and positive numbers represent the number of seconds afterwards.
- POSIXlt
The POSIXlt data type is a vector, and the entries in the vector have the following meanings:
- seconds
- minutes
- hours
- day of month (1-31)
- month of the year (0-11)
- years since 1900
- day of the week (0-6 where 0 represents Sunday)
- day of the year (0-365)
- Daylight savings indicator (positive if it is daylight savings)
25.2. lubridate
- lubridate package
See https://lubridate.tidyverse.org/ for cheatsheet.
# lubridate turns strings into instants with functions # that have y, m, and d in their names. ymd("2011-03-28") mdy("03/28/2011") dmy("28032011") ymd_hms("2011:03:28 08:50:30", tz="America/Chicago") dmy("05-03-2011") mdy("05-03-2011")[1] "2011-03-28" [1] "2011-03-28" [1] "2011-03-28" [1] "2011-03-28 08:50:30 CDT" [1] "2011-03-05" [1] "2011-05-03"
- What type of objects do these return?
# what is the object type (class)? class(mdy("05-03-2011")) class(ymd_hms("2011:03:28 08:50:30", tz="America/Chicago"))[1] "Date" [1] "POSIXct" "POSIXt"
- Example use
url <- "http://r-research-tool.schwilk.org/data/soil-sensors.csv" soil <- read.csv(url, na.strings = c("NA", "#N/A", "")) names(soil) <- c("time", "p1_psi", "p2_VWC", "p3_VWC", "p4_psi") head(soil) class(soil$time)time p1_psi p2_VWC p3_VWC p4_psi 1 24 May 2012 12:00 PM -36 0.230 0.338 -23 2 24 May 2012 12:30 PM -36 0.229 0.338 -23 3 24 May 2012 01:00 PM -36 0.226 0.337 -23 4 24 May 2012 01:30 PM -37 0.224 0.337 -23 5 24 May 2012 02:00 PM -37 0.222 0.336 -23 6 24 May 2012 02:30 PM -38 0.220 0.335 -24 [1] "character" - Try:
Convert time column into date-time objects
- Manipulating instants
now() > ymd("2025-04-16") today() == ymd("2025-04-15") # date object now() > ymd("2025-04-15", tz="America/Chicago")[1] TRUE [1] FALSE [1] TRUE
- Timezones
Tricky to specify. Use this list: https://en.wikipedia.org/wiki/List_of_tz_database_time_zones
d <- ymd_hms("2020-10-22 14:30:00", tz="America/Chicago") d with_tz(d, tz="UTC") with_tz(d, tz="America/Los_Angeles") force_tz(d, tz="America/Los_Angeles") force_tz(d, tz="America/Whitehorse")[1] "2020-10-22 14:30:00 CDT" [1] "2020-10-22 19:30:00 UTC" [1] "2020-10-22 12:30:00 PDT" [1] "2020-10-22 14:30:00 PDT" [1] "2020-10-22 14:30:00 PDT"
- Note  B_note
Note that TZ names like "US/Central" are officially considered deprecated by the tz database that is used globally (by many different software). They are unlikely to ever not work but just a heads up that the continent/city format is now the standard.
Timezones are very confusing because daylight savings dates and observance could differ by municipality, and of course over time. And political boundaries cahnge. You might be surprised if you have a time series that crosses 1867 in Alaska, for example (Julian vs Gregorian calendar, international date line all changed with purchase from Russia).
- Note  B_note
- Rounding times
# round floor_date(now(), "month") ceiling_date(now(), "month") round_date(now(), "month") # extract year(now())[1] "2026-04-01 CDT" [1] "2026-05-01 CDT" [1] "2026-05-01 CDT" [1] 2026
- Accessing components
Date component Accessor Year year()Month month()Week week()Day of year yday()Day of month mday()Day of week wday()Hour hour()Minute minute()Second second()Time Zone tz() - Manipulating instants
now() year(now()) hour(now()) day(now()) yday(now())[1] "2026-04-28 10:18:36 CDT" [1] 2026 [1] 10 [1] 28 [1] 118
- Instants, continued
wday(now()) wday(now(), label = TRUE) wday(now(), label = TRUE, abbr = FALSE) month(now(), label = TRUE, abbr = FALSE)[1] 3 [1] Tue Levels: Sun < Mon < Tue < Wed < Thu < Fri < Sat [1] Tuesday 7 Levels: Sunday < Monday < Tuesday < ... < Saturday [1] April 12 Levels: January < February < March < April < ... < December
- Upon which day of the week were you born?
- Changing components
# Accessor functions can also be used to change the # elements of a date-time ti <- now() year(ti) <- 1999 hour(ti) <- 23 day(ti) <- 45 tz(ti) <- "UTC" ti[1] "1999-05-15 23:18:36 UTC"
- Summarizing over time
library(dplyr) library(ggplot2) soil$date <- floor_date(soil$time, "day") # could use dplyr::mutate # or could use date(soil$time) psi.day <- soil %>% group_by(date) %>% summarize(avepsi = mean(p1_psi, na.rm=TRUE)) ggplot(psi.day, aes(date, avepsi)) + geom_line() - Result
25.3. Time spans
- Time spans
#?dseconds dseconds(140) dhours(32) ddays(2) class(ddays(1))[1] "140s (~2.33 minutes)" [1] "115200s (~1.33 days)" [1] "172800s (~2 days)" [1] "Duration" attr(,"package") [1] "lubridate"
- Let's use R to set our alarm clock
x <- ymd_hms("2026-01-01 08:30:00", tz = "") + ddays(0:30) head(x) # cool[1] "2026-01-01 08:30:00 CST" "2026-01-02 08:30:00 CST" [3] "2026-01-03 08:30:00 CST" "2026-01-04 08:30:00 CST" [5] "2026-01-05 08:30:00 CST" "2026-01-06 08:30:00 CST"
- A problem:
## what about: x <- ymd_hms("2026-03-01 08:30:00", tz = "") + ddays(0:30) head(x, 12)[1] "2026-03-01 08:30:00 CST" "2026-03-02 08:30:00 CST" [3] "2026-03-03 08:30:00 CST" "2026-03-04 08:30:00 CST" [5] "2026-03-05 08:30:00 CST" "2026-03-06 08:30:00 CST" [7] "2026-03-07 08:30:00 CST" "2026-03-08 09:30:00 CDT" [9] "2026-03-09 09:30:00 CDT" "2026-03-10 09:30:00 CDT" [11] "2026-03-11 09:30:00 CDT" "2026-03-12 09:30:00 CDT"
- What went wrong?
- 3 types of time spans
We can still do math on the time line, as long as we’re specific about how we want to measure time.
- durations (exact number of seconds)
- periods (clock times)
- intervals (two specific datetimes)
- Durations
# Use d + the plural of of a unit of time to create # a duration. dminutes(5) dhours(278) #dmonths(4) # why doesn't this work?[1] "300s (~5 minutes)" [1] "1000800s (~1.65 weeks)"
- Periods
# Use the plural of a time unit to create a period. minutes(5) hours(278) months(4) # months are not a problem years(1:10)[1] "5M 0S" [1] "278H 0M 0S" [1] "4m 0d 0H 0M 0S" [1] "1y 0m 0d 0H 0M 0S" "2y 0m 0d 0H 0M 0S" [3] "3y 0m 0d 0H 0M 0S" "4y 0m 0d 0H 0M 0S" [5] "5y 0m 0d 0H 0M 0S" "6y 0m 0d 0H 0M 0S" [7] "7y 0m 0d 0H 0M 0S" "8y 0m 0d 0H 0M 0S" [9] "9y 0m 0d 0H 0M 0S" "10y 0m 0d 0H 0M 0S"
- Why use periods?
Appointments that refer to wall clock time once per week, per month, per year, etc.
2020-11-01 00:00:00 + months(1)will always equal2020-12-01 00:00:00no matter how many leap seconds, leap days or changes in DST occur in between - Intervals
# To create an interval, use the special %--% # operator: int <- ymd("2024-01-01") %--% ymd("2024-01-02") # Access and set the endpoints with int_start() and # int_end(). int_start(int) int_end(int) <- ymd("2024-03-14") as.duration(int) as.period(int) # Intervals can be accurately converted to either # periods or durations with as.period() and # as.duration()[1] "2024-01-01 UTC" [1] "6307200s (~10.43 weeks)" [1] "2m 13d 0H 0M 0S"
25.4. Math with date-times
26. Statistical Models
26.1. Statistical models
- Definition
- Data
- A collection of variables, each variable being a vector of measurements of some specific trait/characteristic
- Problem
- How does variable \(Y\) depend upon variables \(X_1, X_2 \dots X_n\)
- Model
- The model represents the mathematical relationship between X and Y. It aims to represent the true relationship
- First steps in modeling
- Identify the response variable
- Identify the explanatory variables
- Are the explanatory variables continuous, categorical or a mixture of both?
- What is the response variable? Continuous, a count or proportion, category, or time-at-death?
- Explanatory variables
- Continuous
- Regression
- Categorical
- Analysis of Variance (ANOVA)
- Mixture
- Analysis of covariance (ANCOVA)
- Response variables
- Continuous
- Linear regression, anova, ancova
- Proportion
- Logistic regression (and variants for counts, binary data)
- Time-at-death
- Survival analysis
26.2. Model formulas in R
- Model formulas
- A model formula is a sequence of R variables and symbols that specifies the variables in the model and their relationships
- The formula is a necessary argument to many model-fitting functions in R (treated as an expression)
- Although it resembles a mathematical formula, the symbols in the "equation" have different meanings
- Model formula notation
A model formula in R can look like this:
y ~ A + B + A:B
the
~means: "is modeled as a function of" - Example: Linear regression
Setup:
library(ggplot2) x <- runif(1000, 0, 10) y <- 1.5*x + 0.09 + rnorm(1000, 0, 2) d <- data.frame(x=x, y=y) ggplot(d, aes(x, y)) + geom_point()
- Example: Linear regression
fit <- lm(y ~ x, data = d) fitCall: lm(formula = y ~ x, data = d) Coefficients: (Intercept) x 0.09578 1.50354 - Example: Linear regression
summary(fit)Call: lm(formula = y ~ x, data = d) Residuals: Min 1Q Median 3Q Max -5.2281 -1.3923 -0.0896 1.3105 6.9257 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 0.09578 0.12165 0.787 0.431 x 1.50354 0.02114 71.106 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.947 on 998 degrees of freedom Multiple R-squared: 0.8352, Adjusted R-squared: 0.835 F-statistic: 5056 on 1 and 998 DF, p-value: < 2.2e-16 - Extracting information from a fitted object
typeof(fit) class(fit) names(fit) coef(fit)[1] "list" [1] "lm" [1] "coefficients" "residuals" "effects" [4] "rank" "fitted.values" "assign" [7] "qr" "df.residual" "xlevels" [10] "call" "terms" "model" (Intercept) x 0.09577804 1.50353900
- Residuals
head(residuals(fit),5) sd(residuals(fit))1 2 3 4 5 -2.95736864 0.02948475 -0.95593011 -1.22266288 -1.48669100 [1] 1.945628 - Intercepts
But wait! R estimated two coefficients and we only had one explanatory variable.
- R assumes that you want to include an intercept term (our call was equivalent to
y ~ 1 + x) - You can exclude it by adding a
0to the formula (or a-1). But you (almost) never want to do that.
- R assumes that you want to include an intercept term (our call was equivalent to
- Example of no intercept
fitnoi <- lm(y ~ 0 + x, data = d) summary(fitnoi)Call: lm(formula = y ~ 0 + x, data = d) Residuals: Min 1Q Median 3Q Max -5.1886 -1.3703 -0.0788 1.3384 6.9321 Coefficients: Estimate Std. Error t value Pr(>|t|) x 1.5179 0.0107 141.9 <2e-16 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 1.946 on 999 degrees of freedom Multiple R-squared: 0.9527, Adjusted R-squared: 0.9527 F-statistic: 2.013e+04 on 1 and 999 DF, p-value: < 2.2e-16 - Let's build some models
# names(mtcars) ggplot(mtcars, aes(disp, mpg)) + geom_point() ggplot(mtcars, aes(wt, mpg)) + geom_point() ggplot(mtcars, aes(disp, wt)) + geom_point() fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars) summary(fuel.mod) - Model results
Call: lm(formula = mpg ~ disp + wt + wt:disp, data = mtcars) Residuals: Min 1Q Median 3Q Max -3.267 -1.677 -0.836 1.351 5.017 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 44.081998 3.123063 14.115 2.96e-14 *** disp -0.056358 0.013239 -4.257 0.00021 *** wt -6.495680 1.313383 -4.946 3.22e-05 *** disp:wt 0.011705 0.003255 3.596 0.00123 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 2.455 on 28 degrees of freedom Multiple R-squared: 0.8501, Adjusted R-squared: 0.8341 F-statistic: 52.95 on 3 and 28 DF, p-value: 1.158e-11 anova()Test if added variables reduce residual sum-of-squares
# type I tests (sequential): anova(fuel.mod)Analysis of Variance Table Response: mpg Df Sum Sq Mean Sq F value Pr(>F) disp 1 808.89 808.89 134.217 3.4e-12 *** wt 1 70.48 70.48 11.694 0.001942 ** disp:wt 1 77.93 77.93 12.931 0.001227 ** Residuals 28 168.75 6.03 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1- But with regression, data are likely unbalanced, best to use type II or III SS
library(car)# set contrasts different than system difaults: must sum to 0 options(contrasts = c("contr.sum","contr.poly")) # car::Anova() can use type II and type III sum of squares. # We found a significant interaction in our test with anova() # so we should use type III. This will match SAS output. fuel.mod <- lm(mpg ~ disp + wt + wt:disp, data=mtcars) Anova(fuel.mod, contrasts=list(disp=contr.sum, wt=contr.sum), type=3)- Lecture notes  B_note
https://www.r-bloggers.com/2011/03/anova-%E2%80%93-type-iiiiii-ss-explained/
type I ss is sequential, so order of main effects will change result. for unbalanced data.
type 2 tests each main effect after the other. It is most powerful if there is no signifciant interaction (but if there is an interaction, you must use type 3).
- Lecture notes  B_note
- Results
Anova Table (Type III tests) Response: mpg Sum Sq Df F value Pr(>F) (Intercept) 1200.72 1 199.233 2.956e-14 *** disp 109.22 1 18.123 0.0002102 *** wt 147.42 1 24.461 3.217e-05 *** disp:wt 77.93 1 12.931 0.0012270 ** Residuals 168.75 28 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1- Note  B_note
Usually the hypothesis of interest is about the significance of one factor while controlling for the level of the other factors. This equates to using type II or III sum of squares. In general, if there is no significant interaction effect, then type II is more powerful, and follows the principle of marginality. If an interaction is present, then type II is inappropriate while type III can still be used, but results need to be interpreted with caution (in the presence of interactions, main effects are rarely interpretable).
- Note  B_note
- Do use
anova()to compare two models
fuel.mod2 <- lm(mpg ~ disp + wt, data=mtcars) anova(fuel.mod2, fuel.mod)Analysis of Variance Table Model 1: mpg ~ disp + wt Model 2: mpg ~ disp + wt + wt:disp Res.Df RSS Df Sum of Sq F Pr(>F) 1 29 246.68 2 28 168.75 1 77.934 12.931 0.001227 ** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
- Linear models:
summaryvsanovavscar::Anova
- Linear models:
summaryvsanovavscar::Anova
- Longer
Ben Bolker (https://stackoverflow.com/questions/12863178/comparing-two-linear-models-with-anova-in-r): The p-values you get for
summary()of each individual model are the p-values for the effects of each of the parameters in each model, conditional on all the other parameters in that model. If your data are perfectly balanced (which is unlikely in a regression design), you should get the same answers fromsummary()andanova(), but otherwise the results fromanovaare generally preferable.
- Longer
- Categorical variables and contrasts
If you have N levels of a factor, you have N-1 contrasts (reported as coefficients).
- Default is dummy coding: each level compared to reference level (eg first). You can re-order levels to change reference.
- Reverse Helmert coding (
contr.helmert) compares each level of a categorical variable to the mean of previous levels of the variable. - And others including user-defined
See https://cran.r-project.org/web/packages/codingMatrices/vignettes/codingMatrices.pdf
or https://marissabarlaz.github.io/portfolio/contrastcoding/
- Symbols in formulas
+separate effects in a formula :interaction (A:B is interaction of A and B) *main effects plus interactions ^crossed %in%nested within (but see lme4 and nlme) /nested within | conditional on (rarer) A*Bis equivalent toA + B + A:B - lm() works for categorical variables too
mtcars$cylnum <- factor(mtcars$cyl) levels(mtcars$cylnum) # treatment or dummy coding mtcars$cylnum <- C(mtcars$cylnum, treatment) fuel.aov <- lm(mpg ~ cylnum, data=mtcars) Anova(fuel.aov)[1] "4" "6" "8" Anova Table (Type II tests) Response: mpg Sum Sq Df F value Pr(>F) cylnum 824.78 2 39.697 4.979e-09 *** Residuals 301.26 29 --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 - Coefficients of categorical variables
summary(fuel.aov)Call: lm(formula = mpg ~ cylnum, data = mtcars) Residuals: Min 1Q Median 3Q Max -5.2636 -1.8357 0.0286 1.3893 7.2364 Coefficients: Estimate Std. Error t value Pr(>|t|) (Intercept) 26.6636 0.9718 27.437 < 2e-16 *** cylnum6 -6.9208 1.5583 -4.441 0.000119 *** cylnum8 -11.5636 1.2986 -8.905 8.57e-10 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 Residual standard error: 3.223 on 29 degrees of freedom Multiple R-squared: 0.7325, Adjusted R-squared: 0.714 F-statistic: 39.7 on 2 and 29 DF, p-value: 4.979e-09- Note  B_note
The "contrasts" vector is set through `options()` in your R environment. This vector of length 2 sets contrast types represented as strings determines how categorical variables are handled in your models. The most common scheme in regression is called "treatment contrasts": with treatment contrasts, the first level of the categorical variable is assigned the value 0, and then other levels measure the change from the first level. The contrasts argument to options always takes two arguments. the first refers to the contrasts used for unordered factors and the second for ordered factors. The default for unordered variables is `contr.treatment()`.
in anova, we usually use contrasts.sum. For ordered factors, contrasts.poly.
Look up more on contrasts coding. contrasts() function and cont.treatment() etc:
https://stats.oarc.ucla.edu/r/library/r-library-contrast-coding-systems-for-categorical-variables/ or
For more control, there is the `contrasts` package: https://cran.r-project.org/web/packages/contrast/vignettes/contrast.html
- Note  B_note
26.3. Beyond the linear model
- Beyond the linear model
- GLM
- Generalized linear modeling allow non-normal errors in response variable through a linear link function
- Mixed effects models
- relax assumption of independence. Model nested observations and random effects
- GLS
- Generalized least squares relax assumption of homogeneity, model the heterogeneity
- GLMM
- combine first and second, above.
- GAM and GAMM
- Generalized additive models and mixed versions
- Generalized Linear Models (glm)
Consist of:
- Error distributions
The distribution of the response around its mean. for ANOVA and regression, this is the normal; for logistic regression it is the binomial
- Link function
The function,
gthat shows how the linear function of the explanatory variables is related to the expected value of the response:\(g(\mu) = \beta_0 + \beta_1x_1 + \dots + \beta_nx_n\)
- Variance function
This captures how the variance of the response depends upon the mean
- Error distributions
- Example logistic regression
trees <- data.frame(fire.sev = seq(0.01,1,0.005)) fire.mort <- runif(length(trees$fire.sev)) < trees$fire.sev trees$mortality <- ifelse(fire.mort,1,0) ggplot(trees, aes(fire.sev, mortality)) + geom_point() - Not a linear relationship!
- Logistic regression
trees.glm <- glm(mortality ~ fire.sev, data=trees, family = binomial()) trees.glm # summary(trees.glm)Call: glm(formula = mortality ~ fire.sev, family = binomial(), data = trees) Coefficients: (Intercept) fire.sev -2.323 4.627 Degrees of Freedom: 198 Total (i.e. Null); 197 Residual Null Deviance: 275.9 Residual Deviance: 212.7 AIC: 216.7 - Prediction
trees$pred.mort <- predict(trees.glm, type="response") ggplot(trees, aes(fire.sev, mortality)) + geom_point() + geom_line(aes(y =pred.mort) ) - Prediction: plot
26.4. Mixed models
- Example: Mixed model
fish <- read.csv("http://r-research-tool.schwilk.org/data/fish-data.csv") head(fish)aquarium nutrient color.score 1 a low 5.3 2 a low 4.7 3 a low 5.0 4 a low 4.3 5 a low 4.8 6 b low 4.8
- Look at these data
Is each fish by treatment row independent?
- Mixed model using nlme
library(nlme) fit1.lme <- lme(color.score ~ 1, random = ~ 1 | aquarium, method="ML", data=fish) fit2.lme <- lme(color.score ~ nutrient, random = ~ 1 | aquarium, method="ML", data=fish) anova(fit1.lme, fit2.lme)Model df AIC BIC logLik Test L.Ratio fit1.lme 1 3 168.2086 173.9446 -81.10429 fit2.lme 2 4 147.2213 154.8694 -69.61064 1 vs 2 22.98729 p-value fit1.lme fit2.lme <.0001 - BEWARE! Those p values are wrong in any non classical design.
Beware of p-values in multi-level models. Consider model selection (AIC or BIC, but careful there as well). Consider Markov chain Monte Carlo methods (MCMC) for confidence intervals.
See package lmerTest, and pbkrtest for Kenward-Roger approximation, package afex, and car
- Using lme4
library(lme4) library(pbkrtest)fit1.lmer <- lmer(color.score ~ 1 + (1 | aquarium), data=fish) fit2.lmer <- lmer(color.score ~ nutrient + (1 | aquarium), data=fish) KRmodcomp(fit1.lmer, fit2.lmer)boundary (singular) fit: see help('isSingular') large : color.score ~ nutrient + (1 | aquarium) small : color.score ~ 1 + (1 | aquarium) stat ndf ddf F.scaling p.value Ftest 67.198 1.000 8.000 1 3.662e-05 *** --- Signif. codes: 0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 - Mixed models in R
- Pinheiro, J. C. & Bates, D. Mixed-Effects Models in S and S-Plus Springer-Verlag, New York, 2000.
- Zuur, A.F., Ieno, E.N., Walker, N., Saveliev, A.A., and Smith, G.M. Mixed Effects Models and Extensions in Ecology with R. Springer. 2009.
27. Perception and Color
27.1. Graphics and color use
- Future topics
- Markup languages
- Version control (git) - could spend multiple lectures.
- ? emacs text editor / org mode
- ? Survey of some domain specific R packages, PCA, time series, phylogenies
- ? Many models in R
- Exploratory vs communication graphics
- Exploratory graphics
Are for you (not others). Need to be able to create rapidly because your first attempt will never be the most revealing. Iteration is crucial for developing the best display of your data.
- Communication graphics
When you communicate your findings, you need to spend a lot of time polishing your graphics to eliminate distractions and focus on the story. Iteration is crucial to ensure all the the small stuff works well: labels, color choices, tick marks, etc
- Note  B_note
Check out Kieran Healey's book, Data Visualization
- Exploratory graphics
27.2. Color
- What are the components of color?
RGB:
- RGB triplets
Notation RGB triplet Arithmetic (1.0, 0.0, 0.0) 8-bit per channel (255, 0, 0) 8-bit as hexidecimal #FF0000 - Other color models?
- Luminance / Lightness
- HCL
- Benefit
- Perceptually uniform
- Hue is unordered. Use evenly spaced hues with equal chroma and luminance to make aesthetically pleasing discrete palettes
- Chroma and luminance are ordered. Easy to make perceptually uniform gradients by varying either (or both).
- See https://colorspace.r-forge.r-project.org/reference/hcl_palettes.html
- Color blindness
- 7-10% of men are red-green color "blind."
- Solutions: avoid red-green contrasts; use redundant mappings; test.
- Color Oracle: http://colororacle.org
27.3. Colors in R
- Colors in R
Default in R is an hcl color scale. One can use scales in other packages (eg RColorBRewer) as well.
- Continuous scales
The default continuous scale in ggplot2 is a blue gradient, from low =
#132B43to high =#56B1F7which can be reproduced as:library(ggplot2) scale_color_gradient(low = "#132B43", high = "#56B1F7", space = "Lab") - Discrete scales
The default discrete scale in ggplot2 is a range of hues from hcl,
scale_color_hue(h = c(0, 360) + 15, c = 100, l = 65, h.start = 0, direction = 1)
- Continuous scales
- Discrete color scales
p = ggplot(iris, aes(Petal.Length, Sepal.Length, color = Species)) + geom_point() p p + scale_color_hue(l=40,c=30) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50) # change luminance and chroma p + scale_color_hue(h = c(90,200), l = 50,c = 50, direction = -1) - Results:
- Continuous scales
p <- ggplot(mpg, aes(hwy, cty, color = displ)) + geom_point(size = 3.5, alpha = 0.6, position = "jitter") p p + scale_colour_gradient(low = "tan", high = "red") - Results:
- Other color palettes
Cynthia Brewer's palettes for discrete scales http://colorbrewer2.org
These are optimized for maps
library(RColorBrewer) display.brewer.all() p <- qplot(carat, price, color = cut, data=diamonds) p + scale_color_brewer(palette = "Blues") p <- ggplot(mpg, aes(reorder(class, -cty), cty, color = cut(displ, breaks = 4))) + geom_boxplot(position = "dodge") p + scale_color_brewer(palette="Dark2") - Result:
colorspacepackage
Can be used wioth base graphics, ggplot2 etc. See https://colorspace.r-forge.r-project.org/articles/colorspace.html
library(colorspace)- Note  B_note
The scales are named via the scheme
scale_<aesthetic>_<datatype>_<colorscale>(), where <aesthetic> is the name of the aesthetic (fill, color, colour), <datatype> is the type of the variable plotted (discrete or continuous) and<colorscale>sets the type of the color scale used (qualitative, sequential, diverging, divergingx).
- Note  B_note
colorspacepalettes with ggplot2
ggplot(iris, aes(x = Sepal.Length, fill = Species)) + geom_density(alpha = 0.6) + scale_fill_discrete_qualitative(palette = "Dark 3")
- Example of sequential palette
dsamp <- diamonds[1 + 1:1000 * 50, ] ggplot(dsamp, aes(carat, price, color = cut)) + geom_point() + scale_color_discrete_sequential(palette = "Purples 3", nmax = 6, order = 2:6)
27.4. Perception and comparisons
- It is easier to make nearby comparisons
- Beware of visual illusions
- Boxplots and relatives
- My recommendations
For comparing univariate responses across discrete groups. According to data set size
- small: dotplots or jittered dotplots (a special challenge are small, discrete data sets, where there are lots of repeated values)
- moderate: boxplots
- moderate to large: violin plots, density strips, beanplots, box-percentile plots etc. etc.
28. Version Control
28.1. Setup
- The Bash Shell
Bash is a commonly-used shell that gives you the power to do simple tasks more quickly.
- Windows
Install Git for Windows by download and running the installer this will provide you with both Git and Bash in the Git Bash program.
- Mac
The default shell in all versions of Mac OS X is bash, so no need to install anything. You access bash from the Terminal (found in
/Applications/Utilities). You may want to keep Terminal in your dock for this workshop. - Linux
The default shell is usually
bash, Just open a terminal. There is no need to install anything.
- Windows
- Git
- The shell
- What is the shell?
The shell a command line interface which allows you to control your computer.
Tasks are accomplished by entering commands with a keyboard (or by running a plain text script) instead of controlling a graphical user interface.
The most popular shell is Bash ("Bourne Again SHell")
- What is a terminal?
A terminal is a program you run that gives you access to the shell. There are many different terminal programs that vary across operating systems.
- What is the shell?
- Using the shell
Open a terminal (bash)
ls echo "hello" echo "hello" > newfile.txt cat newfile.txt - Full vs relative paths
pwd cd ~ pwd - Shell command options and arguments
ls -l ls -al man ls
28.2. Version control
- Why use version control?
- You already use some sort of version control
- File naming schemes (eg
my_file_April18-2023.docx, better asmy_file_20230418.docx) or by copying folders around - Simple but error-prone
- Does not help with branching, collaboration
- File naming schemes (eg
- A version control system (VCS) allows you to:
- revert files back to a previous state
- revert the entire project back to a previous state
- review changes made over time
- see who last modified something that might be causing a problem, who introduced an issue and when
- bisect the history to discover what change caused the error
- collaborate safely on a project without destroying anyone's work.
- You already use some sort of version control
- Local version control systems (VCS)
- Centralized VCS
- Distributed VCS
28.3. git
- git
- A distributed VCS
- Written first by Linus Torvalds specifically for managing the linux kernel project.
- Has become the standard VCS for many software projects, especially open-source projects
- Many academics/scientists use git in part because of the popularity of the hosting site, github.com
- Configuring
git config --global user.name "Dylan Schwilk" git config --global user.email "dylan@schwilk.org" - Making a "repo" (repository)
cd ~ mkdir newrepo cd newrepo git init touch README.md ls .git -al - What happened?
Hidden directory called
.git. ├── .git │ ├── branches │ ├── config │ ├── description │ ├── HEAD │ ├── hooks │ │ ├── applypatch-msg.sample │ │ ├── commit-msg.sample │ │ ├── post-update.sample │ │ ├── pre-applypatch.sample │ │ ├── pre-commit.sample │ │ ├── prepare-commit-msg.sample │ │ ├── pre-push.sample │ │ ├── pre-rebase.sample │ │ └── update.sample │ ├── info │ │ └── exclude │ ├── objects │ │ ├── info │ │ └── pack │ └── refs │ ├── heads │ └── tags └── README.md
- Checking status
git statusOn branch master Initial commit Untracked files: (use "git add <file>..." to include in what will be committed) README.md nothing added to commit but untracked files present (use "git add" to track) - git commands
- staging files
git add . git statusOn branch master Initial commit Changes to be committed: (use "git rm --cached <file>..." to unstage) new file: README.md - Committing changes
git commit -m "initial commit" git statusOn branch master nothing to commit, working directory clean
- git commit
A commit is a snapshot taken from the index (staging area) not from the working directory!
- What exists now?
- Cloning repo that already exists
git clone git@github.com:schwilklab/trait-flam.git cd trait-flam lsdata ms README.md results scripts
- A basic workflow
- Edit files
- Stage the changes (git add)
- Review the changes (git diff)
- Commit the changes (git commit)
- git log
git log --pretty=oneline --abbrev-commit -10d2ec17b Reduce markup lecture e4d405f Add folder for tangled session files bb2d29e Add shell (Bash) examples and some data 261af83 Clean up birdd files e289d7e Minor edit f2820d3 Scatterplot fig in birdd example 75a4be6 Add slides on Bash shell 66c9a7f Add etherpad link for workshop e23b559 Suggest getting github.com account 7e6a6df Add time and date for workshop
- git diff
git diff- difference between working directory and index (staging area)
git diff --staged- difference between index and last commit
With right options, diff can show changes between any two commits, any two files, etc.
- git diff example
Try:
git diff HEAD~1 - Diffs on GitHub
- Practice
- edit file
- stage
- commit
28.4. Working with branches
- Branches
A branch represents an independent line of development.
- Show and create branches
- Using branches
- Switch branches
git checkout <existing-branch>
- Merge
Merging is Git's way of putting a branched history back together again.
- Fast-forward merge
- Three-way merge
- Example fast-foward merge
# Start a new feature git checkout -b new-feature master # Edit some files git add <file> git commit -m "Start a feature" # Edit some files git add <file> git commit -m "Finish a feature" # Merge in the new-feature branch git checkout master git merge new-feature git branch -d new-feature - Merge vs rebase
28.5. Working with remotes
- What is a "remote"
- a version of the repository that hosted somewhere on the internet.
- Remotes have a url (https or ssh), but the can also have names.
- It is common to have only one remote at it is named "origin"
- GitHub
Many foilks use GitHub for hosting. If you do, most of of your local repos will be associated with a remote named "origin" and a url something like
git@github.com:schwilklab/CSC-sky-island-forest.git - Working with remotes
When you clone from a remote, git remembers and calls that remote "origin". You can look at .git/config or, better use git commands to view:
git remote -vorigin git@github.com:schwilklab/R-research-tool.git (fetch) origin git@github.com:schwilklab/R-research-tool.git (push)
- Adding a remote
If you need to add a remote to an existing repository:
git remote add rem2 git@github.com:dschwilk/software-tools-course.git - Fetching and pulling
- git push
git push <remote> <branch name> eg: git push origin master or just: git push
This works if no one else has pushed changes to that remote since you last pulled. If there is an error, you will need to pull (fetch and merge) first.
- git and GitHub
- "forking": creating a repo hosted on github that was cloned from someone else's repo.
- Your local version will then have your "fork" as the remote named "origin" and the original repo will be a remote called "upstream"
- This is overkill for most small groups but essential for open source projects with many contributors
see