Homework 11: Data Reshaping


For this homework, you will tidy up data. I have supplied the csv file tb-simple-2016.csv. This is data on tuberculosis cases downloaded from the World Health Organization: http://www.who.int/tb/country/data/download/en/index.html. The data is the exact data you can download as "case notifications" except that I have deleted a large number of columns so that the data just contains notifications of new positive pulmonary smears. You will want to look at the data dictionary csv file linked to at the above url in order to understand what the column headings mean.

Tidy the data

You will notice that this is not tidy data. In fact, there are id variable values encoded in column names. In order to conduct analyses, it would be much easier to have the data in a long format. There are some obvious id variables in this data, country, iso2 (a country code), fwhoregion (geographic region codes) and year. But this data really has more id variables: sex and age range. Note that new cases are broken up by sex and age range as separate columns in the format new_sp_[SEX]_[AGE RANGE]. So, the new_sp_f3544 contains cases for females 35–44 and new_sp_fu is females of unknown age.

Your first task is to deal with the column new_sp. This appears to hold the total number of new cases, so this should be redundant data that we could throw away, but it turns out that, in many years, the sum of the other new_sp_xxxx columns does not equal this value. So clearly there are some cases that are not reported by sex or age group. I decided to make a new variable that has the same form as the others but uses "u" for sex and age group (for "unknown"). I made a new column that represented just the uncategorized cases (the value in new_sp minus the sum of the columns 6-26). Here is what I did:

tb <- read.csv("http://r-research-tool.schwilk.org/data/tb-simple-2018.csv") 

# get just the unidentified to sex and age cases per row:
tb <- tb %>% mutate(new_sp_uu = new_sp - apply(.[, 6:25], 1, sum, na.rm=TRUE))
tb <- tb %>% mutate(total_cases = apply(.[, 6:26], 1, sum, na.rm=TRUE))

# now remove the unneeded column new_sp
tb$new_sp <- NULL

Now, all of the columns containing values (number of cases that can be detected by pulmonary smears) are named consistently. You will need to reshape this data into a long format and to turn all of those columns into just two id variables (sex and age) and one value variable.

Provide the R code to create a cleaned up long-format data frame with nicely named columns. You will need to use the gather() function from the tidyr package. To extract the id variables from the column headings (which will become values in the variable column) you will need to write a text extraction function.

TB cases by country

I'd like you to prepare a graph showing total reported cases per year over time for the Americas (g_whoregion = "AMR"=).

Back to top | E-mail Schwilk