Homework 10: Data Exploration

Please turn in one script with sections for both parts, below.

US Baby Names

For this part of the homework, you will need the baby names data that I downloaded from the Social Security Administration and cleaned up for you (top-1000-baby-names.csv). This file has the top 1000 boys and top 1000 girls names for each year 1880–2015 and includes the percentage of all boys or girls that year with that name. I've supplied the script I used to read in and clean that data as hw10-clean-bnames.R, but you don't need to run that file as the result has already been saved. You should look at that script and see what it does: it downloads the most recent data directly from the Social Security Administration website. You can run it yourself to recreate the top 1000 names file.

Question 1: All boys end in 'n'

Investigate the pattern we saw in boys names ending with the letter "n" increasing since 1950. Is this driven by one single name gaining popularity since 1960 or is it a consequence of a general increase in the popularity of names that end with "n"? Provide R code and a plot to illustrate your explanation.

Question 2:

Show me the popularity over time of names that are similar to yours. First you must define similarity. A simple regular expression, soundex (See R script for the dplyr lecture for an implementation), etc. Explain how you defined similarity and why. Illustrate and discuss any pattern over time that you uncover.

Question 3: Old Testament names

Have biblical old testament names been increasing or decreasing in popularity?

I've given you a text file with all the names in the old testament (old-testament.txt). You can read this in as a character vector using scan:

oldt <- scan("http://r-research-tool.schwilk.org/assignments/old-testament.txt",

What is the pattern for the popularity of old testament names over time? Does the pattern differ for boys and girls? Show me a plot or two to illustrate your answer. Also provide me with a table showing the top 20 old testament names in the whole data set (averaged over all years). If you created multiple plots to explore the question, guide me through your logic in your comments.

Back to top | E-mail Schwilk