Problem Set 1: Visualization

Due: 5pm on Friday, January 17.

Note

Want to see how you’ll be evaluated? Check out the rubric

Student identifer: [type your anonymous identifier here]

This problem set involves both data analysis and reading.

Data analysis

This problem set uses the data lifeCourse.csv.

library(tidyverse)
library(scales)
lifeCourse <- read_csv("https://soc114.github.io/data/lifeCourse.csv")

The data contain life course earnings profiles for four cohorts of American workers: those born in 1940, 1950, 1960, and 1970. Each row contains a summary of the annual earnings distribution for a particular birth cohort at a particular age, among the subgroup with a particular level of education. To prepare these data, we aggregated microdata from the Current Population Survey, provided through the Integrated Public Use Microdata Series.

The data contain five variables.

  1. quantity is the metric by which the earnings distribution is summarized: 10th, 50th, or 90th percentile
  2. education is the educational subgroup being summarized: College Degree, Less than College
  3. cohort is the cohort (people with a given birth year) to which these data apply: 1940, 1950, 1960, 1970
  4. age is the age at which earnings were measured: 30–45
  5. income is the value for the given earnings percentile in the given subgroup. Income values are provided in 2022 dollars

1. Visualize (40 points)

Tip

Functions named in this problem are links to helper pages that provide documentation.

Use ggplot to visualize these data. To denote the different trajectories,

  • make your plot using geom_point() or geom_line()
  • use the x-axis for age
  • use the y-axis for income
  • use color for quantity
  • use facet_grid to make a panel of facets where each row is an education value and each column is a cohort value

You should prepare the graph as though you were going to publish it. Modify the axis titles so that a reader would know what is on the axis. Use appropriate capitalization in all labels. Optionally, try using the label_currency() function from the scales package so that the y-axis uses dollar values.

Your code should be well-formatted as defined by R4DS. In your produced PDF, no lines of code should run off the page.

Many different graphs can be equally correct. You will be evaluated by

  • having publication-ready graph aesthetics
  • code that follows style conventions
# your code goes here

2. Understand components of research question

The graph above answered a descriptive research question. Here are a few components of that question:

  1. Annual income in 2022 dollars
  2. American workers in each subgroup defined by birth cohort, age, and education
  3. 10th, 50th, and 90th percentile
  4. A person

For 2.1–2.4, write the letter of the corresponding component of the research question.

2.1 (2.5 points) What is the unit of analysis?

2.2 (2.5 points) What is the outcome variable?

2.3 (2.5 points) What are the target populations?

2.4 (2.5 points) What are the summary statistics?

Recap and connections to your project

The visualization you produced in this problem set is an example of the kind of visualization you will produce in the final project. As in this problem set, your visualization in the project should include well-written labels to make the graph readable. A difference between the two is that in this problem set we provided a dataset in which each row was already a summary statistic. In the final project, you will manipulate data in which each row corresponds to a unit of analysis that you aggregate to produce a summary statistic. This is a skill you will practice on the next problem set.

Back to top