library(tidyverse)
library(scales)
<- read_csv("https://soc114.github.io/data/lifeCourse.csv") lifeCourse
Problem Set 1: Visualization
Due: 5pm on Friday, January 17.
Want to see how you’ll be evaluated? Check out the rubric
Student identifer: [type your anonymous identifier here]
- Use this pset1.qmd to complete the problem set
- In Canvas, you will upload the PDF produced by your .qmd file
- Put your identifier above, not your name! We want anonymous grading to be possible
This problem set involves both data analysis and reading.
Data analysis
This problem set uses the data lifeCourse.csv
.
The data contain life course earnings profiles for four cohorts of American workers: those born in 1940, 1950, 1960, and 1970. Each row contains a summary of the annual earnings distribution for a particular birth cohort at a particular age, among the subgroup with a particular level of education. To prepare these data, we aggregated microdata from the Current Population Survey, provided through the Integrated Public Use Microdata Series.
The data contain five variables.
quantity
is the metric by which the earnings distribution is summarized: 10th, 50th, or 90th percentileeducation
is the educational subgroup being summarized: College Degree, Less than Collegecohort
is the cohort (people with a given birth year) to which these data apply: 1940, 1950, 1960, 1970age
is the age at which earnings were measured: 30–45income
is the value for the given earnings percentile in the given subgroup. Income values are provided in 2022 dollars
1. Visualize (40 points)
Functions named in this problem are links to helper pages that provide documentation.
Use ggplot
to visualize these data. To denote the different trajectories,
- make your plot using
geom_point()
orgeom_line()
- use the x-axis for
age
- use the y-axis for
income
- use
color
forquantity
- use
facet_grid
to make a panel of facets where each row is an education value and each column is a cohort value
You should prepare the graph as though you were going to publish it. Modify the axis titles so that a reader would know what is on the axis. Use appropriate capitalization in all labels. Optionally, try using the label_currency()
function from the scales
package so that the y-axis uses dollar values.
Your code should be well-formatted as defined by R4DS. In your produced PDF, no lines of code should run off the page.
Many different graphs can be equally correct. You will be evaluated by
- having publication-ready graph aesthetics
- code that follows style conventions
# your code goes here
2. Understand components of research question
The graph above answered a descriptive research question. Here are a few components of that question:
- Annual income in 2022 dollars
- American workers in each subgroup defined by birth cohort, age, and education
- 10th, 50th, and 90th percentile
- A person
For 2.1–2.4, write the letter of the corresponding component of the research question.
2.1 (2.5 points) What is the unit of analysis?
2.2 (2.5 points) What is the outcome variable?
2.3 (2.5 points) What are the target populations?
2.4 (2.5 points) What are the summary statistics?
Recap and connections to your project
The visualization you produced in this problem set is an example of the kind of visualization you will produce in the final project. As in this problem set, your visualization in the project should include well-written labels to make the graph readable. A difference between the two is that in this problem set we provided a dataset in which each row was already a summary statistic. In the final project, you will manipulate data in which each row corresponds to a unit of analysis that you aggregate to produce a summary statistic. This is a skill you will practice on the next problem set.