library(tidyverse)
library(scales)
library(haven)
Problem Set 2: Data Transformation
Due: 5pm on Friday, Jan 31.
Want to see how you’ll be evaluated? Check out the rubric
Student identifer: [type your anonymous identifier here]
- Use this pset2.qmd template to complete the problem set
- In Canvas, you will upload the PDF produced by your .qmd file
- Put your identifier above, not your name! We want anonymous grading to be possible
This problem set draws on the following paper.
England, Paula, Andrew Levine, and Emma Mishel. 2020. Progress toward gender equality in the United States has slowed or stalled, PNAS 117(13):6990–6997.
A note about sex and gender. Sex typically refers to categories assigned at birth (e.g., female, male). Gender is a performed construct with many possible values: man, woman, nonbinary, etc. The measure in the CPS-ASEC is “sex,” coded male or female. We will use these data to study sex disparities between those identifying as male and female. The paper at times uses “gender” to refer to this construct.
1. Data analysis: Existing question
25 points. Reproduce Figure 1 from the paper.
Visit cps.ipums.org to download data. Include these variables in your cart: sex, age, asecwt, empstat. One way to find them is to click “Select Data” and then use the “Search” button under “Select Variables.”
Look ahead: you will later study a new outcome of your own choosing. You could add it to your cart now if you want.
Once you select your variables, click “Select Samples.” We will download data from the 1962–2023 March Annual Social and Economic Supplement. When choosing samples, you want all samples in the “ASEC” tab and none of the samples in the “Basic Monthly” tab. To achieve this, under the ASEC tab you can click “Select All Samples.” Under the Basic Monthly tab, you can click and then unclick “Select All Samples” to remove these samples that we will not use.
Once you have selected variables and selected samples, you can click “View Cart.” Optionally, reduce the size of your extract by un-checking any variables that have been automatically added which you won’t be using. Note that one automatically-added variable is year
, which you should keep because you will be using it. Then click “Create Data Extract.” On the next page, select cases to those ages 25–54. Before submitting your extract, we recommend changing the data format to “Stata (.dta)” so that you get value labels.
On your computer, analyze these data.
- filter to
asecwt > 0
(see paper footnote on p. 6995 about negative weights) - mutate to create an
employed
variable indicating thatempstat == 10 | empstat == 12
- mutate to convert
sex
to a factor variable usingas_factor
- group by
sex
andyear
- summarize the proportion employed: use
weighted.mean
to take the mean ofemployed
using the weightasecwt
Your figure will be close but not identical to the original. Yours will include some years that the original did not. Feel free to change aesthetics of the plot, such as the words used in labels.
2. A new outcome
25 points. The CPS-ASEC has numerous variables. Pick another variable of your choosing. Add it to your cart in IPUMS, and visualize how that variable has changed over time for those identifying as male and female.
As in the previous plot, year should be on the x-axis and color should represent sex. The y-axis is up to you. You can examine something like median income, proportion holding college degrees, or the 90th percentile of usual weekly work hours. You can restrict to some subset if you want, such as those who are employed.
Your answer should include
- a written statement of what you estimated: the variable you chose, any sample restrictions you made, and how you summarized that variable
- a written interpretation of what you found
- code following style conventions
- your publication-quality visualization
Recap and connections to your project
A few skills relevant to the project were practiced in this problem set. In your project, you should clearly define your unit of analysis and target population. You should use sample weights if appropriate. And you should write readable, well-organized code that carries out your data transformation and visualization.