Problem Set 2: Data Transformation

Due: 5pm on Friday, Jan 31.

Note

Want to see how you’ll be evaluated? Check out the rubric

Student identifer: [type your anonymous identifier here]

This problem set draws on the following paper.

England, Paula, Andrew Levine, and Emma Mishel. 2020. Progress toward gender equality in the United States has slowed or stalled, PNAS 117(13):6990–6997.

A note about sex and gender. Sex typically refers to categories assigned at birth (e.g., female, male). Gender is a performed construct with many possible values: man, woman, nonbinary, etc. The measure in the CPS-ASEC is “sex,” coded male or female. We will use these data to study sex disparities between those identifying as male and female. The paper at times uses “gender” to refer to this construct.

1. Data analysis: Existing question

25 points. Reproduce Figure 1 from the paper.

Visit cps.ipums.org to download data. Include these variables in your cart: sex, age, asecwt, empstat. One way to find them is to click “Select Data” and then use the “Search” button under “Select Variables.”

Tip

Look ahead: you will later study a new outcome of your own choosing. You could add it to your cart now if you want.

Once you select your variables, click “Select Samples.” We will download data from the 1962–2023 March Annual Social and Economic Supplement. When choosing samples, you want all samples in the “ASEC” tab and none of the samples in the “Basic Monthly” tab. To achieve this, under the ASEC tab you can click “Select All Samples.” Under the Basic Monthly tab, you can click and then unclick “Select All Samples” to remove these samples that we will not use.

Once you have selected variables and selected samples, you can click “View Cart.” Optionally, reduce the size of your extract by un-checking any variables that have been automatically added which you won’t be using. Note that one automatically-added variable is year, which you should keep because you will be using it. Then click “Create Data Extract.” On the next page, select cases to those ages 25–54. Before submitting your extract, we recommend changing the data format to “Stata (.dta)” so that you get value labels.

On your computer, analyze these data.

  • filter to asecwt > 0 (see paper footnote on p. 6995 about negative weights)
  • mutate to create an employed variable indicating that empstat == 10 | empstat == 12
  • mutate to convert sex to a factor variable using as_factor
  • group by sex and year
  • summarize the proportion employed: use weighted.mean to take the mean of employed using the weight asecwt

Your figure will be close but not identical to the original. Yours will include some years that the original did not. Feel free to change aesthetics of the plot, such as the words used in labels.

library(tidyverse)
library(scales)
library(haven)

2. A new outcome

25 points. The CPS-ASEC has numerous variables. Pick another variable of your choosing. Add it to your cart in IPUMS, and visualize how that variable has changed over time for those identifying as male and female.

As in the previous plot, year should be on the x-axis and color should represent sex. The y-axis is up to you. You can examine something like median income, proportion holding college degrees, or the 90th percentile of usual weekly work hours. You can restrict to some subset if you want, such as those who are employed.

Your answer should include

  • a written statement of what you estimated: the variable you chose, any sample restrictions you made, and how you summarized that variable
  • a written interpretation of what you found
  • code following style conventions
  • your publication-quality visualization

Recap and connections to your project

A few skills relevant to the project were practiced in this problem set. In your project, you should clearly define your unit of analysis and target population. You should use sample weights if appropriate. And you should write readable, well-organized code that carries out your data transformation and visualization.

Back to top