Problem Set 4: DAGs and Statistical Learning

Due: 5pm on Friday, Feb 28.

Student identifier: [type your anonymous identifier here]

The format of this problem set is different from the others.

The reason for this is that we are all busy with the final project! So you can have time for the project, there will be no peer review on this problem set. So the TAs can focus on helping with the project, some grading will be done automatically via the BruinLearn quiz.

Here is how to do this problem set:

1. (30 points) DAGs

For 1.1–1.5, answer True or False: \(X\) is a sufficient adjustment set to identify the causal effect of \(A\) on \(Y\). Recall that as you work on these problems, a good strategy is to first list all non-causal paths between \(A\) and \(Y\) and then cross out any that are blocked when conditioning on \(X\).

1.1. [answer here]

1.2. [answer here]

1.3. [answer here]

1.4. [answer here]

1.5. [answer here]

2. Causal inference with statistical modeling

The paragraphs below introduce this part of the problem set. Your work begins at “Prepare your data.”

How does parenthood affect labor market outcomes? For an outcome \(Y\) such as employment, we can imagine that each person \(i\) has a potential outcome as a parent \(Y_i^1\) and a potential outcome as a non-parent, \(Y_i^0\). Parenthood casually shapes an outcome like employment to the degree that these differ.

The effect of parenthood on labor market outcomes has been the subject of extensive social science research which has revealed a consistent finding: parenthood may improve men’s labor market outcomes while harming women’s labor market outcomes (e.g., Waldfogel 1998, Budig & England 2001, Correll et al. 2007). The disparate effects of parenthood for men and women are thus one source of gender disparities in labor market outcomes.

This problem set estimates the causal effect of motherhood on mothers’ employment, using data simulated to approximate data that exist in the National Longitudinal Survey of Youth 1997 cohort. The NLSY97 interviews people repeatedly across years. We manipulated these data so that each row contains information from a pre- and a post-observation, separated by 21+ months. In the pre-observation, we measure confounding variables. In the post-observation, we measure the outcome (y, employment). Between the pre- and post-observation, some women experience a first birth (treated == TRUE) and others do not (treated == FALSE).

The dataset motherhood_simulated.csv contains the following variables.

  • observation_id is an index for each observation
  • sampling_weight is the weight due to unequal probability sampling
  • treated indicates a first birth (TRUE or FALSE)
    • This occurred between the pre- and post-periods.
  • y is the outcome, coded TRUE if employed or FALSE if not employed.
    • This was measured in the post-period.

The data include a set of variables measured in the pre-period. We will consider these to be a sufficient adjustment set. These were measured in the pre-period.

  • race is a categorical variable coded Hispanic, Non-Hispanic Black, and Non-Hispanic Non-Black
  • pre_age is age in years
  • pre_educ is an ordinal variable for educational attainment, coded Less than high school, High school, 2-year degree, and 4-year degree with those with higher levels of education also coded in this last category
  • pre_marital is a categorical variable of marital status, coded no_partner, cohabiting, or married
  • pre_employed is a lag measure of employment in the prior survey wave, coded TRUE and FALSE
  • pre_fulltime indicates full-time employment in the prior survey wave, coded TRUE and FALSE
  • pre_tenure is years of experience with a current employer, as of the prior survey wave
  • pre_experience is total years of full-time work experience, as of the prior survey wave
library(tidyverse)
motherhood_simulated <- read_csv("https://soc114.github.io/data/motherhood_simulated.csv")

Prepare your data

Filter to create two data objects: one with mothers who have treated == TRUE and one with non-mothers who have treated == FALSE.

# your code here

Estimate by linear model predictions

Among non-mothers, model the probability of employment with a linear model. As predictors, use an additive function of the sufficient adjustment set.

Hints:

  • Use the lm() function.
  • use this model formula: y ~ race + pre_age + pre_educ + pre_marital + pre_employed + pre_fulltime + pre_tenure + pre_experience
  • for the data argument, use your data containing non-mothers.
  • you will need the argument weights = sampling_weight to specify to weight the model by the sampling_weight variable
# your code here

2.1. (10 points) Report a predicted value

Using your model estimated among non-mothers, make predictions of \(\hat{Y}^0\) among mothers. Report the predicted value for the first mother in the data.

Hints:

  • Use predict() to make predictions.
  • We suggest your store the variables in a new variable in your dataset using mutate().
  • To see the first predicted value in your predicted data, one strategy is to use select() to keep only the variable you’ve created that contains your predicted value.
# your code here

2.2. (10 points) Report an ATT estimate

Across mothers, estimate the Average Treatment Effect on the Treated (ATT) by the weighted mean difference between \(Y\) (observed) and \(\hat{Y}^0\) (predicted from linear regression), weighted by sampling weights.

  • For each mother, take the difference between the observed outcome y and the probability of employment that you predict for her in the absence of motherhood.
  • Then take the weighted mean across mothers weighted by the sampling weight.
  • Report this weighted mean.
# your code here
Back to top