Problem Set 4: DAGs and Statistical Learning

Due: 5pm on Friday, Feb 28.

Student identifier: [type your anonymous identifier here]

The format of this problem set is different from the others.

the assignment is a quiz in BruinLearn
you will upload your PDF in that quiz
you will also enter answer values in that quiz

The reason for this is that we are all busy with the final project! So you can have time for the project, there will be no peer review on this problem set. So the TAs can focus on helping with the project, some grading will be done automatically via the BruinLearn quiz.

Here is how to do this problem set:

Use this pset4.qmd to complete the problem set.
When you are finished, complete the quiz on BruinLearn
- you will upload your PDF there
- you will type some answers from your PDF there

1. (30 points) DAGs

For 1.1–1.5, answer True or False: \(X\) is a sufficient adjustment set to identify the causal effect of \(A\) on \(Y\). Recall that as you work on these problems, a good strategy is to first list all non-causal paths between \(A\) and \(Y\) and then cross out any that are blocked when conditioning on \(X\).

1.1. [answer here]

1.2. [answer here]

1.3. [answer here]

1.4. [answer here]

1.5. [answer here]

2. Causal inference with statistical modeling

The paragraphs below introduce this part of the problem set. Your work begins at “Prepare your data.”

How does parenthood affect labor market outcomes? For an outcome \(Y\) such as employment, we can imagine that each person \(i\) has a potential outcome as a parent \(Y_i^1\) and a potential outcome as a non-parent, \(Y_i^0\). Parenthood casually shapes an outcome like employment to the degree that these differ.

The effect of parenthood on labor market outcomes has been the subject of extensive social science research which has revealed a consistent finding: parenthood may improve men’s labor market outcomes while harming women’s labor market outcomes (e.g., Waldfogel 1998, Budig & England 2001, Correll et al. 2007). The disparate effects of parenthood for men and women are thus one source of gender disparities in labor market outcomes.

This problem set estimates the causal effect of motherhood on mothers’ employment, using data simulated to approximate data that exist in the National Longitudinal Survey of Youth 1997 cohort. The NLSY97 interviews people repeatedly across years. We manipulated these data so that each row contains information from a pre- and a post-observation, separated by 21+ months. In the pre-observation, we measure confounding variables. In the post-observation, we measure the outcome (y, employment). Between the pre- and post-observation, some women experience a first birth (treated == TRUE) and others do not (treated == FALSE).

The dataset motherhood_simulated.csv contains the following variables.

observation_id is an index for each observation
sampling_weight is the weight due to unequal probability sampling
treated indicates a first birth (TRUE or FALSE)
- This occurred between the pre- and post-periods.
y is the outcome, coded TRUE if employed or FALSE if not employed.
- This was measured in the post-period.

The data include a set of variables measured in the pre-period. We will consider these to be a sufficient adjustment set. These were measured in the pre-period.

race is a categorical variable coded Hispanic, Non-Hispanic Black, and Non-Hispanic Non-Black
pre_age is age in years
pre_educ is an ordinal variable for educational attainment, coded Less than high school, High school, 2-year degree, and 4-year degree with those with higher levels of education also coded in this last category
pre_marital is a categorical variable of marital status, coded no_partner, cohabiting, or married
pre_employed is a lag measure of employment in the prior survey wave, coded TRUE and FALSE
pre_fulltime indicates full-time employment in the prior survey wave, coded TRUE and FALSE
pre_tenure is years of experience with a current employer, as of the prior survey wave
pre_experience is total years of full-time work experience, as of the prior survey wave

library(tidyverse)
motherhood_simulated <- read_csv("https://soc114.github.io/data/motherhood_simulated.csv")

Prepare your data

Filter to create two data objects: one with mothers who have treated == TRUE and one with non-mothers who have treated == FALSE.

# your code here

Estimate by linear model predictions

Among non-mothers, model the probability of employment with a linear model. As predictors, use an additive function of the sufficient adjustment set.

Hints:

Use the lm() function.
use this model formula: y ~ race + pre_age + pre_educ + pre_marital + pre_employed + pre_fulltime + pre_tenure + pre_experience
for the data argument, use your data containing non-mothers.
you will need the argument weights = sampling_weight to specify to weight the model by the sampling_weight variable

# your code here

2.1. (10 points) Report a predicted value

Using your model estimated among non-mothers, make predictions of \(\hat{Y}^0\) among mothers. Report the predicted value for the first mother in the data.

Hints:

Use predict() to make predictions.
We suggest your store the variables in a new variable in your dataset using mutate().
To see the first predicted value in your predicted data, one strategy is to use select() to keep only the variable you’ve created that contains your predicted value.

# your code here

2.2. (10 points) Report an ATT estimate

Across mothers, estimate the Average Treatment Effect on the Treated (ATT) by the weighted mean difference between \(Y\) (observed) and \(\hat{Y}^0\) (predicted from linear regression), weighted by sampling weights.

For each mother, take the difference between the observed outcome y and the probability of employment that you predict for her in the absence of motherhood.
Then take the weighted mean across mothers weighted by the sampling weight.
Report this weighted mean.

# your code here