Problem Set 9: Outcome Models for Causal Inference

This problem set uses the same data as Problem Set 8. To learn about the data, see that page.

This problem set is entirely coding. You will submit an .R script to a programming assignment in Gradescope. You should start by loading the data.

library(tidyverse)
motherhood_simulated <- read_csv("https://soc114.github.io/data/motherhood_simulated.csv")

Create factual and counterfactual datasets. Use the filter() function to create two data objects: one named mothers which contains all observations with treated == TRUE and one named nonmothers which contains all observations with treated == FALSE. If you need help with filter(), see R4DS 3.2.1.
Estimate a model. Use the lm() function to create an model for the probability of employment among non-mothers. Store your model in an object named model_among_nonmothers.

This model should:

be estimated using the lm() function.
use this model formula: y ~ race + pre_age + pre_educ + pre_marital + pre_employed + pre_fulltime + pre_tenure + pre_experience
for the data argument, use your data containing non-mothers.
you will need the argument weights = sampling_weight to specify to weight the model by the sampling_weight variable

Predict counterfactuals. Now take the mothers data. Create a new column y_as_nonmother containing the predicted value of employment if this mother were counterfactually a non-mother. You might use mutate() and predict() in this step. Store your new dataset (with this one additional column) in an object called mothers_predicted.
Summarize by a weighted mean. Estimate the factual employment of mothers and their counterfactual employment that would be realized if they were non-mothers. Store your result in an object called estimates which will be a tibble with with 1 row and 2 columns named y and y_as_nonmother. To create this result,

use summarize() to take the mean of the factual outcome y and the predicted counterfactual y_as_nonmother in your mothers_predicted data
make sure to use sampling_weight to account for the fact that mothers are sampled from the population with unequal probabilities