Problem Set 10: Difference in Difference

This problem set involves a conceptual portion and a coding portion, with separate Gradescope submissions. The coding portion which is very similar to an example from class.

Conceptual questions

In the figures below, the treated group becomes treated between time 1 and time 2. The control group never becomes treated. Figures are hypothetical scenarios that depict true potential outcomes even if those outcomes would not be observed in an actual study.

In which setting does the parallel trends assumption hold: A, B, neither, or both?
1. Setting A
2. Setting B
3. Neither setting
4. Both settings
In actual data analysis, can we ever know for certain whether we are in Setting A or Setting B? If the answer is no, then tell us which outcome cannot be observed.
1. We can know for certain
2. We cannot know because \(Y^1_\text{Treated,2}\) is unobserved.
3. We cannot know because \(Y^0_\text{Treated,2}\) is unobserved.
4. We cannot know because \(Y^0_\text{Control,2}\) is unobserved.
A researcher comes to you with the data below, which depict only observed outcomes. That researcher wants to run a difference in difference analysis. Here, we have not depicted the counterfactual outcome because the researcher would not know it.

Which of the following makes the parallel trends assumption doubtful in this setting?

Within the treated group, the trend from period \(t = 1\) to \(t = 2\) is not the same as the trend from \(t = 0\) to \(t = 1\).
Within the control group, the trend from period \(t = 1\) to \(t = 2\) is not the same as the trend from \(t = 0\) to \(t = 1\).
In the period \(t = 0\) to \(t = 1\), the treated and control groups have different trends.
In the period \(t = 1\) to \(t = 2\), the treated and control groups have different trends.

Coding portion

You will submit an .R script to a programming assignment in Gradescope for this portion of the assignment. You should start by loading the data. The assignment uses the malesky2014.csv data from the Malesky, Nguyen, & Tran (2014) study of government recentralization in Vietnam, which we used in class as an example of difference in difference.

library(tidyverse)
data <- read_csv("https://soc114.github.io/data/malesky2014.csv") |>
  mutate(
    treatment = factor(treatment, labels = c("Control","Treatment"))
  )

In class, we studied the education and cultural center outcome. Use the same strategy to carry out analysis for the tap water outcome (tapwater).

One way to carry out difference in difference is by a linear regression model. Another way is by taking means within subgroups. You should use the strategy we learned in class.

Create an object summary_statistics containing the mean value of the tapwater outcome under each treatment condition at each time period.
Create an object estimate containing the difference in difference estimate.