Difference in Difference

Here are slides

Today we study the effect of a policy change in New Jersey, drawing on evidence from the neighboring state of Pennsylvania.

Difference in difference is an identification strategy to be used when one or more units become treated at some time point while others do not. If we believe an assumption of parallel trends, then we can use the change over time for the never-treated units to estimate the change over time that would have been experienced by the units who become treated, in a counterfactual world where they had not become treated.

Data requirements

To carry out difference in difference, you need

  • One or more units become treated at an intervention time
    • Example: NJ adopts a higher minimum wage
  • One or more units remain untreated
    • Example: PA does not change the minimum wage
  • Data on all the units before and after the intervention time
    • Example: Data on NJ and PA before and after the change

Causal estimand

The causal estimand in DID is typically the average treatment effect on the treated in the post-treatment period. Let \(Y_{it}^a\) be the outcome of unit \(i\) in period \(t\) under treatment value \(a\). The causal estimand is

\[ \tau_\text{ATT} = \frac{1}{\text{Number Treated}}\sum_{i:A_{it}=1}\left(Y_{it}^{1} - Y_{it}^0\right) \]

The fundamental problem of causal inference is that the outcome under control \(Y_{it}^0\) is not observed for unit \(i\) who at time \(t\) has become treated. We do not know NJ’s employment rate after the minimum wage policy, which would have persisted if NJ had not adopted a new minimum wage.

Worked example: Recentralization in Vietnam

This example uses the malesky2014.csv data from Malesky, Nguyen, & Tran 2014 as re-analyzed by Egami & Yamauchi 2023.

Variables in the data

Variables in the data set:

  • treatment: binary variable indicating if the record is from the treatment group (\(1\)) or the control group (\(0\))
  • pro4: educational and cultural program number 4 outcome variable (binary: \(1\) if they participated in the program, \(0\) if not)
  • tapwater: tap water outcome variable (binary: \(1\) if they drink tap water, \(0\) if not)
  • agrext: agricultural center outcome variable (binary: \(1\) if they use the center, \(0\) if not)
  • year: the year the record is from
  • post_treat: binary variable indicating if the record is from the pre-treatment period (\(0\)) or the post-treatment period (\(0\))

The code below loads the data and selects and renames a few variables we will use.

library(tidyverse)
data <- read_csv("https://soc114.github.io/data/malesky2014.csv") |>
  select(treatment, year, education_culture = pro4) |>
  mutate(
    treatment = factor(treatment, labels = c("Control","Treatment"))
  )

Estimate means within group \(\times\) time

We will first estimate the mean value of the outcome (education_culture) within subgroups defined by treatment and year.

summary_statistics <- data |>
  group_by(treatment, year) |>
  summarize(education_culture = mean(education_culture))
`summarise()` has grouped output by 'treatment'. You can override using the
`.groups` argument.

Produce a difference in difference estimate

We can also estimate the difference in difference.

summary_statistics |>
  pivot_wider(
    names_from = c("treatment","year"), 
    values_from = "education_culture"
  ) |>
  mutate(
    estimate = (Treatment_2010 - Treatment_2008) - 
      (Control_2010 - Control_2008)
  ) |>
  select(estimate)
# A tibble: 1 × 1
  estimate
     <dbl>
1   0.0809
Back to top