Problem Set 3: Estimator and Bootstrap

For this problem set, we will summarize wealth gaps using data from the 2022 Survey of Consumer Finances. We accessed these data via the Berkeley Survey Documentation and Analysis website and prepared the data file scf_prepared.csv that you can load directly with the line of code below.

scf_prepared <- read_csv("https://soc114.github.io/data/scf_prepared.csv")

A row of these data corresponds to a household. These data contain three variables:

race of the household respondent, coded in categories White, Black, Hispanic, Asian, and Other.
wealth is the net worth of the household (assets minus debts), with households below $10,000 recoded to a bottom-code of $10,000
weight is a sample weight (the inverse probability of sample inclusion)

Below we have info to get you started. The problem set questions start in the section write an estimator function.

Info to get you started

This problem set will summarize wealth by the geometric mean, which is a summary statistic often used for distributions with long right tails. This summary statistic can be understood in steps:

Transform $X$ to $\text{log}(X)$ to bring in the long right tail
Summarize the transformed distribution by its mean
Exponentiate the summary to move back to the scale of $X$ (since $e^{\text{log}(X)} = X$)

The figure below visualizes these steps to build intuition.

The code below applies these steps to the scf_prepared data to estimate the geometric mean of household wealth.

scf_prepared |>
  # Creates a new variable log_wealth
  # which is the log of the wealth variable
  mutate(
    log_wealth = log(wealth)
  ) |>
  # Summarizes log wealth
  # by its weighted mean
  # (weight is the inverse probability
  # of sample inclusion)
  summarize(
    mean_log_wealth = weighted.mean(
      x = log_wealth, 
      w = weight
    )
  ) |>
  # Create a new variable geo_mean which
  # is the exponentiated mean of the log wealth
  mutate(
    geo_mean = exp(mean_log_wealth)
  ) |>
  # Pull the result out of the tibble
  # object to return a number
  pull(geo_mean)

[1] 161195.7

Now it is your turn! The code above will be useful in the problem set parts below.

1. Write an estimator function

We want to use the code above many times, on many different samples. To do this, write a function called geo_mean_estimator. Your function should have one argument named data, which will be a tibble object in the format of scf_prepared. Your function should take data and calculate a summary statistic: the geometric mean.

You can use the code above within the body of your function. For help, see R4DS Ch 25.

2. Point estimate

Apply your geo_mean_estimator function with data = scf_prepared. Store the result in an object called geo_mean_estimate.

3. Bootstrap draws

Bootstrap your estimator 1,000 times. Store your bootstrap draws in an object named bootstrap_draws.

How you do this is up to you. There are at least two viable options:

You can initialize a vector bootstrap_draws as in class. Then write a for loop to populate it. See lecture slides on for loops.
You can use the replicate(n, expr) function where the n argument is the number of times to call the expression in the expr argument. See this blog.

Whatever method you use, you should use slice_sample(prop = 1, replace = TRUE) to resample the dataset with replacement for each bootstrap draw. See lecture slides on how to generate bootstrap samples.

4. Confidence interval

Construct a 95% confidence interval for the geometric mean by taking the 2.5 and 97.5 percentiles of the bootstrap draws. Store this in an object named confidence_interval.

5. Write a ratio estimator

Now write a new function named ratio_estimator that estimates a ratio: \[ \frac{\text{geometric mean among white respondent households}}{\text{geometric mean among Black respondent households}} \] In words, this summary is how many dollars of wealth White households hold per dollar held by Black households, as summarized by the geometric mean.

You might consider using the geo_mean_estimator() function you already wrote within your ratio_estimator() function.

6. Point estimate for the ratio

Apply ratio_estimator with data = scf_prepared to produce a point estimate. Store the result in ratio_estimate.

7. Bootstrap the ratio estimator

Bootstrap the ratio estimator with 1,000 draws. Store the bootstrap estimates in an object called ratio_bootstrap.

8. Confidence interval for the ratio

Construct a 95% confidence interval for the ratio by taking the 2.5 and 97.5 percentiles of the bootstrap draws. Store this in an object named ratio_confidence_interval.