Final Project

The culmination of the course is a group research project. You will

You can answer any question using the ideas we learned in this course.

What you will submit

There are assignments on BruinLearn. Each will take one submission per group.

Draft Writeup. Due 5pm Fri Mar 7. Worth 0 points, but an opportunity to send us your work and get feedback from the teaching team. Format is the same as your writeup (see below).

Slides. Due 5pm Mon Mar 10. A PDF that you will present in discussion. Keep text to a minimum in the slides. You should plan to have all group members speak during the presentation. Your presentation should be 10 minutes or less.

Writeup. Due 5pm Fri Mar 14. A .qmd document compiled into a PDF. It should include all code that produces your results. This must be no more than 1,000 words and will contain 1 or more visualizations.

Group structure

We will form groups of about 5 students each within discussion sections. Groups will be formed during discussion in the middle of the quarter.

Key components of the project

Your goal should be to tell us a story using the data. What do we learn by studying this outcome, aggregated this way, in these subgroups from this population?

A few points should be part of your writeup:

  1. Define your unit of analysis
  2. Define your target population
    • Motivate: why is this population interesting to study?
  3. Describe how your sample was chosen from that population
    • this may be a probability sample, such as those available via IPUMS. If so, tell us a little bit about the sampling design
    • this may be a convenience sample. If so, why does it speak to the population and what are the limitations?
    • this may be data on the entire population, as in our baseball example
  4. Choose an outcome variable, which is defined for each unit in the population.
    • example: annual wage and salary income
  5. Choose one or more variables on which to create population subgroups
    • example: subgroups defined by sex (male and female)
  6. Choose a summary statistic, which aggregates the outcome distribution to one summary per subgroup
    • examples: proportion, mean, median, 90th percentile
  7. Answer one descriptive and one causal question (see below)
  8. Present results using ggplot()

With these components in mind, you will answer one descriptive and one causal research question.

Descriptive research question. For this question, write your results carefully using only descriptive language. As a heuristic, if your goal is descriptive then you should not be using the sentence structure “X [verb] Y”, such as “going to college increases earnings.” This claim suggests a college graduate would have earned less if they had not gone to college—a counterfactual outcome we did not see. For truly descriptive claims, you should phrase more like: “There is a difference in earnings among those who did and did not go to college.” A heuristic to recognize a non-causal claim is that it can be phrased it in an “among” statement: “Among subgroup A, we find ___. Among subgroup B, we find ___.” Or “There is a disparity in Y across subgroups defined by X.”

Causal research question. Define your potential outcomes. Draw a DAG in which those potential outcomes can be identified by an assumption of conditional exchangeability. If there are unmeasured confounders, discuss how their omission from your DAG may threaten the credibility of your estimates. It is ok if the assumptions are doubtful as long as you write them down clearly.

Considerations to bear in mind

  • Weights. If your sample is drawn from the population with unequal probabilities, you should use sampling weights
  • Models. If your question involves many subgroups (e.g., ages) with few observations in each subgroup, you can (but are not required to) use a statistical model to estimate your summary statistic in the subgroup by a predicted value. For example, you could use OLS to predict the proportion mean income at each age. If you do this, you should report the predicted value of the summary statistic, not the coefficients of the model.
  • Aggregation. Your data must begin with units (e.g., people) who you aggregate into subgroups (e.g., age groups). Your data might come pre-aggregated, such as data where each row contains data for all students in a particular college or university. Then you would need to aggregate further, such as to produce summaries for private versus public universities.
  • Dropped cases. As you move from raw data to the data that produce your graph, you might drop cases on the way. For example, some cases may have missing values on key predictors. Report how many are dropped, and why. Our goal here is transparent, open science.

Rubrics

Writeup rubric

Here are slides for discussion of the writeup rubric.

Criteria Points
The descriptive research question is well motivated, feasible, and avoids causal language. 10 pts
The causal research question is well motivated, feasible, and states appropriate causal assumptions. 10 pts
Unit of analysis is defined clearly 5 pts
Target population is defined clearly 5 pts
Predictor(s) are defined clearly 5 pts
Outcome is defined clearly 5 pts
Summary statistic is defined clearly (e.g., mean, median, proportion) 5 pts
Data source is described 5 pts
Sample restrictions are precisely stated, with the number of cases dropped at each step 10 pts
Weights are used appropriately, if applicable 5 pts
Figure(s) are clearly labeled 10 pts
Text explains and interprets the figure(s) 10 pts
Writing is concise and grammatically correct 5 pts
Code is well-formatted, well-documented, and easy to follow 5 pts
Code lines fit on the PDF page 5 pts
Total 100 pts

Presentation rubric

Criteria Points
Slides are a PDF submitted in Canvas 5 pts
Slides use minimal text 5 pts
Presentation motivates the research question 5 pts
Presentation states the question precisely 5 pts
Presentation walks through how to read the graph 5 pts
All group members speak (excused absences will not be penalized) 5 pts
Presentation finishes on time 5 pts
Group listens to classmates and answers questions clearly 5 pts
Total 40 pts

Within-group peer evaluation

We want to learn about your experience working in your group, for two reasons.

  1. We want to assess for the future how this group project went.
  2. As a secondary goal, in the rare case that a group member was consistently uninvolved, we want to assess some percentage penalty on that group member’s score. We only anticipate doing this in rare cases.

After the project has been submitted, we will ask everyone to complete this within-group peer evaluation form once to tell us about the contributions of each member of their group. We have taken this rubric from a website by the Center for Teaching Innovation at Cornell University.

Have fun

As a teaching team, the project is our favorite part of the course. Preparing you to succeed in the project has been (in some sense) the entire goal of all that precedes the project in the course. We hope you will find joy in answering questions with data, as we do.

Back to top