UCLA Soc 114
Say this code in English:
c(1,2,3) in the object numberslength() function to get the length of numbersggplot()summarize()|>How might we visualize the U.S. income distribution?
Here are some data:
# A tibble: 1,000 × 2
id hhincome
<dbl> <dbl>
1 1 19170.
2 2 124474.
3 3 25114.
# ℹ 997 more rows
ggplotTwo arguments get us started:
data argument contains datamapping argument maps data to plot elementsWithin mapping,
aes() defines the aesthetics of the plotgeom_histogram()+ indicates that a new layer is cominggeom_histogram() is the new layerdata and mapping of the plotggplot()summarize()|>| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
Normative question: Which one is better?
A summary statistic aggregates a distribution to one number
For example, the mean \[\text{mean}(\vec{x}) = \frac{x_1 + x_2 +\cdots}{n}\]
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
| Mean | $73k | $58k | $58k |
By the mean, Distribution 1 seems the best.
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
| Median | $60k | $65k | $60k |
By the median, Distribution 2 seems the best.
The median is the value in the middle
Generalizes to other percentiles
These summarize the bottom and top of a distribution.
Find the lowest value.
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
Find the lowest value.
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
| Minimum | $10k | $40k | $50k |
By the minimum, Distribution 3 seems the best.
Minimum? Median? Mean?
| Household | Distribution 1 | Distribution 2 | Distribution 3 |
|---|---|---|---|
| 1 | $10k | $40k | $50k |
| 2 | $60k | $65k | $60k |
| 3 | $150k | $70k | $65k |
Which summary to choose is not an empirical question.
The value of a chosen summary statistic is empirical.
summarize() functionThe summarize() function aggregates data to summaries.
summarize() function# A tibble: 1,000 × 2
id hhincome
<dbl> <dbl>
1 1 19170.
2 2 124474.
3 3 25114.
# ℹ 997 more rows
summarize() function# A tibble: 1 × 1
estimated_mean
<dbl>
1 100899.
.data is input dataestimated_mean is a variable in output datamean(hhincome) is the mean household incomesummarize() function: Several summariesggplot()summarize()|>|>The pipe |> passes x as the first argument to the length() function.
|>Stylistically helpful
ggplot()summarize()|>