Visualization and Summary Statistics

UCLA Soc 114

Review of last class

Say this code in English:

numbers <- c(1,2,3)
x_length <- length(numbers)

Store the vector c(1,2,3) in the object numbers
Use the length() function to get the length of numbers

Learning goals for today

Reason about distributions
- and visualize with ggplot()
Understand summary statistics
- and construct with summarize()
Write clean code with the pipe |>

How to visualize?

How might we visualize the U.S. income distribution?

Here are some data:

library(tidyverse)
incomeSimulated <- read_csv(
  file = "https://soc114.github.io/data/incomeSimulated.csv"
)

# A tibble: 1,000 × 2
     id hhincome
  <dbl>    <dbl>
1     1   19170.
2     2  124474.
3     3   25114.
# ℹ 997 more rows

Visualize with a histogram

Bins of $25,000
In each bin, count number of people

Learning a new function: `ggplot`

ggplot(
  data = incomeSimulated,
  mapping = aes(x = hhincome)
)

Two arguments get us started:

data argument contains data
mapping argument maps data to plot elements

Within mapping,

aes() defines the aesthetics of the plot
i.e. which variable goes along x-axis

Adding a layer: `geom_histogram()`

ggplot(
  data = incomeSimulated,
  mapping = aes(x = hhincome)
) +
  geom_histogram()

The + indicates that a new layer is coming
geom_histogram() is the new layer
Inherits the data and mapping of the plot

Update axis titles

ggplot(
  data = incomeSimulated,
  mapping = aes(x = hhincome)
) +
  geom_histogram() +
  labs(
    x = "Household Income", 
    y = "Count of Households in Bin"
  )

Learning goals for today

Reason about distributions
- and visualize with ggplot()
Understand summary statistics
- and construct with summarize()
Write clean code with the pipe |>

Imagine 3 income distributions

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k

Normative question: Which one is better?

Summary statistic

A summary statistic aggregates a distribution to one number

For example, the mean \[\text{mean}(\vec{x}) = \frac{x_1 + x_2 +\cdots}{n}\]

Summary statistic: Mean

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k
Mean	$73k	$58k	$58k

By the mean, Distribution 1 seems the best.

Summary statistic: Median

Sort households by income.
Find where 50% of households have higher incomes

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k

Summary statistic: Median

Sort households by income.
Find where 50% of households have higher incomes

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k
Median	$60k	$65k	$60k

By the median, Distribution 2 seems the best.

Aside: Percentiles generalize the median

The median is the value in the middle

50% of people have lower values
Also called the 50th percentile

Generalizes to other percentiles

10th percentile: Value such that 10% are lower
90th percentile: Value such that 90% are lower

These summarize the bottom and top of a distribution.

Summary statistic: Minimum

Find the lowest value.

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k

Summary statistic: Minimum

Find the lowest value.

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k
Minimum	$10k	$40k	$50k

By the minimum, Distribution 3 seems the best.

Which summary statistic to choose?

Minimum? Median? Mean?

Household	Distribution 1	Distribution 2	Distribution 3
1	$10k	$40k	$50k
2	$60k	$65k	$60k
3	$150k	$70k	$65k

Choosing a summary statistic

Which summary to choose is not an empirical question.

Depends on what aspect of the distribution matters to you

The value of a chosen summary statistic is empirical.

Data can tell us a value for the mean, the median, the minimum, etc.

The `summarize()` function

The summarize() function aggregates data to summaries.

Input is a dataset with $n$ rows
Output is a summary with 1 row

The `summarize()` function

incomeSimulated

# A tibble: 1,000 × 2
     id hhincome
  <dbl>    <dbl>
1     1   19170.
2     2  124474.
3     3   25114.
# ℹ 997 more rows

The `summarize()` function

summarize(
  .data = incomeSimulated,
  estimated_mean = mean(hhincome)
)

# A tibble: 1 × 1
  estimated_mean
           <dbl>
1        100899.

.data is input data
estimated_mean is a variable in output data
mean(hhincome) is the mean household income

The `summarize()` function: Several summaries

summarize(
  .data = incomeSimulated,
  estimated_mean = mean(hhincome),
  estimated_median = median(hhincome),
  esitmated_min = min(hhincome)
)

# A tibble: 1 × 3
  estimated_mean estimated_median esitmated_min
           <dbl>            <dbl>         <dbl>
1        100899.           69035.             0

Learning goals for today

Reason about distributions
- and visualize with ggplot()
Understand summary statistics
- and construct with summarize()
Write clean code with the pipe |>

Piping code with `|>`

x <- c(1,2,3)
length(x)

[1] 3

x |> length()

[1] 3

The pipe |> passes x as the first argument to the length() function.

Piping code with `|>`

Stylistically helpful

Data is a different kind of argument
Pipes will help us in the future

incomeSimulated |>
  summarize(
    estimated_mean = mean(hhincome),
    estimated_median = median(hhincome),
    esitmated_min = min(hhincome)
  )

# A tibble: 1 × 3
  estimated_mean estimated_median esitmated_min
           <dbl>            <dbl>         <dbl>
1        100899.           69035.             0

Learning goals for today

Reason about distributions
- and visualize with ggplot()
Understand summary statistics
- and construct with summarize()
Write clean code with the pipe |>

You can now learn more: R4DS Ch 1 and Ch 3

Visualization and Summary Statistics

Review of last class

Learning goals for today

How to visualize?

Visualize with a histogram

Learning a new function: ggplot

Adding a layer: geom_histogram()

Update axis titles

Learning goals for today

Imagine 3 income distributions

Summary statistic

Summary statistic: Mean

Summary statistic: Median

Summary statistic: Median

Aside: Percentiles generalize the median

Summary statistic: Minimum

Summary statistic: Minimum

Which summary statistic to choose?

Choosing a summary statistic

The summarize() function

The summarize() function

The summarize() function

The summarize() function: Several summaries

Learning goals for today

Piping code with |>

Piping code with |>

Learning goals for today

Learning a new function: `ggplot`

Adding a layer: `geom_histogram()`

The `summarize()` function

The `summarize()` function

The `summarize()` function

The `summarize()` function: Several summaries

Piping code with `|>`

Piping code with `|>`