Linear Regression

Here are slides in website and pdf format.

As an example, we will work with U.S. adult income by sex (male, female), age (30–50), and year (2010–2019). We will focus on the target population of those working 35+ hours per week for 50+ weeks per year. Data are simulated based on the 2010–2019 American Community Survey (ACS).

The function below will simulate data

simulate <- function(n = 100) {
  read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
    slice_sample(n = n, weight_by = weight, replace = T) |>
    mutate(income = exp(rnorm(n(), meanlog, sdlog))) |>
    select(year, age, sex, income)
}

Below you can see this function in action.

simulated <- simulate(n = 3e4)

Rows: 420 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (5): year, age, meanlog, sdlog, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

# A tibble: 30,000 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2011    48 female 93676.
2  2012    38 female 98805.
3  2013    38 female 52330.
# ℹ 29,997 more rows

Conditional expectation

A key goal with linear regression is the conditional expectation: the mean of an outcome within a population subgroup.

expectation refers to taking a mean
conditional refers to within a subgroup

Example: Mean income among females age 47 in 2019

Suppose we want to estimate that conditional mean in our data. One way is to first create the subgroup and take the mean among people in that subgroup in our sample.

filter() restricts our data to cases meeting requirements:

the sex variable equals the value female
the age variable equals the value 47
the year variable equals the value 2019

subgroup <- simulated |>
  filter(sex == "female") |>
  filter(age == 47) |>
  filter(year == 2019)

summarize() aggregates to the mean

subgroup |>
  summarize(conditional_expectation = mean(income))

# A tibble: 1 × 1
  conditional_expectation
                    <dbl>
1                  71530.

Often, we want to study many conditional expectations: the mean outcome in many subgroups. In previous homework, we have seen how group_by and summarize can yield the mean in many subgroups.

simulated |>
  group_by(sex, age, year) |>
  summarize(conditional_expectation = mean(income))

`summarise()` has grouped output by 'sex', 'age'. You can override using the
`.groups` argument.

# A tibble: 420 × 4
# Groups:   sex, age [42]
  sex      age  year conditional_expectation
  <chr>  <dbl> <dbl>                   <dbl>
1 female    30  2010                  45928.
2 female    30  2011                  43688.
3 female    30  2012                  42714.
# ℹ 417 more rows

In math, the conditional expectation function is the subgroup mean of $Y$ within a subgroup with the predictor values $\vec{X} = \vec{x}$. We use $\text{E}$ to denote the expectation operator. For simplicity, we will let $f()$ refer to the conditional expectation function, which has input $\vec{x}$ and outputs a conditional mean among those with $\vec{X} = \vec{x}$.

\[ f(\vec{x}) = \text{E}(Y\mid\vec{X} = \vec{x}) \]

To learn $f(\vec{x})$ from data is a central task in statistical learning.

#$ Statistical learning by pooling information

A common problem of statistical inference is that we want to study a subgroup, but there are few cases within the subgroup. For example, female respondents age 47 in 2019 in our simulated data.

simulated |>
  filter(sex == "female") |>
  filter(year == 2019) |>
  filter(age == 47)

# A tibble: 67 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2019    47 female 15761.
2  2019    47 female 32995.
3  2019    47 female 83967.
# ℹ 64 more rows

Few cases in a subgroup leads to statistical uncertainty about the mean in the subgroup. How can we better estimate this conditional mean?

One strategy is to pool information using a model. We have many female respondents in 2019. The only problem is that few are age 47.

simulated |>
  filter(sex == "female") |>
  filter(year == 2019)

# A tibble: 1,427 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2019    32 female 52130.
2  2019    46 female 17465.
3  2019    41 female 66012.
# ℹ 1,424 more rows

We might think that these other respondents (e.g., those age 46 and 48) are informative about the outcomes of 47-year-olds. There are many ways to pool information. A linear regression model is one strategy that pools information across people of all ages, to estimate a conditional mean at any particular age.

The model assumes that all of the conditional means fall along a line. Then, it estimates the intercept $\beta_0$ and slope $\beta_1$ of this line to best fit those conditional means. Finally, one can use the model to make a prediction at any particular $X$ value.

Practice question

\[ \text{E}(Y\mid X) = \beta_0 + \beta_1 X \]

Suppose $\beta_0 = 5$ and $\beta_1 = 3$

What is the conditional mean when $X = 0$?
What is the conditional mean when $X = 1$?
What is the conditional mean when $X = 2$?
How much does the conditional mean change for each unit increase in $X$?

Coding a linear model

Coding a linear model in R is easy. First, generate some data (get the simulate() function from futher up this page).

simulated <- simulate(n = 3e4)

Rows: 420 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (5): year, age, meanlog, sdlog, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.

For our example, restrict to female respondents in 2019.

female_2019 <- simulated |>
  filter(sex == "female") |>
  filter(year == 2019)

Then learn a model from the data with the lm() function.

model <- lm(
  formula = income ~ age, 
  data = female_2019
)

Here is how that code worked:

model is an object of class lm for linear model
lm() function creates this object
formula argument is a model formula
- outcome ~ predictor is the syntax
data is a dataset containing outcome and predictor

We can look at the learned model with the summary() function.

summary(model)


Call:
lm(formula = income ~ age, data = female_2019)

Residuals:
   Min     1Q Median     3Q    Max 
-52689 -29518 -12682  16013 400507 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  47242.6     7518.3   6.284 4.37e-10 ***
age            233.7      185.7   1.259    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43360 on 1437 degrees of freedom
Multiple R-squared:  0.001102,  Adjusted R-squared:  0.0004064 
F-statistic: 1.585 on 1 and 1437 DF,  p-value: 0.2083

Finally, we might predict at a new X value. First, define the data at which to make the prediction: a person age 47.

to_predict <- tibble(age = 47)

Predict for that subgroup

predict(model, newdata = to_predict)

       1 
58228.57

To review, our model pooled information:

People of all ages contributed to model
Then we predicted at a single age

Practice question

Below is the line fit to the population data. Suppose we want to learn $\text{E}(\log(Y)\mid X = 30)$.

Rows: 420 Columns: 6
── Column specification ────────────────────────────────────────────────────────
Delimiter: ","
chr (1): sex
dbl (5): year, age, meanlog, sdlog, weight

ℹ Use `spec()` to retrieve the full column specification for this data.
ℹ Specify the column types or set `show_col_types = FALSE` to quiet this message.
`geom_smooth()` using formula = 'y ~ x'

Why might this model make a misleading estimate?
Why might the model still be useful?

Additive vs interactive models

Below, we visualize two models: one for male and one for female respondents.

There are two equivalent ways to describe these two models. The first is by thinking of them as separate linear regressions.

\[ \begin{aligned} \text{E}(Y\mid X, \text{Female}) &= \beta_0^\text{Female} + \beta_1^\text{Female}\times \text{Age} \\ \text{E}(Y\mid X, \text{Male}) &= \beta_0^\text{Male} + \beta_1^\text{Male}\times \text{Age} \\ \end{aligned} \]

The second way is to think of them as one pooled linear regression that interacts age and sex to allow the slope on age to differ by sex.

\[\text{E}(Y \mid X, \text{Sex}) = \gamma_0 + \gamma_1(\text{Female}) + \gamma_2(\text{Age}) + \gamma_3 (\text{Age} \times \text{Female})\]

Note that both approaches summarize the conditional mean function with 4 parameters. The two approaches are actually equivalent, as you can show with some algebra.

\[\begin{aligned} \gamma_0 &= \beta_0^\text{Male} &\gamma_1 &= \beta_0^\text{Female} - \beta_0^\text{Male} \\ \gamma_2 &= \beta_1^\text{Male} &\gamma_3 &= \beta_1^\text{Female} - \beta_1^\text{Male} \end{aligned}\]

Below, we practice writing an interaction in code. First generate data in 2019 that vary in both sex and age.

all_2019 <- simulated |>
  filter(year == 2019)

# A tibble: 3,204 × 4
   year   age sex   income
  <dbl> <dbl> <chr>  <dbl>
1  2019    41 male  50285.
2  2019    45 male  31057.
3  2019    34 male  66166.
# ℹ 3,201 more rows

The * operator allows slopes to differ across groups

model <- lm(
  formula = income ~ sex * age,
  data = all_2019
)

Two models: Additive model in R

The + operator assumes slopes are the same across groups

model <- lm(
  formula = income ~ sex + age,
  data = all_2019
)

When you have many interactions, the model starts to have lots of terms! This can make interpretation hard. But, you can always use the model to predict any conditional mean you want, even if there are many interactions.

model <- lm(
  formula = income ~ sex * age * year,
  data = simulated
)

summary(model)


Call:
lm(formula = income ~ sex * age * year, data = simulated)

Residuals:
   Min     1Q Median     3Q    Max 
-81158 -33849 -14946  15839 972817 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       1.387e+06  2.343e+06   0.592    0.554
sexmale          -1.943e+06  3.117e+06  -0.623    0.533
age              -6.273e+04  5.760e+04  -1.089    0.276
year             -6.680e+02  1.163e+03  -0.574    0.566
sexmale:age       6.646e+04  7.675e+04   0.866    0.386
sexmale:year      9.519e+02  1.547e+03   0.615    0.538
age:year          3.130e+01  2.859e+01   1.095    0.274
sexmale:age:year -3.247e+01  3.809e+01  -0.852    0.394

Residual standard error: 57790 on 29992 degrees of freedom
Multiple R-squared:  0.03332,   Adjusted R-squared:  0.03309 
F-statistic: 147.7 on 7 and 29992 DF,  p-value: < 2.2e-16

The many terms of the model correspond to many slopes in subgroups, as visualized below.

Penalized Regression

We close with a more advanced data science topic: penalized regression. We will discuss a method known as ridge regression or L2 penalized regression. This data science approach is a linear model just like OLS, but estimates the coefficients slightly differently.

Before defining this type of penalized regression, below we show what happens when we use it. Each line is a regression estimated on a different sample from the population. What similarities and differences do you notice between the penalized and unpenalized regressions?

We could use each approach to predict the mean income among 47-year-olds. Below are those predictions, with one dot from each simulated sample. How do the patterns below align with what you noticed in the graph above?

The reason one uses penalized regression is to reduce the sampling variance of estimates, at the cost of some bias (estimates are generally drawn in toward the overall sample mean). To understand how, it is useful to see some math.

OLS chose $\alpha, \vec\beta$ to minimize this function: \[ \begin{aligned} \underbrace{\sum_i\left(Y_i - \hat{Y}_i\right)^2}_\text{Sum of Squared Error} \end{aligned} \] where $\hat{Y}_i = \hat\alpha + \sum_j X_j \hat\beta_j$

Penalized (ridge) regression chose $\alpha, \vec\beta$ to minimize this function: \[ \begin{aligned} \underbrace{\sum_i\left(Y_i - \hat{Y}_i\right)^2}_\text{Sum of Squared Error} + \underbrace{\lambda \sum_{j} \beta_j^2}_\text{Penalty Term} \end{aligned} \] where $\hat{Y}_i = \hat\alpha + \sum_j X_j \hat\beta_j$

The only difference between the two is that penalized regression seeks to avoid having a large value of $\sum_j \beta_j^2$. Thus, it prefers to estimate models with coefficients near zero. In practice, researchers often mean-center covariates and outcomes for penalized regression so that this pulls all estimates toward the overall mean.

Penalized regression in code

Penalized regression is available through many R packages. Here we illustrate with the glmnet package. First, simulate a large sample.

simulated <- simulate(n = 1e5)

Load the package.

library(glmnet)

Create a model matrix of predictors

This converts the predictor data into matrix form
Each column will correspond to a coefficient

X <- model.matrix(~ age * sex * year, data = simulated)

Create a vector of the outcomes

y <- simulated |> pull(income)

Use the cv.glmnet function to call the package. For now, we will leave as a black box how it chooses the penalty parameter $\lambda$.

penalized <- cv.glmnet(
  x = X,    # model matrix we created
  y = y,    # outcome vector we created
  alpha = 0 # penalize sum of beta ^ 2
)

Finally, make predictions from the model at each observed data point.

yhat <- predict(
  penalized,
  newx = X
)

summary(yhat)

   lambda.1se   
 Min.   :60582  
 1st Qu.:62568  
 Median :65476  
 Mean   :65063  
 3rd Qu.:67425  
 Max.   :69405

When should you use penalized regression? The key reasons are motivated by the original illustration figures above. You should use penalized regression to reduce the variance of your estimates. This may occur in settings where you have many predictors and few observations, for example. However, there is a cost: penalized regression generally yields biased estimates of conditional means, so the model will be wrong on average.

In future classes, we will discuss data-driven ways to choose among the many statistical and machine-learning approaches to estimate conditional mean functions.