Linear Regression

UCLA Soc 114

Linear regression: Learning goals

Some things you may know

  • How to fit a linear model
  • How to make predictions

Data science ideas

  • Why model at all?
  • Penalized linear regression

Data for illustration

U.S. adult income by

  • sex (male, female)
  • age (30–50)
  • year (2010–2019)

among those working 35+ hours per week for 50+ weeks per year. Data are simulated based on the 2010–2019 American Community Survey (ACS).

Data for illustration

The function below will simulate data

simulate <- function(n = 100) {
  read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
    slice_sample(n = n, weight_by = weight, replace = T) |>
    mutate(income = exp(rnorm(n(), meanlog, sdlog))) |>
    select(year, age, sex, income)
}

Data for illustration

simulated <- simulate(n = 3e4)
# A tibble: 30,000 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2011    48 female 93676.
2  2012    38 female 98805.
3  2013    38 female 52330.
# ℹ 29,997 more rows

Conditional expectation

Mean of an outcome within a population subgroup.

  • expectation refers to taking a mean
  • conditional refers to within a subgroup

Example: Mean income among females age 47 in 2019

Task. Estimate this in our data.

Code: Find the subgroup

filter() restricts our data to cases meeting requirements:

  • the sex variable equals the value female
  • the age variable equals the value 47
  • the year variable equals the value 2019
subgroup <- simulated |>
  filter(sex == "female") |>
  filter(age == 47) |>
  filter(year == 2019)

Code: Estimate the mean

summarize() aggregates to the mean

subgroup |>
  summarize(conditional_expectation = mean(income))
# A tibble: 1 × 1
  conditional_expectation
                    <dbl>
1                  71530.

Code: Mean in many subgroups

With group_by, you can summarize many subgroups

simulated |>
  group_by(sex, age, year) |>
  summarize(conditional_expectation = mean(income))
# A tibble: 420 × 4
# Groups:   sex, age [42]
  sex      age  year conditional_expectation
  <chr>  <dbl> <dbl>                   <dbl>
1 female    30  2010                  45928.
2 female    30  2011                  43688.
3 female    30  2012                  42714.
# ℹ 417 more rows

Conditional expectation: Math

The conditional expectation function is the subgroup mean of \(Y\) within a subgroup with the predictor values \(\vec{X} = \vec{x}\).

\[ f(\vec{x}) = \text{E}(Y\mid\vec{X} = \vec{x}) \]

To learn \(f(\vec{x})\) from data is a central task in statistical learning.

Statistical Learning by Pooling Information

A subgroup is small

simulated |>
  filter(sex == "female") |>
  filter(year == 2019) |>
  filter(age == 47)
# A tibble: 67 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2019    47 female 15761.
2  2019    47 female 32995.
3  2019    47 female 83967.
# ℹ 64 more rows

Very few cases \(\rightarrow\) statistically uncertain

How to better estimate for 47-year-old females in 2019?

Pooling information across subgroups

We have many female respondents in 2019. Few are age 47.

simulated |>
  filter(sex == "female") |>
  filter(year == 2019)
# A tibble: 1,427 × 4
   year   age sex    income
  <dbl> <dbl> <chr>   <dbl>
1  2019    32 female 52130.
2  2019    46 female 17465.
3  2019    41 female 66012.
# ℹ 1,424 more rows

Could we use them to learn about the 47-year-olds?

Pooling information across subgroups

Pooling information across subgroups

Pooling information across subgroups

Practice question

\[ \text{E}(Y\mid X) = \beta_0 + \beta_1 X \]

Suppose \(\beta_0 = 5\) and \(\beta_1 = 3\)

  1. What is the conditional mean when \(X = 0\)?
  2. What is the conditional mean when \(X = 1\)?
  3. What is the conditional mean when \(X = 2\)?
  4. How much does the conditional mean change for each unit increase in \(X\)?

Code

The next slides explain how to code a model in R.

Code: Simulate data

Generate some data

simulated <- simulate(n = 3e4)

Restrict to female respondents in 2019

female_2019 <- simulated |>
  filter(sex == "female") |>
  filter(year == 2019)

(Below is simulate if you did not copy it before)

simulate <- function(n = 100) {
  read_csv("https://ilundberg.github.io/description/assets/truth.csv") |>
    slice_sample(n = n, weight_by = weight, replace = T) |>
    mutate(income = exp(rnorm(n(), meanlog, sdlog))) |>
    select(year, age, sex, income)
}

Code: Learn a model

model <- lm(
  formula = income ~ age, 
  data = female_2019
)
  • model is an object of class lm for linear model
  • lm() function creates this object
  • formula argument is a model formula
    • outcome ~ predictor is the syntax
  • data is a dataset containing outcome and predictor

Code: Examine the learned model

summary(model)

Call:
lm(formula = income ~ age, data = female_2019)

Residuals:
   Min     1Q Median     3Q    Max 
-52689 -29518 -12682  16013 400507 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  47242.6     7518.3   6.284 4.37e-10 ***
age            233.7      185.7   1.259    0.208    
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 43360 on 1437 degrees of freedom
Multiple R-squared:  0.001102,  Adjusted R-squared:  0.0004064 
F-statistic: 1.585 on 1 and 1437 DF,  p-value: 0.2083

Code: Predict for a new X value

Define X value at which to predict

to_predict <- tibble(age = 47)

Predict for that subgroup

predict(model, newdata = to_predict)
       1 
58228.57 

Recap: Our model pooled information:

  • People of all ages contributed to model
  • Then we predicted at a single age

Code: Three steps

  • Estimate a model
  • Define \(x\) to predict
  • Predict \(\hat{Y} = \hat{\text{E}}(Y\mid X = x)\)

What if you were going to do this many times on different data?

Code: Three steps in a function

estimator <- function(data) {
  # Learn the model from the data
  model <- lm(formula = income ~ age, data = data)
  # Define our target subgroup
  to_predict <- tibble(age = 47)
  # Predict
  estimate <- predict(model, newdata = to_predict)
  # Return the estimate
  return(estimate)
}

Code: All together

estimator(data = female_2019)
       1 
58228.57 
female_2019 |>
  estimator()
       1 
58228.57 
simulated |> 
  filter(sex == "female") |>
  filter(year == 2010) |>
  estimator()
      1 
54637.9 

Practice question

Below is the line fit to the population data. Suppose we want to learn \(\text{E}(\log(Y)\mid X = 30)\).

  1. Why might this model make a misleading estimate?
  2. Why might the model still be useful?

Additive vs Interactive

Two models

Two models

Two models

Two models: Interaction

\[ \begin{aligned} \text{E}(Y\mid X, \text{Female}) &= \beta_0^\text{Female} + \beta_1^\text{Female}\times \text{Age} \\ \text{E}(Y\mid X, \text{Male}) &= \beta_0^\text{Male} + \beta_1^\text{Male}\times \text{Age} \\ \end{aligned} \]

Equivalently, \[\text{E}(Y \mid X, \text{Sex}) = \gamma_0 + \gamma_1(\text{Female}) + \gamma_2(\text{Age}) + \gamma_3 (\text{Age} \times \text{Female})\] . . .

where \[\begin{aligned} \gamma_0 &= \beta_0^\text{Male} &\gamma_1 &= \beta_0^\text{Female} - \beta_0^\text{Male} \\ \gamma_2 &= \beta_1^\text{Male} &\gamma_3 &= \beta_1^\text{Female} - \beta_1^\text{Male} \end{aligned}\]

Two models: Interaction in code

Generate data in 2019 that vary in both sex and age

all_2019 <- simulated |>
  filter(year == 2019)
# A tibble: 3,204 × 4
   year   age sex   income
  <dbl> <dbl> <chr>  <dbl>
1  2019    41 male  50285.
2  2019    45 male  31057.
3  2019    34 male  66166.
# ℹ 3,201 more rows

Two models: Interaction in code

The * operator allows slopes to differ across groups

model <- lm(
  formula = income ~ sex * age,
  data = all_2019
)

Two models: Additive model in R

The + operator assumes slopes are the same across groups

model <- lm(
  formula = income ~ sex + age,
  data = all_2019
)

Interactions make lots of terms

model <- lm(
  formula = income ~ sex * age * year,
  data = simulated
)
summary(model)

Call:
lm(formula = income ~ sex * age * year, data = simulated)

Residuals:
   Min     1Q Median     3Q    Max 
-81158 -33849 -14946  15839 972817 

Coefficients:
                   Estimate Std. Error t value Pr(>|t|)
(Intercept)       1.387e+06  2.343e+06   0.592    0.554
sexmale          -1.943e+06  3.117e+06  -0.623    0.533
age              -6.273e+04  5.760e+04  -1.089    0.276
year             -6.680e+02  1.163e+03  -0.574    0.566
sexmale:age       6.646e+04  7.675e+04   0.866    0.386
sexmale:year      9.519e+02  1.547e+03   0.615    0.538
age:year          3.130e+01  2.859e+01   1.095    0.274
sexmale:age:year -3.247e+01  3.809e+01  -0.852    0.394

Residual standard error: 57790 on 29992 degrees of freedom
Multiple R-squared:  0.03332,   Adjusted R-squared:  0.03309 
F-statistic: 147.7 on 7 and 29992 DF,  p-value: < 2.2e-16

Interactions make lots of terms

Penalized Regression

Penalized regression

OLS is a linear model

\[\text{E}(Y\mid\vec{X}) = \beta_0 + \beta_1 X_1 + \beta_2 X_2 + \cdots\]

There are many linear models beyond OLS.

  • (other ways of estimating the \(\beta\) coefficients)

Penalized regression

Penalized regression

Unpenalized regression: In math

OLS chose \(\alpha, \vec\beta\) to minimize this function: \[ \begin{aligned} \underbrace{\sum_i\left(Y_i - \hat{Y}_i\right)^2}_\text{Sum of Squared Error} \end{aligned} \] where \(\hat{Y}_i = \hat\alpha + \sum_j X_j \hat\beta_j\)

Penalized regression: In math

Penalized (ridge) regression chose \(\alpha, \vec\beta\) to minimize this function: \[ \begin{aligned} \underbrace{\sum_i\left(Y_i - \hat{Y}_i\right)^2}_\text{Sum of Squared Error} + \underbrace{\lambda \sum_{j} \beta_j^2}_\text{Penalty Term} \end{aligned} \] where \(\hat{Y}_i = \hat\alpha + \sum_j X_j \hat\beta_j\)

Penalized regression: Code

simulated <- simulate(n = 1e5)

Penalized regression: Code

The glmnet package supports penalized regression

library(glmnet)

Penalized regression: Code

Create a model matrix of predictors

  • Each column will correspond to a coefficient
X <- model.matrix(~ age * sex * year, data = simulated)

Create a vector of the outcomes

y <- simulated |> pull(income)

Penalized regression: Code

Use the cv.glmnet function

penalized <- cv.glmnet(
  x = X,    # model matrix we created
  y = y,    # outcome vector we created
  alpha = 0 # penalize sum of beta ^ 2
)

Penalized regression: Code

yhat <- predict(
  penalized,
  newx = X
)
summary(yhat)
   lambda.1se   
 Min.   :60582  
 1st Qu.:62568  
 Median :65476  
 Mean   :65063  
 3rd Qu.:67425  
 Max.   :69405  

When to use penalized regression?

  • Many predictors and few observations
    • High-variance estimates
  • When you are willing to accept bias
    • Model will be a bit wrong on average

Linear regression: Learning goals

Some things you may know

  • How to fit a linear model
  • How to make predictions

Data science ideas

  • Why model at all?
  • Penalized linear regression