Logistic Regression

UCLA Soc 114

Logistic regression: Learning goals

Some things you may know

Logistic regression is good for binary outcomes
Coefficients are hard to interpret

Data science ideas

Predicted values make logistic regression easy to use

Logistic regression

A type of model for a binary outcome
- $Y$ taking the values {0,1} or {FALSE,TRUE}
Modeled as a function of predictor variables $\vec{X}$

A data example

baseball_population.csv

population <- read_csv("https://soc114.github.io/data/baseball_population.csv")

# A tibble: 944 × 6
  player               salary position team    team_past_record team_past_salary
  <chr>                 <dbl> <chr>    <chr>              <dbl>            <dbl>
1 Bumgarner, Madison 21882892 LHP      Arizona            0.457         2794887.
2 Marte, Ketel       11600000 2B       Arizona            0.457         2794887.
3 Ahmed, Nick        10375000 SS       Arizona            0.457         2794887.
4 Kelly, Merrill      8500000 RHP      Arizona            0.457         2794887.
5 Walker, Christian   6500000 1B       Arizona            0.457         2794887.
# ℹ 939 more rows

A data example

player is the player name
salary is the 2023 salary
position is the position played (e.g., LHP for left-handed pitcher)
team is the team name
team_past_record was the team’s win percentage in the previous season
team_past_salary was the team’s average salary in the previous season

A binary outcome

You see a player’s salary
Are they a catcher?
- position == "C"

Linear probability model

We can model with lm() for a linear fit.

ols_binary_outcome <- lm(
  position == "C" ~ salary,
  data = population
)

Code

population |>
  mutate(yhat = predict(ols_binary_outcome)) |>
  ggplot(aes(x = salary, y = yhat)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_x_continuous(
    name = "Player Salary",
    labels = label_currency(scale = 1e-6, suffix = "m")
  ) +
  scale_y_continuous(
    name = "Predicted Probability\nof Being a Catcher"
  ) +
  ggtitle("Modeling a binary outcome with a line")

Goal: Avoid illogical predictions

In OLS, there is a linear predictor \[ \mu = \beta_0 + X_1\beta_1 + X_2\beta_2 + \cdots \] that can take any numeric value. Possibly $\mu <0$ or $\mu > 1$.

From $\mu$ to $\pi$

Logistic regression passes the linear predictor \[\mu = \beta_0 + X_1\beta_1 + X_2\beta_2 + \cdots\] through a nonlinear function to force it between 0 and 1.

\[ \pi = \text{logit}^{-1}\left(\beta_0 + X\beta_1\right) = \frac{e^{\beta_0 + X\beta_1}}{1 + e^{\beta_0 + X\beta_1}} \]

From $\mu$ to $\pi$

At linear predictor 0, what is the predicted probability?
At linear predictor 2.5, what is the predicted probability?
At linear predictor $\infty$, what is the predicted probability?

From $\pi$ to $\mu$

You can also think from $\pi$ to $\mu$.

\[ \begin{aligned} \text{logit}(\pi) &= \mu = \beta_0 + X\beta_1 \\ \log\left(\frac{\pi}{1-\pi}\right) &= \mu = \beta_0 + X\beta_1 \end{aligned} \]

From $\pi$ to $\mu$

Logistic regression in R

The glm() function (for logistic regression) works exactly like the lm() function (for linear regression)

Logistic regression in R

logistic_regression <- glm(
  position == "C" ~ salary,
  data = population,
  family = "binomial"
)

position == "C" is our outcome: the binary indicator that the position variable takes the value "C"
salary is a predictor variable
family = "binomial" specifies logistic regression (since “binomial” is a distribution for binary outcomes)

Coefficients: A word of warning

Hard to interpret. Not probabilities. Use predicted values instead.

summary(logistic_regression)


Call:
glm(formula = position == "C" ~ salary, family = "binomial", 
    data = population)

Coefficients:
              Estimate Std. Error z value Pr(>|z|)    
(Intercept) -2.268e+00  1.500e-01 -15.126   <2e-16 ***
salary      -5.599e-08  2.599e-08  -2.154   0.0312 *  
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

(Dispersion parameter for binomial family taken to be 1)

    Null deviance: 508.94  on 943  degrees of freedom
Residual deviance: 502.85  on 942  degrees of freedom
AIC: 506.85

Number of Fisher Scoring iterations: 6

Predicted values

Be sure to use type = "response" predict probabilities (between 0 and 1) instead of log odds

predict(
  logistic_regression,
  type = "response"
)

Predicted values

Code

population |>
  mutate(yhat = predict(logistic_regression, type = "response")) |>
  distinct(salary, yhat) |>
  ggplot(aes(x = salary, y = yhat)) +
  geom_line() +
  geom_hline(yintercept = 0, linetype = "dashed") +
  scale_x_continuous(
    name = "Player Salary",
    labels = label_currency(scale = 1e-6, suffix = "m")
  ) +
  scale_y_continuous(
    name = "Predicted Probability\nof Being a Catcher"
  ) +
  ggtitle("Modeling a binary outcome with logistic regression")

Predicted values with `newdata`

New player: salary is $5 million.
What is the probability that this player is a catcher?

to_predict <- tibble(salary = 5e6)

Make the predicted value.

predict(
  logistic_regression,
  newdata = to_predict,
  type = "response"
)

         1 
0.07255671

Linear and logistic regression

What is the same? What is different?

Same
- Takes $X$ and predicts $Y$
- Involves $\beta_0 + \beta_1 X$
Different
- Logistic regression predicts a probability $0\leq\pi\leq 1$

\[ \log\left(\frac{\pi}{1-\pi}\right) = \beta_0 + X_1\beta_1 \]

Logistic regression: Learning goals

Some things you may know

Logistic regression is good for binary outcomes
Coefficients are hard to interpret

Data science ideas

Predicted values make logistic regression easy to use

Logistic Regression

Logistic regression: Learning goals

Logistic regression

A data example

A data example

A binary outcome

Linear probability model

Goal: Avoid illogical predictions

From \(\mu\) to \(\pi\)

From \(\mu\) to \(\pi\)

From \(\pi\) to \(\mu\)

From \(\pi\) to \(\mu\)

Logistic regression in R

Logistic regression in R

Coefficients: A word of warning

Predicted values

Predicted values

Predicted values with `newdata`

Linear and logistic regression

Logistic regression: Learning goals

Logistic Regression

Logistic regression: Learning goals

Logistic regression

A data example

A data example

A binary outcome

Linear probability model

Goal: Avoid illogical predictions

From \(\mu\) to \(\pi\)

From \(\mu\) to \(\pi\)

From \(\pi\) to \(\mu\)

From \(\pi\) to \(\mu\)

Logistic regression in R

Logistic regression in R

Coefficients: A word of warning

Predicted values

Predicted values

Predicted values with newdata

Linear and logistic regression

Logistic regression: Learning goals

Predicted values with `newdata`