Regression

Steen Flammild Harsted & Søren O’Neill

The Workflow

Rrrrr straight to the Danger Zone

The promise(s)

Today is about

#### ~~When?~~
#### ~~Why?~~

-> HOW <-

Why? and When? are very important, but these questions are better suited for a statistics course

Base R - Stability is a two-edged sword

Regression

`lm()` - linear models.

A typical model has the form response ~ terms

first + second all the terms in first together with all the terms in second with duplicates removed.
first : second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second.
first * second indicates the cross of first and second. This is the same as first + second + first:second.
A formula has an implied intercept term. To remove the intercept use either y ~ x - 1 or y ~ 0 + x.

lm(y ~ 0 + x1 + x2 * x3)

Regression

`lm()` - linear models.

mpg_model1 <- lm(cty ~ hwy, data = mpg)

mpg_model1


Call:
lm(formula = cty ~ hwy, data = mpg)

Coefficients:
(Intercept)          hwy  
     0.8442       0.6832

Regression

`lm()` - linear models.

The output is a list containing coefficients, residuals, etc.. ::: {.cell output-location=‘column-fragment’}

str(mpg_model1)

List of 12
 $ coefficients : Named num [1:2] 0.844 0.683
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "hwy"
 $ residuals    : Named num [1:234] -2.658 0.342 -2.024 -0.341 -2.608 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ effects      : Named num [1:234] -257.89 -62.1 -1.9 -0.21 -2.46 ...
  ..- attr(*, "names")= chr [1:234] "(Intercept)" "hwy" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:234] 20.7 20.7 22 21.3 18.6 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "hwy"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.06
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 232
 $ xlevels      : Named list()
 $ call         : language lm(formula = cty ~ hwy, data = mpg)
 $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
  .. ..- attr(*, "variables")= language list(cty, hwy)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "cty" "hwy"
  .. .. .. ..$ : chr "hwy"
  .. ..- attr(*, "term.labels")= chr "hwy"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(cty, hwy)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "cty" "hwy"
 $ model        :'data.frame':  234 obs. of  2 variables:
  ..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
  ..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language cty ~ hwy
  .. .. ..- attr(*, "variables")= language list(cty, hwy)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "cty" "hwy"
  .. .. .. .. ..$ : chr "hwy"
  .. .. ..- attr(*, "term.labels")= chr "hwy"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(cty, hwy)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "cty" "hwy"
 - attr(*, "class")= chr "lm"

:::

Regression

`lm()` - linear models.

mpg_model1$coefficients

(Intercept)         hwy 
  0.8442016   0.6832191

Regression

`lm()` - linear models.

summary(mpg_model1)


Call:
lm(formula = cty ~ hwy, data = mpg)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9247 -0.7757 -0.0428  0.6965  4.6096 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.84420    0.33319   2.534   0.0119 *  
hwy          0.68322    0.01378  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.252 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

The `easystats` package(s)

The `easystats` package(s). `report::report()`

report(mpg_model1)

We fitted a linear model (estimated using OLS) to predict cty with hwy
(formula: cty ~ hwy). The model explains a statistically significant and
substantial proportion of variance (R2 = 0.91, F(1, 232) = 2458.64, p < .001,
adj. R2 = 0.91). The model's intercept, corresponding to hwy = 0, is at 0.84
(95% CI [0.19, 1.50], t(232) = 2.53, p = 0.012). Within this model:

  - The effect of hwy is statistically significant and positive (beta = 0.68, 95%
CI [0.66, 0.71], t(232) = 49.58, p < .001; Std. beta = 0.96, 95% CI [0.92,
0.99])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

The `easystats` package(s). `report::report()`

Extra example model: ::: {.cell}

mpg_model2 <- lm(cty ~hwy + drv + displ + cyl, data = mpg)

:::

report(mpg_model2)

We fitted a linear model (estimated using OLS) to predict cty with hwy, drv,
displ and cyl (formula: cty ~ hwy + drv + displ + cyl). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.93,
F(5, 228) = 637.44, p < .001, adj. R2 = 0.93). The model's intercept,
corresponding to hwy = 0, drv = 4, displ = 0 and cyl = 0, is at 4.40 (95% CI
[2.62, 6.18], t(228) = 4.86, p < .001). Within this model:

  - The effect of hwy is statistically significant and positive (beta = 0.64, 95%
CI [0.60, 0.69], t(228) = 26.01, p < .001; Std. beta = 0.90, 95% CI [0.83,
0.97])
  - The effect of drv [f] is statistically significant and negative (beta =
-0.75, 95% CI [-1.20, -0.29], t(228) = -3.25, p = 0.001; Std. beta = -0.18, 95%
CI [-0.28, -0.07])
  - The effect of drv [r] is statistically significant and negative (beta =
-0.97, 95% CI [-1.55, -0.40], t(228) = -3.33, p = 0.001; Std. beta = -0.23, 95%
CI [-0.36, -0.09])
  - The effect of displ is statistically non-significant and negative (beta =
-0.02, 95% CI [-0.36, 0.33], t(228) = -0.09, p = 0.930; Std. beta = -4.65e-03,
95% CI [-0.11, 0.10])
  - The effect of cyl is statistically significant and negative (beta = -0.36,
95% CI [-0.62, -0.11], t(228) = -2.79, p = 0.006; Std. beta = -0.14, 95% CI
[-0.23, -0.04])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

The `easystats` package(s). `report::report()`

report() actually works on many other objects than regression models

Example with t.test()

data_for_t.test <- starwars |> 
  filter(species == "Human") |> 
  select(gender, height) |>  
  drop_na()  
  
t.test(height ~ gender, data = data_for_t.test) |> report()

Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of height by gender (mean in
group feminine = 163.57, mean in group masculine = 182.39) suggests that the
effect is negative, statistically significant, and large (difference = -18.82,
95% CI [-29.92, -7.72], t(7.81) = -3.93, p = 0.005; Cohen's d = -2.81, 95% CI
[-4.73, -0.81])

The `easystats` package(s). `parameters::model_parameters()`

model_parameters(mpg_model1)

Parameter   | Coefficient |   SE |       95% CI | t(232) |      p
-----------------------------------------------------------------
(Intercept) |        0.84 | 0.33 | [0.19, 1.50] |   2.53 | 0.012 
hwy         |        0.68 | 0.01 | [0.66, 0.71] |  49.58 | < .001

model_parameters(mpg_model2)

Parameter   | Coefficient |   SE |         95% CI | t(228) |      p
-------------------------------------------------------------------
(Intercept) |        4.40 | 0.91 | [ 2.62,  6.18] |   4.86 | < .001
hwy         |        0.64 | 0.02 | [ 0.60,  0.69] |  26.01 | < .001
drv [f]     |       -0.75 | 0.23 | [-1.20, -0.29] |  -3.25 | 0.001 
drv [r]     |       -0.97 | 0.29 | [-1.55, -0.40] |  -3.33 | 0.001 
displ       |       -0.02 | 0.18 | [-0.36,  0.33] |  -0.09 | 0.930 
cyl         |       -0.36 | 0.13 | [-0.62, -0.11] |  -2.79 | 0.006

The `easystats` package(s). `parameters::model_parameters()`

Add model information

model_parameters(mpg_model1, 
                 summary = TRUE)

Parameter   | Coefficient |   SE |       95% CI | t(232) |      p
-----------------------------------------------------------------
(Intercept) |        0.84 | 0.33 | [0.19, 1.50] |   2.53 | 0.012 
hwy         |        0.68 | 0.01 | [0.66, 0.71] |  49.58 | < .001

Add reference level

model_parameters(mpg_model2, 
                 include_reference = TRUE)

Parameter   | Coefficient |   SE |         95% CI | t(228) |      p
-------------------------------------------------------------------
(Intercept) |        4.40 | 0.91 | [ 2.62,  6.18] |   4.86 | < .001
hwy         |        0.64 | 0.02 | [ 0.60,  0.69] |  26.01 | < .001
drv [4]     |        0.00 |      |                |        |       
drv [f]     |       -0.75 | 0.23 | [-1.20, -0.29] |  -3.25 | 0.001 
drv [r]     |       -0.97 | 0.29 | [-1.55, -0.40] |  -3.33 | 0.001 
displ       |       -0.02 | 0.18 | [-0.36,  0.33] |  -0.09 | 0.930 
cyl         |       -0.36 | 0.13 | [-0.62, -0.11] |  -2.79 | 0.006

The `easystats` package(s). `parameters::model_parameters()`

plot

model_parameters(mpg_model2) |> 
  plot()

The `easystats` package(s). `parameters::compare_parameters()`

compare_parameters(mpg_model1, mpg_model2)

Parameter    |        mpg_model1 |           mpg_model2
-------------------------------------------------------
(Intercept)  | 0.84 (0.19, 1.50) |  4.40 ( 2.62,  6.18)
hwy          | 0.68 (0.66, 0.71) |  0.64 ( 0.60,  0.69)
drv [f]      |                   | -0.75 (-1.20, -0.29)
drv [r]      |                   | -0.97 (-1.55, -0.40)
displ        |                   | -0.02 (-0.36,  0.33)
cyl          |                   | -0.36 (-0.62, -0.11)
-------------------------------------------------------
Observations |               234 |                  234

The `easystats` package(s). `parameters::compare_parameters()`

compare_parameters(mpg_model1, mpg_model2) |> 
  plot()

The `easystats` package(s). `parameters::model_parameters()`

Robust, heteroskedasticity-consistent standard errors

Method	Adjustment Formula	Best Use Case
HC0	No adjustment	Large samples
HC1	Inflation by \(\frac{n}{n-k}\) (default in Stata)	Moderate samples
HC2	Adjustment by \((1 - h_i)\)	Small samples
HC3	Adjustment by \((1 - h_i)^2\)	Very small samples or high leverage
HC4	Further adjustment for extreme leverage	Severe leverage issues

The `easystats` package(s). `performance::model_performance()`

model_performance(mpg_model1)

# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
773.387 | 773.491 | 783.753 | 0.914 |     0.913 | 1.247 | 1.252

::: {.cell output-location=‘fragment’}

model_performance(mpg_model2)

# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
721.519 | 722.015 | 745.706 | 0.933 |     0.932 | 1.097 | 1.112

:::

The `easystats` package(s). `performance::compare_performance()`

Extra example model: ::: {.cell}

mpg_model3 <- lm(cty ~ ., data = mpg)

:::

compare_performance(mpg_model1, mpg_model2, mpg_model3)

# Comparison of Model Performance Indices

Name       | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2
---------------------------------------------------------------------------
mpg_model1 |    lm | 773.4 (<.001) |  773.5 (<.001) | 783.8 (<.001) | 0.914
mpg_model2 |    lm | 721.5 (<.001) |  722.0 (<.001) | 745.7 (>.999) | 0.933
mpg_model3 |    lm | 609.0 (>.999) |  648.1 (>.999) | 809.4 (<.001) | 0.973

Name       | R2 (adj.) |  RMSE | Sigma
--------------------------------------
mpg_model1 |     0.913 | 1.247 | 1.252
mpg_model2 |     0.932 | 1.097 | 1.112
mpg_model3 |     0.965 | 0.694 | 0.798

The `easystats` package(s). `performance::compare_performance()`

compare_performance(mpg_model1, mpg_model2, mpg_model3) |> plot()

The `easystats` package(s). `performance::check_model()`

check_model(mpg_model1)

check_model(mpg_model2)

We used `lm()` to fit linear regression models

dependent ~ independent
(+, * , :) for building your model

We have looked at the following functions from `easystats`

report()
model_parameters()
model_performance()
compare_performance()
check_model()

Everything above works with other model types such as

glm(, family = binomial) logistic regression
glm(, family = poisson) poissan regression
lmer() linear mixed-effects model

Cheatsheets

Stata -> R

SAS -> R

estimator R-package for calculating Robust SE, etc.

Honorable mention

The marginaleffects package for R and Python offers a single point of entry to easily interpret the results of over 100 classes of models, using a simple and consistent user interface.

Its benefits include:

Powerful: It can compute and plot predictions; comparisons (contrasts, risk ratios, etc.); slopes; and conduct hypothesis and equivalence tests for over 100 different classes of models in R.
Simple: All functions share a simple and unified interface.
Documented: Thoroughly documented with abundant examples.
Efficient: Some operations can be up to 1000 times faster and use 30 times less memory than with the margins package.
Valid: When possible, numerical results are checked against alternative software like Stata or other R packages.

link to paper
Free online book

`gtsummary`

YES! This package can help with regression tables as well

`tbl_regression()`

tbl_regression(mpg_model2)

Characteristic	Beta	95% CI	p-value
hwy	0.64	0.60, 0.69	<0.001
drv
4	—	—
f	-0.75	-1.2, -0.29	0.001
r	-0.97	-1.6, -0.40	0.001
displ	-0.02	-0.36, 0.33	>0.9
cyl	-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval

`tbl_regression()`

tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  bold_levels() |> 
  bold_p()

Characteristic	Beta	95% CI	p-value
hwy	0.64	0.60, 0.69	<0.001
drv
4	—	—
f	-0.75	-1.2, -0.29	0.001
r	-0.97	-1.6, -0.40	0.001
displ	-0.02	-0.36, 0.33	>0.9
cyl	-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval

`tbl_regression()`

tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p() |> 
  add_glance_source_note()

Characteristic	Beta	95% CI	p-value
hwy	0.64	0.60, 0.69	<0.001
drv
4	—	—
f	-0.75	-1.2, -0.29	0.001
r	-0.97	-1.6, -0.40	0.001
displ	-0.02	-0.36, 0.33	>0.9
cyl	-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval
R² = 0.933; Adjusted R² = 0.932; Sigma = 1.11; Statistic = 637; p-value = <0.001; df = 5; Log-likelihood = -354; AIC = 722; BIC = 746; Deviance = 282; Residual df = 228; No. Obs. = 234

`tbl_merge()`

t1 <- tbl_regression(mpg_model1) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p() 

t2 <- tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p()

`tbl_merge()`

tbl_merge(
  tbls = list(t1, 
              t2)
)

Characteristic	Table 1			Table 2
Characteristic	Beta	95% CI	p-value	Beta	95% CI	p-value
hwy	0.68	0.66, 0.71	<0.001	0.64	0.60, 0.69	<0.001
drv
4				—	—
f				-0.75	-1.2, -0.29	0.001
r				-0.97	-1.6, -0.40	0.001
displ				-0.02	-0.36, 0.33	>0.9
cyl				-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval

`tbl_merge()`

Titles on models ::: {.cell output-location=‘fragment’}

tbl_merge(
  tbls = list(t1, 
              t2), 
  tab_spanner = c("Simple", "WOW model!!!")
)

Characteristic	Simple			WOW model!!!
Characteristic	Beta	95% CI	p-value	Beta	95% CI	p-value
hwy	0.68	0.66, 0.71	<0.001	0.64	0.60, 0.69	<0.001
drv
4				—	—
f				-0.75	-1.2, -0.29	0.001
r				-0.97	-1.6, -0.40	0.001
displ				-0.02	-0.36, 0.33	>0.9
cyl				-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval

:::

`tbl_merge()`

Titles are read as markdown ::: {.cell output-location=‘fragment’}

tbl_merge(
  tbls = list(t1, 
              t2), 
  tab_spanner = c("**Simple**", "**WOW model!!!**")
)

Characteristic	Simple			WOW model!!!
Characteristic	Beta	95% CI	p-value	Beta	95% CI	p-value
hwy	0.68	0.66, 0.71	<0.001	0.64	0.60, 0.69	<0.001
drv
4				—	—
f				-0.75	-1.2, -0.29	0.001
r				-0.97	-1.6, -0.40	0.001
displ				-0.02	-0.36, 0.33	>0.9
cyl				-0.36	-0.62, -0.11	0.006
Abbreviation: CI = Confidence Interval

:::

Plot your models

If you use plot() on a gtsummary table you get a ggplot object

t2 |> plot()

Plot your models

You can modify the ggplot object using functions you already know.

t2 |> plot() +
  labs(
    title = "This plot is a regression model"
  )

`gtsummary`

Read more about regression and gtsummary

tbl_regression() tutorial

Regression

The Workflow

The Workflow

Rrrrr straight to the Danger Zone

The promise(s)

Today is about

-> HOW <-

Why? and When? are very important, but these questions are better suited for a statistics course

Base R - Stability is a two-edged sword

Regression

lm() - linear models.

Regression

lm() - linear models.

Regression

lm() - linear models.

Regression

lm() - linear models.

Regression

lm() - linear models.

The easystats package(s)

The easystats package(s). report::report()

The easystats package(s). report::report()

The easystats package(s). report::report()

The easystats package(s). parameters::model_parameters()

The easystats package(s). parameters::model_parameters()

Add model information

Add reference level

The easystats package(s). parameters::model_parameters()

plot

The easystats package(s). parameters::compare_parameters()

The easystats package(s). parameters::compare_parameters()

The easystats package(s). parameters::model_parameters()

Robust, heteroskedasticity-consistent standard errors

The easystats package(s). performance::model_performance()

The easystats package(s). performance::compare_performance()

The easystats package(s). performance::compare_performance()

The easystats package(s). performance::check_model()

We used lm() to fit linear regression models

We have looked at the following functions from easystats

Everything above works with other model types such as

Cheatsheets

Honorable mention

gtsummary

tbl_regression()

tbl_regression()

tbl_regression()

tbl_regression()

tbl_merge()

tbl_merge()

tbl_merge()

tbl_merge()

Plot your models

Plot your models

gtsummary

Thanks!

`lm()` - linear models.

`lm()` - linear models.

`lm()` - linear models.

`lm()` - linear models.

`lm()` - linear models.

The `easystats` package(s)

The `easystats` package(s). `report::report()`

The `easystats` package(s). `report::report()`

The `easystats` package(s). `report::report()`

The `easystats` package(s). `parameters::model_parameters()`

The `easystats` package(s). `parameters::model_parameters()`

The `easystats` package(s). `parameters::model_parameters()`

The `easystats` package(s). `parameters::compare_parameters()`

The `easystats` package(s). `parameters::compare_parameters()`

The `easystats` package(s). `parameters::model_parameters()`

The `easystats` package(s). `performance::model_performance()`

The `easystats` package(s). `performance::compare_performance()`

The `easystats` package(s). `performance::compare_performance()`

The `easystats` package(s). `performance::check_model()`

We used `lm()` to fit linear regression models

We have looked at the following functions from `easystats`

`gtsummary`

`tbl_regression()`

`tbl_regression()`

`tbl_regression()`

`tbl_regression()`

`tbl_merge()`

`tbl_merge()`

`tbl_merge()`

`tbl_merge()`

`gtsummary`