Regression

Steen Flammild Harsted & Søren O’Neill

The Workflow

The Workflow

Rrrrr straight to the Danger Zone

The promise(s)




Today is about


#### When?
#### Why?



-> HOW <-



Why? and When? are very important, but these questions are better suited for a statistics course

Base R - Stability is a two-edged sword

Regression

lm() - linear models.

A typical model has the form response ~ terms

  • first + second all the terms in first together with all the terms in second with duplicates removed.
  • first : second indicates the set of terms obtained by taking the interactions of all terms in first with all terms in second.
  • first * second indicates the cross of first and second. This is the same as first + second + first:second.
  • A formula has an implied intercept term. To remove the intercept use either y ~ x - 1 or y ~ 0 + x.


lm(y ~ 0 + x1 + x2 * x3)

Regression

lm() - linear models.

mpg_model1 <- lm(cty ~ hwy, data = mpg)

mpg_model1

Call:
lm(formula = cty ~ hwy, data = mpg)

Coefficients:
(Intercept)          hwy  
     0.8442       0.6832  

Regression

lm() - linear models.

The output is a list containing coefficients, residuals, etc.. ::: {.cell output-location=‘column-fragment’}

str(mpg_model1)
List of 12
 $ coefficients : Named num [1:2] 0.844 0.683
  ..- attr(*, "names")= chr [1:2] "(Intercept)" "hwy"
 $ residuals    : Named num [1:234] -2.658 0.342 -2.024 -0.341 -2.608 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ effects      : Named num [1:234] -257.89 -62.1 -1.9 -0.21 -2.46 ...
  ..- attr(*, "names")= chr [1:234] "(Intercept)" "hwy" "" "" ...
 $ rank         : int 2
 $ fitted.values: Named num [1:234] 20.7 20.7 22 21.3 18.6 ...
  ..- attr(*, "names")= chr [1:234] "1" "2" "3" "4" ...
 $ assign       : int [1:2] 0 1
 $ qr           :List of 5
  ..$ qr   : num [1:234, 1:2] -15.2971 0.0654 0.0654 0.0654 0.0654 ...
  .. ..- attr(*, "dimnames")=List of 2
  .. .. ..$ : chr [1:234] "1" "2" "3" "4" ...
  .. .. ..$ : chr [1:2] "(Intercept)" "hwy"
  .. ..- attr(*, "assign")= int [1:2] 0 1
  ..$ qraux: num [1:2] 1.07 1.06
  ..$ pivot: int [1:2] 1 2
  ..$ tol  : num 1e-07
  ..$ rank : int 2
  ..- attr(*, "class")= chr "qr"
 $ df.residual  : int 232
 $ xlevels      : Named list()
 $ call         : language lm(formula = cty ~ hwy, data = mpg)
 $ terms        :Classes 'terms', 'formula'  language cty ~ hwy
  .. ..- attr(*, "variables")= language list(cty, hwy)
  .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. ..$ : chr [1:2] "cty" "hwy"
  .. .. .. ..$ : chr "hwy"
  .. ..- attr(*, "term.labels")= chr "hwy"
  .. ..- attr(*, "order")= int 1
  .. ..- attr(*, "intercept")= int 1
  .. ..- attr(*, "response")= int 1
  .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. ..- attr(*, "predvars")= language list(cty, hwy)
  .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. ..- attr(*, "names")= chr [1:2] "cty" "hwy"
 $ model        :'data.frame':  234 obs. of  2 variables:
  ..$ cty: int [1:234] 18 21 20 21 16 18 18 18 16 20 ...
  ..$ hwy: int [1:234] 29 29 31 30 26 26 27 26 25 28 ...
  ..- attr(*, "terms")=Classes 'terms', 'formula'  language cty ~ hwy
  .. .. ..- attr(*, "variables")= language list(cty, hwy)
  .. .. ..- attr(*, "factors")= int [1:2, 1] 0 1
  .. .. .. ..- attr(*, "dimnames")=List of 2
  .. .. .. .. ..$ : chr [1:2] "cty" "hwy"
  .. .. .. .. ..$ : chr "hwy"
  .. .. ..- attr(*, "term.labels")= chr "hwy"
  .. .. ..- attr(*, "order")= int 1
  .. .. ..- attr(*, "intercept")= int 1
  .. .. ..- attr(*, "response")= int 1
  .. .. ..- attr(*, ".Environment")=<environment: R_GlobalEnv> 
  .. .. ..- attr(*, "predvars")= language list(cty, hwy)
  .. .. ..- attr(*, "dataClasses")= Named chr [1:2] "numeric" "numeric"
  .. .. .. ..- attr(*, "names")= chr [1:2] "cty" "hwy"
 - attr(*, "class")= chr "lm"

:::

Regression

lm() - linear models.

mpg_model1$coefficients
(Intercept)         hwy 
  0.8442016   0.6832191 

Regression

lm() - linear models.

summary(mpg_model1)

Call:
lm(formula = cty ~ hwy, data = mpg)

Residuals:
    Min      1Q  Median      3Q     Max 
-2.9247 -0.7757 -0.0428  0.6965  4.6096 

Coefficients:
            Estimate Std. Error t value Pr(>|t|)    
(Intercept)  0.84420    0.33319   2.534   0.0119 *  
hwy          0.68322    0.01378  49.585   <2e-16 ***
---
Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

Residual standard error: 1.252 on 232 degrees of freedom
Multiple R-squared:  0.9138,    Adjusted R-squared:  0.9134 
F-statistic:  2459 on 1 and 232 DF,  p-value: < 2.2e-16

The easystats package(s)

The easystats package(s). report::report()

report(mpg_model1)
We fitted a linear model (estimated using OLS) to predict cty with hwy
(formula: cty ~ hwy). The model explains a statistically significant and
substantial proportion of variance (R2 = 0.91, F(1, 232) = 2458.64, p < .001,
adj. R2 = 0.91). The model's intercept, corresponding to hwy = 0, is at 0.84
(95% CI [0.19, 1.50], t(232) = 2.53, p = 0.012). Within this model:

  - The effect of hwy is statistically significant and positive (beta = 0.68, 95%
CI [0.66, 0.71], t(232) = 49.58, p < .001; Std. beta = 0.96, 95% CI [0.92,
0.99])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

The easystats package(s). report::report()

Extra example model: ::: {.cell}

mpg_model2 <- lm(cty ~hwy + drv + displ + cyl, data = mpg)

:::


report(mpg_model2)
We fitted a linear model (estimated using OLS) to predict cty with hwy, drv,
displ and cyl (formula: cty ~ hwy + drv + displ + cyl). The model explains a
statistically significant and substantial proportion of variance (R2 = 0.93,
F(5, 228) = 637.44, p < .001, adj. R2 = 0.93). The model's intercept,
corresponding to hwy = 0, drv = 4, displ = 0 and cyl = 0, is at 4.40 (95% CI
[2.62, 6.18], t(228) = 4.86, p < .001). Within this model:

  - The effect of hwy is statistically significant and positive (beta = 0.64, 95%
CI [0.60, 0.69], t(228) = 26.01, p < .001; Std. beta = 0.90, 95% CI [0.83,
0.97])
  - The effect of drv [f] is statistically significant and negative (beta =
-0.75, 95% CI [-1.20, -0.29], t(228) = -3.25, p = 0.001; Std. beta = -0.18, 95%
CI [-0.28, -0.07])
  - The effect of drv [r] is statistically significant and negative (beta =
-0.97, 95% CI [-1.55, -0.40], t(228) = -3.33, p = 0.001; Std. beta = -0.23, 95%
CI [-0.36, -0.09])
  - The effect of displ is statistically non-significant and negative (beta =
-0.02, 95% CI [-0.36, 0.33], t(228) = -0.09, p = 0.930; Std. beta = -4.65e-03,
95% CI [-0.11, 0.10])
  - The effect of cyl is statistically significant and negative (beta = -0.36,
95% CI [-0.62, -0.11], t(228) = -2.79, p = 0.006; Std. beta = -0.14, 95% CI
[-0.23, -0.04])

Standardized parameters were obtained by fitting the model on a standardized
version of the dataset. 95% Confidence Intervals (CIs) and p-values were
computed using a Wald t-distribution approximation.

The easystats package(s). report::report()

report() actually works on many other objects than regression models

Example with t.test()

data_for_t.test <- starwars |> 
  filter(species == "Human") |> 
  select(gender, height) |>  
  drop_na()  
  
t.test(height ~ gender, data = data_for_t.test) |> report()
Effect sizes were labelled following Cohen's (1988) recommendations.

The Welch Two Sample t-test testing the difference of height by gender (mean in
group feminine = 163.57, mean in group masculine = 182.39) suggests that the
effect is negative, statistically significant, and large (difference = -18.82,
95% CI [-29.92, -7.72], t(7.81) = -3.93, p = 0.005; Cohen's d = -2.81, 95% CI
[-4.73, -0.81])

The easystats package(s). parameters::model_parameters()

model_parameters(mpg_model1)
Parameter   | Coefficient |   SE |       95% CI | t(232) |      p
-----------------------------------------------------------------
(Intercept) |        0.84 | 0.33 | [0.19, 1.50] |   2.53 | 0.012 
hwy         |        0.68 | 0.01 | [0.66, 0.71] |  49.58 | < .001


model_parameters(mpg_model2)
Parameter   | Coefficient |   SE |         95% CI | t(228) |      p
-------------------------------------------------------------------
(Intercept) |        4.40 | 0.91 | [ 2.62,  6.18] |   4.86 | < .001
hwy         |        0.64 | 0.02 | [ 0.60,  0.69] |  26.01 | < .001
drv [f]     |       -0.75 | 0.23 | [-1.20, -0.29] |  -3.25 | 0.001 
drv [r]     |       -0.97 | 0.29 | [-1.55, -0.40] |  -3.33 | 0.001 
displ       |       -0.02 | 0.18 | [-0.36,  0.33] |  -0.09 | 0.930 
cyl         |       -0.36 | 0.13 | [-0.62, -0.11] |  -2.79 | 0.006 

The easystats package(s). parameters::model_parameters()

Add model information

model_parameters(mpg_model1, 
                 summary = TRUE)
Parameter   | Coefficient |   SE |       95% CI | t(232) |      p
-----------------------------------------------------------------
(Intercept) |        0.84 | 0.33 | [0.19, 1.50] |   2.53 | 0.012 
hwy         |        0.68 | 0.01 | [0.66, 0.71] |  49.58 | < .001

Add reference level

model_parameters(mpg_model2, 
                 include_reference = TRUE)
Parameter   | Coefficient |   SE |         95% CI | t(228) |      p
-------------------------------------------------------------------
(Intercept) |        4.40 | 0.91 | [ 2.62,  6.18] |   4.86 | < .001
hwy         |        0.64 | 0.02 | [ 0.60,  0.69] |  26.01 | < .001
drv [4]     |        0.00 |      |                |        |       
drv [f]     |       -0.75 | 0.23 | [-1.20, -0.29] |  -3.25 | 0.001 
drv [r]     |       -0.97 | 0.29 | [-1.55, -0.40] |  -3.33 | 0.001 
displ       |       -0.02 | 0.18 | [-0.36,  0.33] |  -0.09 | 0.930 
cyl         |       -0.36 | 0.13 | [-0.62, -0.11] |  -2.79 | 0.006 

The easystats package(s). parameters::model_parameters()

plot

model_parameters(mpg_model2) |> 
  plot()

The easystats package(s). parameters::compare_parameters()

compare_parameters(mpg_model1, mpg_model2)
Parameter    |        mpg_model1 |           mpg_model2
-------------------------------------------------------
(Intercept)  | 0.84 (0.19, 1.50) |  4.40 ( 2.62,  6.18)
hwy          | 0.68 (0.66, 0.71) |  0.64 ( 0.60,  0.69)
drv [f]      |                   | -0.75 (-1.20, -0.29)
drv [r]      |                   | -0.97 (-1.55, -0.40)
displ        |                   | -0.02 (-0.36,  0.33)
cyl          |                   | -0.36 (-0.62, -0.11)
-------------------------------------------------------
Observations |               234 |                  234

The easystats package(s). parameters::compare_parameters()

compare_parameters(mpg_model1, mpg_model2) |> 
  plot()

The easystats package(s). parameters::model_parameters()

Robust, heteroskedasticity-consistent standard errors

Read more here

Method Adjustment Formula Best Use Case
HC0 No adjustment Large samples
HC1 Inflation by \(\frac{n}{n-k}\) (default in Stata) Moderate samples
HC2 Adjustment by \((1 - h_i)\) Small samples
HC3 Adjustment by \((1 - h_i)^2\) Very small samples or high leverage
HC4 Further adjustment for extreme leverage Severe leverage issues
model_parameters(mpg_model1, 
                 vcov = "HC1") 
Parameter   | Coefficient |   SE |       95% CI | t(232) |      p
-----------------------------------------------------------------
(Intercept) |        0.84 | 0.38 | [0.10, 1.59] |   2.24 | 0.026 
hwy         |        0.68 | 0.02 | [0.65, 0.72] |  40.18 | < .001

The easystats package(s). performance::model_performance()

model_performance(mpg_model1) 
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
773.387 | 773.491 | 783.753 | 0.914 |     0.913 | 1.247 | 1.252


::: {.cell output-location=‘fragment’}

model_performance(mpg_model2)
# Indices of model performance

AIC     |    AICc |     BIC |    R2 | R2 (adj.) |  RMSE | Sigma
---------------------------------------------------------------
721.519 | 722.015 | 745.706 | 0.933 |     0.932 | 1.097 | 1.112

:::

The easystats package(s). performance::compare_performance()

Extra example model: ::: {.cell}

mpg_model3 <- lm(cty ~ ., data = mpg)

:::


compare_performance(mpg_model1, mpg_model2, mpg_model3) 
# Comparison of Model Performance Indices

Name       | Model | AIC (weights) | AICc (weights) | BIC (weights) |    R2
---------------------------------------------------------------------------
mpg_model1 |    lm | 773.4 (<.001) |  773.5 (<.001) | 783.8 (<.001) | 0.914
mpg_model2 |    lm | 721.5 (<.001) |  722.0 (<.001) | 745.7 (>.999) | 0.933
mpg_model3 |    lm | 609.0 (>.999) |  648.1 (>.999) | 809.4 (<.001) | 0.973

Name       | R2 (adj.) |  RMSE | Sigma
--------------------------------------
mpg_model1 |     0.913 | 1.247 | 1.252
mpg_model2 |     0.932 | 1.097 | 1.112
mpg_model3 |     0.965 | 0.694 | 0.798

The easystats package(s). performance::compare_performance()

compare_performance(mpg_model1, mpg_model2, mpg_model3) |> plot()

The easystats package(s). performance::check_model()

check_model(mpg_model1)

check_model(mpg_model2)

We used lm() to fit linear regression models

  • dependent ~ independent
  • (+, * , :) for building your model


We have looked at the following functions from easystats

  • report()
  • model_parameters()
  • model_performance()
  • compare_performance()
  • check_model()


Everything above works with other model types such as

  • glm(, family = binomial) logistic regression
  • glm(, family = poisson) poissan regression
  • lmer() linear mixed-effects model

Cheatsheets

Stata -> R

SAS -> R

estimator R-package for calculating Robust SE, etc.

Honorable mention

The marginaleffects package for R and Python offers a single point of entry to easily interpret the results of over 100 classes of models, using a simple and consistent user interface.

Its benefits include:

  • Powerful: It can compute and plot predictions; comparisons (contrasts, risk ratios, etc.); slopes; and conduct hypothesis and equivalence tests for over 100 different classes of models in R.
  • Simple: All functions share a simple and unified interface.
  • Documented: Thoroughly documented with abundant examples.
  • Efficient: Some operations can be up to 1000 times faster and use 30 times less memory than with the margins package.
  • Valid: When possible, numerical results are checked against alternative software like Stata or other R packages.

link to paper
Free online book

gtsummary

YES! This package can help with regression tables as well

tbl_regression()

tbl_regression()

tbl_regression(mpg_model2)
Characteristic Beta 95% CI p-value
hwy 0.64 0.60, 0.69 <0.001
drv


    4
    f -0.75 -1.2, -0.29 0.001
    r -0.97 -1.6, -0.40 0.001
displ -0.02 -0.36, 0.33 >0.9
cyl -0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval

tbl_regression()

tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  bold_levels() |> 
  bold_p() 
Characteristic Beta 95% CI p-value
hwy 0.64 0.60, 0.69 <0.001
drv


    4
    f -0.75 -1.2, -0.29 0.001
    r -0.97 -1.6, -0.40 0.001
displ -0.02 -0.36, 0.33 >0.9
cyl -0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval

tbl_regression()

tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p() |> 
  add_glance_source_note()
Characteristic Beta 95% CI p-value
hwy 0.64 0.60, 0.69 <0.001
drv


    4
    f -0.75 -1.2, -0.29 0.001
    r -0.97 -1.6, -0.40 0.001
displ -0.02 -0.36, 0.33 >0.9
cyl -0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval
R² = 0.933; Adjusted R² = 0.932; Sigma = 1.11; Statistic = 637; p-value = <0.001; df = 5; Log-likelihood = -354; AIC = 722; BIC = 746; Deviance = 282; Residual df = 228; No. Obs. = 234

tbl_merge()

t1 <- tbl_regression(mpg_model1) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p() 

t2 <- tbl_regression(mpg_model2) |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p() 

tbl_merge()

tbl_merge(
  tbls = list(t1, 
              t2)
)
Characteristic
Table 1
Table 2
Beta 95% CI p-value Beta 95% CI p-value
hwy 0.68 0.66, 0.71 <0.001 0.64 0.60, 0.69 <0.001
drv





    4



    f


-0.75 -1.2, -0.29 0.001
    r


-0.97 -1.6, -0.40 0.001
displ


-0.02 -0.36, 0.33 >0.9
cyl


-0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval

tbl_merge()

Titles on models ::: {.cell output-location=‘fragment’}

tbl_merge(
  tbls = list(t1, 
              t2), 
  tab_spanner = c("Simple", "WOW model!!!")
)
Characteristic
Simple
WOW model!!!
Beta 95% CI p-value Beta 95% CI p-value
hwy 0.68 0.66, 0.71 <0.001 0.64 0.60, 0.69 <0.001
drv





    4



    f


-0.75 -1.2, -0.29 0.001
    r


-0.97 -1.6, -0.40 0.001
displ


-0.02 -0.36, 0.33 >0.9
cyl


-0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval

:::

tbl_merge()

Titles are read as markdown ::: {.cell output-location=‘fragment’}

tbl_merge(
  tbls = list(t1, 
              t2), 
  tab_spanner = c("**Simple**", "**WOW model!!!**")
)
Characteristic
Simple
WOW model!!!
Beta 95% CI p-value Beta 95% CI p-value
hwy 0.68 0.66, 0.71 <0.001 0.64 0.60, 0.69 <0.001
drv





    4



    f


-0.75 -1.2, -0.29 0.001
    r


-0.97 -1.6, -0.40 0.001
displ


-0.02 -0.36, 0.33 >0.9
cyl


-0.36 -0.62, -0.11 0.006
Abbreviation: CI = Confidence Interval

:::

Plot your models

If you use plot() on a gtsummary table you get a ggplot object

t2 |> plot()

Plot your models

You can modify the ggplot object using functions you already know.

t2 |> plot() +
  labs(
    title = "This plot is a regression model"
  )

gtsummary



Read more about regression and gtsummary

tbl_regression() tutorial

Thanks!