Regression

Author

Steen Flammild Harsted & Søren O´Neill

Published

June 17, 2025

1 Presentation

You can download the course slides for this section here

Getting Started

Make sure that you are working in your course project
Create a new quarto document and name it “regression.qmd”
Insert a code chunk and load 2 important libraries
Insert a new code chunk- Write source(here("scripts", "01_import.R")) in the chunk
Write a short headline to each code chunk
Change the YAML header to style your document output.

The YAML header can look like this

---
title: "TITLE"
subtitle: "SUBTITLE"
author: "ME"
date: today
format: 
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
    number-sections: true
    number-depth: 2
    code-fold: true
    code-summary: "Show the code"
    code-tools: true
execute:
  message: false
  warning: false
---

Add the `easystats` and `gtsummary` packages to the code chunk where you have your library calls.

If you need to install any of the packages:

use install.packages(c("package_name", "package_name")) to download the packages.
This is done in the console and NOT in your script.

Show the code

library(tidyverse)        # import, tidy, transform, visualize
library(here)             # filepath control
library(gtsummary)        # regresssion tables (with stats)
library(easystats)        # everything regression

2 The `palmerpenguins` dataset

Add the `palmerpenguins` packages to the code chunk where you have your library calls.

If you need to install the package:

use install.packages("palmerpenguins") to download the package.
This is done in the console and NOT in your script.

Show the code

library(palmerpenguins)  # Has the 'penguins' data

If you want to see the penguins data in your environment pane you can add it as an object. You don’t need to do this.

Show the code

penguins <- penguins

Read the documentation for `palmerpenguins` here

Using the `penguins` dataset. Let `body_mass_g` be your dependent variable (y), and investigate how changes in the other variables in `penguins` impact `body_mass_g`

Build at least two models
Consider making the formula for one of the models body_mass_g ~ . - What does this do?
Store them as (pen_model1, pen_model2, etc.)
Remember that we are just practicing coding here. Today is about How do we code it. The purpose of this particular exercise is to introduce you to a few very useful functions from the easystats package(s), and to get you started with using lm().

Show the code

pen_model1 <- lm(body_mass_g ~ sex + species, data = penguins)
pen_model2 <- lm(body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm, data = penguins)
pen_model3 <- lm(body_mass_g ~ ., data = penguins)

2.1 Base R

Explore the elements in one of your models using $
- Type pen_model1$ and explore the drop down menu
- coefficients(pen_model1)
- Print a summary() of one of your models

2.2 `easystats`

Check the assumptions of your models (check_model)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?

Show the code

check_model(pen_model1)
check_model(pen_model2)
check_model(pen_model3)

Investigate the model parameters of your models model_parameters()
add summary info to one or more of the outputs
Use robust standard errors for one or more of the model outputs

Show the code

model_parameters(pen_model1)
model_parameters(pen_model2, vcov = "HC1")
model_parameters(pen_model3, summary = TRUE)

Compare your models parameters compare_parameters()

Show the code

compare_parameters(pen_model1, pen_model2, pen_model3)

Compare your models parameters compare_parameters() and plot them

Show the code

compare_parameters(pen_model1, pen_model2, pen_model3) |> 
  plot()

Compare your models compare_models()

Show the code

compare_models(pen_model1, pen_model2, pen_model3)

Compare the performance of your models compare_performance()

Show the code

compare_performance(pen_model1, pen_model2, pen_model3)

Plot a comparison of your models compare_performance() %>% plot()

Show the code

compare_performance(pen_model1, pen_model2, pen_model3) %>% plot()

Make a report of the model you think is the best

Show the code

report(pen_model3)

2.3 `gtsummary`

Use tbl_regression() to build a table for one of your models.
Style your table

Show the code

pen_t1 <- pen_model1 |> 
  tbl_regression() |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p()

pen_t1

Use tbl_regression() to build a table for two more of your models.
Style them the same way as you did before

Show the code

pen_t2 <- pen_model2 |> 
  tbl_regression() |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p()

pen_t3 <- pen_model3 |> 
  tbl_regression() |> 
  bold_labels() |> 
  italicize_levels() |> 
  bold_p()

Use tbl_merge() to merge your tables
Add a caption to the merged table

Show the code

pen_t1_t2_t3 <- tbl_merge(
  tbls = list(
    pen_t1, pen_t2, pen_t3
  )
) |> 
  modify_caption(
    caption = "**Three regression models modeling the body mass of penguins**"
  ) 

pen_t1_t2_t3

save your merged table

Show the code

pen_t1_t2_t3 |> 
  as_gt() |> 
  gtsave(here("tables", "my_regression_table.docx"))

3 The `PhDPublications` dataset

Install the package AER and load the package (+add library call to your library code chunk)
AER contains >100 different datasets (to see a list: data(package = 'AER')).
Load the dataset PhDPublications, by writing data(PhDPublications)
Read the documentation for PhDPublications

Using `PhDPublications`, investigate what influenced the number of `articles` PhD students published during the last 3 years of their PhD.

We want to predict articles what type of data is this?
- ANSWER
  R will tell you this is an integer, but if you read the documentaion you will realize that the numbers represent Count or Rate data (Number of articles within a three year period). We typically need to model this type of data using another approach than linear regression. Lets investigate the distribution of articles.
Investigate the distribution of the articles variable (see plot below).
- Does it look normal?
- Can you recognize the type of distribution?
- ANSWER
  This looks like a possion distribution, and you should not model this using normal linear regression with lm(). Instead you should use glm(, family = poisson())

Show the code

PhDPublications %>% 
  ggplot(aes(x = articles))+
  geom_density(bw = 0.5, fill = "lightgrey", size = 1)+
  theme_modern()

Build at least two models
Consider making the formula for one of the models articles ~ . - What does this do?
Store them as (phd_model1, phd_model2, etc.)
Remember that we are just practicing coding here. Today is about How do we code it. The purpose of this particular exercise is to show you that you can use the same easystats functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs.

Show the code

phd_model1 <- PhDPublications %>% glm(articles ~ gender, data = ., family = poisson())
phd_model2 <- PhDPublications %>% glm(articles ~ gender * kids + mentor,  data = ., family = poisson())
phd_model3 <- PhDPublications %>% glm(articles ~ ., data = ., family = poisson())

Base R

Explore the elements in one of your models using $
- Type phd_model1$ and explore the drop down menu
- coefficients(phd_model1)
- Print a summary() of one of your models

`easystats`

Check the assumptions of your models (check_model)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?

Show the code

check_model(phd_model1)
check_model(phd_model2)
check_model(phd_model3)

Investigate the model parameters of your models model_parameters()

Show the code

model_parameters(phd_model1)
model_parameters(phd_model2)
model_parameters(phd_model3)

Compare your models compare_models()

Show the code

compare_models(phd_model1, phd_model2, phd_model3, 
               exponentiate = TRUE  # To have coefficients as ratios
               )

Compare the performance of your models compare_performance()

Show the code

compare_performance(phd_model1, phd_model2, phd_model3)

Plot a comparison of your models compare_performance() %>% plot()

Show the code

compare_performance(phd_model1, phd_model2, phd_model3) %>% plot()

Make a report of the model you think is the best

Show the code

report(phd_model3, exponentiate = TRUE)

Want More `easystats`?

For more information on easystats read here.

Improve your home assignment

Return to your home assignment and add a regressions analysis if it makes sense
Remember to change project (top right corner in Rstudio)
Using the menu in the top right corner, you can switch between your course project and your home assignment

--- title: "Regression" author: "Steen Flammild Harsted & Søren O´Neill" date: today format: html: toc: true toc-depth: 2 number-sections: true number-depth: 2 code-fold: true code-summary: "Show the code" code-tools: true execute: eval: false message: false warning: false --- # Presentation You can download the course slides for this section <a href="./presentation_regression.html" download>here</a> <div> ```{=html} <iframe class="slide-deck" src="presentation_regression.html" width=90% ></iframe> ``` </div> ```{r setup} #| include: false #| echo: false #| eval: true library(tidyverse) library(here) library(gtsummary) library(AER) library(easystats) library(GGally) library(gt) library(gtsummary) data("PhDPublications") library(palmerpenguins) source(here("scripts", "01_import.R")) ``` ```{r} #| eval: true #| column: margin #| echo: false knitr::include_graphics(here::here("img", "sherif.png")) ``` ## Getting Started {.unnumbered} * Make sure that you are working in your course project * Create a new quarto document and name it "regression.qmd" * Insert a code chunk and load 2 important libraries * Insert a new code chunk- Write `source(here("scripts", "01_import.R"))` in the chunk * Write a short headline to each code chunk * Change the YAML header to style your document output. ::: {.callout-tip collapse="true"} ### The YAML header can look like this ````{verbatim} --- title: "TITLE" subtitle: "SUBTITLE" author: "ME" date: today format: html: toc: true toc-depth: 2 embed-resources: true number-sections: true number-depth: 2 code-fold: true code-summary: "Show the code" code-tools: true execute: message: false warning: false --- ```` ::: ### Add the `easystats` and `gtsummary` packages to the code chunk where you have your library calls. If you need to install any of the packages: * use `install.packages(c("package_name", "package_name"))` to download the packages. * This is done in the console and NOT in your script. ```{r} #| eval: false library(tidyverse) # import, tidy, transform, visualize library(here) # filepath control library(gtsummary) # regresssion tables (with stats) library(easystats) # everything regression ``` # The `palmerpenguins` dataset ### Add the `palmerpenguins` packages to the code chunk where you have your library calls. If you need to install the package: * use `install.packages("palmerpenguins")` to download the package. * This is done in the console and NOT in your script. ```{r} #| eval: false library(palmerpenguins) # Has the 'penguins' data ``` If you want to see the `penguins` data in your environment pane you can add it as an object. You don't need to do this. ```{r} penguins <- penguins ``` ### Read the documentation for `palmerpenguins` [here](https://allisonhorst.github.io/palmerpenguins/) ### Using the `penguins` dataset. Let `body_mass_g` be your dependent variable (y), and investigate how changes in the other variables in `penguins` impact `body_mass_g` * Build at least two models * Consider making the formula for one of the models `body_mass_g ~ .` - What does this do? * Store them as (`pen_model1`, `pen_model2`, etc.) * Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to introduce you to a few very useful functions from the `easystats` package(s), and to get you started with using `lm()`. ```{r} pen_model1 <- lm(body_mass_g ~ sex + species, data = penguins) pen_model2 <- lm(body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm, data = penguins) pen_model3 <- lm(body_mass_g ~ ., data = penguins) ``` ## Base R * Explore the elements in one of your models using `$` * Type `pen_model1$` and explore the drop down menu * `coefficients(pen_model1)` * Print a `summary()` of one of your models ## `easystats` ```{r} #| echo: false #| eval: true #| column: margin knitr::include_graphics(here("img", "easystats.png")) ``` * Check the assumptions of your models (`check_model`) - Discuss if/why some assumptions are violated - What could be done to remedy this? ```{r} check_model(pen_model1) check_model(pen_model2) check_model(pen_model3) ``` \ \ * Investigate the model parameters of your models `model_parameters()` * add summary info to one or more of the outputs * Use robust standard errors for one or more of the model outputs ```{r results='hide'} model_parameters(pen_model1) model_parameters(pen_model2, vcov = "HC1") model_parameters(pen_model3, summary = TRUE) ``` \ \ * Compare your models parameters `compare_parameters()` ```{r} compare_parameters(pen_model1, pen_model2, pen_model3) ``` * Compare your models parameters `compare_parameters()` and plot them ```{r} compare_parameters(pen_model1, pen_model2, pen_model3) |> plot() ``` \ \ * Compare your models `compare_models()` ```{r results='hide'} compare_models(pen_model1, pen_model2, pen_model3) ``` \ \ * Compare the performance of your models `compare_performance()` ```{r results='hide'} compare_performance(pen_model1, pen_model2, pen_model3) ``` \ \ * Plot a comparison of your models `compare_performance() %>% plot()` ```{r results='hide', fig.show='hide'} compare_performance(pen_model1, pen_model2, pen_model3) %>% plot() ``` \ \ * Make a report of the model you think is the best ```{r results='hide'} report(pen_model3) ``` ```{r} #| echo: false #| eval: true #| column: margin knitr::include_graphics(here("img", "gtsummary.png")) ``` ## `gtsummary` * Use `tbl_regression()` to build a table for one of your models. * Style your table ```{r} pen_t1 <- pen_model1 |> tbl_regression() |> bold_labels() |> italicize_levels() |> bold_p() pen_t1 ``` * Use `tbl_regression()` to build a table for two more of your models. * Style them the same way as you did before ```{r} pen_t2 <- pen_model2 |> tbl_regression() |> bold_labels() |> italicize_levels() |> bold_p() pen_t3 <- pen_model3 |> tbl_regression() |> bold_labels() |> italicize_levels() |> bold_p() ``` * Use `tbl_merge()` to merge your tables * Add a caption to the merged table ```{r} pen_t1_t2_t3 <- tbl_merge( tbls = list( pen_t1, pen_t2, pen_t3 ) ) |> modify_caption( caption = "**Three regression models modeling the body mass of penguins**" ) pen_t1_t2_t3 ``` ```{r} #| include: false #| eval: true if(file.exists(here("tables", "my_regression_table.docx"))){ file.remove(here("tables", "my_regression_table.docx")) } ``` * save your merged table ```{r} pen_t1_t2_t3 |> as_gt() |> gtsave(here("tables", "my_regression_table.docx")) ``` # The `PhDPublications` dataset * Install the package `AER` and load the package (+add library call to your library code chunk) * `AER` contains >100 different datasets (to see a list: `data(package = 'AER')`). * Load the dataset `PhDPublications`, by writing `data(PhDPublications)` * Read the documentation for `PhDPublications` ### Using `PhDPublications`, investigate what influenced the number of `articles` PhD students published during the last 3 years of their PhD. * We want to predict `articles` what type of data is this? - <details><summary>ANSWER</summary> R will tell you this is an integer, but if you read the documentaion you will realize that the numbers represent Count or Rate data (Number of articles within a three year period). We typically need to model this type of data using another approach than linear regression. Lets investigate the distribution of `articles`. </details> * Investigate the distribution of the `articles` variable (see plot below). - Does it look normal? - Can you recognize the type of distribution? - <details><summary>ANSWER</summary> This looks like a possion distribution, and you should not model this using normal linear regression with `lm()`. Instead you should use `glm(, family = poisson())` </details> ```{r} PhDPublications %>% ggplot(aes(x = articles))+ geom_density(bw = 0.5, fill = "lightgrey", size = 1)+ theme_modern() ``` * Build at least two models * Consider making the formula for one of the models `articles ~ .` - What does this do? * Store them as (`phd_model1`, `phd_model2`, etc.) * Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to show you that you can use the same `easystats` functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs. ```{r results='hide'} phd_model1 <- PhDPublications %>% glm(articles ~ gender, data = ., family = poisson()) phd_model2 <- PhDPublications %>% glm(articles ~ gender * kids + mentor, data = ., family = poisson()) phd_model3 <- PhDPublications %>% glm(articles ~ ., data = ., family = poisson()) ``` ### Base R * Explore the elements in one of your models using `$` * Type `phd_model1$` and explore the drop down menu * `coefficients(phd_model1)` * Print a `summary()` of one of your models ```{r} #| echo: false #| eval: true #| column: margin knitr::include_graphics(here("img", "easystats.png")) ``` ### `easystats` \ \ * Check the assumptions of your models (`check_model`) - Discuss if/why some assumptions are violated - What could be done to remedy this? ```{r} check_model(phd_model1) check_model(phd_model2) check_model(phd_model3) ``` \ \ * Investigate the model parameters of your models `model_parameters()` ```{r} model_parameters(phd_model1) model_parameters(phd_model2) model_parameters(phd_model3) ``` \ \ * Compare your models `compare_models()` ```{r results='hide'} compare_models(phd_model1, phd_model2, phd_model3, exponentiate = TRUE # To have coefficients as ratios ) ``` \ \ * Compare the performance of your models `compare_performance()` ```{r results='hide'} compare_performance(phd_model1, phd_model2, phd_model3) ``` \ \ * Plot a comparison of your models `compare_performance() %>% plot()` ```{r results='hide', fig.show='hide'} compare_performance(phd_model1, phd_model2, phd_model3) %>% plot() ``` \ \ * Make a report of the model you think is the best ```{r results='hide'} report(phd_model3, exponentiate = TRUE) ``` ### Want More `easystats`? {-} For more information on `easystats` read [here](https://github.com/easystats/easystats). ### Improve your home assignment * Return to your home assignment and add a regressions analysis if it makes sense * Remember to change project (top right corner in Rstudio) * Using the menu in the top right corner, you can switch between your course project and your home assignment

1 Presentation

Getting Started

Add the easystats and gtsummary packages to the code chunk where you have your library calls.

2 The palmerpenguins dataset

Add the palmerpenguins packages to the code chunk where you have your library calls.

Read the documentation for palmerpenguins here

Using the penguins dataset. Let body_mass_g be your dependent variable (y), and investigate how changes in the other variables in penguins impact body_mass_g

2.1 Base R

2.2 easystats

2.3 gtsummary

3 The PhDPublications dataset

Using PhDPublications, investigate what influenced the number of articles PhD students published during the last 3 years of their PhD.

Base R

easystats

Want More easystats?

Improve your home assignment

Add the `easystats` and `gtsummary` packages to the code chunk where you have your library calls.

2 The `palmerpenguins` dataset

Add the `palmerpenguins` packages to the code chunk where you have your library calls.

Read the documentation for `palmerpenguins` here

Using the `penguins` dataset. Let `body_mass_g` be your dependent variable (y), and investigate how changes in the other variables in `penguins` impact `body_mass_g`

2.2 `easystats`

2.3 `gtsummary`

3 The `PhDPublications` dataset

Using `PhDPublications`, investigate what influenced the number of `articles` PhD students published during the last 3 years of their PhD.

`easystats`

Want More `easystats`?