Show the code
library(tidyverse) # import, tidy, transform, visualize
library(here) # filepath control
library(gtsummary) # regresssion tables (with stats)
library(easystats) # everything regression
Steen Flammild Harsted & Søren O´Neill
April 25, 2025
You can download the course slides for this section here
source(here("scripts", "01_import.R"))
in the chunkeasystats
and gtsummary
packages to the code chunk where you have your library calls.If you need to install any of the packages:
install.packages(c("package_name", "package_name"))
to download the packages.palmerpenguins
datasetpalmerpenguins
packages to the code chunk where you have your library calls.If you need to install the package:
install.packages("palmerpenguins")
to download the package.If you want to see the penguins
data in your environment pane you can add it as an object. You don’t need to do this. ::: {.cell}
:::
palmerpenguins
herepenguins
dataset. Let body_mass_g
be your dependent variable (y), and investigate how changes in the other variables in penguins
impact body_mass_g
Build at least two models
Consider making the formula for one of the models body_mass_g ~ .
- What does this do?
Store them as (pen_model1
, pen_model2
, etc.)
Remember that we are just practicing coding here. Today is about How do we code it. The purpose of this particular exercise is to introduce you to a few very useful functions from the easystats
package(s), and to get you started with using lm()
.
$
pen_model1$
and explore the drop down menucoefficients(pen_model1)
summary()
of one of your modelseasystats
check_model
)
:::
model_parameters()
:::
compare_parameters()
compare_parameters()
and plot them
compare_models()
::: {.cell}:::
compare_performance()
::: {.cell}:::
compare_performance() %>% plot()
::: {.cell}:::
:::
gtsummary
tbl_regression()
to build a table for one of your models.:::
tbl_regression()
to build a table for two more of your models.:::
tbl_merge()
to merge your tables:::
:::
PhDPublications
datasetAER
and load the package (+add library call to your library code chunk)AER
contains >100 different datasets (to see a list: data(package = 'AER')
).PhDPublications
, by writing data(PhDPublications)
PhDPublications
PhDPublications
, investigate what influenced the number of articles
PhD students published during the last 3 years of their PhD.articles
what type of data is this?
articles
.
articles
variable (see plot below).
lm()
. Instead you should use glm(, family = poisson())
:::
articles ~ .
- What does this do?phd_model1
, phd_model2
, etc.)easystats
functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs.$
phd_model1$
and explore the drop down menucoefficients(phd_model1)
summary()
of one of your modelseasystats
check_model
)
:::
model_parameters()
::: {.cell}:::
compare_models()
::: {.cell}:::
compare_performance()
::: {.cell}:::
compare_performance() %>% plot()
::: {.cell}:::
:::
easystats
?For more information on easystats
read here.
---
title: "Regression"
author: "Steen Flammild Harsted & Søren O´Neill"
date: today
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
eval: false
message: false
warning: false
---
<br><br>
# Presentation
You can download the course slides for this section <a href="./presentation_regression.html" download>here</a>
<div>
```{=html}
<iframe class="slide-deck" src="presentation_regression.html" width=90% ></iframe>
```
</div>
```{r setup}
#| include: false
#| echo: false
#| eval: true
library(tidyverse)
library(here)
library(gtsummary)
library(AER)
library(easystats)
library(GGally)
library(gt)
library(gtsummary)
data("PhDPublications")
library(palmerpenguins)
source(here("scripts", "01_import.R"))
```
```{r}
#| eval: true
#| column: margin
#| echo: false
knitr::include_graphics(here::here("img", "sherif.png"))
```
## Getting Started {.unnumbered}
* Make sure that you are working in your course project
* Create a new quarto document and name it "regression.qmd"
* Insert a code chunk and load 2 important libraries
* Insert a new code chunk- Write `source(here("scripts", "01_import.R"))` in the chunk
* Write a short headline to each code chunk
* Change the YAML header to style your document output.
::: {.callout-tip collapse="true"}
### The YAML header can look like this
````{verbatim}
---
title: "TITLE"
subtitle: "SUBTITLE"
author: "ME"
date: today
format:
html:
toc: true
toc-depth: 2
embed-resources: true
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
````
:::
<br><br>
<br>
### Add the `easystats` and `gtsummary` packages to the code chunk where you have your library calls.
If you need to install any of the packages:
* use `install.packages(c("package_name", "package_name"))` to download the packages.
* This is done in the console and NOT in your script.
```{r}
#| eval: false
library(tidyverse) # import, tidy, transform, visualize
library(here) # filepath control
library(gtsummary) # regresssion tables (with stats)
library(easystats) # everything regression
```
<br><br>
<br><br>
# The `palmerpenguins` dataset
<br><br>
### Add the `palmerpenguins` packages to the code chunk where you have your library calls.
If you need to install the package:
* use `install.packages("palmerpenguins")` to download the package.
* This is done in the console and NOT in your script.
```{r}
#| eval: false
library(palmerpenguins) # Has the 'penguins' data
```
If you want to see the `penguins` data in your environment pane you can add it as an object. You don't need to do this.
```{r}
penguins <- penguins
```
<br><br>
### Read the documentation for `palmerpenguins` [here](https://allisonhorst.github.io/palmerpenguins/)
<br><br>
### Using the `penguins` dataset. Let `body_mass_g` be your dependent variable (y), and investigate how changes in the other variables in `penguins` impact `body_mass_g`
* Build at least two models
* Consider making the formula for one of the models `body_mass_g ~ .` - What does this do?
* Store them as (`pen_model1`, `pen_model2`, etc.)
* Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to introduce you to a few very useful functions from the `easystats` package(s), and to get you started with using `lm()`.
```{r}
pen_model1 <- lm(body_mass_g ~ sex + species, data = penguins)
pen_model2 <- lm(body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm, data = penguins)
pen_model3 <- lm(body_mass_g ~ ., data = penguins)
```
<br><br>
## Base R
* Explore the elements in one of your models using `$`
* Type `pen_model1$` and explore the drop down menu
* `coefficients(pen_model1)`
* Print a `summary()` of one of your models
<br><br>
## `easystats`
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "easystats.png"))
```
* Check the assumptions of your models (`check_model`)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?
```{r}
check_model(pen_model1)
check_model(pen_model2)
check_model(pen_model3)
```
\ \
* Investigate the model parameters of your models `model_parameters()`
* add summary info to one or more of the outputs
* Use robust standard errors for one or more of the model outputs
```{r results='hide'}
model_parameters(pen_model1)
model_parameters(pen_model2, vcov = "HC1")
model_parameters(pen_model3, summary = TRUE)
```
\ \
* Compare your models parameters `compare_parameters()`
```{r}
compare_parameters(pen_model1, pen_model2, pen_model3)
```
* Compare your models parameters `compare_parameters()` and plot them
```{r}
compare_parameters(pen_model1, pen_model2, pen_model3) |>
plot()
```
\ \
* Compare your models `compare_models()`
```{r results='hide'}
compare_models(pen_model1, pen_model2, pen_model3)
```
\ \
* Compare the performance of your models `compare_performance()`
```{r results='hide'}
compare_performance(pen_model1, pen_model2, pen_model3)
```
\ \
* Plot a comparison of your models `compare_performance() %>% plot()`
```{r results='hide', fig.show='hide'}
compare_performance(pen_model1, pen_model2, pen_model3) %>% plot()
```
\ \
* Make a report of the model you think is the best
```{r results='hide'}
report(pen_model3)
```
<br><br>
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "gtsummary.png"))
```
## `gtsummary`
<br><br>
* Use `tbl_regression()` to build a table for one of your models.
* Style your table
```{r}
pen_t1 <- pen_model1 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
pen_t1
```
<br><br>
* Use `tbl_regression()` to build a table for two more of your models.
* Style them the same way as you did before
```{r}
pen_t2 <- pen_model2 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
pen_t3 <- pen_model3 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
```
<br><br>
* Use `tbl_merge()` to merge your tables
* Add a caption to the merged table
```{r}
pen_t1_t2_t3 <- tbl_merge(
tbls = list(
pen_t1, pen_t2, pen_t3
)
) |>
modify_caption(
caption = "**Three regression models modeling the body mass of penguins**"
)
pen_t1_t2_t3
```
<br><br>
```{r}
#| include: false
#| eval: true
if(file.exists(here("tables", "my_regression_table.docx"))){
file.remove(here("tables", "my_regression_table.docx"))
}
```
* save your merged table
```{r}
pen_t1_t2_t3 |>
as_gt() |>
gtsave(here("tables", "my_regression_table.docx"))
```
# The `PhDPublications` dataset
* Install the package `AER` and load the package (+add library call to your library code chunk)
* `AER` contains >100 different datasets (to see a list: `data(package = 'AER')`).
* Load the dataset `PhDPublications`, by writing `data(PhDPublications)`
* Read the documentation for `PhDPublications`
<BR><BR>
### Using `PhDPublications`, investigate what influenced the number of `articles` PhD students published during the last 3 years of their PhD.
* We want to predict `articles` what type of data is this?
- <details><summary>ANSWER</summary>
R will tell you this is an integer, but if you read the documentaion you will realize that the numbers represent Count or Rate data (Number of articles within a three year period). We typically need to model this type of data using another approach than linear regression. Lets investigate the distribution of `articles`.
</details>
* Investigate the distribution of the `articles` variable (see plot below).
- Does it look normal?
- Can you recognize the type of distribution?
- <details><summary>ANSWER</summary>
This looks like a possion distribution, and you should not model this using normal linear regression with `lm()`. Instead you should use `glm(, family = poisson())`
</details>
```{r}
PhDPublications %>%
ggplot(aes(x = articles))+
geom_density(bw = 0.5, fill = "lightgrey", size = 1)+
theme_modern()
```
* Build at least two models
* Consider making the formula for one of the models `articles ~ .` - What does this do?
* Store them as (`phd_model1`, `phd_model2`, etc.)
* Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to show you that you can use the same `easystats` functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs.
```{r results='hide'}
phd_model1 <- PhDPublications %>% glm(articles ~ gender, data = ., family = poisson())
phd_model2 <- PhDPublications %>% glm(articles ~ gender * kids + mentor, data = ., family = poisson())
phd_model3 <- PhDPublications %>% glm(articles ~ ., data = ., family = poisson())
```
<br><br>
### Base R
* Explore the elements in one of your models using `$`
* Type `phd_model1$` and explore the drop down menu
* `coefficients(phd_model1)`
* Print a `summary()` of one of your models
<br><br>
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "easystats.png"))
```
### `easystats`
\ \
* Check the assumptions of your models (`check_model`)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?
```{r}
check_model(phd_model1)
check_model(phd_model2)
check_model(phd_model3)
```
\ \
* Investigate the model parameters of your models `model_parameters()`
```{r}
model_parameters(phd_model1)
model_parameters(phd_model2)
model_parameters(phd_model3)
```
\ \
* Compare your models `compare_models()`
```{r results='hide'}
compare_models(phd_model1, phd_model2, phd_model3,
exponentiate = TRUE # To have coefficients as ratios
)
```
\ \
* Compare the performance of your models `compare_performance()`
```{r results='hide'}
compare_performance(phd_model1, phd_model2, phd_model3)
```
\ \
* Plot a comparison of your models `compare_performance() %>% plot()`
```{r results='hide', fig.show='hide'}
compare_performance(phd_model1, phd_model2, phd_model3) %>% plot()
```
\ \
* Make a report of the model you think is the best
```{r results='hide'}
report(phd_model3, exponentiate = TRUE)
```
<br><br>
### Want More `easystats`? {-}
For more information on `easystats` read [here](https://github.com/easystats/easystats).
<br><br>
### Improve your home assignment
* Return to your home assignment and add a regressions analysis if it makes sense
* Remember to change project (top right corner in Rstudio)
* Using the menu in the top right corner, you can switch between your course project and your home assignment