Show the code
library(tidyverse) # import, tidy, transform, visualize
library(here) # filepath control
library(gtsummary) # regresssion tables (with stats)
library(easystats) # everything regression
Steen Flammild Harsted & Søren O´Neill
March 13, 2025
You can download the course slides for this section here
source(here("scripts", "01_import.R"))
in the chunkeasystats
and gtsummary
packages to the code chunk where you have your library calls.If you need to install any of the packages:
install.packages(c("package_name", "package_name"))
to download the packages.palmerpenguins
datasetpalmerpenguins
packages to the code chunk where you have your library calls.If you need to install the package:
install.packages("palmerpenguins")
to download the package.If you want to see the penguins
data in your environment pane you can add it as an object. You don’t need to do this.
palmerpenguins
herepenguins
dataset. Let body_mass_g
be your dependent variable (y), and investigate how changes in the other variables in penguins
impact body_mass_g
Build at least two models
Consider making the formula for one of the models body_mass_g ~ .
- What does this do?
Store them as (pen_model1
, pen_model2
, etc.)
Remember that we are just practicing coding here. Today is about How do we code it. The purpose of this particular exercise is to introduce you to a few very useful functions from the easystats
package(s), and to get you started with using lm()
.
$
pen_model1$
and explore the drop down menucoefficients(pen_model1)
summary()
of one of your modelseasystats
check_model
)
model_parameters()
compare_parameters()
compare_parameters()
and plot them
compare_models()
compare_performance()
compare_performance() %>% plot()
gtsummary
tbl_regression()
to build a table for one of your models.tbl_regression()
to build a table for two more of your models.tbl_merge()
to merge your tablesPhDPublications
datasetAER
and load the package (+add library call to your library code chunk)AER
contains >100 different datasets (to see a list: data(package = 'AER')
).PhDPublications
, by writing data(PhDPublications)
PhDPublications
PhDPublications
, investigate what influenced the number of articles
PhD students published during the last 3 years of their PhD.articles
what type of data is this?
articles
.
articles
variable (see plot below).
lm()
. Instead you should use glm(, family = poisson())
articles ~ .
- What does this do?phd_model1
, phd_model2
, etc.)easystats
functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs.$
phd_model1$
and explore the drop down menucoefficients(phd_model1)
summary()
of one of your modelseasystats
check_model
)
model_parameters()
compare_models()
compare_performance()
compare_performance() %>% plot()
easystats
?For more information on easystats
read here.
---
title: "Regression"
author: "Steen Flammild Harsted & Søren O´Neill"
date: today
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
eval: false
message: false
warning: false
---
<br><br>
# Presentation
You can download the course slides for this section <a href="./presentation_regression.html" download>here</a>
<div>
```{=html}
<iframe class="slide-deck" src="presentation_regression.html" width=90% ></iframe>
```
</div>
```{r setup}
#| include: false
#| echo: false
#| eval: true
library(tidyverse)
library(here)
library(gtsummary)
library(AER)
library(easystats)
library(GGally)
library(gt)
library(gtsummary)
data("PhDPublications")
library(palmerpenguins)
source(here("scripts", "01_import.R"))
```
```{r}
#| eval: true
#| column: margin
#| echo: false
knitr::include_graphics(here::here("img", "sherif.png"))
```
## Getting Started {.unnumbered}
* Make sure that you are working in your course project
* Create a new quarto document and name it "regression.qmd"
* Insert a code chunk and load 2 important libraries
* Insert a new code chunk- Write `source(here("scripts", "01_import.R"))` in the chunk
* Write a short headline to each code chunk
* Change the YAML header to style your document output.
::: {.callout-tip collapse="true"}
### The YAML header can look like this
````{verbatim}
---
title: "TITLE"
subtitle: "SUBTITLE"
author: "ME"
date: today
format:
html:
toc: true
toc-depth: 2
embed-resources: true
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
````
:::
<br><br>
<br>
### Add the `easystats` and `gtsummary` packages to the code chunk where you have your library calls.
If you need to install any of the packages:
* use `install.packages(c("package_name", "package_name"))` to download the packages.
* This is done in the console and NOT in your script.
```{r}
#| eval: false
library(tidyverse) # import, tidy, transform, visualize
library(here) # filepath control
library(gtsummary) # regresssion tables (with stats)
library(easystats) # everything regression
```
<br><br>
<br><br>
# The `palmerpenguins` dataset
<br><br>
### Add the `palmerpenguins` packages to the code chunk where you have your library calls.
If you need to install the package:
* use `install.packages("palmerpenguins")` to download the package.
* This is done in the console and NOT in your script.
```{r}
#| eval: false
library(palmerpenguins) # Has the 'penguins' data
```
If you want to see the `penguins` data in your environment pane you can add it as an object. You don't need to do this.
```{r}
penguins <- penguins
```
<br><br>
### Read the documentation for `palmerpenguins` [here](https://allisonhorst.github.io/palmerpenguins/)
<br><br>
### Using the `penguins` dataset. Let `body_mass_g` be your dependent variable (y), and investigate how changes in the other variables in `penguins` impact `body_mass_g`
* Build at least two models
* Consider making the formula for one of the models `body_mass_g ~ .` - What does this do?
* Store them as (`pen_model1`, `pen_model2`, etc.)
* Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to introduce you to a few very useful functions from the `easystats` package(s), and to get you started with using `lm()`.
```{r}
pen_model1 <- lm(body_mass_g ~ sex + species, data = penguins)
pen_model2 <- lm(body_mass_g ~ bill_depth_mm + bill_length_mm + flipper_length_mm, data = penguins)
pen_model3 <- lm(body_mass_g ~ ., data = penguins)
```
<br><br>
## Base R
* Explore the elements in one of your models using `$`
* Type `pen_model1$` and explore the drop down menu
* `coefficients(pen_model1)`
* Print a `summary()` of one of your models
<br><br>
## `easystats`
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "easystats.png"))
```
* Check the assumptions of your models (`check_model`)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?
```{r}
check_model(pen_model1)
check_model(pen_model2)
check_model(pen_model3)
```
\ \
* Investigate the model parameters of your models `model_parameters()`
* add summary info to one or more of the outputs
* Use robust standard errors for one or more of the model outputs
```{r results='hide'}
model_parameters(pen_model1)
model_parameters(pen_model2, vcov = "HC1")
model_parameters(pen_model3, summary = TRUE)
```
\ \
* Compare your models parameters `compare_parameters()`
```{r}
compare_parameters(pen_model1, pen_model2, pen_model3)
```
* Compare your models parameters `compare_parameters()` and plot them
```{r}
compare_parameters(pen_model1, pen_model2, pen_model3) |>
plot()
```
\ \
* Compare your models `compare_models()`
```{r results='hide'}
compare_models(pen_model1, pen_model2, pen_model3)
```
\ \
* Compare the performance of your models `compare_performance()`
```{r results='hide'}
compare_performance(pen_model1, pen_model2, pen_model3)
```
\ \
* Plot a comparison of your models `compare_performance() %>% plot()`
```{r results='hide', fig.show='hide'}
compare_performance(pen_model1, pen_model2, pen_model3) %>% plot()
```
\ \
* Make a report of the model you think is the best
```{r results='hide'}
report(pen_model3)
```
<br><br>
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "gtsummary.png"))
```
## `gtsummary`
<br><br>
* Use `tbl_regression()` to build a table for one of your models.
* Style your table
```{r}
pen_t1 <- pen_model1 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
pen_t1
```
<br><br>
* Use `tbl_regression()` to build a table for two more of your models.
* Style them the same way as you did before
```{r}
pen_t2 <- pen_model2 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
pen_t3 <- pen_model3 |>
tbl_regression() |>
bold_labels() |>
italicize_levels() |>
bold_p()
```
<br><br>
* Use `tbl_merge()` to merge your tables
* Add a caption to the merged table
```{r}
pen_t1_t2_t3 <- tbl_merge(
tbls = list(
pen_t1, pen_t2, pen_t3
)
) |>
modify_caption(
caption = "**Three regression models modeling the body mass of penguins**"
)
pen_t1_t2_t3
```
<br><br>
```{r}
#| include: false
#| eval: true
if(file.exists(here("tables", "my_regression_table.docx"))){
file.remove(here("tables", "my_regression_table.docx"))
}
```
* save your merged table
```{r}
pen_t1_t2_t3 |>
as_gt() |>
gtsave(here("tables", "my_regression_table.docx"))
```
# The `PhDPublications` dataset
* Install the package `AER` and load the package (+add library call to your library code chunk)
* `AER` contains >100 different datasets (to see a list: `data(package = 'AER')`).
* Load the dataset `PhDPublications`, by writing `data(PhDPublications)`
* Read the documentation for `PhDPublications`
<BR><BR>
### Using `PhDPublications`, investigate what influenced the number of `articles` PhD students published during the last 3 years of their PhD.
* We want to predict `articles` what type of data is this?
- <details><summary>ANSWER</summary>
R will tell you this is an integer, but if you read the documentaion you will realize that the numbers represent Count or Rate data (Number of articles within a three year period). We typically need to model this type of data using another approach than linear regression. Lets investigate the distribution of `articles`.
</details>
* Investigate the distribution of the `articles` variable (see plot below).
- Does it look normal?
- Can you recognize the type of distribution?
- <details><summary>ANSWER</summary>
This looks like a possion distribution, and you should not model this using normal linear regression with `lm()`. Instead you should use `glm(, family = poisson())`
</details>
```{r}
PhDPublications %>%
ggplot(aes(x = articles))+
geom_density(bw = 0.5, fill = "lightgrey", size = 1)+
theme_modern()
```
* Build at least two models
* Consider making the formula for one of the models `articles ~ .` - What does this do?
* Store them as (`phd_model1`, `phd_model2`, etc.)
* Remember that we are just practicing coding here. Today is about **How do we code it**. The purpose of this particular exercise is to show you that you can use the same `easystats` functions to work with different model types. The output will automagically adjust. This is a big advantage to other packages and programs.
```{r results='hide'}
phd_model1 <- PhDPublications %>% glm(articles ~ gender, data = ., family = poisson())
phd_model2 <- PhDPublications %>% glm(articles ~ gender * kids + mentor, data = ., family = poisson())
phd_model3 <- PhDPublications %>% glm(articles ~ ., data = ., family = poisson())
```
<br><br>
### Base R
* Explore the elements in one of your models using `$`
* Type `phd_model1$` and explore the drop down menu
* `coefficients(phd_model1)`
* Print a `summary()` of one of your models
<br><br>
```{r}
#| echo: false
#| eval: true
#| column: margin
knitr::include_graphics(here("img", "easystats.png"))
```
### `easystats`
\ \
* Check the assumptions of your models (`check_model`)
- Discuss if/why some assumptions are violated
- What could be done to remedy this?
```{r}
check_model(phd_model1)
check_model(phd_model2)
check_model(phd_model3)
```
\ \
* Investigate the model parameters of your models `model_parameters()`
```{r}
model_parameters(phd_model1)
model_parameters(phd_model2)
model_parameters(phd_model3)
```
\ \
* Compare your models `compare_models()`
```{r results='hide'}
compare_models(phd_model1, phd_model2, phd_model3,
exponentiate = TRUE # To have coefficients as ratios
)
```
\ \
* Compare the performance of your models `compare_performance()`
```{r results='hide'}
compare_performance(phd_model1, phd_model2, phd_model3)
```
\ \
* Plot a comparison of your models `compare_performance() %>% plot()`
```{r results='hide', fig.show='hide'}
compare_performance(phd_model1, phd_model2, phd_model3) %>% plot()
```
\ \
* Make a report of the model you think is the best
```{r results='hide'}
report(phd_model3, exponentiate = TRUE)
```
<br><br>
### Want More `easystats`? {-}
For more information on `easystats` read [here](https://github.com/easystats/easystats).
<br><br>
### Improve your home assignment
* Return to your home assignment and add a regressions analysis if it makes sense
* Remember to change project (top right corner in Rstudio)
* Using the menu in the top right corner, you can switch between your course project and your home assignment