Tidy

dplyr and tidyr

Author

Steen Flammild Harsted & Søren O´Neill

Published

June 17, 2025

1 Presentation

You can download the course slides for this section here

Getting Started

Show the code

knitr::include_graphics(here::here("img", "badges",  "sheriff.png"))

Create a new quarto document and name it “tidy.qmd”
Change the YAML header to style your document output. (see YAML example below)
Insert a code chunk and load 2 important libraries

Show the code

library(tidyverse)
library(here)

YAML example

---
title: "TITLE"
subtitle: "SUBTITLE"
author: "ME"
date: today
format: 
  html:
    toc: true
    toc-depth: 2
    embed-resources: true
    number-sections: true
    number-depth: 2
    code-fold: true
    code-summary: "Show the code"
    code-tools: true
execute:
  message: false
  warning: false
---

Excercises

Load the `soldiers` dataset and assign to an object called `soldiers`

Use the function read_csv2()
The file argument should be here("raw_data", "soldiers.csv")
remember <-

Show the code

soldiers <- read_csv2(here("raw_data", "soldiers.csv"))

1.1 Explore the dataset

There are many ways to explore a dataset in R. The simplest is to use the dataviewer:

click on the table icon next to the soldiers object in the Enviroment plane
write view(soldiers) in the console (This should not be a part of your the script)
glimpse(soldiers)
summary(soldiers)

Discuss with your neighbor:

Nr of rows?
Nr of columns?
Missing values?
Types of variables?
Any fake data? (hint: Yes, for educational purposes we have added some fake data)

2 `dplyr`

The main functions of dplyr that you will practice today are:

select()
filter()
summarise()
group_by()
arrange()
mutate()
count()

2.1 `select()`

select the columns `subjectid`, `sex`, `age`

Show the code

soldiers |> 
  select(subjectid, sex, age)

select the columns 1, 3, 5:7

Show the code

soldiers |> 
  select(1,3,5:7)

remove the columns 3:5

Show the code

soldiers |> 
  select(-(3:5))

select all columns that contains the word “circumference”

HINT

Use one of the tidyselect helper functions.

Show the code

soldiers |> 
  select(contains("circumference"))

remove all columns containing the letter “c”

HINT

Use one of the tidyselect helper functions.
Use a minus sign.

Show the code

soldiers |> 
  select(-contains("c"))

select all columns that contains a “c” OR “x” OR “y” OR “z”

HINT

In R(and many other programming languages) the | sign is used as a logical operator for OR.

Show the code

soldiers |> 
  select(contains("c") | contains("x") | contains("y") | contains("z"))

Time to check your code

Clean your code
Restart R. (Click Session -> Restart R)
Run all your code chunks
Your code should run without errors before your proceed

2.2 `filter()`

Keep all rows where `sex` is Female:

Hint

???? == "Female"

Show the code

soldiers |> filter(sex == "Female")

Keep all rows where `weightkg` is missing (NA value)

Hint

use the is.na() function

Show the code

soldiers |> 
  filter(is.na(weightkg))

Keep all rows where `WritingPreference` is “Left hand” AND `sex` is “Female”

Show the code

soldiers |> 
  filter(WritingPreference == "Left hand" ,  sex == "Female")  # you can use & instead of a ,

Keep all rows where `WritingPreference` is “Left hand” OR `sex` is “Female”

Show the code

soldiers |> 
  filter(WritingPreference == "Left hand" |  sex == "Female")

What is going wrong in this code?

soldiers |> 
  select(1:5) |> 
  filter(WritingPreference == "Left hand" |  sex == "Female")

The error message is:

Error in `filter()`:
ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`.
Caused by error:
! object 'WritingPreference' not found
Run `rlang::last_error()` to see where the error occurred.

ANSWER

The variable WritingPreference was not selected in the first line.

Keep all rows where `age` is above 30 and the `weightkg` is below 600

Show the code

soldiers |> 
  filter(age > 30, weightkg < 600)

Keep all rows where `Ethnicity` is either “Mexican” OR “Filipino” OR “Caribbean Islander”

Hint

you can use %in% and c() or multiple calls of Ethnicity == xxxx | Ethnicity == zzz | ...

Show the code

soldiers |> 
  filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander"))

# Alternatively
soldiers |> 
  filter(Ethnicity == "Mexican" | Ethnicity == "Filipino" | Ethnicity == "Caribbean Islander")

Difference between == and %in%

tldr - Use %in% instead of == when you want to filter for multiple values.

Read on if you want to understand why. (You don’t have to)

The code filter(Ethnicity == c("Mexican", "Filipino")) is likely not doing what you expect. The ‘==’ operator does an element-wise comparison, which means it compares the first element of ‘Ethnicity’ to the first element of the vector (“Mexican”), the second element of ‘Ethnicity’ to the second element of the vector (“Filipino”). The short vector is then recycled so now the third element of ‘Ethnicity’ is compared to the first element of the vector (“Mexican”), the fourth element of ‘Ethnicity’ to the second element of the vector (“Filipino”), and so on.

Inspect the differences in how may rows these lines of code produce

soldiers |> 
  filter(Ethnicity %in% c("Mexican", "Filipino"))

soldiers |> 
  filter(Ethnicity == c("Mexican", "Filipino"))

Run this code chunk line by line. Inspect the differences.

# Create a data frame
df <- data.frame(
  Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
  Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
  stringsAsFactors = FALSE
)

# Investigate the data frame
df

# Filter using %in%
df |> filter(Ethnicity %in% c("Mexican", "Filipino"))

# Filter using ==
df |> filter(Ethnicity == c("Mexican", "Filipino"))

Time to check your code

Clean your code
Restart R. (Click Session -> Restart R)
Run all your code chunks
Your code should run without errors before your proceed

2.3 `summarise()`

Calculate the mean and standard deviation of `footlength`

Show the code

soldiers |> summarise(
  footlength_avg = mean(footlength),
  footlength_sd = sd(footlength))

Calculate the median and interquartile range of `earlength`

HINT

use the IQR() function

Show the code

soldiers |> 
  summarise(
    earlength_median = median(earlength),
    earlength_iqr = IQR(earlength))

Count the number of rows where `WritingPreference` is equal to “Right hand”

Show the code

soldiers |>  
  summarise(
    n_righthanded = sum(WritingPreference == "Right hand"))

How old is the oldest soldier?

HINT if you can’t work out why get an NA value

Many Base R functions, including mean(), does not ignore NA values by default. This means that if the vector contains an NA value the result will be NA. Is this a good or bad thing?
You can set the argument na.rm = TRUE, to ignore missing values.

Show the code

soldiers |> 
  summarise(
    age_max = max(age, na.rm = TRUE))

Calculate the mean weight of the Females

HINT if you can’t work out why get an NA value

Show the code

soldiers |> 
  filter(sex == "Female") |> 
  summarise(
    weight_avg = mean(weightkg, na.rm = TRUE))

Calculate the range in weight (max-min) within Males

Show the code

soldiers |> 
  filter(sex == "Male") |> 
  summarise(
    weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE))

Time to check your code

Clean your code
Restart R. (Click Session -> Restart R)
Run all your code chunks (you can use “run all” CTRL+ALT+R)
Your code should run without errors before your proceed

2.4 `group_by()` and `arrange()`

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s

Show the code

soldiers |> 
  group_by(Installation) |> 
  summarise(weight_avg = mean(weightkg, na.rm = TRUE),
            weight_sd = sd(weightkg, na.rm = TRUE),
            age_avg = mean(age, na.rm = TRUE),
            age_sd = sd(age, na.rm = TRUE))

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es

Show the code

soldiers |> 
  group_by(Installation, sex) |> 
  summarise(weight_avg = mean(weightkg, na.rm = TRUE),
            weight_sd = sd(weightkg, na.rm = TRUE),
            age_avg = mean(age, na.rm = TRUE),
            age_sd = sd(age, na.rm = TRUE))

Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`

Hint

To count missings, use the functions sum() and is.na()

Show the code

soldiers |> 
  group_by(Installation) |> 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)))

As before, but now also add the number of observations (rows) within each `Installation`

Hint

Use n()

Show the code

soldiers |> 
  group_by(Installation) |> 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)),
            n = n())

As before, but now arrange the output after number of soldiers at each `Installation` in descending order.

Show the code

soldiers |> 
  group_by(Installation) |> 
  summarise(height_avg = mean(Heightin, na.rm = TRUE),
            height_n_miss = sum(is.na(Heightin)),
            n = n()) |> 
  arrange(desc(n))

Time to check your code

Clean your code
Restart R. (Click Session -> Restart R)
Run all your code chunks (you can use “run all” CTRL+ALT+R)
Your code should run without errors before your proceed

2.5 `mutate()`

Add a column called `heightcm` with the height of the soldiers in cm

Update the soldiers dataset with the new variable (<-)
place the new variable after Heightin

Show the code

soldiers <- soldiers |> 
  mutate(
    heightcm = Heightin * 2.54,
    .after = Heightin)

# Tip:
# The typical workflow for updating an object is that you first
# write your code WITHOUT updating. Instead you print
# each incremental change in the console. When you are happy with 
# your result you update the object. 
# In this case this would mean that you add:
# "soldiers <-" in the first line as the last step

Update the `weightkg` column to kg instead of kg*10

Update the soldiers dataset with the new weightkg column

Show the code

soldiers <- soldiers |> 
  mutate(
    weightkg = weightkg/10
    )

Add a column called `BMI` with the Body mass index (BMI) of the soldiers

BMI

Update the soldiers dataset with the new variable
place the new variable after weightkg

Show the code

soldiers <- soldiers |> 
  mutate(BMI = weightkg/(heightcm/100)^2,
         .after = weightkg)

Add a column called `obese` that contains the value TRUE if BMI is > 30

Show the code

soldiers |> 
  mutate(
    obese = if_else(BMI > 30, TRUE, FALSE),
    .before = 1 # This line code just places the variable at the front
  )

# OR

soldiers |> 
  mutate(
    obese = BMI > 30,
    .before = 1 # This line code just places the variable at the front
  )

Inspect the below table from Wikipedia

Category	BMI (kg/m²)
Underweight (Severe thinness)	< 16.0
Underweight (Moderate thinness)	16.0 – 16.9
Underweight (Mild thinness)	17.0 – 18.4
Normal range	18.5 – 24.9
Overweight (Pre-obese)	25.0 – 29.9
Obese (Class I)	30.0 – 34.9
Obese (Class II)	35.0 – 39.9
Obese (Class III)	>= 40.0

Create the variable `category` that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

Update the soldiers dataset with the new variable
place the new variable after BMI

Hint 1

Use case_when()

Hint 2

soldiers |> 
  mutate(
    category = ????
    )

Hint 3

soldiers |> 
  mutate(
    category = case_when(
      #TEST HERE ~ OUTPUT, 
      #TEST HERE ~ OUTPUT,
      #TEST HERE ~ OUTPUT,
      #.default = OUTPUT
    )
    )

Show the code

soldiers <- soldiers |>  
    mutate( 
        category = case_when( 
            BMI < 18.5 ~ "Underweight",         # BMI below 18.5
            BMI < 25   ~ "Normal range",        # BMI between 18.5 and 24.9
            BMI < 30   ~ "Overweight",          # BMI between 25 and 29.9
            BMI >= 30  ~ "Obese",               # BMI 30 and above
            .default = NA
        ), 
        .after = BMI                            # Place the new variable after BMI
    )

Code reading

Consider the two other solution examples in the tabs.

Do you see a problems?

Do you see improvements?

You can run the code in the examples, but try to see if you can predict what the code will do before you run it.

soldiers |>  
    mutate( 
        category = case_when( 
            BMI >= 30  ~ "Obese",               
            BMI < 30   ~ "Overweight",         
            BMI < 25   ~ "Normal range",        
            BMI < 18.5 ~ "Underweight",     
            .default = NA
        ), 
        .after = BMI                            
    )

Answer

soldiers |>  
    mutate( 
        category = case_when( 
            BMI >= 30  ~ "Obese",               # OK!
            BMI < 30   ~ "Overweight",          # ERROR !!! All BMI values below 30 will now be labelled as "Overweight"
            BMI < 25   ~ "Normal range",        # Will never be evaluated
            BMI < 18.5 ~ "Underweight",         # Will never be evaluated
            .default = NA
        ), 
        .after = BMI                            
    )

soldiers |>  
    mutate( 
        category = case_when( 
            BMI < 18.5 ~ "Underweight",         
            BMI < 25   ~ "Normal range",       
            BMI < 30   ~ "Overweight",          
            BMI >= 30  ~ "Obese",              
            is.na(BMI) ~ NA,                  
            .default = "UNCLASSIFIED CHECK YOUR CODE!!!!" 
        ), 
        .after = BMI                           
    )

Answer

# The condition is.na(BMI) is logically valid but redundant in this example, 
# as the .default argument already handles unmatched cases, including NAs.
# In this example, the preceding arguments mathematically account for all 
# possible BMI values except NA. 

# However, explicitly separating NA cases is considered good practice.
# It makes the code more readable and clearly distinguishes missing
# data from unexpected or unclassified combinations.

Time to check your code

Clean your code
Restart R. (Click Session -> Restart R)
Run all your code chunks (you can use “run all” CTRL+ALT+R)
Your code should run without errors before your proceed

2.6 `count()`

For simple counting count() is faster than summarise(n = n()) or mutate(n = n())
In some of the exercises below we are using the diamonds dataset. This is an inbuilt dataset that comes with the tidyverse. diamonds is available even though you cant see it in your Environment pane.

What is this code equivalent to?

diamonds |> 
  count()

ANSWER

count() works like summarise(n = n())

What is this code equivalent to?

diamonds |> 
  count(cut)

Answer

# The code is equivalent to:
diamonds |> group_by(cut) |> summarise(n = n())

# OR

diamonds |> summarise(n = n(), .by = cut)

What is this code equivalent to?

diamonds |> 
  count(cut, color)

Answer

# The code is equivalent to:
diamonds |> group_by(cut, color) |> summarise(n = n())

# OR

diamonds |> summarise(n = n(), .by = c(cut, color))


# However, notice that the first solution returns a grouped tibble

What is this code equivalent to?

diamonds |> 
  add_count()

ANSWER

add_count() works like mutate(n = n())

Return so `soldiers`

Count the number of soldiers of each sex calculate the percentage of each sex

Show the code

soldiers |> 
  count(sex) |> 
  mutate(percent = n/sum(n)*100)

Inspect the output of this code. What is going wrong?

soldiers |> 
  group_by(sex) |> 
  count() |> 
  mutate(percent = n/sum(n)*100)

# A tibble: 4 × 3
# Groups:   sex [4]
  sex        n percent
  <chr>  <int>   <dbl>
1 Female  1452     100
2 Male    3026     100
3 Man     1126     100
4 Woman    604     100

2.7 Next up

Congratulations! You have completed all the dplyrstarter exercises. Return to “Files, Folders, and Projects Discipline” and continue with Setting up your course project (continued)

--- title: "Tidy" subtitle: "dplyr and tidyr" author: "Steen Flammild Harsted & Søren O´Neill" date: today format: html: toc: true toc-depth: 2 number-sections: true number-depth: 2 code-fold: true code-summary: "Show the code" code-tools: true execute: message: false warning: false --- # Presentation You can download the course slides for this section <a href="./presentation_tidy.html" download>here</a> <div> ```{=html} <iframe class="slide-deck" src="presentation_tidy.html" width=90% ></iframe> ``` </div> \ ## Getting Started {.unnumbered} ```{r} knitr::include_graphics(here::here("img", "badges", "sheriff.png")) ``` * Create a new quarto document and name it "tidy.qmd" * Change the YAML header to style your document output. (see YAML example below) * Insert a code chunk and load 2 important libraries ```{r} library(tidyverse) library(here) ``` ::: {.callout-tip collapse="true"} ### YAML example ````{verbatim} --- title: "TITLE" subtitle: "SUBTITLE" author: "ME" date: today format: html: toc: true toc-depth: 2 embed-resources: true number-sections: true number-depth: 2 code-fold: true code-summary: "Show the code" code-tools: true execute: message: false warning: false --- ```` ::: # Excercises {.unnumbered} ### Load the `soldiers` dataset and assign to an object called `soldiers` * Use the function `read_csv2()` \ * The `file` argument should be `here("raw_data", "soldiers.csv")` * remember `<-` ```{r} soldiers <- read_csv2(here("raw_data", "soldiers.csv")) ``` \ ## Explore the dataset There are many ways to explore a dataset in R. The simplest is to use the dataviewer: * click on the table icon next to the `soldiers` object in the Enviroment plane * write `view(soldiers)` in the console (This should not be a part of your the script) * `glimpse(soldiers)` * `summary(soldiers)` Discuss with your neighbor: * Nr of rows? * Nr of columns? * Missing values? * Types of variables? * Any fake data? *(hint: Yes, for educational purposes we have added some fake data)* # `dplyr` The main functions of dplyr that you will practice today are: * `select()` * `filter()` * `summarise()` * `group_by()` * `arrange()` * `mutate()` * `count()` ## `select()` ### select the columns `subjectid`, `sex`, `age` ```{r} #| eval: false soldiers |> select(subjectid, sex, age) ``` \ ### select the columns 1, 3, 5:7 ```{r} #| eval: false soldiers |> select(1,3,5:7) ``` \ ### remove the columns 3:5 ```{r} #| eval: false soldiers |> select(-(3:5)) ``` \ ### select all columns that contains the word "circumference" ::: {.callout-tip collapse=true} #### HINT Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions. ::: ```{r soldiers_6} #| eval: false soldiers |> select(contains("circumference")) ``` ### remove all columns containing the letter "c" ::: {.callout-tip collapse=true} #### HINT Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions. Use a minus sign. ::: ```{r soldiers_7} #| eval: false soldiers |> select(-contains("c")) ``` ### select all columns that contains a "c" OR "x" OR "y" OR "z" ::: {.callout-tip collapse=true} #### HINT In R(and many other programming languages) the \| sign is used as a logical operator for OR. ::: ```{r} #| eval: false soldiers |> select(contains("c") | contains("x") | contains("y") | contains("z")) ``` ```{r} #| echo: false #| column: margin knitr::include_graphics(here::here("img", "sherif.png")) ``` ### Time to check your code * Clean your code * Restart R. (Click Session -> Restart R)\ * Run all your code chunks * Your code should run without errors before your proceed ## `filter()` \ ### Keep all rows where `sex` is Female: ::: {.callout-tip collapse=true} #### Hint `???? == "Female"` ::: ```{r} #| eval: false soldiers |> filter(sex == "Female") ``` ### Keep all rows where `weightkg` is missing (NA value) ::: {.callout-tip collapse=true} #### Hint use the `is.na()` function ::: ```{r} #| eval: false soldiers |> filter(is.na(weightkg)) ``` ### Keep all rows where `WritingPreference` is "Left hand" AND `sex` is "Female" ```{r} #| eval: false soldiers |> filter(WritingPreference == "Left hand" , sex == "Female") # you can use & instead of a , ``` ### Keep all rows where `WritingPreference` is "Left hand" OR `sex` is "Female" ```{r} #| eval: false soldiers |> filter(WritingPreference == "Left hand" | sex == "Female") ``` ### What is going wrong in this code? ```{r} #| eval: false #| code-fold: false soldiers |> select(1:5) |> filter(WritingPreference == "Left hand" | sex == "Female") ``` The error message is: ```{r} #| eval: false #| code-fold: false Error in `filter()`: ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`. Caused by error: ! object 'WritingPreference' not found Run `rlang::last_error()` to see where the error occurred. ``` ```{r} #| eval: false #| code-summary: "ANSWER" The variable WritingPreference was not selected in the first line. ``` ### Keep all rows where `age` is above 30 and the `weightkg` is below 600 ```{r} #| eval: false soldiers |> filter(age > 30, weightkg < 600) ``` \ ### Keep all rows where `Ethnicity` is either "Mexican" OR "Filipino" OR "Caribbean Islander" ::: {.callout-tip collapse=true} #### Hint you can use `%in%` and `c()` or multiple calls of `Ethnicity == xxxx | Ethnicity == zzz | ...` ::: ```{r} #| eval: false soldiers |> filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander")) # Alternatively soldiers |> filter(Ethnicity == "Mexican" | Ethnicity == "Filipino" | Ethnicity == "Caribbean Islander") ``` ::: {.callout-caution collapse=true} #### Difference between `==` and `%in%` tldr - Use `%in%` instead of `==` when you want to filter for multiple values. Read on if you want to understand why. (You don't have to) The code `filter(Ethnicity == c("Mexican", "Filipino"))` is likely not doing what you expect. The '==' operator does an element-wise comparison, which means it compares the first element of 'Ethnicity' to the first element of the vector ("Mexican"), the second element of 'Ethnicity' to the second element of the vector ("Filipino"). The short vector is then recycled so now the third element of 'Ethnicity' is compared to the first element of the vector ("Mexican"), the fourth element of 'Ethnicity' to the second element of the vector ("Filipino"), and so on. Inspect the differences in how may rows these lines of code produce ```{r} #| code-fold: false #| eval: false soldiers |> filter(Ethnicity %in% c("Mexican", "Filipino")) soldiers |> filter(Ethnicity == c("Mexican", "Filipino")) ``` Run this code chunk line by line. Inspect the differences. ```{r} #| code-fold: false #| eval: false # Create a data frame df <- data.frame( Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"), Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"), stringsAsFactors = FALSE ) # Investigate the data frame df # Filter using %in% df |> filter(Ethnicity %in% c("Mexican", "Filipino")) # Filter using == df |> filter(Ethnicity == c("Mexican", "Filipino")) ``` ::: ```{r} #| echo: false #| column: margin knitr::include_graphics(here::here("img", "sherif.png")) ``` ### Time to check your code * Clean your code * Restart R. (Click Session -> Restart R)\ * Run all your code chunks * Your code should run without errors before your proceed ## `summarise()` ### Calculate the mean and standard deviation of `footlength` ```{r} #| eval: false soldiers |> summarise( footlength_avg = mean(footlength), footlength_sd = sd(footlength)) ``` ### Calculate the median and interquartile range of `earlength` <details><summary>HINT</summary> use the `IQR()` function </details> ```{r} #| eval: false soldiers |> summarise( earlength_median = median(earlength), earlength_iqr = IQR(earlength)) ``` ### Count the number of rows where `WritingPreference` is equal to "Right hand" ```{r} #| eval: false soldiers |> summarise( n_righthanded = sum(WritingPreference == "Right hand")) ``` \ ### How old is the oldest soldier? <details> <summary> HINT if you can't work out why get an `NA` value </summary> Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\ You can set the argument `na.rm = TRUE`, to ignore missing values. </details> ```{r} #| eval: false soldiers |> summarise( age_max = max(age, na.rm = TRUE)) ``` \ ### Calculate the mean weight of the Females <details> <summary> HINT if you can't work out why get an `NA` value </summary> Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\ You can set the argument `na.rm = TRUE`, to ignore missing values. </details> ```{r} #| eval: false soldiers |> filter(sex == "Female") |> summarise( weight_avg = mean(weightkg, na.rm = TRUE)) ``` ### Calculate the range in weight (max-min) within Males ```{r} #| eval: false soldiers |> filter(sex == "Male") |> summarise( weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE)) ``` ```{r} #| echo: false #| column: margin knitr::include_graphics(here::here("img", "sherif.png")) ``` ### Time to check your code * Clean your code * Restart R. (Click Session -> Restart R)\ * Run all your code chunks (you can use "run all" CTRL+ALT+R) * Your code should run without errors before your proceed ## `group_by()` and `arrange()` ### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s ```{r} #| eval: false soldiers |> group_by(Installation) |> summarise(weight_avg = mean(weightkg, na.rm = TRUE), weight_sd = sd(weightkg, na.rm = TRUE), age_avg = mean(age, na.rm = TRUE), age_sd = sd(age, na.rm = TRUE)) ``` ### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es ```{r} #| eval: false soldiers |> group_by(Installation, sex) |> summarise(weight_avg = mean(weightkg, na.rm = TRUE), weight_sd = sd(weightkg, na.rm = TRUE), age_avg = mean(age, na.rm = TRUE), age_sd = sd(age, na.rm = TRUE)) ``` ### Calcualate the average height for each `Installation` and count the number of missing values within each `Installation` ::: {.callout-tip collapse=true} #### Hint To count missings, use the functions `sum()` and `is.na()` ::: ```{r} #| eval: false soldiers |> group_by(Installation) |> summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin))) ``` ### As before, but now also add the number of observations (rows) within each `Installation` ::: {.callout-tip collapse=true} #### Hint Use `n()` ::: ```{r} #| eval: false soldiers |> group_by(Installation) |> summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin)), n = n()) ``` ### As before, but now arrange the output after number of soldiers at each `Installation` in descending order. ```{r} #| eval: false soldiers |> group_by(Installation) |> summarise(height_avg = mean(Heightin, na.rm = TRUE), height_n_miss = sum(is.na(Heightin)), n = n()) |> arrange(desc(n)) ``` ```{r} #| echo: false #| column: margin knitr::include_graphics(here::here("img", "sherif.png")) ``` ### Time to check your code * Clean your code * Restart R. (Click Session -> Restart R)\ * Run all your code chunks (you can use "run all" CTRL+ALT+R) * Your code should run without errors before your proceed ## `mutate()` ### Add a column called `heightcm` with the height of the soldiers in cm * Update the `soldiers` dataset with the new variable (`<-`) * place the new variable after `Heightin` ```{r} #| output: false soldiers <- soldiers |> mutate( heightcm = Heightin * 2.54, .after = Heightin) # Tip: # The typical workflow for updating an object is that you first # write your code WITHOUT updating. Instead you print # each incremental change in the console. When you are happy with # your result you update the object. # In this case this would mean that you add: # "soldiers <-" in the first line as the last step ``` \ ### Update the `weightkg` column to kg instead of kg*10 * Update the `soldiers` dataset with the new `weightkg` column ```{r} #| output: false soldiers <- soldiers |> mutate( weightkg = weightkg/10 ) ``` \ ### Add a column called `BMI` with the Body mass index (BMI) of the soldiers [BMI](https://en.wikipedia.org/wiki/Body_mass_index) * Update the `soldiers` dataset with the new variable * place the new variable after `weightkg` ```{r} #| output: false soldiers <- soldiers |> mutate(BMI = weightkg/(heightcm/100)^2, .after = weightkg) ``` \ ### Add a column called `obese` that contains the value TRUE if BMI is > 30 ```{r} #| eval: false soldiers |> mutate( obese = if_else(BMI > 30, TRUE, FALSE), .before = 1 # This line code just places the variable at the front ) # OR soldiers |> mutate( obese = BMI > 30, .before = 1 # This line code just places the variable at the front ) ``` \ ### Inspect the below table from [Wikipedia](https://en.wikipedia.org/wiki/Body_mass_index) {#sec-BMI} ```{r} #| echo: false tibble::tribble( ~Category, ~`BMI (kg/m^2^)`, "Underweight (Severe thinness)", "< 16.0", "Underweight (Moderate thinness)", "16.0 – 16.9", "Underweight (Mild thinness)", "17.0 – 18.4", "Normal range", "18.5 – 24.9", "Overweight (Pre-obese)", "25.0 – 29.9", "Obese (Class I)", "30.0 – 34.9", "Obese (Class II)", "35.0 – 39.9", "Obese (Class III)", ">= 40.0", ) |> knitr::kable() ``` \ ### Create the variable `category` that tells us whether the soldiers are "Underweight", "Normal range", "Overweight", or "Obese" * Update the `soldiers` dataset with the new variable * place the new variable after `BMI`\ ::: {.callout-tip collapse=true} #### Hint 1 Use `case_when()` ::: ::: {.callout-tip collapse=true} #### Hint 2 ```{r} #| eval: false #| code-fold: false soldiers |> mutate( category = ???? ) ``` ::: ::: {.callout-tip collapse=true} #### Hint 3 ```{r} #| eval: false #| code-fold: false soldiers |> mutate( category = case_when( #TEST HERE ~ OUTPUT, #TEST HERE ~ OUTPUT, #TEST HERE ~ OUTPUT, #.default = OUTPUT ) ) ``` ::: ```{r} #| output: false soldiers <- soldiers |> mutate( category = case_when( BMI < 18.5 ~ "Underweight", # BMI below 18.5 BMI < 25 ~ "Normal range", # BMI between 18.5 and 24.9 BMI < 30 ~ "Overweight", # BMI between 25 and 29.9 BMI >= 30 ~ "Obese", # BMI 30 and above .default = NA ), .after = BMI # Place the new variable after BMI ) ``` ### Code reading ::: {.panel-tabset} ## Task Consider the two other solution examples in the tabs. Do you see a problems? Do you see improvements? You can run the code in the examples, but try to see if you can predict what the code will do before you run it. ## Example 1 ```{r} #| code-fold: false #| eval: false soldiers |> mutate( category = case_when( BMI >= 30 ~ "Obese", BMI < 30 ~ "Overweight", BMI < 25 ~ "Normal range", BMI < 18.5 ~ "Underweight", .default = NA ), .after = BMI ) ``` ```{r} #| eval: false #| code-summary: "Answer" soldiers |> mutate( category = case_when( BMI >= 30 ~ "Obese", # OK! BMI < 30 ~ "Overweight", # ERROR !!! All BMI values below 30 will now be labelled as "Overweight" BMI < 25 ~ "Normal range", # Will never be evaluated BMI < 18.5 ~ "Underweight", # Will never be evaluated .default = NA ), .after = BMI ) ``` ## Example 2 ```{r} #| code-fold: false #| eval: false soldiers |> mutate( category = case_when( BMI < 18.5 ~ "Underweight", BMI < 25 ~ "Normal range", BMI < 30 ~ "Overweight", BMI >= 30 ~ "Obese", is.na(BMI) ~ NA, .default = "UNCLASSIFIED CHECK YOUR CODE!!!!" ), .after = BMI ) ``` ```{r} #| eval: false #| code-summary: "Answer" # The condition is.na(BMI) is logically valid but redundant in this example, # as the .default argument already handles unmatched cases, including NAs. # In this example, the preceding arguments mathematically account for all # possible BMI values except NA. # However, explicitly separating NA cases is considered good practice. # It makes the code more readable and clearly distinguishes missing # data from unexpected or unclassified combinations. ``` ::: ### Time to check your code ```{r} #| echo: false #| column: margin knitr::include_graphics(here::here("img", "sherif.png")) ``` * Clean your code * Restart R. (Click Session -> Restart R)\ * Run all your code chunks (you can use "run all" CTRL+ALT+R) * Your code should run without errors before your proceed ## `count()` For simple counting `count()` is faster than `summarise(n = n())` or `mutate(n = n())` In some of the exercises below we are using the `diamonds` dataset. This is an inbuilt dataset that comes with the `tidyverse`. `diamonds` is available even though you cant see it in your Environment pane. ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds |> count() ``` <details><summary>ANSWER</summary> `count()` works like `summarise(n = n())` </details> ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds |> count(cut) ``` ```{r} #| code-fold: true #| code-summary: "Answer" #| eval: false # The code is equivalent to: diamonds |> group_by(cut) |> summarise(n = n()) # OR diamonds |> summarise(n = n(), .by = cut) ``` ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds |> count(cut, color) ``` ```{r} #| code-fold: true #| code-summary: "Answer" #| eval: false # The code is equivalent to: diamonds |> group_by(cut, color) |> summarise(n = n()) # OR diamonds |> summarise(n = n(), .by = c(cut, color)) # However, notice that the first solution returns a grouped tibble ``` ### What is this code equivalent to? ```{r} #| eval: false #| code-fold: false diamonds |> add_count() ``` <details><summary>ANSWER</summary> `add_count()` works like `mutate(n = n())` </details> ### Return so `soldiers` ### Count the number of soldiers of each sex calculate the percentage of each sex ```{r} #| output: false soldiers |> count(sex) |> mutate(percent = n/sum(n)*100) ``` ### Inspect the output of this code. What is going wrong? ```{r} #| code-fold: false soldiers |> group_by(sex) |> count() |> mutate(percent = n/sum(n)*100) ``` ## Next up ```{r} #| echo: false #| column: margin knitr::include_graphics(here("img", "art.png")) ``` Congratulations! You have completed all the `dplyr`starter exercises. Return to "Files, Folders, and Projects Discipline" and continue with [Setting up your course project (continued)](../files_folders_and_projects_discipline/files_folders_and_project_discipline.qmd)

1 Presentation

Getting Started

Excercises

Load the soldiers dataset and assign to an object called soldiers

1.1 Explore the dataset

2 dplyr

2.1 select()

select the columns subjectid, sex, age

select the columns 1, 3, 5:7

remove the columns 3:5

select all columns that contains the word “circumference”

remove all columns containing the letter “c”

select all columns that contains a “c” OR “x” OR “y” OR “z”

Time to check your code

2.2 filter()

Keep all rows where sex is Female:

Keep all rows where weightkg is missing (NA value)

Keep all rows where WritingPreference is “Left hand” AND sex is “Female”

Keep all rows where WritingPreference is “Left hand” OR sex is “Female”

What is going wrong in this code?

Keep all rows where age is above 30 and the weightkg is below 600

Keep all rows where Ethnicity is either “Mexican” OR “Filipino” OR “Caribbean Islander”

Time to check your code

2.3 summarise()

Calculate the mean and standard deviation of footlength

Calculate the median and interquartile range of earlength

Count the number of rows where WritingPreference is equal to “Right hand”

How old is the oldest soldier?

Calculate the mean weight of the Females

Calculate the range in weight (max-min) within Males

Time to check your code

2.4 group_by() and arrange()

Calculate the mean and sd of weightkg and age for all Installations

Calculate the mean and sd of weightkg and age for all Installations for both sexes

Calcualate the average height for each Installation and count the number of missing values within each Installation

As before, but now also add the number of observations (rows) within each Installation

As before, but now arrange the output after number of soldiers at each Installation in descending order.

Time to check your code

2.5 mutate()

Add a column called heightcm with the height of the soldiers in cm

Update the weightkg column to kg instead of kg*10

Add a column called BMI with the Body mass index (BMI) of the soldiers

Add a column called obese that contains the value TRUE if BMI is > 30

Inspect the below table from Wikipedia

Create the variable category that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

Code reading

Time to check your code

2.6 count()

What is this code equivalent to?

What is this code equivalent to?

What is this code equivalent to?

What is this code equivalent to?

Return so soldiers

Count the number of soldiers of each sex calculate the percentage of each sex

Inspect the output of this code. What is going wrong?

2.7 Next up

Load the `soldiers` dataset and assign to an object called `soldiers`

2 `dplyr`

2.1 `select()`

select the columns `subjectid`, `sex`, `age`

2.2 `filter()`

Keep all rows where `sex` is Female:

Keep all rows where `weightkg` is missing (NA value)

Keep all rows where `WritingPreference` is “Left hand” AND `sex` is “Female”

Keep all rows where `WritingPreference` is “Left hand” OR `sex` is “Female”

Keep all rows where `age` is above 30 and the `weightkg` is below 600

Keep all rows where `Ethnicity` is either “Mexican” OR “Filipino” OR “Caribbean Islander”

2.3 `summarise()`

Calculate the mean and standard deviation of `footlength`

Calculate the median and interquartile range of `earlength`

Count the number of rows where `WritingPreference` is equal to “Right hand”

2.4 `group_by()` and `arrange()`

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s

Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es

Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`

As before, but now also add the number of observations (rows) within each `Installation`

As before, but now arrange the output after number of soldiers at each `Installation` in descending order.

2.5 `mutate()`

Add a column called `heightcm` with the height of the soldiers in cm

Update the `weightkg` column to kg instead of kg*10

Add a column called `BMI` with the Body mass index (BMI) of the soldiers

Add a column called `obese` that contains the value TRUE if BMI is > 30

Create the variable `category` that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”

2.6 `count()`

Return so `soldiers`