Show the code
::include_graphics(here::here("img", "badges", "sheriff.png")) knitr
dplyr and tidyr
Steen Flammild Harsted & Søren O´Neill
April 25, 2025
You can download the course slides for this section here
soldiers
dataset and assign to an object called soldiers
read_csv2()
file
argument should be here("raw_data", "soldiers.csv")
<-
There are many ways to explore a dataset in R. The simplest is to use the dataviewer:
soldiers
object in the Enviroment planeview(soldiers)
in the console (This should not be a part of your the script)glimpse(soldiers)
summary(soldiers)
Discuss with your neighbor:
dplyr
The main functions of dplyr that you will practice today are:
select()
filter()
summarise()
group_by()
arrange()
mutate()
count()
select()
subjectid
, sex
, age
Use one of the tidyselect helper functions.
Use one of the tidyselect helper functions.
Use a minus sign.
In R(and many other programming languages) the | sign is used as a logical operator for OR.
filter()
sex
is Female:???? == "Female"
weightkg
is missing (NA value)use the is.na()
function
WritingPreference
is “Left hand” AND sex
is “Female”WritingPreference
is “Left hand” OR sex
is “Female”The error message is: ::: {.cell}
Error in `filter()`:
ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`.
Caused by error:
! object 'WritingPreference' not found
Run `rlang::last_error()` to see where the error occurred.
:::
age
is above 30 and the weightkg
is below 600
Ethnicity
is either “Mexican” OR “Filipino” OR “Caribbean Islander”you can use %in%
and c()
or multiple calls of Ethnicity == xxxx | Ethnicity == zzz | ...
==
and %in%
tldr - Use %in%
instead of ==
when you want to filter for multiple values.
Read on if you want to understand why. (You don’t have to)
The code filter(Ethnicity == c("Mexican", "Filipino"))
is likely not doing what you expect. The ‘==’ operator does an element-wise comparison, which means it compares the first element of ‘Ethnicity’ to the first element of the vector (“Mexican”), the second element of ‘Ethnicity’ to the second element of the vector (“Filipino”). The short vector is then recycled so now the third element of ‘Ethnicity’ is compared to the first element of the vector (“Mexican”), the fourth element of ‘Ethnicity’ to the second element of the vector (“Filipino”), and so on.
Inspect the differences in how may rows these lines of code produce
Run this code chunk line by line. Inspect the differences.
# Create a data frame
df <- data.frame(
Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
stringsAsFactors = FALSE
)
# Investigate the data frame
df
# Filter using %in%
df |> filter(Ethnicity %in% c("Mexican", "Filipino"))
# Filter using ==
df |> filter(Ethnicity == c("Mexican", "Filipino"))
summarise()
footlength
earlength
IQR()
function
WritingPreference
is equal to “Right hand”
HINT if you can’t work out why get an NA
value
Many Base R functions, including mean()
, does not ignore NA values by default. This means that if the vector contains an NA
value the result will be NA
. Is this a good or bad thing?
You can set the argument na.rm = TRUE
, to ignore missing values.
HINT if you can’t work out why get an NA
value
Many Base R functions, including mean()
, does not ignore NA values by default. This means that if the vector contains an NA
value the result will be NA
. Is this a good or bad thing?
You can set the argument na.rm = TRUE
, to ignore missing values.
group_by()
and arrange()
weightkg
and age
for all Installation
sweightkg
and age
for all Installation
s for both sex
esInstallation
and count the number of missing values within each Installation
To count missings, use the functions sum()
and is.na()
Installation
Use n()
Installation
in descending order.mutate()
heightcm
with the height of the soldiers in cmsoldiers
dataset with the new variable (<-
)Heightin
soldiers <- soldiers |>
mutate(
heightcm = Heightin * 2.54,
.after = Heightin)
# Tip:
# The typical workflow for updating an object is that you first
# write your code WITHOUT updating. Instead you print
# each incremental change in the console. When you are happy with
# your result you update the object.
# In this case this would mean that you add:
# "soldiers <-" in the first line as the last step
weightkg
column to kg instead of kg*10soldiers
dataset with the new weightkg
column
BMI
with the Body mass index (BMI) of the soldierssoldiers
dataset with the new variableweightkg
obese
that contains the value TRUE if BMI is > 30
Category | BMI (kg/m2) |
---|---|
Underweight (Severe thinness) | < 16.0 |
Underweight (Moderate thinness) | 16.0 – 16.9 |
Underweight (Mild thinness) | 17.0 – 18.4 |
Normal range | 18.5 – 24.9 |
Overweight (Pre-obese) | 25.0 – 29.9 |
Obese (Class I) | 30.0 – 34.9 |
Obese (Class II) | 35.0 – 39.9 |
Obese (Class III) | >= 40.0 |
category
that tells us whether the soldiers are “Underweight”, “Normal range”, “Overweight”, or “Obese”soldiers
dataset with the new variableBMI
Use case_when()
soldiers <- soldiers |>
mutate(
category = case_when(
BMI < 18.5 ~ "Underweight", # BMI below 18.5
BMI < 25 ~ "Normal range", # BMI between 18.5 and 24.9
BMI < 30 ~ "Overweight", # BMI between 25 and 29.9
BMI >= 30 ~ "Obese", # BMI 30 and above
.default = NA
),
.after = BMI # Place the new variable after BMI
)
Consider the two other solution examples in the tabs.
Do you see a problems?
Do you see improvements?
You can run the code in the examples, but try to see if you can predict what the code will do before you run it.
soldiers |>
mutate(
category = case_when(
BMI >= 30 ~ "Obese", # OK!
BMI < 30 ~ "Overweight", # ERROR !!! All BMI values below 30 will now be labelled as "Overweight"
BMI < 25 ~ "Normal range", # Will never be evaluated
BMI < 18.5 ~ "Underweight", # Will never be evaluated
.default = NA
),
.after = BMI
)
# The condition is.na(BMI) is logically valid but redundant in this example,
# as the .default argument already handles unmatched cases, including NAs.
# In this example, the preceding arguments mathematically account for all
# possible BMI values except NA.
# However, explicitly separating NA cases is considered good practice.
# It makes the code more readable and clearly distinguishes missing
# data from unexpected or unclassified combinations.
count()
For simple counting count()
is faster than summarise(n = n())
or mutate(n = n())
In some of the exercises below we are using the diamonds
dataset. This is an inbuilt dataset that comes with the tidyverse
. diamonds
is available even though you cant see it in your Environment pane.
count()
works like summarise(n = n())
add_count()
works like mutate(n = n())
soldiers
# A tibble: 4 × 3
# Groups: sex [4]
sex n percent
<chr> <int> <dbl>
1 Female 1452 100
2 Male 3026 100
3 Man 1126 100
4 Woman 604 100
Congratulations! You have completed all the dplyr
starter exercises. Return to “Files, Folders, and Projects Discipline” and continue with Setting up your course project (continued)
---
title: "Tidy"
subtitle: "dplyr and tidyr"
author: "Steen Flammild Harsted & Søren O´Neill"
date: today
format:
html:
toc: true
toc-depth: 2
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
<br><br>
# Presentation
You can download the course slides for this section <a href="./presentation_tidy.html" download>here</a>
<div>
```{=html}
<iframe class="slide-deck" src="presentation_tidy.html" width=90% ></iframe>
```
</div>
\
## Getting Started {.unnumbered}
```{r}
knitr::include_graphics(here::here("img", "badges", "sheriff.png"))
```
* Create a new quarto document and name it "tidy.qmd"
* Change the YAML header to style your document output. (see YAML example below)
* Insert a code chunk and load 2 important libraries
```{r}
library(tidyverse)
library(here)
```
::: {.callout-tip collapse="true"}
### YAML example
````{verbatim}
---
title: "TITLE"
subtitle: "SUBTITLE"
author: "ME"
date: today
format:
html:
toc: true
toc-depth: 2
embed-resources: true
number-sections: true
number-depth: 2
code-fold: true
code-summary: "Show the code"
code-tools: true
execute:
message: false
warning: false
---
````
:::
<br><br>
# Excercises {.unnumbered}
### Load the `soldiers` dataset and assign to an object called `soldiers`
* Use the function `read_csv2()` \
* The `file` argument should be `here("raw_data", "soldiers.csv")`
* remember `<-`
```{r}
soldiers <- read_csv2(here("raw_data", "soldiers.csv"))
```
\
## Explore the dataset
There are many ways to explore a dataset in R. The simplest is to use the dataviewer:
* click on the table icon next to the `soldiers` object in the Enviroment plane
* write `view(soldiers)` in the console (This should not be a part of your the script)
* `glimpse(soldiers)`
* `summary(soldiers)`
Discuss with your neighbor:
* Nr of rows?
* Nr of columns?
* Missing values?
* Types of variables?
* Any fake data? *(hint: Yes, for educational purposes we have added some fake data)*
<br><br><br><br><br><br><br><br>
# `dplyr`
The main functions of dplyr that you will practice today are:
* `select()`
* `filter()`
* `summarise()`
* `group_by()`
* `arrange()`
* `mutate()`
* `count()`
<br><br>
## `select()`
### select the columns `subjectid`, `sex`, `age`
```{r}
#| eval: false
soldiers |>
select(subjectid, sex, age)
```
\
### select the columns 1, 3, 5:7
```{r}
#| eval: false
soldiers |>
select(1,3,5:7)
```
\
### remove the columns 3:5
```{r}
#| eval: false
soldiers |>
select(-(3:5))
```
\
### select all columns that contains the word "circumference"
::: {.callout-tip collapse=true}
#### HINT
Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions.
:::
```{r soldiers_6}
#| eval: false
soldiers |>
select(contains("circumference"))
```
<br>
### remove all columns containing the letter "c"
::: {.callout-tip collapse=true}
#### HINT
Use one of the [tidyselect](https://tidyselect.r-lib.org/reference/language.html) helper functions.
Use a minus sign.
:::
```{r soldiers_7}
#| eval: false
soldiers |>
select(-contains("c"))
```
<br>
### select all columns that contains a "c" OR "x" OR "y" OR "z"
::: {.callout-tip collapse=true}
#### HINT
In R(and many other programming languages) the \| sign is used as a logical operator for OR.
:::
```{r}
#| eval: false
soldiers |>
select(contains("c") | contains("x") | contains("y") | contains("z"))
```
<br><br>
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here::here("img", "sherif.png"))
```
### Time to check your code
* Clean your code
* Restart R. (Click Session -> Restart R)\
* Run all your code chunks
* Your code should run without errors before your proceed
<br><br><br><br><br><br><br><br>
## `filter()`
\
### Keep all rows where `sex` is Female:
::: {.callout-tip collapse=true}
#### Hint
`???? == "Female"`
:::
```{r}
#| eval: false
soldiers |> filter(sex == "Female")
```
<br>
### Keep all rows where `weightkg` is missing (NA value)
::: {.callout-tip collapse=true}
#### Hint
use the `is.na()` function
:::
```{r}
#| eval: false
soldiers |>
filter(is.na(weightkg))
```
<br>
### Keep all rows where `WritingPreference` is "Left hand" AND `sex` is "Female"
```{r}
#| eval: false
soldiers |>
filter(WritingPreference == "Left hand" , sex == "Female") # you can use & instead of a ,
```
<br>
### Keep all rows where `WritingPreference` is "Left hand" OR `sex` is "Female"
```{r}
#| eval: false
soldiers |>
filter(WritingPreference == "Left hand" | sex == "Female")
```
<br>
### What is going wrong in this code?
```{r}
#| eval: false
#| code-fold: false
soldiers |>
select(1:5) |>
filter(WritingPreference == "Left hand" | sex == "Female")
```
The error message is:
```{r}
#| eval: false
#| code-fold: false
Error in `filter()`:
ℹ In argument: `WritingPreference == "Left hand" | sex == "Female"`.
Caused by error:
! object 'WritingPreference' not found
Run `rlang::last_error()` to see where the error occurred.
```
```{r}
#| eval: false
#| code-summary: "ANSWER"
The variable WritingPreference was not selected in the first line.
```
<br>
### Keep all rows where `age` is above 30 and the `weightkg` is below 600
```{r}
#| eval: false
soldiers |>
filter(age > 30, weightkg < 600)
```
\
### Keep all rows where `Ethnicity` is either "Mexican" OR "Filipino" OR "Caribbean Islander"
::: {.callout-tip collapse=true}
#### Hint
you can use `%in%` and `c()` or multiple calls of `Ethnicity == xxxx | Ethnicity == zzz | ...`
:::
```{r}
#| eval: false
soldiers |>
filter(Ethnicity %in% c("Mexican", "Filipino", "Caribbean Islander"))
# Alternatively
soldiers |>
filter(Ethnicity == "Mexican" | Ethnicity == "Filipino" | Ethnicity == "Caribbean Islander")
```
::: {.callout-caution collapse=true}
#### Difference between `==` and `%in%`
tldr - Use `%in%` instead of `==` when you want to filter for multiple values.
Read on if you want to understand why. (You don't have to)
The code `filter(Ethnicity == c("Mexican", "Filipino"))` is likely not doing what you expect. The '==' operator does an element-wise comparison, which means it compares the first element of 'Ethnicity' to the first element of the vector ("Mexican"), the second element of 'Ethnicity' to the second element of the vector ("Filipino"). The short vector is then recycled so now the third element of 'Ethnicity' is compared to the first element of the vector ("Mexican"), the fourth element of 'Ethnicity' to the second element of the vector ("Filipino"), and so on.
Inspect the differences in how may rows these lines of code produce
```{r}
#| code-fold: false
#| eval: false
soldiers |>
filter(Ethnicity %in% c("Mexican", "Filipino"))
soldiers |>
filter(Ethnicity == c("Mexican", "Filipino"))
```
Run this code chunk line by line. Inspect the differences.
```{r}
#| code-fold: false
#| eval: false
# Create a data frame
df <- data.frame(
Ethnicity = c("Mexican", "Filipino", "Italian", "Mexican", "Italian", "Filipino"),
Name = c("John", "Maria", "Luigi", "Carlos", "Francesco", "Jose"),
stringsAsFactors = FALSE
)
# Investigate the data frame
df
# Filter using %in%
df |> filter(Ethnicity %in% c("Mexican", "Filipino"))
# Filter using ==
df |> filter(Ethnicity == c("Mexican", "Filipino"))
```
:::
<br><br>
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here::here("img", "sherif.png"))
```
### Time to check your code
* Clean your code
* Restart R. (Click Session -> Restart R)\
* Run all your code chunks
* Your code should run without errors before your proceed
<br><br><br><br><br><br><br><br>
## `summarise()`
<br><br>
### Calculate the mean and standard deviation of `footlength`
```{r}
#| eval: false
soldiers |> summarise(
footlength_avg = mean(footlength),
footlength_sd = sd(footlength))
```
<br>
### Calculate the median and interquartile range of `earlength`
<details><summary>HINT</summary>
use the `IQR()` function
</details>
```{r}
#| eval: false
soldiers |>
summarise(
earlength_median = median(earlength),
earlength_iqr = IQR(earlength))
```
<br>
### Count the number of rows where `WritingPreference` is equal to "Right hand"
```{r}
#| eval: false
soldiers |>
summarise(
n_righthanded = sum(WritingPreference == "Right hand"))
```
\
### How old is the oldest soldier?
<details>
<summary>
HINT if you can't work out why get an `NA` value
</summary>
Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\
You can set the argument `na.rm = TRUE`, to ignore missing values.
</details>
```{r}
#| eval: false
soldiers |>
summarise(
age_max = max(age, na.rm = TRUE))
```
\
### Calculate the mean weight of the Females
<details>
<summary>
HINT if you can't work out why get an `NA` value
</summary>
Many Base R functions, including `mean()`, does not ignore NA values by default. This means that if the vector contains an `NA` value the result will be `NA`. Is this a good or bad thing?\
You can set the argument `na.rm = TRUE`, to ignore missing values.
</details>
```{r}
#| eval: false
soldiers |>
filter(sex == "Female") |>
summarise(
weight_avg = mean(weightkg, na.rm = TRUE))
```
<br>
### Calculate the range in weight (max-min) within Males
```{r}
#| eval: false
soldiers |>
filter(sex == "Male") |>
summarise(
weight_range = max(weightkg, na.rm = TRUE)-min(weightkg, na.rm = TRUE))
```
<br><br>
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here::here("img", "sherif.png"))
```
### Time to check your code
* Clean your code
* Restart R. (Click Session -> Restart R)\
* Run all your code chunks (you can use "run all" CTRL+ALT+R)
* Your code should run without errors before your proceed
<br><br><br><br><br><br><br><br>
## `group_by()` and `arrange()`
<br><br>
### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s
```{r}
#| eval: false
soldiers |>
group_by(Installation) |>
summarise(weight_avg = mean(weightkg, na.rm = TRUE),
weight_sd = sd(weightkg, na.rm = TRUE),
age_avg = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE))
```
<br>
### Calculate the mean and sd of `weightkg` and `age` for all `Installation`s for both `sex`es
```{r}
#| eval: false
soldiers |>
group_by(Installation, sex) |>
summarise(weight_avg = mean(weightkg, na.rm = TRUE),
weight_sd = sd(weightkg, na.rm = TRUE),
age_avg = mean(age, na.rm = TRUE),
age_sd = sd(age, na.rm = TRUE))
```
<br>
### Calcualate the average height for each `Installation` and count the number of missing values within each `Installation`
::: {.callout-tip collapse=true}
#### Hint
To count missings, use the functions `sum()` and `is.na()`
:::
```{r}
#| eval: false
soldiers |>
group_by(Installation) |>
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)))
```
<br>
### As before, but now also add the number of observations (rows) within each `Installation`
::: {.callout-tip collapse=true}
#### Hint
Use `n()`
:::
```{r}
#| eval: false
soldiers |>
group_by(Installation) |>
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)),
n = n())
```
<br>
### As before, but now arrange the output after number of soldiers at each `Installation` in descending order.
```{r}
#| eval: false
soldiers |>
group_by(Installation) |>
summarise(height_avg = mean(Heightin, na.rm = TRUE),
height_n_miss = sum(is.na(Heightin)),
n = n()) |>
arrange(desc(n))
```
<br><br>
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here::here("img", "sherif.png"))
```
### Time to check your code
* Clean your code
* Restart R. (Click Session -> Restart R)\
* Run all your code chunks (you can use "run all" CTRL+ALT+R)
* Your code should run without errors before your proceed
<br><br><br><br><br><br><br><br>
## `mutate()`
<br><br>
### Add a column called `heightcm` with the height of the soldiers in cm
* Update the `soldiers` dataset with the new variable (`<-`)
* place the new variable after `Heightin`
```{r}
#| output: false
soldiers <- soldiers |>
mutate(
heightcm = Heightin * 2.54,
.after = Heightin)
# Tip:
# The typical workflow for updating an object is that you first
# write your code WITHOUT updating. Instead you print
# each incremental change in the console. When you are happy with
# your result you update the object.
# In this case this would mean that you add:
# "soldiers <-" in the first line as the last step
```
\
### Update the `weightkg` column to kg instead of kg*10
* Update the `soldiers` dataset with the new `weightkg` column
```{r}
#| output: false
soldiers <- soldiers |>
mutate(
weightkg = weightkg/10
)
```
\
### Add a column called `BMI` with the Body mass index (BMI) of the soldiers
[BMI](https://en.wikipedia.org/wiki/Body_mass_index)
* Update the `soldiers` dataset with the new variable
* place the new variable after `weightkg`
```{r}
#| output: false
soldiers <- soldiers |>
mutate(BMI = weightkg/(heightcm/100)^2,
.after = weightkg)
```
\
### Add a column called `obese` that contains the value TRUE if BMI is > 30
```{r}
#| eval: false
soldiers |>
mutate(
obese = if_else(BMI > 30, TRUE, FALSE),
.before = 1 # This line code just places the variable at the front
)
# OR
soldiers |>
mutate(
obese = BMI > 30,
.before = 1 # This line code just places the variable at the front
)
```
\
### Inspect the below table from [Wikipedia](https://en.wikipedia.org/wiki/Body_mass_index) {#sec-BMI}
```{r}
#| echo: false
tibble::tribble( ~Category, ~`BMI (kg/m^2^)`,
"Underweight (Severe thinness)", "< 16.0",
"Underweight (Moderate thinness)", "16.0 – 16.9",
"Underweight (Mild thinness)", "17.0 – 18.4",
"Normal range", "18.5 – 24.9",
"Overweight (Pre-obese)", "25.0 – 29.9",
"Obese (Class I)", "30.0 – 34.9",
"Obese (Class II)", "35.0 – 39.9",
"Obese (Class III)", ">= 40.0",
) |> knitr::kable()
```
\
### Create the variable `category` that tells us whether the soldiers are "Underweight", "Normal range", "Overweight", or "Obese"
* Update the `soldiers` dataset with the new variable
* place the new variable after `BMI`\
::: {.callout-tip collapse=true}
#### Hint 1
Use `case_when()`
:::
::: {.callout-tip collapse=true}
#### Hint 2
```{r}
#| eval: false
#| code-fold: false
soldiers |>
mutate(
category = ????
)
```
:::
::: {.callout-tip collapse=true}
#### Hint 3
```{r}
#| eval: false
#| code-fold: false
soldiers |>
mutate(
category = case_when(
#TEST HERE ~ OUTPUT,
#TEST HERE ~ OUTPUT,
#TEST HERE ~ OUTPUT,
#.default = OUTPUT
)
)
```
:::
```{r}
#| output: false
soldiers <- soldiers |>
mutate(
category = case_when(
BMI < 18.5 ~ "Underweight", # BMI below 18.5
BMI < 25 ~ "Normal range", # BMI between 18.5 and 24.9
BMI < 30 ~ "Overweight", # BMI between 25 and 29.9
BMI >= 30 ~ "Obese", # BMI 30 and above
.default = NA
),
.after = BMI # Place the new variable after BMI
)
```
### Code reading
::: {.panel-tabset}
## Task
Consider the two other solution examples in the tabs.
Do you see a problems?
Do you see improvements?
You can run the code in the examples, but try to see if you can predict what the code will do before you run it.
## Example 1
```{r}
#| code-fold: false
#| eval: false
soldiers |>
mutate(
category = case_when(
BMI >= 30 ~ "Obese",
BMI < 30 ~ "Overweight",
BMI < 25 ~ "Normal range",
BMI < 18.5 ~ "Underweight",
.default = NA
),
.after = BMI
)
```
```{r}
#| eval: false
#| code-summary: "Answer"
soldiers |>
mutate(
category = case_when(
BMI >= 30 ~ "Obese", # OK!
BMI < 30 ~ "Overweight", # ERROR !!! All BMI values below 30 will now be labelled as "Overweight"
BMI < 25 ~ "Normal range", # Will never be evaluated
BMI < 18.5 ~ "Underweight", # Will never be evaluated
.default = NA
),
.after = BMI
)
```
## Example 2
```{r}
#| code-fold: false
#| eval: false
soldiers |>
mutate(
category = case_when(
BMI < 18.5 ~ "Underweight",
BMI < 25 ~ "Normal range",
BMI < 30 ~ "Overweight",
BMI >= 30 ~ "Obese",
is.na(BMI) ~ NA,
.default = "UNCLASSIFIED CHECK YOUR CODE!!!!"
),
.after = BMI
)
```
```{r}
#| eval: false
#| code-summary: "Answer"
# The condition is.na(BMI) is logically valid but redundant in this example,
# as the .default argument already handles unmatched cases, including NAs.
# In this example, the preceding arguments mathematically account for all
# possible BMI values except NA.
# However, explicitly separating NA cases is considered good practice.
# It makes the code more readable and clearly distinguishes missing
# data from unexpected or unclassified combinations.
```
:::
<br><br>
### Time to check your code
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here::here("img", "sherif.png"))
```
* Clean your code
* Restart R. (Click Session -> Restart R)\
* Run all your code chunks (you can use "run all" CTRL+ALT+R)
* Your code should run without errors before your proceed
<br><br><br><br><br><br><br><br>
## `count()`
For simple counting `count()` is faster than `summarise(n = n())` or `mutate(n = n())`
In some of the exercises below we are using the `diamonds` dataset. This is an inbuilt dataset that comes with the `tidyverse`. `diamonds` is available even though you cant see it in your Environment pane.
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds |>
count()
```
<details><summary>ANSWER</summary>
`count()` works like `summarise(n = n())`
</details>
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds |>
count(cut)
```
```{r}
#| code-fold: true
#| code-summary: "Answer"
#| eval: false
# The code is equivalent to:
diamonds |> group_by(cut) |> summarise(n = n())
# OR
diamonds |> summarise(n = n(), .by = cut)
```
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds |>
count(cut, color)
```
```{r}
#| code-fold: true
#| code-summary: "Answer"
#| eval: false
# The code is equivalent to:
diamonds |> group_by(cut, color) |> summarise(n = n())
# OR
diamonds |> summarise(n = n(), .by = c(cut, color))
# However, notice that the first solution returns a grouped tibble
```
<br><br>
### What is this code equivalent to?
```{r}
#| eval: false
#| code-fold: false
diamonds |>
add_count()
```
<details><summary>ANSWER</summary>
`add_count()` works like `mutate(n = n())`
</details>
<br><br>
### Return so `soldiers`
### Count the number of soldiers of each sex calculate the percentage of each sex
```{r}
#| output: false
soldiers |>
count(sex) |>
mutate(percent = n/sum(n)*100)
```
<br><br>
### Inspect the output of this code. What is going wrong?
```{r}
#| code-fold: false
soldiers |>
group_by(sex) |>
count() |>
mutate(percent = n/sum(n)*100)
```
<br><br>
## Next up
```{r}
#| echo: false
#| column: margin
knitr::include_graphics(here("img", "art.png"))
```
Congratulations! You have completed all the `dplyr`starter exercises. Return to "Files, Folders, and Projects Discipline" and continue with [Setting up your course project (continued)](../files_folders_and_projects_discipline/files_folders_and_project_discipline.qmd)