tidy

Steen Flammild Harsted and Søren O´Neill

The Workflow

Data transformation with `dplyr`

We will use the dplyr package that loads with the tidyverse.
At current dplyr contains 293 functions.

The main 6 functions in dplyr that we will use today are:

select()
filter()
summarise()
group_by()
mutate()
count()

dplyr cheatsheet

Load the `dplyr` library

library(tidyverse)

dplyr is one of the many packages that are loaded with the tidyverse

Take a look at the `starwars` dataset

starwars is one of several datasets that loads when you load the tidyverse.
The datasets are great for practicing.

We call the dataset by typing its name.

starwars

# A tibble: 87 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 R2-D2        96    32 <NA>       white, bl… red             33   none  mascu…
 4 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 5 Leia Or…    150    49 brown      light      brown           19   fema… femin…
 6 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 7 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 8 R5-D4        97    32 <NA>       white, red red             NA   none  mascu…
 9 Biggs D…    183    84 black      light      brown           24   male  mascu…
10 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

`select()`

Function:
Select a subset of columns (variables). Gets rid of all other columns. You can refer to the columns by the column position (first column) or by name.
The order in which you list the column names/positions is the order that the columns will be displayed.

Example
In the starwars dataset, only keep the columns name and homeworld:

starwars |> 
  select(name, homeworld)

# A tibble: 87 × 2
   name               homeworld
   <chr>              <chr>    
 1 Luke Skywalker     Tatooine 
 2 C-3PO              Tatooine 
 3 R2-D2              Naboo    
 4 Darth Vader        Tatooine 
 5 Leia Organa        Alderaan 
 6 Owen Lars          Tatooine 
 7 Beru Whitesun Lars Tatooine 
 8 R5-D4              Tatooine 
 9 Biggs Darklighter  Tatooine 
10 Obi-Wan Kenobi     Stewjon  
# ℹ 77 more rows

`select()`

Only keep the first five columns:

starwars |> select(1:5)
starwars |> select(1,2,3,4,5)

Keep all of the columns except for name:

starwars |> select(-name)

Keep all of the columns except for name and mass:

starwars |> select(-name, -mass)

Keep all of the columns except for the first five columns:

starwars |> select (-1,-2,-3,-4,-5)
starwars |> select(-(1:5))

`select()` and the tidyselect helpers

starts_with()
ends_with()
contains()

Are three of the tidyselect helper functions.

You can use them inside the select() function.

Example
select all columns that start with “s”

starwars |> 
  select(starts_with("s"))

# A tibble: 87 × 4
   skin_color  sex    species starships
   <chr>       <chr>  <chr>   <list>   
 1 fair        male   Human   <chr [2]>
 2 gold        none   Droid   <chr [0]>
 3 white, blue none   Droid   <chr [0]>
 4 white       male   Human   <chr [1]>
 5 light       female Human   <chr [0]>
 6 light       male   Human   <chr [0]>
 7 light       female Human   <chr [0]>
 8 white, red  none   Droid   <chr [0]>
 9 light       male   Human   <chr [1]>
10 fair        male   Human   <chr [5]>
# ℹ 77 more rows

Lets practice using `select()`

`filter()`

Function:
Only keep specific rows of data that meet the specified requirement(s).

Example
In the starwars dataset, only Keep the rows where height is above 150:

starwars |> 
  filter(height>150)

# A tibble: 69 × 14
   name     height  mass hair_color skin_color eye_color birth_year sex   gender
   <chr>     <int> <dbl> <chr>      <chr>      <chr>          <dbl> <chr> <chr> 
 1 Luke Sk…    172    77 blond      fair       blue            19   male  mascu…
 2 C-3PO       167    75 <NA>       gold       yellow         112   none  mascu…
 3 Darth V…    202   136 none       white      yellow          41.9 male  mascu…
 4 Owen La…    178   120 brown, gr… light      blue            52   male  mascu…
 5 Beru Wh…    165    75 brown      light      blue            47   fema… femin…
 6 Biggs D…    183    84 black      light      brown           24   male  mascu…
 7 Obi-Wan…    182    77 auburn, w… fair       blue-gray       57   male  mascu…
 8 Anakin …    188    84 blond      fair       blue            41.9 male  mascu…
 9 Wilhuff…    180    NA auburn, g… fair       blue            64   male  mascu…
10 Chewbac…    228   112 brown      unknown    blue           200   male  mascu…
# ℹ 59 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
#   vehicles <list>, starships <list>

`filter()`

Works with the logical operators:

Operator	Action
!	NOT
==	EQUAL TO
!=	NOT EQUAL TO
\|	OR
>	GREATER THAN
>=	GREATER THAN OR EQUAL TO
<	LESS THAN
<=	LESS THAN OR EQUAL TO
&	AND

Actually, most functions that output either TRUE or FALSE e.g., is.NA() will work within filter()
Further reading

`filter()`

Keep all rows where gender is masculine:
notice == not =
notice “masculine” not masculine

starwars |> filter(gender == "masculine")

Keep all rows where hair_color is missing (NA value)

starwars |> filter(is.na(hair_color))

Keep all rows where hair_color is “blond” AND eye_color is “blue”

starwars |> filter(hair_color == "blond" &  eye_color == "blue")
starwars |> filter(hair_color == "blond" ,  eye_color == "blue")

Keep all rows where hair_color is “blond” OR eye_color is “blue”

starwars |> filter(hair_color == "blond" |  eye_color == "blue")

Keep all rows where hair_color is either “blond” OR “white”

starwars |> filter(hair_color %in% c("blond", "white"))

Lets Practice using the `filter()` function

`summarise()`

Function:
Collapses all rows and returns a one-row summary.

Example

In the diamonds dataset, calculate the mean price

diamonds |> 
  summarise(
    avg_price = mean(price))

# A tibble: 1 × 1
  avg_price
      <dbl>
1     3933.

`summarise()`

Some functions that often go inside summarise()

mean()
sd()
n()
median()
quantile()
sum()
max()
min()
IQR()

`summarise`

Calculate the mean and standard deviation of price

diamonds |> 
  summarise(
    price_avg = mean(price),
    price_sd = sd(price))

Calculate the median and interquartile range of price

diamonds |> 
  summarise(
    price_median = median(price),
    price_iqr = IQR(price))

Count the number of rows where cut is equal to “Ideal”

diamonds |>
  summarise(
    n_ideal = sum(cut == "Ideal"))

What is the price of the most expensive diamond?

diamonds |> 
  summarise(
    price_max = max(price))

Lets Practice using the `summarise()` function

`group_by()`

Function:
Takes existing data and groups specific variables together for future operations.

Example

Group the diamonds dataset by cut

diamonds |> 
  group_by(cut)

# A tibble: 53,940 × 10
# Groups:   cut [5]
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

`group_by()`

Function:
Takes existing data and groups specific variables together for future operations.

Example

Group the diamonds dataset by cut, then caluclate the mean price

diamonds |> 
  group_by(cut) |> 
  summarise(avg_price = mean(price))

# A tibble: 5 × 2
  cut       avg_price
  <ord>         <dbl>
1 Fair          4359.
2 Good          3929.
3 Very Good     3982.
4 Premium       4584.
5 Ideal         3458.

`group_by()`

You can group by multiple columns

Example

Group the diamonds dataset by cut and clarity, then caluclate the mean and sd of price

diamonds |> 
  group_by(cut, color) |> 
  summarise(avg_price = mean(price),
            sd_price = sd(price))

# A tibble: 35 × 4
# Groups:   cut [5]
   cut   color avg_price sd_price
   <ord> <ord>     <dbl>    <dbl>
 1 Fair  D         4291.    3286.
 2 Fair  E         3682.    2977.
 3 Fair  F         3827.    3223.
 4 Fair  G         4239.    3610.
 5 Fair  H         5136.    3886.
 6 Fair  I         4685.    3730.
 7 Fair  J         4976.    4050.
 8 Good  D         3405.    3175.
 9 Good  E         3424.    3331.
10 Good  F         3496.    3202.
# ℹ 25 more rows

`group_by()` and `ungroup`

You can undo a group_by() with ungroup()

diamonds |> 
  group_by(cut) |> 
  ungroup()

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

`arrange()`

Function:
Arranges (sorts) your tibble based on the values within a variable.

Example

Arrange the diamonds dataset by cut

diamonds |> 
  arrange(cut)

# A tibble: 53,940 × 10
   carat cut   color clarity depth table price     x     y     z
   <dbl> <ord> <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.22 Fair  E     VS2      65.1    61   337  3.87  3.78  2.49
 2  0.86 Fair  E     SI2      55.1    69  2757  6.45  6.33  3.52
 3  0.96 Fair  F     SI2      66.3    62  2759  6.27  5.95  4.07
 4  0.7  Fair  F     VS2      64.5    57  2762  5.57  5.53  3.58
 5  0.7  Fair  F     VS2      65.3    55  2762  5.63  5.58  3.66
 6  0.91 Fair  H     SI2      64.4    57  2763  6.11  6.09  3.93
 7  0.91 Fair  H     SI2      65.7    60  2763  6.03  5.99  3.95
 8  0.98 Fair  H     SI2      67.9    60  2777  6.05  5.97  4.08
 9  0.84 Fair  G     SI1      55.1    67  2782  6.39  6.2   3.47
10  1.01 Fair  E     I1       64.5    58  2788  6.29  6.21  4.03
# ℹ 53,930 more rows

`arrange()` and `desc()`

Function:
arrange() Arranges (sorts) your tibble based on the values within a variable.
desc() is put inside arrange() to make the order descending

Example
Arrange the diamonds dataset by price

diamonds |> 
  arrange(price)

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39
# ℹ 53,930 more rows

Arrange the diamonds dataset in descending order by price

diamonds |> 
  arrange(desc(price))

# A tibble: 53,940 × 10
   carat cut       color clarity depth table price     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl>
 1  2.29 Premium   I     VS2      60.8    60 18823  8.5   8.47  5.16
 2  2    Very Good G     SI1      63.5    56 18818  7.9   7.97  5.04
 3  1.51 Ideal     G     IF       61.7    55 18806  7.37  7.41  4.56
 4  2.07 Ideal     G     SI2      62.5    55 18804  8.2   8.13  5.11
 5  2    Very Good H     SI1      62.8    57 18803  7.95  8     5.01
 6  2.29 Premium   I     SI1      61.8    59 18797  8.52  8.45  5.24
 7  2.04 Premium   H     SI1      58.1    60 18795  8.37  8.28  4.84
 8  2    Premium   I     VS1      60.8    59 18795  8.13  8.02  4.91
 9  1.71 Premium   F     VS2      62.3    59 18791  7.57  7.53  4.7 
10  2.15 Ideal     G     SI2      62.6    54 18791  8.29  8.35  5.21
# ℹ 53,930 more rows

Lets Practice using the `group_by()` and `arrange()` functions

`mutate()`

Function:
Adds new columns or modifies current variables in the tibble.

Example
In the diamonds dataset, create a new variable called tax equal to 25% of price

diamonds |> 
  mutate(
    tax = price*0.25)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z   tax
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43  81.5
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31  81.5
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31  81.8
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63  83.5
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75  83.8
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48  84  
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47  84  
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53  84.2
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49  84.2
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39  84.5
# ℹ 53,930 more rows

`mutate()`

Function:
Adds new columns or modifies current variables in the tibble.

Example
In the diamonds dataset, create a new variable called tax equal to 25% of price
Place the new variable tax right after price

diamonds |> 
  mutate(
    tax = price*0.25,
    .after = price)

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price   tax     x     y     z
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
 1  0.23 Ideal     E     SI2      61.5    55   326  81.5  3.95  3.98  2.43
 2  0.21 Premium   E     SI1      59.8    61   326  81.5  3.89  3.84  2.31
 3  0.23 Good      E     VS1      56.9    65   327  81.8  4.05  4.07  2.31
 4  0.29 Premium   I     VS2      62.4    58   334  83.5  4.2   4.23  2.63
 5  0.31 Good      J     SI2      63.3    58   335  83.8  4.34  4.35  2.75
 6  0.24 Very Good J     VVS2     62.8    57   336  84    3.94  3.96  2.48
 7  0.24 Very Good I     VVS1     62.3    57   336  84    3.95  3.98  2.47
 8  0.26 Very Good H     SI1      61.9    55   337  84.2  4.07  4.11  2.53
 9  0.22 Fair      E     VS2      65.1    61   337  84.2  3.87  3.78  2.49
10  0.23 Very Good H     VS1      59.4    61   338  84.5  4     4.05  2.39
# ℹ 53,930 more rows

`mutate()`

Add a new colum called dummy with only TRUE values

diamonds |> 
  mutate(dummy = TRUE)

Add a new collum called cubic_volume that multiplies x, y, and z

diamonds |> 
  mutate(cubic_volume = x*y*z)

Add a new column that calculates the difference in price from the mean price

diamonds |> 
  mutate(price_dm = price-mean(price))

`if_else()` often goes inside `mutate()`

Function:
if_else()contains 3 parts:

A test
What to do if the test is TRUE
What to do if the test is FALSE

Example

diamonds |> 
  mutate(
    buy = if_else(cut == "Ideal" & price < mean(price), "YES", "NO"))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z buy  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 YES  
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 NO   
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 NO   
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 NO   
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 NO   
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 NO   
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 NO   
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 NO   
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 NO   
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 NO   
# ℹ 53,930 more rows

`case_when()` often goes inside `mutate()`

Function:

case_when() contains a series of tests instead of just one.
The output of a passed test is written after a ~
case_when ends with TRUE ~ (what to do if none of the tests are passed)

Example

diamonds |> 
  mutate(
    buy = case_when(
      cut == "Ideal" & price < mean(price) ~ "YES", 
      cut == "Premium" & price < mean(price) ~ "MAYBE",
      .default ="NO"))

# A tibble: 53,940 × 11
   carat cut       color clarity depth table price     x     y     z buy  
   <dbl> <ord>     <ord> <ord>   <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
 1  0.23 Ideal     E     SI2      61.5    55   326  3.95  3.98  2.43 YES  
 2  0.21 Premium   E     SI1      59.8    61   326  3.89  3.84  2.31 MAYBE
 3  0.23 Good      E     VS1      56.9    65   327  4.05  4.07  2.31 NO   
 4  0.29 Premium   I     VS2      62.4    58   334  4.2   4.23  2.63 MAYBE
 5  0.31 Good      J     SI2      63.3    58   335  4.34  4.35  2.75 NO   
 6  0.24 Very Good J     VVS2     62.8    57   336  3.94  3.96  2.48 NO   
 7  0.24 Very Good I     VVS1     62.3    57   336  3.95  3.98  2.47 NO   
 8  0.26 Very Good H     SI1      61.9    55   337  4.07  4.11  2.53 NO   
 9  0.22 Fair      E     VS2      65.1    61   337  3.87  3.78  2.49 NO   
10  0.23 Very Good H     VS1      59.4    61   338  4     4.05  2.39 NO   
# ℹ 53,930 more rows

Lets Practice using the `mutate()` function

and if_else() and case_when()

`count()`

count() is very useful for fast counting and to get an overview of the number of observations within a (grouped) dataset

diamonds |> 
  count()

# A tibble: 1 × 1
      n
  <int>
1 53940

diamonds |> 
  count(cut)

# A tibble: 5 × 2
  cut           n
  <ord>     <int>
1 Fair       1610
2 Good       4906
3 Very Good 12082
4 Premium   13791
5 Ideal     21551

diamonds |> 
  count(cut, color)

# A tibble: 35 × 3
   cut   color     n
   <ord> <ord> <int>
 1 Fair  D       163
 2 Fair  E       224
 3 Fair  F       312
 4 Fair  G       314
 5 Fair  H       303
 6 Fair  I       175
 7 Fair  J       119
 8 Good  D       662
 9 Good  E       933
10 Good  F       909
# ℹ 25 more rows

tidy

The Workflow

The Workflow

Data transformation with dplyr

Load the dplyr library

Take a look at the starwars dataset

select()

select()

select()

select() and the tidyselect helpers

Lets practice using select()

filter()

filter()

filter()

filter()

Lets Practice using the filter() function

summarise()

summarise()

summarise()

summarise

Lets Practice using the summarise() function

group_by()

group_by()

group_by()

group_by()

group_by() and ungroup

arrange()

arrange() and desc()

Lets Practice using the group_by() and arrange() functions

mutate()

mutate()

mutate()

mutate()

if_else() often goes inside mutate()

case_when() often goes inside mutate()

Lets Practice using the mutate() function

count()

count()

Thanks!

Data transformation with `dplyr`

Load the `dplyr` library

Take a look at the `starwars` dataset

`select()`

`select()`

`select()`

`select()` and the tidyselect helpers

Lets practice using `select()`

`filter()`

`filter()`

`filter()`

`filter()`

Lets Practice using the `filter()` function

`summarise()`

`summarise()`

`summarise()`

`summarise`

Lets Practice using the `summarise()` function

`group_by()`

`group_by()`

`group_by()`

`group_by()`

`group_by()` and `ungroup`

`arrange()`

`arrange()` and `desc()`

Lets Practice using the `group_by()` and `arrange()` functions

`mutate()`

`mutate()`

`mutate()`

`mutate()`

`if_else()` often goes inside `mutate()`

`case_when()` often goes inside `mutate()`

Lets Practice using the `mutate()` function

`count()`

`count()`