dplyr
We will use the dplyr
package that loads with the tidyverse
.
At current dplyr
contains 293 functions.
The main 6 functions in dplyr
that we will use today are:
select()
filter()
summarise()
group_by()
mutate()
count()
dplyr
cheatsheet
dplyr
librarydplyr
is one of the many packages that are loaded with the tidyverse
starwars
datasetstarwars
is one of several datasets that loads when you load the tidyverse.
The datasets are great for practicing.
We call the dataset by typing its name.
::: {.cell output-location=‘fragment’}
# A tibble: 87 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 R2-D2 96 32 <NA> white, bl… red 33 none mascu…
4 Darth V… 202 136 none white yellow 41.9 male mascu…
5 Leia Or… 150 49 brown light brown 19 fema… femin…
6 Owen La… 178 120 brown, gr… light blue 52 male mascu…
7 Beru Wh… 165 75 brown light blue 47 fema… femin…
8 R5-D4 97 32 <NA> white, red red NA none mascu…
9 Biggs D… 183 84 black light brown 24 male mascu…
10 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
# ℹ 77 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
:::
select()
select()
Function:
Select a subset of columns (variables). Gets rid of all other columns. You can refer to the columns by the column position (first column) or by name.
The order in which you list the column names/positions is the order that the columns will be displayed.
Example
In the starwars dataset, only keep the columns name
and homeworld
:
::: {.cell output-location=‘column-fragment’}
# A tibble: 87 × 2
name homeworld
<chr> <chr>
1 Luke Skywalker Tatooine
2 C-3PO Tatooine
3 R2-D2 Naboo
4 Darth Vader Tatooine
5 Leia Organa Alderaan
6 Owen Lars Tatooine
7 Beru Whitesun Lars Tatooine
8 R5-D4 Tatooine
9 Biggs Darklighter Tatooine
10 Obi-Wan Kenobi Stewjon
# ℹ 77 more rows
:::
select()
Only keep the first five columns: ::: {.cell}
:::
Keep all of the columns except for name
: ::: {.cell}
:::
Keep all of the columns except for name
and mass
: ::: {.cell}
:::
Keep all of the columns except for the first five columns: ::: {.cell}
:::
select()
and the tidyselect helpers
::: {.cell}
:::
Are three of the tidyselect helper functions.
You can use them inside the select()
function.
Example
select all columns that start with “s” ::: {.cell output-location=‘column-fragment’}
# A tibble: 87 × 4
skin_color sex species starships
<chr> <chr> <chr> <list>
1 fair male Human <chr [2]>
2 gold none Droid <chr [0]>
3 white, blue none Droid <chr [0]>
4 white male Human <chr [1]>
5 light female Human <chr [0]>
6 light male Human <chr [0]>
7 light female Human <chr [0]>
8 white, red none Droid <chr [0]>
9 light male Human <chr [1]>
10 fair male Human <chr [5]>
# ℹ 77 more rows
:::
select()
filter()
filter()
Function:
Only keep specific rows of data that meet the specified requirement(s).
Example
In the starwars dataset, only Keep the rows where height
is above 150:
::: {.cell output-location=‘fragment’}
# A tibble: 69 × 14
name height mass hair_color skin_color eye_color birth_year sex gender
<chr> <int> <dbl> <chr> <chr> <chr> <dbl> <chr> <chr>
1 Luke Sk… 172 77 blond fair blue 19 male mascu…
2 C-3PO 167 75 <NA> gold yellow 112 none mascu…
3 Darth V… 202 136 none white yellow 41.9 male mascu…
4 Owen La… 178 120 brown, gr… light blue 52 male mascu…
5 Beru Wh… 165 75 brown light blue 47 fema… femin…
6 Biggs D… 183 84 black light brown 24 male mascu…
7 Obi-Wan… 182 77 auburn, w… fair blue-gray 57 male mascu…
8 Anakin … 188 84 blond fair blue 41.9 male mascu…
9 Wilhuff… 180 NA auburn, g… fair blue 64 male mascu…
10 Chewbac… 228 112 brown unknown blue 200 male mascu…
# ℹ 59 more rows
# ℹ 5 more variables: homeworld <chr>, species <chr>, films <list>,
# vehicles <list>, starships <list>
:::
filter()
Works with the logical operators:
::: {.cell} ::: {.cell-output-display}
Operator | Action |
---|---|
! | NOT |
== | EQUAL TO |
!= | NOT EQUAL TO |
| | OR |
> | GREATER THAN |
>= | GREATER THAN OR EQUAL TO |
< | LESS THAN |
<= | LESS THAN OR EQUAL TO |
& | AND |
::: :::
Actually, most functions that output either TRUE or FALSE e.g., is.NA()
will work within filter()
Further reading
filter()
Keep all rows where gender
is masculine:
notice == not =
notice “masculine” not masculine ::: {.cell}
:::
Keep all rows where hair_color
is missing (NA value) ::: {.cell}
:::
Keep all rows where hair_color
is “blond” AND eye_color
is “blue” ::: {.cell}
starwars |> filter(hair_color == "blond" & eye_color == "blue")
starwars |> filter(hair_color == "blond" , eye_color == "blue")
:::
Keep all rows where hair_color
is “blond” OR eye_color
is “blue” ::: {.cell}
:::
Keep all rows where hair_color
is either “blond” OR “white” ::: {.cell}
:::
filter()
functionsummarise()
summarise()
Function:
Collapses all rows and returns a one-row summary.
Example
In the diamonds
dataset, calculate the mean price ::: {.cell output-location=‘fragment’}
# A tibble: 1 × 1
avg_price
<dbl>
1 3933.
:::
summarise()
Some functions that often go inside summarise()
mean()
sd()
n()
median()
quantile()
sum()
max()
min()
IQR()
summarise
Calculate the mean and standard deviation of price
Calculate the median and interquartile range of price
::: {.cell}
:::
Count the number of rows where cut
is equal to “Ideal” ::: {.cell}
:::
What is the price of the most expensive diamond? ::: {.cell}
:::
summarise()
functiongroup_by()
group_by()
Function:
Takes existing data and groups specific variables together for future operations.
Example
Group the diamonds
dataset by cut
::: {.cell output-location=‘fragment’}
# A tibble: 53,940 × 10
# Groups: cut [5]
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
:::
group_by()
Function:
Takes existing data and groups specific variables together for future operations.
Example
Group the diamonds
dataset by cut
, then caluclate the mean price ::: {.cell output-location=‘fragment’}
# A tibble: 5 × 2
cut avg_price
<ord> <dbl>
1 Fair 4359.
2 Good 3929.
3 Very Good 3982.
4 Premium 4584.
5 Ideal 3458.
:::
group_by()
You can group by multiple columns
Example
Group the diamonds
dataset by cut
and clarity
, then caluclate the mean and sd of price
::: {.cell output-location=‘fragment’}
# A tibble: 35 × 4
# Groups: cut [5]
cut color avg_price sd_price
<ord> <ord> <dbl> <dbl>
1 Fair D 4291. 3286.
2 Fair E 3682. 2977.
3 Fair F 3827. 3223.
4 Fair G 4239. 3610.
5 Fair H 5136. 3886.
6 Fair I 4685. 3730.
7 Fair J 4976. 4050.
8 Good D 3405. 3175.
9 Good E 3424. 3331.
10 Good F 3496. 3202.
# ℹ 25 more rows
:::
group_by()
and ungroup
You can undo a group_by()
with ungroup()
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
arrange()
Function:
Arranges (sorts) your tibble based on the values within a variable.
Example
Arrange the diamonds
dataset by cut
::: {.cell output-location=‘fragment’}
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
2 0.86 Fair E SI2 55.1 69 2757 6.45 6.33 3.52
3 0.96 Fair F SI2 66.3 62 2759 6.27 5.95 4.07
4 0.7 Fair F VS2 64.5 57 2762 5.57 5.53 3.58
5 0.7 Fair F VS2 65.3 55 2762 5.63 5.58 3.66
6 0.91 Fair H SI2 64.4 57 2763 6.11 6.09 3.93
7 0.91 Fair H SI2 65.7 60 2763 6.03 5.99 3.95
8 0.98 Fair H SI2 67.9 60 2777 6.05 5.97 4.08
9 0.84 Fair G SI1 55.1 67 2782 6.39 6.2 3.47
10 1.01 Fair E I1 64.5 58 2788 6.29 6.21 4.03
# ℹ 53,930 more rows
:::
arrange()
and desc()
Function:
arrange()
Arranges (sorts) your tibble based on the values within a variable.
desc()
is put inside arrange()
to make the order descending
Example
Arrange the diamonds
dataset by price
::: {.cell output-location=‘column-fragment’}
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39
# ℹ 53,930 more rows
:::
Arrange the diamonds
dataset in descending order by price
::: {.cell output-location=‘column-fragment’}
# A tibble: 53,940 × 10
carat cut color clarity depth table price x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl>
1 2.29 Premium I VS2 60.8 60 18823 8.5 8.47 5.16
2 2 Very Good G SI1 63.5 56 18818 7.9 7.97 5.04
3 1.51 Ideal G IF 61.7 55 18806 7.37 7.41 4.56
4 2.07 Ideal G SI2 62.5 55 18804 8.2 8.13 5.11
5 2 Very Good H SI1 62.8 57 18803 7.95 8 5.01
6 2.29 Premium I SI1 61.8 59 18797 8.52 8.45 5.24
7 2.04 Premium H SI1 58.1 60 18795 8.37 8.28 4.84
8 2 Premium I VS1 60.8 59 18795 8.13 8.02 4.91
9 1.71 Premium F VS2 62.3 59 18791 7.57 7.53 4.7
10 2.15 Ideal G SI2 62.6 54 18791 8.29 8.35 5.21
# ℹ 53,930 more rows
:::
group_by()
and arrange()
functionsmutate()
mutate()
Function:
Adds new columns or modifies current variables in the tibble.
Example
In the diamonds dataset, create a new variable called tax
equal to 25% of price
::: {.cell output-location=‘fragment’}
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z tax
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 81.5
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 81.5
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 81.8
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 83.5
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 83.8
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 84
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 84
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 84.2
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 84.2
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 84.5
# ℹ 53,930 more rows
:::
mutate()
Function:
Adds new columns or modifies current variables in the tibble.
Example
In the diamonds dataset, create a new variable called tax
equal to 25% of price
Place the new variable tax
right after price
::: {.cell output-location=‘fragment’}
# A tibble: 53,940 × 11
carat cut color clarity depth table price tax x y z
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <dbl>
1 0.23 Ideal E SI2 61.5 55 326 81.5 3.95 3.98 2.43
2 0.21 Premium E SI1 59.8 61 326 81.5 3.89 3.84 2.31
3 0.23 Good E VS1 56.9 65 327 81.8 4.05 4.07 2.31
4 0.29 Premium I VS2 62.4 58 334 83.5 4.2 4.23 2.63
5 0.31 Good J SI2 63.3 58 335 83.8 4.34 4.35 2.75
6 0.24 Very Good J VVS2 62.8 57 336 84 3.94 3.96 2.48
7 0.24 Very Good I VVS1 62.3 57 336 84 3.95 3.98 2.47
8 0.26 Very Good H SI1 61.9 55 337 84.2 4.07 4.11 2.53
9 0.22 Fair E VS2 65.1 61 337 84.2 3.87 3.78 2.49
10 0.23 Very Good H VS1 59.4 61 338 84.5 4 4.05 2.39
# ℹ 53,930 more rows
:::
mutate()
Add a new colum called dummy
with only TRUE
values ::: {.cell}
:::
Add a new collum called cubic_volume
that multiplies x
, y
, and z
::: {.cell}
:::
Add a new column that calculates the difference in price
from the mean price
::: {.cell}
:::
if_else()
often goes inside mutate()
Function:
if_else()
contains 3 parts:
A test
What to do if the test is TRUE
What to do if the test is FALSE
Example
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z buy
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 YES
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 NO
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NO
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 NO
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NO
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 NO
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 NO
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 NO
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NO
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 NO
# ℹ 53,930 more rows
case_when()
often goes inside mutate()
Function:
case_when()
contains a series of tests instead of just one.
The output of a passed test is written after a ~
case_when ends with TRUE ~
(what to do if none of the tests are passed)
Example
::: {.cell output-location=‘fragment’}
diamonds |>
mutate(
buy = case_when(
cut == "Ideal" & price < mean(price) ~ "YES",
cut == "Premium" & price < mean(price) ~ "MAYBE",
.default ="NO"))
# A tibble: 53,940 × 11
carat cut color clarity depth table price x y z buy
<dbl> <ord> <ord> <ord> <dbl> <dbl> <int> <dbl> <dbl> <dbl> <chr>
1 0.23 Ideal E SI2 61.5 55 326 3.95 3.98 2.43 YES
2 0.21 Premium E SI1 59.8 61 326 3.89 3.84 2.31 MAYBE
3 0.23 Good E VS1 56.9 65 327 4.05 4.07 2.31 NO
4 0.29 Premium I VS2 62.4 58 334 4.2 4.23 2.63 MAYBE
5 0.31 Good J SI2 63.3 58 335 4.34 4.35 2.75 NO
6 0.24 Very Good J VVS2 62.8 57 336 3.94 3.96 2.48 NO
7 0.24 Very Good I VVS1 62.3 57 336 3.95 3.98 2.47 NO
8 0.26 Very Good H SI1 61.9 55 337 4.07 4.11 2.53 NO
9 0.22 Fair E VS2 65.1 61 337 3.87 3.78 2.49 NO
10 0.23 Very Good H VS1 59.4 61 338 4 4.05 2.39 NO
# ℹ 53,930 more rows
:::
mutate()
functionand if_else()
and case_when()
count()
count()
count()
is very useful for fast counting and to get an overview of the number of observations within a (grouped) dataset ::: {.cell output-location=‘column-fragment’}
# A tibble: 1 × 1
n
<int>
1 53940
:::
::: {.cell output-location=‘column-fragment’}
# A tibble: 5 × 2
cut n
<ord> <int>
1 Fair 1610
2 Good 4906
3 Very Good 12082
4 Premium 13791
5 Ideal 21551
:::
::: {.cell output-location=‘column-fragment’}
# A tibble: 35 × 3
cut color n
<ord> <ord> <int>
1 Fair D 163
2 Fair E 224
3 Fair F 312
4 Fair G 314
5 Fair H 303
6 Fair I 175
7 Fair J 119
8 Good D 662
9 Good E 933
10 Good F 909
# ℹ 25 more rows
:::