superheroes
dplyr and tidyr
full_join()
and friends[]_join
It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.
inner_join()
semi_join()
left_join()
anti_join()
full_join()
[]_join
- Example dataname | alignment | gender | publisher |
---|---|---|---|
Magneto | bad | male | Marvel |
Storm | good | female | Marvel |
Mystique | bad | female | Marvel |
Batman | good | male | DC |
Joker | bad | male | DC |
Catwoman | bad | female | DC |
Hellboy | good | male | Dark Horse Comics |
Take a good look at the two dataframes
What columns match between the two dataframes?
Are the values in the publisher
column identical in the two dataframes?
Are the values in the publisher
column arranged (sorted) in the same way in the two dataframes?
What kind of information are we missing on the Dark Horse Comics
and Image
publishers?
inner_join()
inner_join(x, y)
returns all rows from x where there are matching values in y, and all columns from x and y.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Results are identical. Only variable order is different.
What information was filtered away?
semi_join()
semi_join(x, y)
returns all rows from x where there are matching values in y, keeping just columns from x.
left_join()
left_join(x, y)
returns all rows from x, and all columns from x and y.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
anti_join()
anti_join(x, y)
returns all rows from x where there are not matching values in y. Keeping only columns in x.
full_join()
full_join(x, y)
Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.
name | alignment | gender | publisher | yr_founded |
---|---|---|---|---|
Magneto | bad | male | Marvel | 1939 |
Storm | good | female | Marvel | 1939 |
Mystique | bad | female | Marvel | 1939 |
Batman | good | male | DC | 1934 |
Joker | bad | male | DC | 1934 |
Catwoman | bad | female | DC | 1934 |
Hellboy | good | male | Dark Horse Comics | NA |
NA | NA | NA | Image | 1992 |
publisher | yr_founded | name | alignment | gender |
---|---|---|---|---|
DC | 1934 | Batman | good | male |
DC | 1934 | Joker | bad | male |
DC | 1934 | Catwoman | bad | female |
Marvel | 1939 | Magneto | bad | male |
Marvel | 1939 | Storm | good | female |
Marvel | 1939 | Mystique | bad | female |
Image | 1992 | NA | NA | NA |
Dark Horse Comics | NA | Hellboy | good | male |
pivot_longer()
and pivot_wider()
pivot_
pivot_longer()
makes datasets longer by increasing the number of rows and decreasing the number of columns.pivot_wider()
makes datasets wider by decreasing the number of rows and increasing the number of columns.pivot_longer()
The relig_income
dataset
# A tibble: 6 × 11
religion `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
<chr> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Agnostic 27 34 60 81 76 137 122
2 Atheist 12 27 37 52 35 70 73
3 Buddhist 27 21 30 34 33 58 62
4 Catholic 418 617 732 670 638 1116 949
5 Don’t kn… 15 14 15 11 10 35 21
6 Evangeli… 575 869 1064 982 881 1486 949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
# `Don't know/refused` <dbl>
pivot_longer()
# A tibble: 180 × 3
religion name value
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# ℹ 170 more rows
# A tibble: 180 × 3
religion income count
<chr> <chr> <dbl>
1 Agnostic <$10k 27
2 Agnostic $10-20k 34
3 Agnostic $20-30k 60
4 Agnostic $30-40k 81
5 Agnostic $40-50k 76
6 Agnostic $50-75k 137
7 Agnostic $75-100k 122
8 Agnostic $100-150k 109
9 Agnostic >150k 84
10 Agnostic Don't know/refused 96
# ℹ 170 more rows
pivot_wider()
Is less often used, but often comes in handy when tidying data ::: {.cell}
:::
# A tibble: 10 × 10
id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
<int> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 1 2 1 1 1 1 1 0 1
2 2 2 2 1 2 1 1 2 0 1
3 3 3 2 1 3 1 1 3 0 1
4 4 4 2 1 4 1 1 4 0 1
5 5 5 2 1 5 1 1 5 0 1
6 6 6 2 1 6 1 1 6 0 1
7 7 7 2 1 7 1 1 7 0 1
8 8 8 2 1 8 1 1 8 0 1
9 9 9 2 1 9 1 1 9 0 1
10 10 10 2 1 10 1 1 10 0 1
*notice: The column names contain two pieces of information ({joint}_{coordinate})*
pivot_wider()
We want to go from this: ::: {.cell} ::: {.cell-output .cell-output-stdout}
# A tibble: 3 × 10
id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
<int> <int> <dbl> <dbl> <int> <dbl> <dbl> <int> <dbl> <dbl>
1 1 1 2 1 1 1 1 1 0 1
2 2 2 2 1 2 1 1 2 0 1
3 3 3 2 1 3 1 1 3 0 1
::: :::
To this: ::: {.cell} ::: {.cell-output .cell-output-stdout}
# A tibble: 30 × 5
id Joint x y z
<int> <chr> <dbl> <dbl> <dbl>
1 1 hip 1 2 1
2 1 knee 1 1 1
3 1 ankle 1 0 1
4 2 hip 2 2 1
5 2 knee 2 1 1
6 2 ankle 2 0 1
7 3 hip 3 2 1
8 3 knee 3 1 1
9 3 ankle 3 0 1
10 4 hip 4 2 1
# ℹ 20 more rows
::: :::
pivot_wider()
We start with pivot_longer()
, but it pulls to much together ::: {.cell output-location=‘fragment’}
# A tibble: 90 × 3
id name value
<int> <chr> <dbl>
1 1 hip_x 1
2 1 hip_y 2
3 1 hip_z 1
4 1 knee_x 1
5 1 knee_y 1
6 1 knee_z 1
7 1 ankle_x 1
8 1 ankle_y 0
9 1 ankle_z 1
10 2 hip_x 2
# ℹ 80 more rows
:::
pivot_wider()
We then use extract()
to split the name column ::: {.cell output-location=‘fragment’}
mocap_data |>
pivot_longer(cols = -id) |>
extract(col = name, # What column to extract information from
into = c("Joint", "coord"), # What should the new columns be called
regex = "(.+)_(.+)") # A regular expression
# A tibble: 90 × 4
id Joint coord value
<int> <chr> <chr> <dbl>
1 1 hip x 1
2 1 hip y 2
3 1 hip z 1
4 1 knee x 1
5 1 knee y 1
6 1 knee z 1
7 1 ankle x 1
8 1 ankle y 0
9 1 ankle z 1
10 2 hip x 2
# ℹ 80 more rows
:::
pivot_wider()
Finally we use pivot_wider()
::: {.cell output-location=‘fragment’}
mocap_data |>
pivot_longer(cols = -id) |> # pulls to much together
extract(name,
into = c("Joint", "coord"),
regex = "(.+)_(.+)") |>
pivot_wider(names_from = coord,
values_from = value)
# A tibble: 30 × 5
id Joint x y z
<int> <chr> <dbl> <dbl> <dbl>
1 1 hip 1 2 1
2 1 knee 1 1 1
3 1 ankle 1 0 1
4 2 hip 2 2 1
5 2 knee 2 1 1
6 2 ankle 2 0 1
7 3 hip 3 2 1
8 3 knee 3 1 1
9 3 ankle 3 0 1
10 4 hip 4 2 1
# ℹ 20 more rows
:::
across()
across()
across
is more advanced but very useful when you want to perform the same operations on multiple columns
Read more about across
here and here ::: {.cell output-location=‘fragment’}
diamonds |>
group_by(cut) |>
summarise(
across(.cols = where(is.numeric), # All numeric columns will change
.fns = ~mean(.x, na.rm = T), # .fns is the function you wnat to want to apply to the columns
.names = "{.col}_mean" # How should the new columns be named?
))
# A tibble: 5 × 8
cut carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair 1.05 64.0 59.1 4359. 6.25 6.18 3.98
2 Good 0.849 62.4 58.7 3929. 5.84 5.85 3.64
3 Very Good 0.806 61.8 58.0 3982. 5.74 5.77 3.56
4 Premium 0.892 61.3 58.7 4584. 5.97 5.94 3.65
5 Ideal 0.703 61.7 56.0 3458. 5.51 5.52 3.40
:::
across()
across
is more advanced but very useful when you want to perform the same operations on multiple columns
Read more about across
here and here ::: {.cell output-location=‘fragment’}
diamonds |>
group_by(cut) |>
summarise(
across(.cols = where(is.numeric), # All numeric columns will change
.fns = list(mean = ~mean(.x, na.rm = T), # If more than one function you supply them in a list and name the functions
sd = ~sd(.x, na.rm = T)),
.names = "{.col}_{.fn}"
))
# A tibble: 5 × 15
cut carat_mean carat_sd depth_mean depth_sd table_mean table_sd price_mean
<ord> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl> <dbl>
1 Fair 1.05 0.516 64.0 3.64 59.1 3.95 4359.
2 Good 0.849 0.454 62.4 2.17 58.7 2.85 3929.
3 Very G… 0.806 0.459 61.8 1.38 58.0 2.12 3982.
4 Premium 0.892 0.515 61.3 1.16 58.7 1.48 4584.
5 Ideal 0.703 0.433 61.7 0.719 56.0 1.25 3458.
# ℹ 7 more variables: price_sd <dbl>, x_mean <dbl>, x_sd <dbl>, y_mean <dbl>,
# y_sd <dbl>, z_mean <dbl>, z_sd <dbl>
:::