Tidy - Extra Material

dplyr and tidyr

Steen Flammild Harsted and Søren O´Neill

`full_join()` and friends

`[]_join`

It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.

inner_join()
semi_join()
left_join()
anti_join()
full_join()

`[]_join` - Example data

superheroes

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC
Hellboy	good	male	Dark Horse Comics

publishers

publisher	yr_founded
DC	1934
Marvel	1939
Image	1992

Take a good look at the two dataframes

What columns match between the two dataframes?
Are the values in the publisher column identical in the two dataframes?
Are the values in the publisher column arranged (sorted) in the same way in the two dataframes?
What kind of information are we missing on the Dark Horse Comics and Image publishers?

`inner_join()`

inner_join(x, y) returns all rows from x where there are matching values in y, and all columns from x and y.

inner_join(superheroes, publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934

inner_join(publishers, superheroes)

publisher	yr_founded	name	alignment	gender
DC	1934	Batman	good	male
DC	1934	Joker	bad	male
DC	1934	Catwoman	bad	female
Marvel	1939	Magneto	bad	male
Marvel	1939	Storm	good	female
Marvel	1939	Mystique	bad	female

Results are identical. Only variable order is different.
What information was filtered away?

`semi_join()`

semi_join(x, y) returns all rows from x where there are matching values in y, keeping just columns from x.

semi_join(superheroes, publishers)

name	alignment	gender	publisher
Magneto	bad	male	Marvel
Storm	good	female	Marvel
Mystique	bad	female	Marvel
Batman	good	male	DC
Joker	bad	male	DC
Catwoman	bad	female	DC

semi_join(publishers, superheroes)

publisher	yr_founded
DC	1934
Marvel	1939

`left_join()`

left_join(x, y) returns all rows from x, and all columns from x and y.

left_join(superheroes, publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA

left_join(publishers, superheroes)

publisher	yr_founded	name	alignment	gender
DC	1934	Batman	good	male
DC	1934	Joker	bad	male
DC	1934	Catwoman	bad	female
Marvel	1939	Magneto	bad	male
Marvel	1939	Storm	good	female
Marvel	1939	Mystique	bad	female
Image	1992	NA	NA	NA

`anti_join()`

anti_join(x, y) returns all rows from x where there are not matching values in y. Keeping only columns in x.

anti_join(superheroes, publishers)

name	alignment	gender	publisher
Hellboy	good	male	Dark Horse Comics

anti_join(publishers, superheroes)

publisher	yr_founded
Image	1992

`full_join()`

full_join(x, y) Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(superheroes, publishers)

name	alignment	gender	publisher	yr_founded
Magneto	bad	male	Marvel	1939
Storm	good	female	Marvel	1939
Mystique	bad	female	Marvel	1939
Batman	good	male	DC	1934
Joker	bad	male	DC	1934
Catwoman	bad	female	DC	1934
Hellboy	good	male	Dark Horse Comics	NA
NA	NA	NA	Image	1992

full_join(publishers, superheroes)

publisher	yr_founded	name	alignment	gender
DC	1934	Batman	good	male
DC	1934	Joker	bad	male
DC	1934	Catwoman	bad	female
Marvel	1939	Magneto	bad	male
Marvel	1939	Storm	good	female
Marvel	1939	Mystique	bad	female
Image	1992	NA	NA	NA
Dark Horse Comics	NA	Hellboy	good	male

lets practice joins

`pivot_longer()` and `pivot_wider()`

`pivot_`

pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
pivot_wider() makes datasets wider by decreasing the number of rows and increasing the number of columns.

`pivot_longer()`

The relig_income dataset

Information on religion and annual income:
cell values are counts

head(relig_income)

# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

`pivot_longer()`

relig_income |> 
  pivot_longer(cols = -religion)

# A tibble: 180 × 3
   religion name               value
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows

relig_income |> 
  pivot_longer(cols = -religion,
               names_to = "income",
               values_to = "count")

# A tibble: 180 × 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows

`pivot_wider()`

Is less often used, but often comes in handy when tidying data

mocap_data

# A tibble: 10 × 10
      id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
   <int> <int> <dbl> <dbl>  <int>  <dbl>  <dbl>   <int>   <dbl>   <dbl>
 1     1     1     2     1      1      1      1       1       0       1
 2     2     2     2     1      2      1      1       2       0       1
 3     3     3     2     1      3      1      1       3       0       1
 4     4     4     2     1      4      1      1       4       0       1
 5     5     5     2     1      5      1      1       5       0       1
 6     6     6     2     1      6      1      1       6       0       1
 7     7     7     2     1      7      1      1       7       0       1
 8     8     8     2     1      8      1      1       8       0       1
 9     9     9     2     1      9      1      1       9       0       1
10    10    10     2     1     10      1      1      10       0       1

*notice: The column names contain two pieces of information ({joint}_{coordinate})*

`pivot_wider()`

We want to go from this:

# A tibble: 3 × 10
     id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
  <int> <int> <dbl> <dbl>  <int>  <dbl>  <dbl>   <int>   <dbl>   <dbl>
1     1     1     2     1      1      1      1       1       0       1
2     2     2     2     1      2      1      1       2       0       1
3     3     3     2     1      3      1      1       3       0       1

To this:

# A tibble: 30 × 5
      id Joint     x     y     z
   <int> <chr> <dbl> <dbl> <dbl>
 1     1 hip       1     2     1
 2     1 knee      1     1     1
 3     1 ankle     1     0     1
 4     2 hip       2     2     1
 5     2 knee      2     1     1
 6     2 ankle     2     0     1
 7     3 hip       3     2     1
 8     3 knee      3     1     1
 9     3 ankle     3     0     1
10     4 hip       4     2     1
# ℹ 20 more rows

`pivot_wider()`

We start with pivot_longer(), but it pulls to much together

mocap_data |> 
  pivot_longer(cols = -id)

# A tibble: 90 × 3
      id name    value
   <int> <chr>   <dbl>
 1     1 hip_x       1
 2     1 hip_y       2
 3     1 hip_z       1
 4     1 knee_x      1
 5     1 knee_y      1
 6     1 knee_z      1
 7     1 ankle_x     1
 8     1 ankle_y     0
 9     1 ankle_z     1
10     2 hip_x       2
# ℹ 80 more rows

`pivot_wider()`

We then use extract() to split the name column

mocap_data |> 
  pivot_longer(cols = -id) |>  
  extract(col = name,                  # What column to extract information from 
          into = c("Joint", "coord"),  # What should the new columns be called
          regex = "(.+)_(.+)")         # A regular expression

# A tibble: 90 × 4
      id Joint coord value
   <int> <chr> <chr> <dbl>
 1     1 hip   x         1
 2     1 hip   y         2
 3     1 hip   z         1
 4     1 knee  x         1
 5     1 knee  y         1
 6     1 knee  z         1
 7     1 ankle x         1
 8     1 ankle y         0
 9     1 ankle z         1
10     2 hip   x         2
# ℹ 80 more rows

`pivot_wider()`

Finally we use pivot_wider()

mocap_data |> 
  pivot_longer(cols = -id) |>  # pulls to much together
  extract(name, 
          into = c("Joint", "coord"),
          regex = "(.+)_(.+)") |> 
  pivot_wider(names_from = coord,
              values_from = value)

# A tibble: 30 × 5
      id Joint     x     y     z
   <int> <chr> <dbl> <dbl> <dbl>
 1     1 hip       1     2     1
 2     1 knee      1     1     1
 3     1 ankle     1     0     1
 4     2 hip       2     2     1
 5     2 knee      2     1     1
 6     2 ankle     2     0     1
 7     3 hip       3     2     1
 8     3 knee      3     1     1
 9     3 ankle     3     0     1
10     4 hip       4     2     1
# ℹ 20 more rows

`across()`

across is more advanced but very useful when you want to perform the same operations on multiple columns

`across()`

across is more advanced but very useful when you want to perform the same operations on multiple columns

Tidy - Extra Material

full_join() and friends

[]_join

[]_join - Example data

inner_join()

semi_join()

left_join()

anti_join()

full_join()

lets practice joins

pivot_longer() and pivot_wider()

pivot_

pivot_longer()

pivot_longer()

pivot_wider()

pivot_wider()

pivot_wider()

pivot_wider()

pivot_wider()

across()

across()

across()

Thanks!

`full_join()` and friends

`[]_join`

`[]_join` - Example data

`inner_join()`

`semi_join()`

`left_join()`

`anti_join()`

`full_join()`

`pivot_longer()` and `pivot_wider()`

`pivot_`

`pivot_longer()`

`pivot_longer()`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

`pivot_wider()`

`across()`

`across()`

`across()`