Tidy - Extra Material

dplyr and tidyr

Steen Flammild Harsted and Søren O´Neill

full_join() and friends

[]_join

It’s rare that a data analysis involves only a single table of data. In practice, you’ll normally have many tables that contribute to an analysis, and you need flexible tools to combine them.

  • inner_join()

  • semi_join()

  • left_join()

  • anti_join()

  • full_join()

[]_join - Example data

superheroes
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
Hellboy good male Dark Horse Comics
publishers
publisher yr_founded
DC 1934
Marvel 1939
Image 1992


Take a good look at the two dataframes

What columns match between the two dataframes?
Are the values in the publisher column identical in the two dataframes?
Are the values in the publisher column arranged (sorted) in the same way in the two dataframes?
What kind of information are we missing on the Dark Horse Comics and Image publishers?

inner_join()

inner_join(x, y) returns all rows from x where there are matching values in y, and all columns from x and y.

inner_join(superheroes, publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
inner_join(publishers, superheroes)
publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female


Results are identical. Only variable order is different.
What information was filtered away?

semi_join()

semi_join(x, y) returns all rows from x where there are matching values in y, keeping just columns from x.

semi_join(superheroes, publishers)
name alignment gender publisher
Magneto bad male Marvel
Storm good female Marvel
Mystique bad female Marvel
Batman good male DC
Joker bad male DC
Catwoman bad female DC
semi_join(publishers, superheroes)
publisher yr_founded
DC 1934
Marvel 1939

left_join()

left_join(x, y) returns all rows from x, and all columns from x and y.

left_join(superheroes, publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
left_join(publishers, superheroes)
publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female
Image 1992 NA NA NA

anti_join()

anti_join(x, y) returns all rows from x where there are not matching values in y. Keeping only columns in x.

anti_join(superheroes, publishers)
name alignment gender publisher
Hellboy good male Dark Horse Comics
anti_join(publishers, superheroes)
publisher yr_founded
Image 1992

full_join()

full_join(x, y) Return all rows and all columns from both x and y. Where there are not matching values, returns NA for the one missing.

full_join(superheroes, publishers)
name alignment gender publisher yr_founded
Magneto bad male Marvel 1939
Storm good female Marvel 1939
Mystique bad female Marvel 1939
Batman good male DC 1934
Joker bad male DC 1934
Catwoman bad female DC 1934
Hellboy good male Dark Horse Comics NA
NA NA NA Image 1992
full_join(publishers, superheroes)
publisher yr_founded name alignment gender
DC 1934 Batman good male
DC 1934 Joker bad male
DC 1934 Catwoman bad female
Marvel 1939 Magneto bad male
Marvel 1939 Storm good female
Marvel 1939 Mystique bad female
Image 1992 NA NA NA
Dark Horse Comics NA Hellboy good male

lets practice joins

pivot_longer() and pivot_wider()

pivot_


  • pivot_longer() makes datasets longer by increasing the number of rows and decreasing the number of columns.
  • pivot_wider() makes datasets wider by decreasing the number of rows and increasing the number of columns.

pivot_longer()

The relig_income dataset

  • Information on religion and annual income:
  • cell values are counts
head(relig_income)
# A tibble: 6 × 11
  religion  `<$10k` `$10-20k` `$20-30k` `$30-40k` `$40-50k` `$50-75k` `$75-100k`
  <chr>       <dbl>     <dbl>     <dbl>     <dbl>     <dbl>     <dbl>      <dbl>
1 Agnostic       27        34        60        81        76       137        122
2 Atheist        12        27        37        52        35        70         73
3 Buddhist       27        21        30        34        33        58         62
4 Catholic      418       617       732       670       638      1116        949
5 Don’t kn…      15        14        15        11        10        35         21
6 Evangeli…     575       869      1064       982       881      1486        949
# ℹ 3 more variables: `$100-150k` <dbl>, `>150k` <dbl>,
#   `Don't know/refused` <dbl>

pivot_longer()

relig_income |> 
  pivot_longer(cols = -religion)
# A tibble: 180 × 3
   religion name               value
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows
relig_income |> 
  pivot_longer(cols = -religion,
               names_to = "income",
               values_to = "count")
# A tibble: 180 × 3
   religion income             count
   <chr>    <chr>              <dbl>
 1 Agnostic <$10k                 27
 2 Agnostic $10-20k               34
 3 Agnostic $20-30k               60
 4 Agnostic $30-40k               81
 5 Agnostic $40-50k               76
 6 Agnostic $50-75k              137
 7 Agnostic $75-100k             122
 8 Agnostic $100-150k            109
 9 Agnostic >150k                 84
10 Agnostic Don't know/refused    96
# ℹ 170 more rows

pivot_wider()

Is less often used, but often comes in handy when tidying data ::: {.cell}

:::

mocap_data
# A tibble: 10 × 10
      id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
   <int> <int> <dbl> <dbl>  <int>  <dbl>  <dbl>   <int>   <dbl>   <dbl>
 1     1     1     2     1      1      1      1       1       0       1
 2     2     2     2     1      2      1      1       2       0       1
 3     3     3     2     1      3      1      1       3       0       1
 4     4     4     2     1      4      1      1       4       0       1
 5     5     5     2     1      5      1      1       5       0       1
 6     6     6     2     1      6      1      1       6       0       1
 7     7     7     2     1      7      1      1       7       0       1
 8     8     8     2     1      8      1      1       8       0       1
 9     9     9     2     1      9      1      1       9       0       1
10    10    10     2     1     10      1      1      10       0       1


*notice: The column names contain two pieces of information ({joint}_{coordinate})*

pivot_wider()

We want to go from this: ::: {.cell} ::: {.cell-output .cell-output-stdout}

# A tibble: 3 × 10
     id hip_x hip_y hip_z knee_x knee_y knee_z ankle_x ankle_y ankle_z
  <int> <int> <dbl> <dbl>  <int>  <dbl>  <dbl>   <int>   <dbl>   <dbl>
1     1     1     2     1      1      1      1       1       0       1
2     2     2     2     1      2      1      1       2       0       1
3     3     3     2     1      3      1      1       3       0       1

::: :::


To this: ::: {.cell} ::: {.cell-output .cell-output-stdout}

# A tibble: 30 × 5
      id Joint     x     y     z
   <int> <chr> <dbl> <dbl> <dbl>
 1     1 hip       1     2     1
 2     1 knee      1     1     1
 3     1 ankle     1     0     1
 4     2 hip       2     2     1
 5     2 knee      2     1     1
 6     2 ankle     2     0     1
 7     3 hip       3     2     1
 8     3 knee      3     1     1
 9     3 ankle     3     0     1
10     4 hip       4     2     1
# ℹ 20 more rows

::: :::

pivot_wider()


We start with pivot_longer(), but it pulls to much together ::: {.cell output-location=‘fragment’}

mocap_data |> 
  pivot_longer(cols = -id) 
# A tibble: 90 × 3
      id name    value
   <int> <chr>   <dbl>
 1     1 hip_x       1
 2     1 hip_y       2
 3     1 hip_z       1
 4     1 knee_x      1
 5     1 knee_y      1
 6     1 knee_z      1
 7     1 ankle_x     1
 8     1 ankle_y     0
 9     1 ankle_z     1
10     2 hip_x       2
# ℹ 80 more rows

:::

pivot_wider()


We then use extract() to split the name column ::: {.cell output-location=‘fragment’}

mocap_data |> 
  pivot_longer(cols = -id) |>  
  extract(col = name,                  # What column to extract information from 
          into = c("Joint", "coord"),  # What should the new columns be called
          regex = "(.+)_(.+)")         # A regular expression
# A tibble: 90 × 4
      id Joint coord value
   <int> <chr> <chr> <dbl>
 1     1 hip   x         1
 2     1 hip   y         2
 3     1 hip   z         1
 4     1 knee  x         1
 5     1 knee  y         1
 6     1 knee  z         1
 7     1 ankle x         1
 8     1 ankle y         0
 9     1 ankle z         1
10     2 hip   x         2
# ℹ 80 more rows

:::

pivot_wider()


Finally we use pivot_wider() ::: {.cell output-location=‘fragment’}

mocap_data |> 
  pivot_longer(cols = -id) |>  # pulls to much together
  extract(name, 
          into = c("Joint", "coord"),
          regex = "(.+)_(.+)") |> 
  pivot_wider(names_from = coord,
              values_from = value)
# A tibble: 30 × 5
      id Joint     x     y     z
   <int> <chr> <dbl> <dbl> <dbl>
 1     1 hip       1     2     1
 2     1 knee      1     1     1
 3     1 ankle     1     0     1
 4     2 hip       2     2     1
 5     2 knee      2     1     1
 6     2 ankle     2     0     1
 7     3 hip       3     2     1
 8     3 knee      3     1     1
 9     3 ankle     3     0     1
10     4 hip       4     2     1
# ℹ 20 more rows

:::

across()

across()

across is more advanced but very useful when you want to perform the same operations on multiple columns

Read more about across here and here ::: {.cell output-location=‘fragment’}

diamonds |>  
  group_by(cut) |>  
  summarise(
    across(.cols = where(is.numeric),      # All numeric columns will change
           .fns = ~mean(.x, na.rm = T),    # .fns is the function you wnat to want to apply to the columns 
           .names = "{.col}_mean"          # How should the new columns be named?
  ))
# A tibble: 5 × 8
  cut       carat_mean depth_mean table_mean price_mean x_mean y_mean z_mean
  <ord>          <dbl>      <dbl>      <dbl>      <dbl>  <dbl>  <dbl>  <dbl>
1 Fair           1.05        64.0       59.1      4359.   6.25   6.18   3.98
2 Good           0.849       62.4       58.7      3929.   5.84   5.85   3.64
3 Very Good      0.806       61.8       58.0      3982.   5.74   5.77   3.56
4 Premium        0.892       61.3       58.7      4584.   5.97   5.94   3.65
5 Ideal          0.703       61.7       56.0      3458.   5.51   5.52   3.40

:::

across()

across is more advanced but very useful when you want to perform the same operations on multiple columns

Read more about across here and here ::: {.cell output-location=‘fragment’}

diamonds |>  
  group_by(cut) |>  
  summarise(
    across(.cols = where(is.numeric),                   # All numeric columns will change
           .fns = list(mean = ~mean(.x, na.rm = T),     # If more than one function you supply them in a list and name the functions
                       sd = ~sd(.x, na.rm = T)),
           .names = "{.col}_{.fn}"
             ))
# A tibble: 5 × 15
  cut     carat_mean carat_sd depth_mean depth_sd table_mean table_sd price_mean
  <ord>        <dbl>    <dbl>      <dbl>    <dbl>      <dbl>    <dbl>      <dbl>
1 Fair         1.05     0.516       64.0    3.64        59.1     3.95      4359.
2 Good         0.849    0.454       62.4    2.17        58.7     2.85      3929.
3 Very G…      0.806    0.459       61.8    1.38        58.0     2.12      3982.
4 Premium      0.892    0.515       61.3    1.16        58.7     1.48      4584.
5 Ideal        0.703    0.433       61.7    0.719       56.0     1.25      3458.
# ℹ 7 more variables: price_sd <dbl>, x_mean <dbl>, x_sd <dbl>, y_mean <dbl>,
#   y_sd <dbl>, z_mean <dbl>, z_sd <dbl>

:::

Thanks!