Factors

forcats package

Steen Flammild Harsted & Søren O’Neill

The Workflow

The Workflow

Whats wrong with this plot?

The 5 basic data structures in R

Image Credits: Gauraw Tiwari
https://medium.com/@tiwarigaurav2512/r-data-types-847fffb01d5b

The four basic vector types in R

Image Credits: Hadley Wickham
https://adv-r.hadley.nz/vectors-chap.html

A factor is:

  • A vector that can contain only predefined values.

  • Used to store categorical data.

  • Built on top of an integer vector with two attributes:

    • a class, “factor”, which makes it behave differently from regular integer vectors.

    • levels, which defines the set of allowed values.

Image Credits: Hadley Wickham
https://adv-r.hadley.nz/vectors-chap.html

Factors are useful when:

You know the set of possible values but they’re not all present in a given dataset.

library(tidyverse)
tib_example <- tibble(sex_factor = factor(x = c("m", "m", "m"),
                                          levels = c("m", "f"),
                                          labels = c("Male", "Female")))

tib_example |> 
  count(sex_factor,
        .drop = FALSE) # Do not drop levels with 0 observations
# A tibble: 2 × 2
  sex_factor     n
  <fct>      <int>
1 Male           3
2 Female         0

Factors are useful when:

You want to display character vectors in a non-alphabetical order.

x1 <- c("Monday", "Sunday", "Thursday", "Tuesday", "Friday") 

sort(x1)
[1] "Friday"   "Monday"   "Sunday"   "Thursday" "Tuesday" 



x1f <- factor(x1, levels = c("Monday", "Tuesday", "Wedensday", 
                             "Thursday", "Friday", "Saterday", "Sunday"))
sort(x1f)
[1] Monday   Tuesday  Thursday Friday   Sunday  
Levels: Monday Tuesday Wedensday Thursday Friday Saterday Sunday

Ordered factors

Ordered factors are a minor variation of factors.
In general, they behave like regular factors, but the order of the levels is meaningful (low, medium, high) This property that is automatically leveraged by some modelling and visualization functions.

diamonds |> 
  # cut is an ordered variable
  ggplot(aes(x = color, fill = cut))+
  geom_bar()


diamonds |> 
  mutate(cut = as.character(cut)) |> 
  ggplot(aes(x = color, fill = cut))+
  geom_bar()

Factors - forcats

  • The forcats package from the tidyverse contains many useful functions for working with factors.

  • We are going to learn:

    • factor()
    • fct_reorder()
    • fct_infreq()
    • fct_rev()
    • fct_recode()
    • fct_lump()



fct_infreq() and fct_rev()

fct_infreq()order levels after increasing frequency.
fct_rev() reverses the order.

gss_cat |>
  ggplot(aes(marital)) +
    geom_bar()

gss_cat |>
  mutate(
    marital = marital |> fct_infreq()) |>
  ggplot(aes(marital)) +
    geom_bar()

gss_cat |>
  mutate(
    marital = marital |> fct_infreq() |> fct_rev()) |>
  ggplot(aes(marital)) +
    geom_bar()

fct_recode()

x1f
[1] Monday   Sunday   Thursday Tuesday  Friday  
Levels: Monday Tuesday Wedensday Thursday Friday Saterday Sunday


fct_recode() recodes, or changes, the value of each level.
::: {.cell output-location=‘column-fragment’}

x1f |> factor() |> 
  fct_recode(
    "mon" = "Monday",
    "tue" = "Tuesday",
    "thu" = "Thursday",
    "fri" = "Friday",
    "sun" = "Sunday"
  )
[1] mon sun thu tue fri
Levels: mon tue thu fri sun

:::


fct_recode() will leave levels that aren’t explicitly mentioned as is (Tuesday in example), and will warn you if you refer to a level that doesn’t exist. ::: {.cell output-location=‘column-fragment’}

x1f |> factor() |> 
  fct_recode(
    "mon" = "Monday",
    # "tue" = "Tuesday",
    "wed" = "Wedensday",  # doesn't exist in x1f
    "thu" = "Thursday",
    "fri" = "Friday",     
    "sat" = "Saterday",   # doesn't exist in x1f
    "sun" = "Sunday"
  )
Warning: Unknown levels in `f`: Wedensday, Saterday
[1] mon     sun     thu     Tuesday fri    
Levels: mon Tuesday thu fri sun

:::

fct_collapse()

fct_collapse() is a useful variant of fct_recode().

x1f |> factor() |> 
  fct_recode(
    "Work" = "Monday",
    "Work" = "Tuesday",
    "Work" = "Wedensday",
    "Work" = "Thursday",
    "Work" = "Friday",
    "Weekend" = "Saterday",
    "Weekend" = "Sunday"
  )
[1] Work    Weekend Work    Work    Work   
Levels: Work Weekend


x1f |> factor() |> 
  fct_collapse(
    Work = c("Monday", "Tuesday", "Wedenday", "Thursday", "Friday"),
    Weekend = c("Saterday", "Sunday")
  )
[1] Work    Weekend Work    Work    Work   
Levels: Work Weekend

fct_lump()

fct_lump() is another useful variant of fct_recode().

starwars |>
  filter(!is.na(species)) |> 
  count(species)
# A tibble: 37 × 2
   species       n
   <chr>     <int>
 1 Aleena        1
 2 Besalisk      1
 3 Cerean        1
 4 Chagrian      1
 5 Clawdite      1
 6 Droid         6
 7 Dug           1
 8 Ewok          1
 9 Geonosian     1
10 Gungan        3
# ℹ 27 more rows


starwars |>
  filter(!is.na(species)) |> 
  mutate(
    species = species |> fct_lump(n = 1)
  ) |> 
  count(species)
# A tibble: 2 × 2
  species     n
  <fct>   <int>
1 Human      35
2 Other      48


starwars |> 
  filter(!is.na(species)) |> 
  mutate(
    species = species |> fct_lump(n = 3)
  ) |> 
  count(species)
# A tibble: 4 × 2
  species     n
  <fct>   <int>
1 Droid       6
2 Gungan      3
3 Human      35
4 Other      39

Thanks!

Lets practice