Visualise

An introduction to ggplot2

Steen Flammild Harsted & Søren O’Neill

The Workflow

The Workflow

Data visualization

“The simple graph has brought more information to the data analyst’s mind than any other device.”

— John Tukey

Anscombe’s quartet

Tmisc::quartet is a tidy version of the built-in Anscombe’s Quartet data.

Anscombe’s Quartet data contains four datasets that have nearly identical linear regression properties, yet appear very different when graphed.

Tmisc::quartet |>
  head() |> 
  gt::gt() # print as table
set x y
I 10 8.04
I 8 6.95
I 13 7.58
I 9 8.81
I 11 8.33
I 14 9.96

  

Tmisc::quartet |> 
  tail() |> 
  gt::gt() # print as table
set x y
IV 8 7.04
IV 8 5.25
IV 19 12.50
IV 8 5.56
IV 8 7.91
IV 8 6.89

Summarising Anscombe’s quartet

Tmisc::quartet |>
  group_by(set) |>
  summarise(
    mean_x = mean(x), 
    mean_y = mean(y),
    sd_x = sd(x),
    sd_y = sd(y),
    r = cor(x, y)
  )|> 
  gt::gt() # print as table
set mean_x mean_y sd_x sd_y r
I 9 7.500909 3.316625 2.031568 0.8164205
II 9 7.500909 3.316625 2.031657 0.8162365
III 9 7.500000 3.316625 2.030424 0.8162867
IV 9 7.500909 3.316625 2.030579 0.8165214

Visualizing Anscombe’s quartet

ggplot2

ggplot2 \(\in\) tidyverse

  • ggplot2 is tidyverse’s data visualization package.

  • gg in “ggplot2” stands for Grammar of Graphics.

  • Inspired by the book Grammar of Graphics by Leland Wilkinson.

Grammar of Graphics

A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.


Source: BloggoType

Hello ggplot2!

Structure of the code for ggplot2 plots can be summarized as:

ggplot(data = [dataset], 
       mapping = aes(x = [x-variable], y = [y-variable])) +
   geom_xxx() +
   other options


Every ggplot2 plot has three key components:

  1. data

  2. A set of aesthetic mappings between variables in the data and visual properties

  3. At least one layer which describes how to render each observation. Layers are usually created with a geom function

Hello ggplot2!

ggplot2: Elegant Graphics for Data Analysis

Simple example

ggplot(data = mpg,                               # Use the mpg data
       mapping = aes(x = displ, y = hwy)) +      # map (explore) the variables displ and hwy
  geom_point()                                   # visualize the observations with points

colour, size, shape and other aesthetic attributes

ggplot(mpg, aes(displ, hwy, colour = drv))+ 
  geom_point()

ggplot(mpg, aes(displ, hwy, shape = drv))+ 
  geom_point()

ggplot(mpg, aes(displ, hwy, size = drv))+ 
  geom_point()

Faceting

facet_wrap()

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  facet_wrap(~class)

Other Plot Geoms

Other Plot Geoms


  • geom_smooth() fits a smoother to the data and displays the smooth and its standard error.

  • geom_jitter() as geom_point() but with a random spread of the points to prevent over plotting.

  • geom_boxplot() produces a box-and-whisker plot to summarise the distribution of a set of points.

  • geom_histogram() and geom_freqpoly() show the distribution of continuous variables.

  • geom_bar() shows the distribution of categorical variables.

  • geom_path() draw lines between the data points. Paths can go in any direction.

  • geom_line() draw lines between the data points. A line plot is constrained to produce lines that travel from left to right. Typically used to explore how things change over time.

    Other Plot Geoms - geom_smooth()

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth()

Other Plot Geoms - geom_smooth()

An important argument to geom_smooth() is the method, which allows you to choose which type of model is used to fit the smooth curve.
Default is method = "loess"

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(method = "loess")

ggplot(mpg, aes(displ, hwy)) + 
  geom_point() + 
  geom_smooth(method = "lm") #linear model

Other Plot Geoms

ggplot(mpg, aes(drv, hwy))+
  geom_jitter()

ggplot(mpg, aes(drv, hwy))+
  geom_boxplot()

ggplot(mpg, aes(drv, hwy))+
  geom_violin()

Other Plot Geoms

geom_histogram() geom_freqploy()

ggplot(mpg, aes(hwy))+
  geom_histogram()

ggplot(mpg, aes(hwy))+
  geom_freqpoly()

Other Plot Geoms

geom_histogram() geom_freqploy()

Both geoms bin the data. You should experiment with the binwidth or bin arguments when exploring your data.

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 10)

ggplot(mpg, aes(hwy)) + 
  geom_freqpoly(binwidth = 1)

Other Plot Geoms geom_bar()

ggplot(mpg, aes(x = manufacturer)) + 
  geom_bar()

ggplot(mpg, aes(y =manufacturer)) + 
  geom_bar()

Other Plot Geoms

geom_bar() + fct_infreq()

mpg |> 
ggplot(aes(y = fct_infreq(manufacturer))) + 
  geom_bar()

Other Plot Geoms

geom_bar() + fct_infreq() and aes(fill = [VAR] )

mpg |> 
ggplot(aes(y = fct_infreq(manufacturer), 
           fill = class)) + 
  geom_bar()

Other Plot Geoms

geom_bar() and position = "fill"

position = "fill" works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.

mpg |> 
ggplot(aes(y = fct_infreq(manufacturer), 
           fill = class)) + 
  geom_bar(position = "fill")

Other Plot Geoms

geom_bar() and position = "dodge"

position = "dodge" places overlapping objects directly beside one another. This makes it easier to compare individual values.

mpg |> 
ggplot(aes(y = fct_infreq(manufacturer), 
           fill = class)) + 
  geom_bar(position = "dodge")

Annotations

labs()- Title, Subtitle, Legends, and more

labs() can do it all

mpg |> 
  ggplot(aes(x = class, y = cty, fill = class))+
  geom_boxplot()+
  labs(
    title = "Milage in cities of various car classes",
    subtitle = "Car classes are colored with nice colors using the fill argument!!",
    x = "Class of car",
    y = "Milage in cities",
    fill = "Class",
    caption = "The data comes from the mpg dataset"
    )

Themes

Themes

ggplot comes prepackaged with several different themes

p <- mpg |> 
  ggplot(aes(x = class, y = cty, fill = class))+
  geom_boxplot()

p + theme_bw()

p + theme_classic()

Themes

The package ggthemes have many different themes.

p + ggthemes::theme_stata()
p + ggthemes::theme_economist()
p + ggthemes::theme_excel_new()
p + ggthemes::theme_fivethirtyeight()

color and fill scales

Notice fill is mapped to a categorical variable

p <- mpg |> 
  ggplot(aes(x = cty, y = hwy, fill=drv))+
  geom_point(shape = 21, size = 3)

p

color and fill scales

scale_fill_brewer(type = )

p + scale_fill_brewer(type = "seq")  # sequential (default)

p + scale_fill_brewer(type = "qual") # qualitative

p + scale_fill_brewer(type = "div")  # diverging

color and fill scales

scale_fill_brewer(palette = )

p + scale_fill_brewer(type = "seq", # sequential (default)
                      palette = 2)  

p + scale_fill_brewer(type = "qual", # qualitative
                      palette = 2) 

p + scale_fill_brewer(type = "div", # diverging
                      palette = 2)  

color and fill scales

Notice color is mapped to a continous variable

p <- mpg |> 
  ggplot(aes(x = cty, y = hwy, color=displ))+
  geom_point(size = 3)

p

color and fill scales

scale_colour_continuous(type = )

p + 
  scale_colour_continuous(type = "gradient")

# The virdidis color scale
p + 
  scale_colour_continuous(type = "viridis")

p + 
  scale_colour_continuous(type = "viridis", 
                          option = "magma")

Arranging plots with the patchwork package

The patchwork package

Arranging plots with the patchwork package

p1 <- ggplot(mpg) + 
  geom_point(aes(x = displ, y = hwy))+
  ggtitle("p1")

p2 <- ggplot(mpg) + 
  geom_bar(aes(x = as.character(year), fill = drv), position = "dodge") + 
  labs(x = "year")+
  ggtitle("p2")

p3 <- ggplot(mpg) + 
  geom_density(aes(x = hwy, fill = drv), colour = NA) + 
  facet_grid(rows = vars(drv))+
  ggtitle("p3")

p4 <- ggplot(mpg) + 
  stat_summary(aes(x = drv, y = hwy, fill = drv), geom = "col", fun.data = mean_se) +
  stat_summary(aes(x = drv, y = hwy), geom = "errorbar", fun.data = mean_se, width = 0.5)+
  ggtitle("p4")

Arranging plots with the patchwork package

Arranging plots with the patchwork package

  • + does not specify any specific layout, only that the plots should be displayed together.

  • patchwork automagically decides the layout.

  • Adding 3 plots together will create a 1x3 grid.

  • Adding 4 plots together will create a 2x2 grid.

p1 + p2

Arranging plots with the patchwork package

  • + does not specify any specific layout, only that the plots should be displayed together.

  • patchwork automagically decides the layout.

  • Adding 3 plots together will create a 1x3 grid.

  • Adding 4 plots together will create a 2x2 grid.

p1 + p2 + p3 +p4

Arranging plots with the patchwork package

plot_layout()

You have a very high degree of control via the plot_layout() function

p1 + p2 + p3 + p4 + plot_layout(nrow = 1)

Arranging plots with the patchwork package

/ and |

/ forces a single column.
| forces a single row.

p1/p2

p1|p2 # sames as p1 + p2

Arranging plots with the patchwork package

nesting plots

p3 | (p2 /(p1 | p4))

Arranging plots with the patchwork package

guides = "collect"

p1 + p2 + p3 + plot_layout(ncol = 2, guides = "collect")

Arranging plots with the patchwork package

guide_area()

p1 + p2 + p3 + plot_layout(ncol = 2, guides = "collect") + guide_area() 

Arranging plots with the patchwork package

patchwork comes with the powerful & operator that allows you change the settings on all the individual plots.

p1 + p2 + p3 + 
  plot_layout(ncol = 2, guides = "collect") + 
  guide_area() +
  scale_fill_brewer(type = "qual",
                    palette = 3) +
  theme_classic()

Can you spot the &s in the code below?

p1 + p2 + p3 + 
  plot_layout(ncol = 2, guides = "collect") + 
  guide_area() &   
  scale_fill_brewer(type = "qual",
                    palette = 3) &
  theme_classic()

Thanks!

Several slides are reworked slides from presentations in Data Science in a Box

Most examples come from ggplot2: Elegant Graphics for Data Analysis