An introduction to ggplot2
“The simple graph has brought more information to the data analyst’s mind than any other device.”
— John Tukey
Tmisc::quartet
is a tidy version of the built-in Anscombe’s Quartet data.
Anscombe’s Quartet data contains four datasets that have nearly identical linear regression properties, yet appear very different when graphed.
set | x | y |
---|---|---|
I | 10 | 8.04 |
I | 8 | 6.95 |
I | 13 | 7.58 |
I | 9 | 8.81 |
I | 11 | 8.33 |
I | 14 | 9.96 |
set | mean_x | mean_y | sd_x | sd_y | r |
---|---|---|---|---|---|
I | 9 | 7.500909 | 3.316625 | 2.031568 | 0.8164205 |
II | 9 | 7.500909 | 3.316625 | 2.031657 | 0.8162365 |
III | 9 | 7.500000 | 3.316625 | 2.030424 | 0.8162867 |
IV | 9 | 7.500909 | 3.316625 | 2.030579 | 0.8165214 |
ggplot2 is tidyverse’s data visualization package.
gg
in “ggplot2” stands for Grammar of Graphics.
Inspired by the book Grammar of Graphics by Leland Wilkinson.
A grammar of graphics is a tool that enables us to concisely describe the components of a graphic.
Source: BloggoType
Structure of the code for ggplot2
plots can be summarized as:
Every ggplot2 plot has three key components:
data
A set of aesthetic mappings between variables in the data and visual properties
At least one layer which describes how to render each observation. Layers are usually created with a geom function
ggplot2: Elegant Graphics for Data Analysis
colour
, size
, shape
and other aesthetic attributesfacet_wrap()
geom_smooth()
fits a smoother to the data and displays the smooth and its standard error.
geom_jitter()
as geom_point() but with a random spread of the points to prevent over plotting.
geom_boxplot()
produces a box-and-whisker plot to summarise the distribution of a set of points.
geom_histogram()
and geom_freqpoly() show the distribution of continuous variables.
geom_bar()
shows the distribution of categorical variables.
geom_path()
draw lines between the data points. Paths can go in any direction.
geom_line()
draw lines between the data points. A line plot is constrained to produce lines that travel from left to right. Typically used to explore how things change over time.geom_smooth()
geom_smooth()
An important argument to geom_smooth()
is the method
, which allows you to choose which type of model is used to fit the smooth curve.
Default is method = "loess"
geom_histogram()
geom_freqploy()
geom_histogram()
geom_freqploy()
Both geoms bin the data. You should experiment with the binwidth
or bin
arguments when exploring your data.
geom_bar()
geom_bar()
+ fct_infreq()
geom_bar()
+ fct_infreq()
and aes(fill = [VAR] )
geom_bar()
and position = "fill"
position = "fill"
works like stacking, but makes each set of stacked bars the same height. This makes it easier to compare proportions across groups.
geom_bar()
and position = "dodge"
position = "dodge"
places overlapping objects directly beside one another. This makes it easier to compare individual values.
labs()
- Title, Subtitle, Legends, and morelabs()
can do it all
mpg |>
ggplot(aes(x = class, y = cty, fill = class))+
geom_boxplot()+
labs(
title = "Milage in cities of various car classes",
subtitle = "Car classes are colored with nice colors using the fill argument!!",
x = "Class of car",
y = "Milage in cities",
fill = "Class",
caption = "The data comes from the mpg dataset"
)
ggplot
comes prepackaged with several different themes
The package ggthemes
have many different themes.
fill
is mapped to a categorical variablescale_fill_brewer(type = )
scale_fill_brewer(palette = )
color
is mapped to a continous variablescale_colour_continuous(type = )
patchwork
packagepatchwork
packagepatchwork
packagep1 <- ggplot(mpg) +
geom_point(aes(x = displ, y = hwy))+
ggtitle("p1")
p2 <- ggplot(mpg) +
geom_bar(aes(x = as.character(year), fill = drv), position = "dodge") +
labs(x = "year")+
ggtitle("p2")
p3 <- ggplot(mpg) +
geom_density(aes(x = hwy, fill = drv), colour = NA) +
facet_grid(rows = vars(drv))+
ggtitle("p3")
p4 <- ggplot(mpg) +
stat_summary(aes(x = drv, y = hwy, fill = drv), geom = "col", fun.data = mean_se) +
stat_summary(aes(x = drv, y = hwy), geom = "errorbar", fun.data = mean_se, width = 0.5)+
ggtitle("p4")
patchwork
packagepatchwork
package+
does not specify any specific layout, only that the plots should be displayed together.
patchwork
automagically decides the layout.
Adding 3 plots together will create a 1x3 grid.
Adding 4 plots together will create a 2x2 grid.
patchwork
package+
does not specify any specific layout, only that the plots should be displayed together.
patchwork
automagically decides the layout.
Adding 3 plots together will create a 1x3 grid.
Adding 4 plots together will create a 2x2 grid.
patchwork
packageplot_layout()
You have a very high degree of control via the plot_layout()
function
patchwork
package/
and |
/
forces a single column.
|
forces a single row.
patchwork
packagepatchwork
packageguides = "collect"
patchwork
packageguide_area()
patchwork
packagepatchwork
comes with the powerful &
operator that allows you change the settings on all the individual plots.
Can you spot the &
s in the code below?
FOR STEEN: CHECK NOUT THIS BLOGPOST! https://www.andrewheiss.com/blog/2022/06/23/long-labels-ggplot/ https://twitter.com/patilindrajeets/status/1536625572570357760?t=xhVDi62vDFggL3tA5cQ7bA&s=19
Several slides are reworked slides from presentations in Data Science in a Box
Most examples come from ggplot2: Elegant Graphics for Data Analysis