Reproducible Research with R

…the basics

Søren O’Neill & Steen Flammild Harsted

2023-01-01

Replicable, Robust and Reproducible Research

Replication refers to testing the reliability of a prior finding with different data.
Robustness refers to testing the reliability of a prior finding using the same data and different analysis strategy.
Reproducibility refers to testing the reliability of a prior finding using the same data and same analysis strategy.

..from https://doi.org/10.1146/annurev-psych-020821-114157

What it is …

“Facilitate easy and accurate reproducibility of all steps of research: Results, process and comprehension – from raw data to finished output”

Why

To help yourself not get lost
Documentation (formal/legal)
Reproducibility (from A-Z)
Reuseability (different outputs)
Recycling (reassemble for other purposes)

Levels of research reproducibility

Aim for (**)

Basics

Comments in R code and markdown

R code

# This is a single-line comment in R code
# ...there are no multi-line comments
x <- 2 + 2 # ...but inline comments are okay

# NOTE: The hash tag has different meanings
# in R code and markdown!

Markdown

#| echo: true
#| eval: false

<!--

This is a multi-line comment ..
..in markdown

-->

<!-- This is a single line comment -->

Where to put comments?

(..click the tabs)

R chunk
Output
Good practice

# this is definitely okay
x <- 2 + 2

# ...but is the following okay?
tibble(x=1:10, y=2:11) %>% # ..Is this okay?
  filter(y>5 & # How about breaking in the middle..
  x != 9) # ..of a statement like 'filter'?

# A tibble: 5 × 2
      x     y
  <int> <int>
1     5     6
2     6     7
3     7     8
4     8     9
5    10    11

These comments were syntactically okay … BUT did they make your code easier for humans to read?

Keep it simple!

# Comment lines before the code
# can be multiple lines and does
# not disturb your reading of the
# code ... best practice!

x <- 2 + 2 # use inline comments sparringly
y <- x^2 - c(2, 5, 8, 10:14) # ..the are distracting!

Meaningful comments

What comments would be relevant here?

Suggest comments for this code… think ‘what’ and ‘why’

# 1
d <- read.csv("my_data_file.csv")

# 2
d <- d %>% filter(id != "241269-1212")

# 3
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

Suggestions for meaningful comments

# ?? comments necessary ??
d <- read.csv("my_data_file.csv")

# excluded because participant entered an invalid CPR number
d <- d %>% filter(id != "2321369-1212")

# set 's' to F(emale) or M(ale) depending on odd/even last digit in CPR
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

Commenting your code

Main points

Always comment your code
Comments should explain your thinking : ‘why’
With good code the ‘how’ is self-evident
- If your code can not be self-evident: explain in comments

In markdown, there's  
an important difference
between '_new-line_' and '_empty-line_'.

...white space matters!

In markdown, there’s
an important difference between ‘new-line’ and ‘empty-line’.

…white space matters!

any number of spaces between words is equal to one space
any number of empty lines between paragraphs is equal to one empty line
use one space to separate functions, operators, etc a%>%b vs a %>% b
use empty line to indicate new paragraph

Check out Soft wrap long lines in the Code menu.

Commenting your project

Maintain a README.md file in each project, at the root level

A simple description of the project purpose (the ‘why’)
People involved
Data sources etc
List most important components (data and files)

Especially important for larger, more complex projects with many data sources, collaborators, etc

README.md template

# Project title

A subtitle that describes your project, e.g., research question

## Motivation

Motivate your research question or business problem. Clearly explain which problem is solved.

## Method and results

First, introduce and motivate your chosen method, and explain how it contributes to solving the research 
question/business problem.

Second, summarize your results concisely. Make use of subheaders where appropriate.


## Repository overview

Provide an overview of the directory structure and files, for example:

├── README.md
├── data
│   ├── my_data.csv  # raw data from CPR register
│   ├── exp_data.csv # experimental data register
├── plots
│   ├── plot_1.png   # Boxplot of age
│   ├── plot_2.png   # Pi chart of sex
│   └── plot_3.png   # Bi-plot age vs measurement X
├── main.R           # all analyses in one place
└── manuscript1.Rmd  # for J of RR

## Running instructions

Explain to potential users how to run/replicate your workflow. If necessary, touch upon the required input 
data, which secret credentials are required (and how to obtain them), which software tools are needed 
to run the workflow (including links to the installation instructions), and how to run the workflow.

## More resources

Point interested users to any related literature and/or documentation.

## About

Explain who has contributed to the repository.

Commenting your project

Maintain a data definition (markdown) file in data file project, in the same folder as the data file itself

Where does data come from?
What are the individual variables in the data
- E.g. ‘weight’ is self-explanatory … but is it kg or lb, is it measured or self reported, etc?
How is data in different files related to each other?
- E.g. the file ‘clean_data.rds’ might be generated by the script ‘clean_my_data.R’ on the basis of data file ‘raw_data.csv’

Especially important for larger, more complex projects with many data sources, collaborators, etc

Naming stuff

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

Meaning

Let variable, function and file names convey meaning.

# 1 
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

# 2 
d_m <- d %>% filter(s=="M")

Suggest alternative code and variable names for this code

Meaning

Let variable, function and file names convey meaning.

# 1 
data <- data %>% 
  mutate(sex=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

# 2 
data_males_only <- data %>% filter(sex=="M")

Making the meaning even clearer…

data <- data %>% 
  mutate(sex=cpr2sex(id))

Alas, the function cpr2sex does not exist in base R or Tidyverse

Tip

# Requires a custom function like this -- which could be sourced from file

cpr2sex <- function(x) {
  # This function takes a string (x), presumed to be a valid Danish CPR
  # and return "F", "M" or NA depending on the last character in the string
  # If the last CPR character is an even number, it indicates female sex, and
  # an odd number indicates male sex.
  if (str_sub(x, str_length(x), str_length(x)) %in% c("0","2","4","6","8")) {
    return("F") 
  } else {
    return("M") 
  } 
  return(NA) # Last character in CPR is not a ciffre
}

We could hide this away in a separate file and ‘source’ it .. or even make a new package…

Compound names

Do use under_scores
Do not use camelCase
Do not use kebab-case

Nouns and verbs

make_larger_by_10 <- function(x) {
  return(x+10)
}

ten_larger <- make_larger_by_10(112)

# For instance:
# selected_data <- data %>% select(..)

Names

Main points

Names should be meaningful
Use under_scores, not CamelCase, nor kebab-case
Function names should be verbs
Variable names should be nouns

Files and folders

R Projects

One project in one folder!

The project root folder

Should contain

RStudio project (*.RProj)
README.md
sessionInfo.txt
Your main R scripts
Your RMarkdown and quarto scripts
Relevant subfolders (/custom_functions, /gfx, /data, etc)

Using subfolders

Use the here() function to refer subfolders.

Using relative paths with ./ and ../ can also work

Do not use absolute filesystem paths like C:/users/Einstein/Documents/

The ubiquituos versioning nightmare

Files and folders

Make and use your own folder/sub-folder template for new projects

File types

Text always preferable to binary files
Markup files (HTML, XML, etc) are human readable, but can be complex
Application files (.odt, .docx) are often either binary files or very complex markup files

gfx/prefer_text_files.pdf

gfx/prefer_text_files.xml

File types

Stick to simple, human-readable files like R-scripts, markdown, csv files, etc, as far into the process as you can and only generate pdf, word, tiff, jpeg etc files as the final step.

…only really one potential issue with textfiles: CHARSET

Weird characters?

E.g. SÃ¸ren instead of Søren

It’s probably the character encoding (Microsoft Excel again!) – just stick to UTF8/UTF16

Files and folders

Main points

Use a strict folder structure you can handle – make a template!
Use relational folder paths if possible
Use human-readable files only if possible (txt, md, Rmd, csv, etc)
Only use non-human readable files for ‘final output’ (pdf, docx, xlsx, etc)

Reproducible Research with R

Replicable, Robust and Reproducible Research

What it is …

Why

Levels of research reproducibility

Basics

Comments in R code and markdown

Where to put comments?

Keep it simple!

Meaningful comments

Commenting your code

Using white space

Commenting your project

README.md template

Commenting your project

Naming stuff

Meaning

Meaning

Making the meaning even clearer…

Compound names

Nouns and verbs

Names

Files and folders

R Projects

The project root folder

Using subfolders

The ubiquituos versioning nightmare

Files and folders

Make and use your own folder/sub-folder template for new projects

File types

File types

Files and folders

The end