Reproducible Research with R

…the basics

Søren O’Neill & Steen Flammild Harsted

2023-01-01

Replicable, Robust and Reproducible Research

  • Replication refers to testing the reliability of a prior finding with different data.
  • Robustness refers to testing the reliability of a prior finding using the same data and different analysis strategy.
  • Reproducibility refers to testing the reliability of a prior finding using the same data and same analysis strategy.

..from https://doi.org/10.1146/annurev-psych-020821-114157

What it is …

Facilitate easy and accurate reproducibility of all steps of research: Results, process and comprehension – from raw data to finished output”

Why

  • To help yourself not get lost
  • Documentation (formal/legal)
  • Reproducibility (from A-Z)
  • Reuseability (different outputs)
  • Recycling (reassemble for other purposes)

Levels of research reproducibility

::: {.cell}

:::

Aim for (**)

Basics

Comments in R code and markdown

R code

# This is a single-line comment in R code
# ...there are no multi-line comments
x <- 2 + 2 # ...but inline comments are okay

# NOTE: The hash tag has different meanings
# in R code and markdown!

Markdown

#| echo: true
#| eval: false

<!--

This is a multi-line comment ..
..in markdown

-->

<!-- This is a single line comment -->

Where to put comments?

(..click the tabs)

# this is definitely okay
x <- 2 + 2

# ...but is the following okay?
tibble(x=1:10, y=2:11) %>% # ..Is this okay?
  filter(y>5 & # How about breaking in the middle..
  x != 9) # ..of a statement like 'filter'?
# A tibble: 5 × 2
      x     y
  <int> <int>
1     5     6
2     6     7
3     7     8
4     8     9
5    10    11

These comments were syntactically okay … BUT did they make your code easier for humans to read?

Keep it simple!

# Comment lines before the code
# can be multiple lines and does
# not disturb your reading of the
# code ... best practice!

x <- 2 + 2 # use inline comments sparringly
y <- x^2 - c(2, 5, 8, 10:14) # ..the are distracting!

Meaningful comments

What comments would be relevant here?

Suggest comments for this code… think ‘what’ and ‘why’

# 1
d <- read.csv("my_data_file.csv")

# 2
d <- d %>% filter(id != "241269-1212")

# 3
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

Suggestions for meaningful comments

# ?? comments necessary ??
d <- read.csv("my_data_file.csv")

# excluded because participant entered an invalid CPR number
d <- d %>% filter(id != "2321369-1212")

# set 's' to F(emale) or M(ale) depending on odd/even last digit in CPR
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

Commenting your code

Main points

  • Always comment your code
  • Comments should explain your thinking : ‘why’
  • With good code the ‘how’ is self-evident
    • If your code can not be self-evident: explain in comments

Using white space

In markdown, there's  
an important difference
between '_new-line_' and '_empty-line_'.

...white space matters!

In markdown, there’s
an important difference between ‘new-line’ and ‘empty-line’.

…white space matters!

  • any number of spaces between words is equal to one space
  • any number of empty lines between paragraphs is equal to one empty line
  • use one space to separate functions, operators, etc a%>%b vs a %>% b
  • use empty line to indicate new paragraph

Check out Soft wrap long lines in the Code menu.

Commenting your project

Maintain a README.md file in each project, at the root level

  • A simple description of the project purpose (the ‘why’)
  • People involved
  • Data sources etc
  • List most important components (data and files)

Especially important for larger, more complex projects with many data sources, collaborators, etc

README.md template

# Project title

A subtitle that describes your project, e.g., research question

## Motivation

Motivate your research question or business problem. Clearly explain which problem is solved.

## Method and results

First, introduce and motivate your chosen method, and explain how it contributes to solving the research 
question/business problem.

Second, summarize your results concisely. Make use of subheaders where appropriate.


## Repository overview

Provide an overview of the directory structure and files, for example:

├── README.md
├── data
│   ├── my_data.csv  # raw data from CPR register
│   ├── exp_data.csv # experimental data register
├── plots
│   ├── plot_1.png   # Boxplot of age
│   ├── plot_2.png   # Pi chart of sex
│   └── plot_3.png   # Bi-plot age vs measurement X
├── main.R           # all analyses in one place
└── manuscript1.Rmd  # for J of RR

## Running instructions

Explain to potential users how to run/replicate your workflow. If necessary, touch upon the required input 
data, which secret credentials are required (and how to obtain them), which software tools are needed 
to run the workflow (including links to the installation instructions), and how to run the workflow.

## More resources

Point interested users to any related literature and/or documentation.

## About

Explain who has contributed to the repository.

Commenting your project

Maintain a data definition (markdown) file in data file project, in the same folder as the data file itself

  • Where does data come from?
  • What are the individual variables in the data
    • E.g. ‘weight’ is self-explanatory … but is it kg or lb, is it measured or self reported, etc?
  • How is data in different files related to each other?
    • E.g. the file ‘clean_data.rds’ might be generated by the script ‘clean_my_data.R’ on the basis of data file ‘raw_data.csv’

Especially important for larger, more complex projects with many data sources, collaborators, etc

Naming stuff

There are only two hard things in Computer Science: cache invalidation and naming things. – Phil Karlton

Meaning

Let variable, function and file names convey meaning.

# 1 
d <- d %>% 
  mutate(s=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

# 2 
d_m <- d %>% filter(s=="M")

Suggest alternative code and variable names for this code

Meaning

Let variable, function and file names convey meaning.

# 1 
data <- data %>% 
  mutate(sex=factor(c("M", "F"))[as.numeric(substr(id,nchar(id),nchar(id))) %% 2])

# 2 
data_males_only <- data %>% filter(sex=="M")

Making the meaning even clearer…

data <- data %>% 
  mutate(sex=cpr2sex(id))

Alas, the function cpr2sex does not exist in base R or Tidyverse

Tip

# Requires a custom function like this -- which could be sourced from file

cpr2sex <- function(x) {
  # This function takes a string (x), presumed to be a valid Danish CPR
  # and return "F", "M" or NA depending on the last character in the string
  # If the last CPR character is an even number, it indicates female sex, and
  # an odd number indicates male sex.
  if (str_sub(x, str_length(x), str_length(x)) %in% c("0","2","4","6","8")) {
    return("F") 
  } else {
    return("M") 
  } 
  return(NA) # Last character in CPR is not a ciffre
}

We could hide this away in a separate file and ‘source’ it .. or even make a new package…

Compound names

  • Do use under_scores
  • Do not use camelCase
  • Do not use kebab-case

Nouns and verbs

make_larger_by_10 <- function(x) {
  return(x+10)
}

ten_larger <- make_larger_by_10(112)

# For instance:
# selected_data <- data %>% select(..)

Names

Main points

  • Names should be meaningful
  • Use under_scores, not CamelCase, nor kebab-case
  • Function names should be verbs
  • Variable names should be nouns

Files and folders

R Projects

One project in one folder!

The project root folder

Should contain

  • RStudio project (*.RProj)
  • README.md
  • sessionInfo.txt
  • Your main R scripts
  • Your RMarkdown and quarto scripts
  • Relevant subfolders (/custom_functions, /gfx, /data, etc)

Using subfolders

Use the here() function to refer subfolders.

Using relative paths with ./ and ../ can also work

Do not use absolute filesystem paths like C:/users/Einstein/Documents/

The ubiquituos versioning nightmare

Files and folders

Make and use your own folder/sub-folder template for new projects

File types

  • Text always preferable to binary files
  • Markup files (HTML, XML, etc) are human readable, but can be complex
  • Application files (.odt, .docx) are often either binary files or very complex markup files

gfx/prefer_text_files.pdf

gfx/prefer_text_files.pdf

gfx/prefer_text_files.xml

File types

Stick to simple, human-readable files like R-scripts, markdown, csv files, etc, as far into the process as you can and only generate pdf, word, tiff, jpeg etc files as the final step.

…only really one potential issue with textfiles: CHARSET

Weird characters?

E.g. Søren instead of Søren

It’s probably the character encoding (Microsoft Excel again!) – just stick to UTF8/UTF16

Files and folders

Main points

  • Use a strict folder structure you can handle – make a template!
  • Use relational folder paths if possible
  • Use human-readable files only if possible (txt, md, Rmd, csv, etc)
  • Only use non-human readable files for ‘final output’ (pdf, docx, xlsx, etc)

The end