Reproducible Research with R

…workflow

Søren O’Neill & Steen Flammild Harsted

2023-01-01

Workflow and documentation in RR

Automation

The entire process from raw data to output should be:

  • scripted
  • reproducible
  • reversible

Automation

The entire process from raw data to output should be:

  • scripted
  • reproducible
  • reversible.

E.g.

data <- read.csv("my_data_file.csv")
# excluded as participant entered an invalid CPR number
data <- data %>% filter(id != "2321369-1212")

..but why not just delete that observation in raw data..?

Scripted workflows

Load and clean data

http://www.sthda.com/english/wiki/best-practices-in-preparing-data-files-for-importing-into-r

Structuring workflow

Example project

  • Download ‘RR_example.zip’ from this site
  • Unzip the file in an appropriate folder
  • Create a new R-project in that existing folder

Examine the project files

For now don’t change anything and don’t get bogged down in all the details of the code in each file … just get an overview, see if you can understand the code and how it ties together.

This exercise is best done in pairs.

  • What is the project all about?
  • What do the individual files do?
  • How are the files related to each other?

Overview of the project

Let’s try to:

  • Knit a manuscript to html or work or pdf

Working with the manuscript

A simple problem

  • The data for participant with ID=7 was corrupted (equipment failure)
    • What to do?
    • How to do it?

Another simple problem

The manuscript you are submitting to, has some specific requirements:

  • Tables should be included in text close to first reference
  • Figure should be included in a section ‘Figures’ (just before References) and a placeholder <<FIGURE X ABOUT HERE>> where it is referenced.

How would you change the code?

Working with literature citations

BibTeX, csl, zotero and friends

BibTeX

A simple human readable text file format for references

  • Export directly from PubMED and others
  • Export from Zotero, EndNote(?) and others

Zotero

An open source, free reference manager – well integrated with firefox.

Allows for shared libraries.

Plugin BetterBibTex for … well, better bibTeX.

Alternatives: Mendeley (Elsevier), EndNote, RefManager, etc

Zotero

Let me demonstrate…

  • how to find a new reference
  • include it in zoteros library
  • automatically update bibTeX file (per project)
  • cite in Rmd file

Zotero got style!

Loads of styles … 10.000+ … from APA to Ugeskrift for Læger

…and you can write your own if need be.

https://www.zotero.org/styles

Exercise

(without Zotero)

  • You have decided to include an extra citation in the main_v1.qmd file
    • Go to scholar.google.com
    • Search for ‘What low back pain is and why we need to pay attention’
    • Go to science direct and click ‘Cite’ to get bibtex
    • Copy bibtex content to your reference file
    • Copy the BibTeX identifier and insert it as a citation
    • Re-render your document

Collaboration

Collaborating with other researchers

Three scenarios :

  • Your collaborators do not use the R ecosystem (booh!)
  • Your collaborators use the R ecosystem (woohoo!)
  • You and your collaborators are hardcore nerds (git!)

Collaborating with R-agnostic colleagues

  • Knit your R-markdown code into a Word or OpenDocument file … send it
  • Collaborators can use Track changes in wordprocessor and … return it
  • Manually transfer changes back to Rmd … start over again

Benefits

  • Simple
  • No new tools to learn (?)
  • Anyone can join in (?)
  • Separation of comments and revisions

Collaborating with R-agnostic colleagues

  • Knit your R-markdown code into a Word or OpenDocument file … send it
  • Collaborators can use Track changes in wordprocessor and … return it
  • Manually transfer changes back to Rmd … start over again

Drawbacks

  • Error prone
  • Redundancy
  • Messy versioning nightmare
  • Focus shifts towards typesetting / presentation

Collaborating with R users – the simple way

  • Get your Rmd files and … send it
  • Collaborators edit the Rmd directly and … return it
  • Diff the two version of the Rmd file and … merge them selectively

Diff’ing two (or more) text files

  • Meld
  • Beyond compare
  • Araxis merge
  • P4Merge
  • DeltaWalker
  • (RStudio)
  • ..and others

There are also several online version .. e.g. https://editor.mergely.com/

…you probably should not upload sensitive information though.

Collaborating with R users – the simple way

  • Get your Rmd files and … send it
  • Collaborators edit the Rmd directly and … return it
  • Diff the two version of the Rmd file and … merge them selectively

Benefits

  • Relatively simple
  • New tools are easy learn
  • Anyone can join in
  • Only one file type – no knitting
  • Focus remains on content

Collaborating with R users – the simple way

  • Get your Rmd files and … send it
  • Collaborators edit the Rmd directly and … return it
  • Diff the two version of the Rmd file and … merge them selectively

Drawback

  • Still some redundancy (Rmd file versions)
  • Messy versioning nightmare
  • Comments and revisions are not clearly separated

Documenting process

Collaborating with tech savvy nerds

…enter GIT and renv

R package renv

Not very difficult at all.

  • Maintains a library of R packages along with r code
  • Updates to R and packages does not break your project
  • Permits you to come back in 5 years and re-execute your code

https://rstudio.github.io/renv/articles/renv.html

GIT

It is a bit difficult.

  • Instead of storing multiple versions of the same file …
  • ..it stores just the one file, and a (hidden) version-history
  • Changes are handled by Diff’ing versions
  • All changes (process) are documented
  • All changes (process) are reversible
  • Projects are forkable and mergable
  • indispensable for LARGE projects
  • GIT is a separate program, supported by RStudio.

Summary

Summary

  • Comment your code
  • Comment your project
  • Meaningful names
  • Structured files/folders
  • Human readable files
  • SCRIPTED AUTOMATION !!
  • Split the process into ‘chunks’ (data / analyses / output)
  • Split the code into files
  • Split the code-in-files into chunks

Later in your R journey:

  • Accessing databases directly (RedCAP, ao)
  • Creating interactive websitesand documents
  • Making graphs come alive with animations
  • Live data analyses dashboards
  • ..and much much more (it can even order pizza)

Getting help

  • Google
  • Stackoverflow
  • Stackoverflow – R4PHD