This tutorial was compiled by Clemens Schmid for a workshop on reproducible research and data management at MPI-SHH in January 2020. It’s based on and inspired by two workshops prepared by Ben Marwick for the SAA2019 conference and Brown University Digital Archaeology Lab.

The research compendium

Idea

A research compendium is a manuscript accompanied by code and data files (or persistent links to reputable online repositories) that allows reviewers and readers to reproduce and extend the results without needing any further materials from the original authors […].

– Marwick 2017, 442.

A research compendium contains relevant information to make the scientific process behind a book, book chapter or journal article more transparent and reproducible.

Ideally it contains all data, code and text necessary to compile the published document.

This might be difficult sometimes due to raw data size, publishing restrictions or parts of the workflow that are not quantitative or inherently impossible to reproduce.

The term research compendium therefore covers a whole range of imaginable degrees of completeness.

Data

Research can only be reproduced, checked and expanded when underlying raw data is available.

Data sharing is a moral obligation and non-trivial.

Marwick & Birch 2018 have some recommendations:

Anticipate how your data will be used
Keep raw data raw
Store data in open formats
Data should be structured for analysis (tidy data)
Data should be uniquely identifiable (persistent references)
Provide relevant metadata
Adopt the proper privacy protocols
Use a trustworthy repository
Use an open license

Code

In most contexts of quantitative research data analysis can be expressed in the form of a computer script. Other researchers should be able to run our script to obtain the same statistical results and data visualizations.

Computational reproducibility is an important foundation of scientific progress, that requires a new type of researcher:

Text

Text can be integrated with data and code to be more interlinked, transparent and didactically powerful:

This can preciscly answer questions like “106 samples? Where is this number coming from?”. Hopefully not round(runif(1) * 200), but solid algorithms like number_of_samples_with_raikenburg_treatment().

In the future interactive documents may become more popular: e.g. Shiny Documents (https://github.com/nevrome/neiman1995)

What we will do today

Research compendia can be compiled in many different ways.

I will present one particular workflow based on our R package rrtools.

The following figure by Marwick 2017 illustrates one implementation of this workflow.

R for reproducible research

R scripting language

R is a scripting language and a framework for statistical data analysis. The way it is used for data analysis fits well to the research compendium concept.

library(magrittr)
eat_carrot <- function(x) {append(x, "eats a carrot")}
`%dance%` <- function(x, y) {c(x, "dances with", y)}
horse1 <- "Betsy"; horse2 <- "Trapper"
horse1 %>% eat_carrot()
horse1 %dance% horse2

It is an easy to learn programming language with a huge community in science.
It is supported by an advanced working environment: RStudio
It allows for direct integration of code and text: RMarkdown
It offers an established data structure to manage code, data and text: the R package

If you never worked with R, you can use swirl to learn the basics.

RStudio

RStudio is an integrated development environment (IDE) for R.

RStudio

Open RStudio.

Inspect the RStudio interface.

Open an R code file, add some code (1 + 1), select it and run it in the R console with ctrl + enter.

R Markdown

Markdown is a lightweight and easy-to-use markup language for styling your writing.

# Header 1
## Header 2

- Bulleted
- List

**Bold** and _Italic_ and `Code` text or [Link](url) and ![Image](src)

Here’s a cheatsheet that documents all basic Markdown features.

Text written in Markdown can easily converted to more advanced layout systems like HTML or LaTex or MS Word. This website is written in Markdown.

RMarkdown is an advanced implementation of Markdown that allows to combine text and code. It adds the possibility to define chunks of code that run when the document is rendered.

R Markdown

Open an Rmarkdown file based on the default RStudio template, inspect it and render it with the Knit button.

R Packages

R packages are a core feature of R. They contain ready to use functions, documentation and example data in a standard structure.

There currently are >15000 R packages mostly written and maintained by volunteers for all sorts of research questions and applications.

mypackage/
|
├── DESCRIPTION         # Package metadata
├── R/                  # R code
├── man/                # Function documentation
├── NAMESPACE/          # Exported names
├── vignettes/          # Extended documentation
├── data/               # (Example) data
├── tests/              # Unit tests
├── src/                # Compiled language code
└── inst/               # Arbitrary, additional files

An example: https://github.com/nevrome/bleiglas

Here’s an excellent introduction if you want to create an own package.

rrtools setup

rrtools

rrtools is a mighty wizard package written by Ben Marwick and colleagues that facilitates some of the steps of research compendium creation and maintenance. It has a lot of dependencies and it is strongly opinionated.

Create an R package

If rrtools is installed on your system you can start immediately to use it in R.

Run rrtools::use_compendium("~/test/mycompendium") to create a basic R package with the name project.

Inspect both the command output in the old and the new RStudio session.

Inspect the compendium file structure.

Edit the DESCRIPTION file (located in your project directory) to include some better metadata.

For the future keep in mind to periodically update the Imports: section of the DESCRIPTION file with the names of packages used in the code we write in /R and the Rmd document(s).

Activate version control

~~Run usethis::use_git_config(user.name = "", user.email = "") to configure git.~~

Run usethis::use_git() to initiate git for this project (git init). Follow the command line instruction of this function.

Inspect the new Git tab in the top right RStudio panel.

Run usethis::browse_github_pat() to get to the right page to create a github access token. This access token is needed to control Github remotely. When you generate the token (click “Generate new token”), make sure the “repo” scope is included by checking the “repo” box. Don’t save this token in your project, keep it elsewhere.

Run usethis::use_github(protocol = "ssh", auth_token = "your token") to create a repository for your local project on Github.

Test your Git + Github setup by editing the DESCRIPTION file ones more and by pushing the result.

Establish the compendium file structure

Run rrtools::use_analysis() to create the basic files and directory structure and transform this R package into a research compendium.

Inspect the command line output of this function and the resulting file structure.

analysis/
|
├── paper/
│   ├── paper.Rmd       # this is the main document to edit
│   └── references.bib  # this contains the reference list information
│
├── figures/            # location of the figures produced by the Rmd
|
├── data/
│   ├── raw_data/       # data obtained from elsewhere
│   └── derived_data/   # data generated during the analysis
|
└── templates
    ├── journal-of-archaeological-science.csl
    |                   # this sets the style of citations & reference list
    ├── template.docx   # used to style the output of the paper.Rmd
    └── template.Rmd

Commit and push the setup.

Inspect the manuscript

Inspect the ./analysis/paper/paper.Rmd file.

Render it with the Knit button and inspect the resulting .docx file.

rrtools workflow

Collect some data

Create a file names.txt in ./analysis/data/raw_data that contains the first names of some of the people around you.

Paul
Hannah
Anne
Maxime
Clemens

Add an R function to analyse and plot this data

Create a file plot_name_dist_matrix.R in ./R to define an R function.

#' Calculate Levenshtein distance between strings in the input vector
#'
#' @param x A character vector
#'
#' @return Nothing. Only called for the plot
#' @export
plot_name_dist_matrix <- function(x) {
  # calculate distances
  dist_matrix <- as.matrix(stringdist::stringdistmatrix(x, method = "lv"))
  # plot
  image(dist_matrix, axes = FALSE)
  axis(1, at = seq(0, 1, length = length(x)), labels = x)
  axis(2, at = seq(0, 1, length = length(x)), labels = x)
  text(
    expand.grid(seq(0, 1, length = length(x)), seq(0, 1, length = length(x))),
    labels = dist_matrix
  )
}

Make this function available in the manuscript file

The lazy way

Add a call to a function at the beginning of the paper.Rmd file that simply loads every function in the package:

devtools::load_all()

The knigths of R way

Build the package documentation (CTRL+Shift+D) and install the package (CTRL+Shift+B).

Incorporate data and code into the manuscript

Edit ./analysis/paper/paper.Rmd to include our analysis in an R code chunk with the chunk options {r, fig.width = 10, fig.height=10}.

names_vector <- readLines(here::here("analysis/data/raw_data/names.txt"))

mycompendium::plot_name_dist_matrix(names_vector)

Render paper.Rmd again to see the result.

rrtools advanced

README, Code of Conduct and Contribution

Beyond the bare code an established open source software project in the 21st century should have at least the following three things:

A README.md file that gives a minimal description what this project is about, how it can be used and who made it
A guide how to contribute to this project CONTRIBUTING.md
A code of conduct that defines which behaviour we expect from participants: CONDUCT.md

These documentation files are valuable for your research compendium as well.

Run rrtools::use_readme_rmd() to create these files and inspect them.

Render the README.Rmd file to a README.md file. Why is this intermediate step necessary?

Licensing

Code and data (!) in a research repository should come with a license declaration that declares the copyright holder and what can and can not be done legally with your intellectual property.

Sometimes it’s not easy to decide which of the established licenses fits your purpose best. There are websites that give you an overview, e.g. https://choosealicense.com, but real legal advice is always recommended.

Test one of the following functions and see what it does.

usethis::use_mit_license(name = "your name") usethis::use_gpl3_license(name = "your name") usethis::use_lgpl_license(name = "your name") usethis::use_apl2_license(name = "your name") usethis::use_cc0_license(name = "your name") usethis::use_ccby_license(name = "your name")

CI/CD

To test your workflow on another, independent system or to outsource some document processing steps you can work with services that provide on-the-fly virtual machines.

rrtools::use_travis(docker = FALSE) creates a configuration file (.travis.yml) for the TravisCI service which you can directly link to your Github repository.

An example: https://github.com/ISAAKiel/recexcavAAR
And another example: https://github.com/nevrome/neomod_textdev

Virtualisation

To go beyond CI and to make our computational workflow completely independent of version changes in the software we use, we have to encapsulate it in a virtual environment, that simulates a computer with exactly the right software.

A good solution for this is the Docker container system.

You can start to set this up with rrtools::use_dockerfile(), which creates default configuration file (Dockerfile), but it requires some further considerations.

An example: https://github.com/nevrome/cultrans.bronzeageburials.article2019

Unit tests

If the code in your compendium becomes more and more complex and you maintain a significant number of functions in the R/ directory, it might be useful to establish unit tests.

Unit tests reduce the number of bugs, force you to structure your code in a better way and make it overall more robust.

usethis::use_testthat() adds the main components to get you started.

An example: https://github.com/ropensci/c14bazAAR

Research Compendia with R

Research Compendia with R

The research compendium

Idea

Data

Code

Text

What we will do today

R for reproducible research

R scripting language

RStudio

RStudio

R Markdown

R Markdown

R Packages

rrtools setup

rrtools

Create an R package

Activate version control

Establish the compendium file structure

Inspect the manuscript

rrtools workflow

Collect some data

Add an R function to analyse and plot this data

Make this function available in the manuscript file

The lazy way

The knigths of R way

Incorporate data and code into the manuscript

rrtools advanced

README, Code of Conduct and Contribution

Licensing

CI/CD

Virtualisation

Unit tests

Further reading