cheatsheet.qmd

---
title: "Reproducible Research in R (and friends)"
subtitle: "Cheatsheet"
author: 
  - name: M. Rolland
    degrees: MSc
    orcid: 0000-0002-8076-5550
    email: matthieu.rolland@univ-grenoble-alpes.fr
date: "2024-06-08"
date-modified: today
format: html
editor: visual
bibliography: bib_mr.bib
nocite: |
  @marwick_packaging_2018
toc: true
number-sections: true
---

This cheatsheet provides essential guidelines and best practices for conducting reproducible research using R and related tools. It covers project organization, version control, data management, documentation, environment management, workflow automation, and more. For any suggestions or feedback, please feel free to email me.

## Project Organization

- Use a consistent folder structure:
  - `data/` - Analysis data files
  - `scripts/` - Analysis scripts
  - `outputs/` - Results (figures, tables)
  - `docs/` - Documentation and reports
- Use RStudio Projects to facilitate project management and environment isolation
- Maintain a project log to document progress, changes, and key decisions throughout the analysis
- Reference:
  - [The concept of research compendium](https://peerj.com/preprints/3192/)
  - [Using RStudio projects](https://support.posit.co/hc/en-us/articles/200526207-Using-RStudio-Projects)

## Version Control

- Use Git to track changes in scripts and documents
- Commit regularly with meaningful messages
- One repository per analysis 
- Make sure your `data/` folder is in the `.gitignore` file 
- Make sure there is no sensitive information in your code
- Reference:
  - [Happy Git with R](https://happygitwithr.com/)

## Data Management

- Store raw data in `data/raw/` and never modify it directly
- Produce a README describing the source data
- Use scripts to clean and process data, save the cleaned data in `data/processed/`
- Document each step of data cleaning
- Keep data cleaning separate from analysis
- Organize your data in a tidy format: each variable is a column, each observation is a row
- Reference:
  - [Principles of tidy data](https://www.jstatsoft.org/article/view/v059i10)

## Documentation

- Comment code extensively to explain steps and logic
- Create README files to explain project structure and instructions for running the analysis
- Document all functions clearly, including input parameters, output, and purpose
- Reference:
  - [Example README file](https://gricad-gitlab.univ-grenoble-alpes.fr/iab-env-epi/rolland_effects_2022)
  - [{roxygen2} for function documentation](https://cran.r-project.org/web/packages/roxygen2/vignettes/roxygen2.html)

## Environment Management

- Use `sessionInfo()` or `devtools::session_info()` to capture the R session information 
- Use `{renv}` to manage package versions
- Reference:
  - [Introduction to renv](https://rstudio.github.io/renv/articles/renv.html)
  - [Example on how to show the session info (scroll to bottom)](https://gricad-gitlab.univ-grenoble-alpes.fr/iab-env-epi/rolland_effects_2022)

## Workflow Automation

- Organize your analysis into a series of numbered and ordered scripts to create a clear and reproducible workflow (e.g., 01-data-cleaning.R, 02-data-analysis.R, 03-visualization.R)
- Create a master script (e.g., run_all.R) that sequentially runs each numbered script

OR

- Use Makefile or `{targets}` package to automate and document the workflow

- Reference:
  - [{targets} Package](https://books.ropensci.org/targets/)  
  - [Example Project using {targets}](https://mjrolland.github.io/ed-neuro-hpa/)  

## Analysis Scripts

- Break analysis into small, reusable functions
- Use meaningful and consistent naming conventions such as provided by the [Tidyverse Naming Conventions](https://style.tidyverse.org/syntax.html#object-names) for variables and functions and by [data carpentry](https://datacarpentry.org/rr-organization1/01-file-naming/index.html) for folders and files 
- Style your code according to standardized recommendations from the [Tidyverse Style Guide](https://style.tidyverse.org/)
- Reference:
  - [Embrace functional programming](https://tidyverse.tidyverse.org/articles/manifesto.html#embrace-functional-programming)
  - [Tidyverse Style Guide](https://style.tidyverse.org/)
  
## Computational reproducibility

- Set seeds to ensure reproducibility when using randomness in your analysis
- Document all warnings
- Reference: 
  - [Random number seed in R](https://makemeanalyst.com/r-programming/random-number-seed/)

## Reporting

- Use RMarkdown (.Rmd) or Quarto (.Qmd) files to combine code, results, and narrative for creating dynamic reports
- Reference:
  - [Quarto Documentation](https://quarto.org/)

## Validation

- Get your code reviewed prior to publication
- Reference:
  - [Code Review Practices](https://mtlynch.io/human-code-reviews-1/)

## Sharing Code And Data

- Use repositories like [GitHub](https://github.com/) or [GitLab](https://gitlab.com/) for sharing code
- Use repositories like [Zenodo](https://zenodo.org/) or [data.gouv.fr](https://www.data.gouv.fr/fr/) if you have data sets to share