BCEM standards for reproducible research

Welcome to BCEM!

We’ve compiled this tutorial to share BCEM's reproducibility standards so that we can better document what we do, for the sake of our future selves, our collaborators, and ultimately, a world with better science. Even though reproducible standards in the field of bioinformatics go well beyond the requirements established here (see a fully reproducible paper as an example), we have decided to take things slow and adopt steps that are manageable and realistic for researchers in our group. As we grow and develop our skills, we will move in that direction, keeping those examples as a North Star. For now, let's delve into the areas we're currently covering.

About this tutorial

This tutorial covers the following aspects around research data management:

Project management
Data storage
File structure
Naming conventions
Shared resource usage
Scripts conventions
Version control

Project folder in the Cloud:

Every member in the lab needs to set up a project folder in a Cloud service of choice (Google Drive, OneDrive, Dropbox) in agreement with all project collaborators. This folder serves an essential purpose: to store the most important documents related to the project that will allow all parties to understand the project's developments. There is a suggested file structure and naming conventions for all files in this folder (see sections "File Structure" and "File Naming Conventions", below).

Mind map:

At first glance, you may wonder why this step is necessary, and one of the first one in the guide. After all, you may have joined us thinking about getting your hands dirty with data. But, sometimes, it is a good idea to slow down, consider where to look and what to look for. At first, it may seem that nothing is happening. In time, you'll realize this is a helpful roadmap – one that is meant to develop with your understanding of your project and the hints you receive from your results.

The most important aspect about the mind map is that you identify the key components that will help you find answers on your objectives. Here are some guideline questions to aid that process:

Data acquisition: How? Where? Is it an experimental project, or are you downloading the data from open databases? In either case, be very explicit on the sources of data.
Data processing: How? Which tools?
Data analysis: What are the expected results? Which techniques could help me reach my goals?
Visualization: Which results are relevant? What are resulting analysis showing?
Final results: Where are these stored?

We suggest using a service such as diagrams.net to produce the actual map. But that's not binding, you may use any one that gets you there. Ideally, at every step of the way, the mind map should be up-to-date in your project folder in the Cloud service chosen.

Here's an example for a project based on experimental data collection:

Other examples for a projects based on data acquired from public databases are Mind map 1 (genomics' project) and Mind map 2 (viromes' project).

Metadata

This is a mandatory piece of documentation accompanying all data sets used in the project that details the source and the process of data acquisition and processing. We abide by the standards on Minimum Information about a Genome Sequence (MIGS), which are already adopted by specific repositories of genome sequence data such as the European Nucleotide Archive (ENA).

Examples for metadata files based on experimental data collection and data obtained from open source database are in metadata folder.

Raw Data

Raw data must be stored under our lab's ENA account immediately upon receival. The guidelines for submission are as follows:

Documenting experiments and data processing

Our lab requires that any process of data acquisition and/or processing be properly documented so that the work is as transparent and reproducible as possible. There are two possibilities for this documentation process, using an Electronic Lab Notebook (ELN), specifically RSpace (our Lab has a centralized account to manage multiple projects by all members with this provider), or digitally keeping detailed logs in a MarkDown (.md) document. Suggested tools to this end are Jupyter notebooks, Zettlr, Typora. The platform does not matter as long as it is a MarkDown document.

These documents must:

Have one per flowchart (mind map) component
Contain the following sections for each entry:
- Date
- Aim
- Protocol followed
  - Command lines or methodology in the lab
  - Third-party software (description of how it was used, under what parameters, include link to the tutorial(s))
- Results
  - Must include relevant tables, graphs, etc. (or links to where these are stored, in case of large files)
  - Must be commented (interpretations of what has been found)
- Indication of where the (intermediate) data was deposited (path, link).

Here are some notebooks which illustrate the previous guidelines:

Example of notebook in metagenomics (experimental): Experimental notebook
Example of notebook in genomics (computational): Computational notebook

File Structure :

This is the suggested file structure for the folder:

./Folder/
├── README.md
├── 01_Quality
├── 02_Trimming
├── 03_Quality_Trimming
├── 04_Assembly
├── 05_Results_Figures
├── 06_Results_Tables
└── 07_Manuscript

In some cases this could be organized by folders representing each step of the project. In the case of data:

./Data/
├── README.md
├── reads_raw
│   └── <file_name>.fa
├── reads_quality
├── reads_clean
├── reads_mapping
├── genomes_quality
│   ├── CheckM
│   ├── Quast
|   └── Miga
├── genomes_anotation
├── genomes_orthologs
└── secretion_systems_shrimps

Or in the case of codes and scripts:

./Scripts/
├── README.md
├── Bowtie2
│   ├── log_files
│   ├── output_files
|   ├── Scripts
|   |   ├── <DataType_TypeProcessing>.sh
|   |   ├── plasmid_quality.sh
└── └── └── genomes_quality.sh

File Naming Conventions :

These are the conventions adopted by our lab to ensure, as much as possible, an understanding of what is contained in a file. A suggested file naming convention is:

<DataType_TypeProcessing>.sh

Also, there are numerous examples for projects based on experimental data collection, as the established protocol in qiime. http://qiime.org/documentation/index.html

Script Requirements

The minimum requirements for a script include:

The name of the file must be consistent with the function implemented.
Adapt to a standard of mnemonics and notation (Notation camel):
- For example: NotationCamel
Name
Description
Author
Institution
Contact email
Date: When was it implemented
Help (input, output) - how to run
Requirements (codependences) versions

There must be one README per project module

Version
Parameters
Information needed before
Order in which the script should be put
Data structure (input and output)
Dependencies (versions)
TYPORA
Results graphs (if applicable)

Here is an example of an ideal script that has an associated README file:

#!/usr/bin/bash


#SBATCH -p medium                        
#SBATCH -N 1                           
#SBATCH -n 1
#SBATCH --cpus-per-task=15
#SBATCH --mem=61440       
#SBATCH --time=73:00:00 
#SBATCH -o Outputs/trainRF.o%j
#SBATCH --mail-type=ALL
#SBATCH --mail-user=lc.camelo10@uniandes.edu.co


######### How to Run #########
#sbatch scripts/trainRF.sh /hpcfs/home/lc.camelo10/JovenInvestigador/Outputs/Files/Data_PairedRelationsTRAINToTetramers.tsv

######### Description #########
# To Train the random forest model 
#Written by Laura Carolina Camelo Valera at Computational Biology and Microbial Ecology lab (BCEM)
#Institution: Los Andes University, Colombia
#email. lc.camelo10@uniandes.edu.co


######### Parameters #########
#1 Train matrix, phage-bacteria pairs

module load R/3.5.1mro

date 

echo "Predicting interactions ..."

Rscript ~/JovenInvestigador/RScripts/Train.R $1

echo "Model Implemented"

date

There are also examples of ideal scripts in R and Python.

Git repository usage

Note that this is not a requirement yet, advanced users only (link to a complete tutorial):

Each commit must be adequately described: consistent, without omitting information
DO NOT commit on incomplete or unstable versions of the script
Teamwork (create n work branches, work on the branch that corresponds, do merge, do push)
Execute push only on the main work branch.
Only push to the master branch once ...(?)
Main folder will be the project with a README and a workflow
Within each project there are modules and each module folder must contain a README file

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

BCEM standards for reproducible research

About this tutorial

Project folder in the Cloud:

Mind map:

Metadata

Raw Data

Documenting experiments and data processing

File Structure :

File Naming Conventions :

Script Requirements

Git repository usage

About

Releases

Packages

Contributors 5

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 42 Commits
codes		codes
metadata		metadata
notebooks		notebooks
work_flows		work_flows
LogoBCEM_C.png		LogoBCEM_C.png
README.md		README.md

BCEM-UniAndes/Reproducibility-Guidelines

Folders and files

Latest commit

History

Repository files navigation

BCEM standards for reproducible research

About this tutorial

Project folder in the Cloud:

Mind map:

Metadata

Raw Data

Documenting experiments and data processing

File Structure :

File Naming Conventions :

Script Requirements

Git repository usage

About

Resources

Stars

Watchers

Forks

Releases

Packages 0

Contributors 5

Languages

Packages