-
Notifications
You must be signed in to change notification settings - Fork 9
Contributing Guidelines
March 23, 2021
Thank you for your interest!
The SingleCellMultiModal
package aims to provide single cell datasets
from several different technologies / modalities for benchmarking and
analysis. We currently provide from scNMT
, scM&T
, seqFISH
,
CITEseq
, and other technologies. Contributions are very much welcome.
For a full list of available datasets, see here: Google Drive Sheet
In order to contribute, we generally require data in Rda
or Rds
format though we also support HDF5
and MTX
formats. Aside from the
usual required metadata.csv
documentation in the package, contributors
are required to add a name to the DataType
column in the metadata
table that indicates the name of the contributed dataset. To illustrate,
here are some DataType
names already in the package:
- mouse_gastrulation
- mouse_visual_cortex
- cord_blood
- pbmc_10x
- macrophage_differentiation
- mouse_embryo_8_cell
library(SingleCellMultiModal)
meta <- system.file("extdata", "metadata.csv",
package = "SingleCellMultiModal", mustWork = TRUE)
head(read.csv(meta))
#> DataType ResourceName
#> 1 mouse_gastrulation scnmt_acc_cgi.rda
#> 2 mouse_gastrulation scnmt_acc_CTCF.rda
#> 3 mouse_gastrulation scnmt_acc_DHS.rda
#> 4 mouse_gastrulation scnmt_acc_genebody.rda
#> 5 mouse_gastrulation scnmt_acc_p300.rda
#> 6 mouse_gastrulation scnmt_acc_promoter.rda
#> RDataPath DispatchClass
#> 1 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_cgi.rda Rda
#> 2 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_CTCF.rda Rda
#> 3 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_DHS.rda Rda
#> 4 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_genebody.rda Rda
#> 5 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_p300.rda Rda
#> 6 SingleCellMultiModal/mouse_gastrulation/scnmt_acc_promoter.rda Rda
#> RDataClass Maintainer
#> 1 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 2 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 3 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 4 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 5 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> 6 matrix Marcel Ramos <marcel.ramos@roswellpark.org>
#> DataProvider
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 2 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 3 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 4 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 5 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> 6 Dept. of Bioinformatics, The Babraham Institute, United Kingdom
#> Coordinate_1_based TaxonomyId Species SourceVersion
#> 1 NA 10090 Mus musculus 1.0.0
#> 2 NA 10090 Mus musculus 1.0.0
#> 3 NA 10090 Mus musculus 1.0.0
#> 4 NA 10090 Mus musculus 1.0.0
#> 5 NA 10090 Mus musculus 1.0.0
#> 6 NA 10090 Mus musculus 1.0.0
#> SourceUrl SourceType Genome
#> 1 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> 2 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> 3 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> 4 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> 5 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> 6 https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ RDS NA
#> BiocVersion
#> 1 3.12
#> 2 3.12
#> 3 3.12
#> 4 3.12
#> 5 3.12
#> 6 3.12
#> Description
#> 1 scnmt_acc_cgi data specific to the MOUSE_GASTRULATION project
#> 2 scnmt_acc_CTCF data specific to the MOUSE_GASTRULATION project
#> 3 scnmt_acc_DHS data specific to the MOUSE_GASTRULATION project
#> 4 scnmt_acc_genebody data specific to the MOUSE_GASTRULATION project
#> 5 scnmt_acc_p300 data specific to the MOUSE_GASTRULATION project
#> 6 scnmt_acc_promoter data specific to the MOUSE_GASTRULATION project
#> Title
#> 1 scnmt_acc_cgi
#> 2 scnmt_acc_CTCF
#> 3 scnmt_acc_DHS
#> 4 scnmt_acc_genebody
#> 5 scnmt_acc_p300
#> 6 scnmt_acc_promoter
We associate a version with all datasets. We start with version 1.0.0
using semantic versioning and include data in a corresponding version
folder (v1.0.0
). Thus, the recommended folder structure is as follows:
~/data
└ scmm/
└ mouse_gastrulation/
└ v1.0.0/
└ scnmt_acc_cgi.rda
└ scnmt_met_genebody.rda
└ scnmt_met_cgi.rda
└ scnmt_rna.rda
└ scnmt_colData.rda
└ scnmt_sampleMap.rda
In the inst
section, we will discuss how to annotate these data
products.
It is customary to include one Rda
/ Rds
file per assay or per assay
and region combination of interest (as above). We also highly recommend
including sampleMap
and colData
datasets for the
MultiAssayExperiment
that will be built on the fly. In this example,
there are three modalities in the scNMT
dataset, rna
(transcriptome), acc
(chromatin accessibility), and met
(methylation).
Contributors are required to demonstrate user-level functionality via examples in a vignette for each contributed dataset.
Ideally, the interface for the contributed dataset should be similar to
that of scNMT
so that users have a sense of consistency in the usage
of the package. This means having one main function that returns a
MultiAssayExperiment
object and having options that show the user what
datasets are available for a particular technology. Contributors should
use roxygen2
for documenting datasets and using @inheritParams scNMT
tag to avoid copying @param
documentation.
See the current example for implementation details:
scNMT(
DataType = "mouse_gastrulation",
mode = "*",
version = "1.0.0",
dry.run = TRUE
)
#> snapshotDate(): 2021-01-20
#> ah_id mode file_size rdataclass rdatadateadded rdatadateremoved
#> 1 EH3738 acc_cgi 7 Mb matrix 2020-09-03 <NA>
#> 2 EH3739 acc_CTCF 1.2 Mb matrix 2020-09-03 <NA>
#> 3 EH3740 acc_DHS 0.3 Mb matrix 2020-09-03 <NA>
#> 4 EH3741 acc_genebody 49.6 Mb matrix 2020-09-03 <NA>
#> 5 EH3742 acc_p300 0.2 Mb matrix 2020-09-03 <NA>
#> 6 EH3743 acc_promoter 27.2 Mb matrix 2020-09-03 <NA>
#> 7 EH3745 met_cgi 4.6 Mb matrix 2020-09-03 <NA>
#> 8 EH3746 met_CTCF 0.1 Mb matrix 2020-09-03 <NA>
#> 9 EH3747 met_DHS 0.1 Mb matrix 2020-09-03 <NA>
#> 10 EH3748 met_genebody 26.8 Mb matrix 2020-09-03 <NA>
#> 11 EH3749 met_p300 0.1 Mb matrix 2020-09-03 <NA>
#> 12 EH3750 met_promoter 11.5 Mb matrix 2020-09-03 <NA>
#> 13 EH3751 rna 18.6 Mb matrix 2020-09-03 <NA>
Note. Contributors should ensure that the documentation is complete and the proper data sources have been attributed.
In the following section we will describe how to annotate and append to
the metadata.csv
file. First, we have to ensure that we are accounting
for all of the fields required by ExperimentHub
. They are listed here:
- ResourceName
- Title
- Description
- BiocVersion
- Genome
- SourceType
- SourceUrl
- SourceVersion
- Species
- TaxonomyId
- Coordinate_1_based
- DataProvider
- Maintainer
- RDataPath
- RDataClass
- DispatchClass
- DataType+
Note. DataType
is a field we’ve added to help distinguish
multimodal technologies and is required for SingleCellMultiModal
. Some
of the DataType
s already available are mouse_gastrulation
,
mouse_visual_cortex
, cord_blood
, peripheral_blood
, etc.
To make it easy for contributions, we’ve provided a mechanism for easy
documentation using a file from a data.frame
we call a doc_file
.
Interested contributors should create a doc_file
in
inst/extdata/docuData
folder. Although we do not have a strict naming
convention for the doc_file
, we usually name the file
singlecellmultimodalvX.csv
where X
is the nth dataset added to the
package.
Here is an example of the file from version v1.0.0
of the scNMT
dataset:
doc_file <- system.file("extdata", "docuData", "singlecellmultimodalv1.csv",
package = "SingleCellMultiModal", mustWork = TRUE)
read.csv(doc_file, header = TRUE)
#> DataProvider TaxonomyId
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom 10090
#> Species SourceUrl
#> 1 Mus musculus https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ
#> SourceType SourceVersion DataType
#> 1 RDS 1.0.0 mouse_gastrulation
#> Maintainer
#> 1 Marcel Ramos <marcel.ramos@roswellpark.org>
Contributors will then use their doc_file
to append to the existing
metadata.csv
.
To create a doc_file
data.frame
with the file name
singlecellmultimodalvX.csv
, first we create a data.frame
object.
Each general annotation or row in this data.frame
will be applied to
all files uploaded to ExperimentHub
. We take advantage of the
data.frame
function to repeat data and create a uniform data.frame
with equal values across the columns.
scmeta <- data.frame(
DataProvider =
"Dept. of Bioinformatics, The Babraham Institute, United Kingdom",
TaxonomyId = "10090",
Species = "Mus musculus",
SourceUrl = "https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ",
SourceType = "RDS",
SourceVersion = "1.0.0",
DataType = "mouse_gastrulation",
Maintainer = "Ricard Argelaguet <ricard@ebi.ac.uk>",
stringsAsFactors = FALSE
)
scmeta
#> DataProvider TaxonomyId
#> 1 Dept. of Bioinformatics, The Babraham Institute, United Kingdom 10090
#> Species SourceUrl
#> 1 Mus musculus https://cloudstor.aarnet.edu.au/plus/s/Xzf5vCgAEUVgbfQ
#> SourceType SourceVersion DataType
#> 1 RDS 1.0.0 mouse_gastrulation
#> Maintainer
#> 1 Ricard Argelaguet <ricard@ebi.ac.uk>
After creating the documentation data.frame
(doc_file
), the
contributor can save that dataset as a .csv
file using write.csv
.
write.csv(
scmeta,
file = "inst/extdata/docuData/singlecellmultimodal.csv",
row.names = FALSE
)
In the case that the contributed data is not uniform, meaning that there
are multiple file types from potentially different speciments, the
data.frame
will have to account for all contributed data files.
For example, if the contributed data has a number of different source
types, the contributor is required to create a data.frame
with the
number of rows equal to the number of files to be uploaded.
In this example, we have two data files from different source types and formats:
data.frame(
DataProvider =
c("Institute of Population Genetics", "Mouse Science Center"),
TaxonomyId = c("9606", "10090"),
Species = c("Homo sapiens", "Mus musculus"),
SourceUrl = c("https://human.science/org", "https://mouse.science/gov"),
SourceType = c("RDS", "XML"),
DataType = c("human_genetics", "mouse_genetics"),
stringsAsFactors = FALSE
)
#> DataProvider TaxonomyId Species
#> 1 Institute of Population Genetics 9606 Homo sapiens
#> 2 Mouse Science Center 10090 Mus musculus
#> SourceUrl SourceType DataType
#> 1 https://human.science/org RDS human_genetics
#> 2 https://mouse.science/gov XML mouse_genetics
The individual data products that will eventually come together into a
MultiAssayExperiment
can be uploaded as serialized RDA
/ RDS
files, HDF5
, and even MTX
files. For examples on how to save data
into their respective file formats, see the make-data
folder.
Based on the folder structure described previously, the directory
argument in make_metadata
will correspond to the ~/data/scmm
folder.
The dataDir
folder will correspond to the DataType
/ technology
subfolder (e.g., “mouse_gastrulation”). These will be used as inputs to
the make_metadata
function.
Once the data is ready, the user can use the function in
make-metadata.R
in the scripts
folder. A typical call to
make_metadata
will either add to the metadata or replace it entirely.
The easiest for current contributors is to append
rows to the metadata
file.
make_metadata(
directory = "~/data/scmm",
dataDirs = "mouse_gastrulation", # also the name of the DataType
ext_pattern = "\\.[Rr][Dd][Aa]$",
doc_file = "inst/extdata/docuData/singlecellmultimodalv1.csv",
pkg_name = "SingleCellMultiModal",
append = TRUE,
dry.run = TRUE
)
Note that the extraction pattern (ext_pattern
) will allow contributors
to match a specific file extension in that folder and ignore any
intermediate files.
The contributor may also wish to run dry.run=TRUE
to see the output
data.frame
to be added to the metadata.csv
file.
Note. The make_metadata
function should be run from the base package
directory from a GitHub / git checkout (git clone ...
).
It is recommended to run the metadata validation function from
AnnotationHubData
:
AnnotationHubData::makeAnnotationHubMetadata("SingleCellMultiModal")
to ensure that some of the metadata fields are properly annotated.
Contributors should update the NEWS.md
file with a mention of the
function and data that are being provided. See the NEWS.md
for
examples.
The contributor should then create a Pull Request on GitHub.
If you are interested in contributing, I can help you go over the contribution and submission. Please contact me either on the Bioc-community Slack (mramos148) or at marcel {dot} ramos [at] sph (dot) cuny (dot) edu. If you need to sign up to the community Slack channel, follow this link: https://bioc-community.herokuapp.com/
sessionInfo
#> R Under development (unstable) (2020-12-12 r79621)
#> Platform: x86_64-pc-linux-gnu (64-bit)
#> Running under: Ubuntu 20.10
#>
#> Matrix products: default
#> BLAS: /usr/lib/x86_64-linux-gnu/blas/libblas.so.3.9.0
#> LAPACK: /usr/lib/x86_64-linux-gnu/lapack/liblapack.so.3.9.0
#>
#> locale:
#> [1] LC_CTYPE=en_US.UTF-8 LC_NUMERIC=C
#> [3] LC_TIME=en_US.UTF-8 LC_COLLATE=en_US.UTF-8
#> [5] LC_MONETARY=en_US.UTF-8 LC_MESSAGES=en_US.UTF-8
#> [7] LC_PAPER=en_US.UTF-8 LC_NAME=C
#> [9] LC_ADDRESS=C LC_TELEPHONE=C
#> [11] LC_MEASUREMENT=en_US.UTF-8 LC_IDENTIFICATION=C
#>
#> attached base packages:
#> [1] stats4 parallel stats graphics grDevices utils datasets
#> [8] methods base
#>
#> other attached packages:
#> [1] SingleCellMultiModal_1.3.9 MultiAssayExperiment_1.17.6
#> [3] SummarizedExperiment_1.21.1 DelayedArray_0.17.7
#> [5] MatrixGenerics_1.3.0 matrixStats_0.57.0
#> [7] Matrix_1.3-2 Biobase_2.51.0
#> [9] GenomicRanges_1.43.3 GenomeInfoDb_1.27.3
#> [11] IRanges_2.25.6 S4Vectors_0.29.6
#> [13] BiocGenerics_0.37.0
#>
#> loaded via a namespace (and not attached):
#> [1] Rcpp_1.0.6 lattice_0.20-41
#> [3] SingleCellExperiment_1.13.3 assertthat_0.2.1
#> [5] digest_0.6.27 mime_0.9
#> [7] BiocFileCache_1.15.1 R6_2.5.0
#> [9] RSQLite_2.2.2 evaluate_0.14
#> [11] httr_1.4.2 pillar_1.4.7
#> [13] zlibbioc_1.37.0 rlang_0.4.10
#> [15] curl_4.3 blob_1.2.1
#> [17] rmarkdown_2.6 AnnotationHub_2.23.0
#> [19] stringr_1.4.0 RCurl_1.98-1.2
#> [21] bit_4.0.4 shiny_1.5.0
#> [23] compiler_4.1.0 httpuv_1.5.5
#> [25] xfun_0.20 pkgconfig_2.0.3
#> [27] htmltools_0.5.1 tidyselect_1.1.0
#> [29] tibble_3.0.5 GenomeInfoDbData_1.2.4
#> [31] interactiveDisplayBase_1.29.0 codetools_0.2-18
#> [33] withr_2.4.0 crayon_1.3.4
#> [35] dplyr_1.0.3 dbplyr_2.0.0
#> [37] later_1.1.0.1 bitops_1.0-6
#> [39] rappdirs_0.3.1 grid_4.1.0
#> [41] xtable_1.8-4 lifecycle_0.2.0
#> [43] DBI_1.1.1 magrittr_2.0.1
#> [45] stringi_1.5.3 XVector_0.31.1
#> [47] promises_1.1.1 SpatialExperiment_1.1.0
#> [49] ellipsis_0.3.1 filelock_1.0.2
#> [51] generics_0.1.0 vctrs_0.3.6
#> [53] tools_4.1.0 bit64_4.0.5
#> [55] glue_1.4.2 purrr_0.3.4
#> [57] BiocVersion_3.13.1 fastmap_1.0.1
#> [59] yaml_2.2.1 AnnotationDbi_1.53.0
#> [61] ExperimentHub_1.17.0 BiocManager_1.30.10
#> [63] memoise_1.1.0 knitr_1.30