diff --git a/.github/workflows/check-bioc.yml b/.github/workflows/check-bioc.yml index 979b48c..aef96a7 100644 --- a/.github/workflows/check-bioc.yml +++ b/.github/workflows/check-bioc.yml @@ -22,7 +22,8 @@ on: push: - pull_request: + paths-ignore: + - 'README.md' name: R-CMD-check-bioc diff --git a/DESCRIPTION b/DESCRIPTION index fde1bbc..1d2b161 100644 --- a/DESCRIPTION +++ b/DESCRIPTION @@ -1,6 +1,6 @@ Package: MsBackendMetaboLights Title: Retrieve Mass Spectrometry Data from MetaboLights -Version: 0.0.1 +Version: 0.0.3 Authors@R: c(person(given = "Johannes", family = "Rainer", email = "Johannes.Rainer@eurac.edu", @@ -21,13 +21,20 @@ Description: MetaboLights is one of the main public repositories for storage Depends: R (>= 4.2.0) Imports: - curl + curl, + ProtGenerics, + Spectra, + BiocFileCache, + methods Suggests: testthat, rmarkdown, - mzR + mzR, + knitr, + BiocStyle License: Artistic-2.0 Encoding: UTF-8 +VignetteBuilder: knitr BugReports: https://github.com/RforMassSpectrometry/MsBackendMetaboLights/issues URL: https://github.com/RforMassSpectrometry/MsBackendMetaboLights biocViews: Infrastructure, MassSpectrometry, Metabolomics, DataImport, Proteomics diff --git a/NAMESPACE b/NAMESPACE index 759f9fd..d74447e 100644 --- a/NAMESPACE +++ b/NAMESPACE @@ -1,8 +1,23 @@ # Generated by roxygen2: do not edit by hand +export(MsBackendMetaboLights) export(mtbls_ftp_path) export(mtbls_list_files) +exportClasses(MsBackendMetaboLights) +exportMethods(backendInitialize) +exportMethods(backendMerge) +importClassesFrom(Spectra,MsBackendMzR) +importFrom(BiocFileCache,BiocFileCache) +importFrom(ProtGenerics,backendMerge) importFrom(curl,curl) importFrom(curl,handle_setopt) importFrom(curl,new_handle) +importFrom(methods,callNextMethod) +importFrom(methods,new) importFrom(utils,read.table) +importMethodsFrom(BiocFileCache,"bfcmeta<-") +importMethodsFrom(BiocFileCache,bfcmetalist) +importMethodsFrom(BiocFileCache,bfcquery) +importMethodsFrom(BiocFileCache,bfcrpath) +importMethodsFrom(ProtGenerics,backendInitialize) +importMethodsFrom(ProtGenerics,dataOrigin) diff --git a/NEWS.md b/NEWS.md index 1da36bc..440d205 100644 --- a/NEWS.md +++ b/NEWS.md @@ -1,5 +1,14 @@ # MsBackendMetabolights 0.0 +## Changes in 0.0.3 + +- Add vignette and `backendMerge,MsBackendMetaboLights` function. + +## Changes in 0.0.2 + +- Add `MsBackendMetaboLights` class, constructor and `backendInitalize()` + method. + ## Changes in 0.0.1 - Add utility functions to retrieve information from MetaboLights. diff --git a/R/MetaboLights.R b/R/MetaboLights.R deleted file mode 100644 index 90c8c0b..0000000 --- a/R/MetaboLights.R +++ /dev/null @@ -1,156 +0,0 @@ -################################################################################ -## Utility functions for MetaboLights -## -################################################################################ - -#' @title Utility functions for the MetaboLights repository -#' -#' @name MetaboLights-utils -#' -#' @description -#' -#' [MetaboLights](https://www.ebi.ac.uk/metabolights/) is one of the main -#' public repositories for deposition of metabolomics experiments including -#' (raw) mass spectrometry (MS) and NMR data files and experimental/analysis -#' results. The experimental metadata and results are stored as plain text -#' files in ISA-tab format. Each MetaboLights experiment must provide a -#' file describing the samples analyzed and at least one *assay* file that -#' links between the experimental samples and the (raw and processed) data -#' files with quantification of metabolites/features in these samples. -#' -#' Each experiment in MetaboLights is identified with its unique identifier, -#' starting with *MTBLS* followed by a number. The data (metadata files and -#' MS/NMR data files) of an experiment are available through the repository's -#' ftp server. -#' -#' The functions listed here allow to query and retrieve information of an -#' data set/experiment from MetaboLights. -#' -#' - `mtbls_ftp_path`: returns the FTP path for a provided MetaboLights ID. -#' With `mustWork = TRUE` (the default) the function throws an error if -#' the path is not accessible (either because the data set does not exist or -#' no internet connection is available). The function returns a -#' `character(1)` with the FTP path to the data set folder. -#' -#' - `mtbls_list_files`: returns the available files (and directories) for the -#' specified MetaboLights data set (i.e. the FTP directory content of the -#' data set). The function returns a `character` vector with the relative -#' file names to the absolute FTP path (`mtbls_ftp_path()`) of the data set. -#' Parameter `pattern` allows to filter which file names should be returned. -#' -#' @param x `character(1)` with the ID of the MetaboLights data set (usually -#' starting with a *MTBLS* followed by a number). -#' -#' @param mustWork for `mtbls_ftp_path()`: `logical(1)` whether the validity of -#' the path should be verified or not. By default (with `mustWork = TRUE`) -#' the function throws an error if either the data set does not exist or -#' if the folder can not be accessed (e.g. if no internet connection is -#' available). -#' -#' @param pattern for `mtbls_list_files()`: `character(1)` defining a pattern -#' to filter the file names, such as `pattern = "^a_"` to retrieve the -#' file names of all assay files of the data set. This parameter is -#' passed to the [grepl()] function. -#' -#' @author Johannes Rainer, Philippine Louail -#' -#' @examples -#' -#' ## Get the FTP path to the data set MTBLS2 -#' mtbls_ftp_path("MTBLS2") -#' -#' ## Retrieve available files (and directories) for the data set MTBLS2 -#' mtbls_list_files("MTBLS2") -#' -#' ## Retrieve the available assay files (file names starting with "a_"). -#' afiles <- mtbls_list_files("MTBLS2", pattern = "^a_") -#' afiles -#' -#' ## Read the content of one file -#' a <- read.table(paste0(mtbls_ftp_path("MTBLS2"), afiles[1L]), -#' header = TRUE, sep = "\t", check.names = FALSE) -#' head(a) -NULL - -#' @rdname MetaboLights-utils -#' -#' @export -mtbls_ftp_path <- function(x = character(), mustWork = TRUE) { - if (length(x) != 1L) - stop("'x' has to be a single ID.") - res <- paste0("ftp://ftp.ebi.ac.uk/pub/databases/metabolights/", - "studies/public/", x, "/") - if (mustWork) - mtbls_list_files(x) - res -} - -#' @importFrom curl new_handle handle_setopt curl -#' -#' @rdname MetaboLights-utils -#' -#' @export -mtbls_list_files <- function(x = character(), pattern = NULL) { - cu <- new_handle() - handle_setopt(cu, ftp_use_epsv = TRUE, dirlistonly = TRUE) - tryCatch({ - con <- curl(url = mtbls_ftp_path(x, mustWork = FALSE), "r", handle = cu) - }, error = function(e) { - stop("Failed to connect to MetaboLights. No internet connection? ", - "Does the data set \"", x, "\" exist?\n - ", e$message, - call. = FALSE) - }) - fls <- readLines(con) - close(con) - if (length(pattern)) - fls[grepl(pattern, fls)] - else fls -} - -#' retrieves and reads the/all assay data file(s) for a given MetaboLights -#' data set and returns it/them as a `list` of `data.frame`s. The file -#' names of the respective assay data file(s) are reported as `names()` of -#' the returned `list`. -#' -#' @param x `character(1)` with the MetaboLights ID of the data set. -#' -#' @importFrom utils read.table -#' -#' @noRd -.mtbls_assay_list <- function(x = character()) { - fpath <- mtbls_ftp_path(x, mustWork = FALSE) - a_fls <- mtbls_list_files(x, pattern = "a_") - res <- lapply(a_fls, function(z) { - read.table(paste0(fpath, z), - sep = "\t", header = TRUE, - check.names = FALSE) - }) - names(res) <- a_fls - res -} - -#' Extract the MS data files from MTBLS assay tables' -#' *Derived Spectral Data File* column(s). The function checks all present -#' columns in the provided `data.frame` and returns the content of the first of -#' these columns with files matching the provided `pattern`. -#' -#' @param x `data.frame` representing the content of an *assay* ISA file from -#' MetaboLights. -#' -#' @param pattern `character(1)` with supported data types. -#' -#' @return `character` with the file names matching the provided pattern. -#' -#' @noRd -.mtbls_derived_data_file <- function(x, pattern = "mzML$|CDF$|mzXML$") { - cls <- which(colnames(x) == "Derived Spectral Data File") - res <- character() - for (i in cls) { - keep <- grepl(pattern, x[[i]]) - if (any(keep)) { - res <- x[[i]][keep] - break - } - } - res -} diff --git a/R/MsBackendMetaboLights.R b/R/MsBackendMetaboLights.R new file mode 100644 index 0000000..de8a967 --- /dev/null +++ b/R/MsBackendMetaboLights.R @@ -0,0 +1,417 @@ +#' @title MsBackend representing MS data from MetaboLights +#' +#' @name MsBackendMetaboLights +#' +#' @aliases MsBackendMetaboLights-class +#' +#' @description +#' +#' The `MsBackendMetaboLights` retrieves and represents mass spectrometry (MS) +#' data from metabolomics experiments stored in the +#' [MetaboLights](https://www.ebi.ac.uk/metabolights/) repository. The backend +#' directly extends the [MsBackendMzR] backend from the *Spectra* package and +#' hence supports MS data in mzML, netCDF and mzXML format. Upon initialization +#' with the `backendInitialize()` method, the `MsBackendMetaboLights` backend +#' downloads and caches the MS data files of an experiment locally avoiding +#' hence repeated download of the data. +#' +#' @section Initialization and loading of data: +#' +#' New instances of the class can be created with the `MsBackendMetaboLights()` +#' function. Data is loaded and initialized using the `backendInitialize()` +#' function with parameters `mtblsId`, `assayName` and `filePattern`. `mtblsId` +#' must be the ID of a **single** (existing) MetaboLights data set. Parameter +#' `assayName` allows to define specific *assays* of the MetaboLights data set +#' from which the data files should be loaded. If provided, it should be the +#' file names of the respective assays in MetaboLights (use e.g. +#' `mtbls_list_files(, pattern = "^a_")` to list all available +#' assay files for a given MetaboLights ID ``. By default, +#' with `assayName = character()` MS data files from all assays of a data set +#' are loaded. Optional parameter `filePattern` defines the pattern that should +#' be used to filter the file names. It defaults to data files with file +#' endings of supported MS data files. `backendInitialize()` requires by +#' default an active internet connection as the function first compares the +#' remote file content to eventually synchronize changes/updates. This can be +#' skipped with `offline = TRUE` in which case only locally cached content +#' is considered. +#' +#' @param object an instance of `MsBackendMetaboLights`. +#' +#' @param mtblsId `character(1)` with the ID of the MetaboLights data +#' set/experiment. +#' +#' @param assayName `character` with the file names of assay files of the data +#' set. If not provided (`assayName = character()`, the default), MS data +#' files of all data set's assays is loaded. Use +#' `mtbls_list_files(, pattern = "^a_")` to list all +#' available assay files of a data set ``. +#' +#' @param filePattern `character` with the pattern defining the supported (or +#' requested) file types. Defaults to +#' `filePattern = "mzML$|CDF$|cdf$|mzXML$"` hence restricting to mzML, +#' CDF and mzXML files supported by *Spectra*'s `MsBackendMzR` backend. +#' +#' @param offline `logical(1)` whether only locally cached content should be +#' evaluated/loaded. +#' +#' @param ... additional parameters; currently ignored. +#' +#' @details +#' +#' Data files are by default extracted from the column `"Derived Spectral +#' Data File"` of the MetaboLights data set's *assay* table. If this column +#' does not contain any supported file names, the assay's column +#' `"Raw Spectral Data File"` is evaluated. +#' +#' The backend uses the +#' [BiocFileCache](https://bioconductor.org/packages/BiocFileCache) package for +#' caching of the data files. These are stored in the default local +#' *BiocFileCache* cache along with additional metadata that includes the +#' MetaboLights ID, the assay file name with which the data file is associated +#' with. Note that at present only MS data files in *mzML*, *CDF* and *mzXML* +#' format are supported. +#' +#' The `MsBackendMetaboLights` backend defines and provides additional spectra +#' variables `"mtbls_id"`, `"mtbls_assay_name"` and +#' `"derived_spectral_data_file"` that list the MetaboLights ID, the name of +#' the assay file and the original data file name on the MetaboLights ftp +#' server for each individual spectrum. The `"derived_spectral_data_file"` can +#' be used for the mapping between the experiment/data sets samples and the +#' individual data files, respective their spectra. This mapping is provided +#' in the respective MetaboLights assay file. +#' +#' The `MsBackendMetaboLights()` is considered *read-only* and does thus not +#' support changing *m/z* and intensity values directly. +#' +#' Also, merging of MS data of `MsBackendMetaboLights` is not supported and +#' thus `c()` of several `Spectra` with MS data represented by +#' `MsBackendMetaboLights` will throw an error. +#' +#' @importClassesFrom Spectra MsBackendMzR +#' +#' @exportClass MsBackendMetaboLights +#' +#' @author Philippine Louail, Johannes Rainer +#' +NULL + +setClass("MsBackendMetaboLights", + contains = "MsBackendMzR") + +#' @rdname MsBackendMetaboLights +#' +#' @importFrom methods new +#' +#' @export +MsBackendMetaboLights <- function() { + new("MsBackendMetaboLights") +} + +#' @rdname MsBackendMetaboLights +#' +#' @importMethodsFrom ProtGenerics backendInitialize +#' +#' @importMethodsFrom ProtGenerics dataOrigin +#' +#' @importFrom methods callNextMethod +#' +#' @exportMethod backendInitialize +setMethod( + "backendInitialize", "MsBackendMetaboLights", + function(object, mtblsId = character(), assayName = character(), + filePattern = "mzML$|CDF$|cdf$|mzXML$", offline = FALSE, ...) { + dots <- list(...) + if (any(names(dots) == "data")) + stop("Parameter 'data' is not supported for ", + "'MsBackendMetaboLights'. A 'MsBackendMetaboLights' object ", + "can only be instantiated using 'backendInitialize()'.") + if (length(mtblsId) != 1) + stop("Parameter 'mtblsId' is required and can only be a single ID ", + "of a MetaboLights data set.") + if (offline) + mdata <- .mtbls_data_files_offline(mtblsId, assayName, filePattern) + else mdata <- .mtbls_data_files(mtblsId, assayName, filePattern) + object <- callNextMethod(object, files = mdata$rpath) + idx <- match(dataOrigin(object), + normalizePath(mdata$rpath, mustWork = FALSE)) + object@spectraData$mtbls_id <- mdata$mtbls_id[idx] + object@spectraData$mtbls_assay_name <- mdata$mtbls_assay_name[idx] + object@spectraData$derived_spectral_data_file <- + mdata$derived_spectral_data_file[idx] + object + }) + +#' @rdname MsBackendMetaboLights +#' +#' @importFrom ProtGenerics backendMerge +#' +#' @exportMethod backendMerge +setMethod( + "backendMerge", "MsBackendMetaboLights", + function(object, ...) { + stop("Merging of backends of type 'MsBackendMetaboLights' is not ", + "supported. Use 'setBackend()' to change to a backend that ", + "supports merging, such as the 'MsBackendMemory'.") + }) + +################################################################################ +## Utility functions for MetaboLights +## +################################################################################ + +#' @title Utility functions for the MetaboLights repository +#' +#' @name MetaboLights-utils +#' +#' @description +#' +#' [MetaboLights](https://www.ebi.ac.uk/metabolights/) is one of the main +#' public repositories for deposition of metabolomics experiments including +#' (raw) mass spectrometry (MS) and NMR data files and experimental/analysis +#' results. The experimental metadata and results are stored as plain text +#' files in ISA-tab format. Each MetaboLights experiment must provide a +#' file describing the samples analyzed and at least one *assay* file that +#' links between the experimental samples and the (raw and processed) data +#' files with quantification of metabolites/features in these samples. +#' +#' Each experiment in MetaboLights is identified with its unique identifier, +#' starting with *MTBLS* followed by a number. The data (metadata files and +#' MS/NMR data files) of an experiment are available through the repository's +#' ftp server. +#' +#' The functions listed here allow to query and retrieve information of an +#' data set/experiment from MetaboLights. +#' +#' - `mtbls_ftp_path`: returns the FTP path for a provided MetaboLights ID. +#' With `mustWork = TRUE` (the default) the function throws an error if +#' the path is not accessible (either because the data set does not exist or +#' no internet connection is available). The function returns a +#' `character(1)` with the FTP path to the data set folder. +#' +#' - `mtbls_list_files`: returns the available files (and directories) for the +#' specified MetaboLights data set (i.e. the FTP directory content of the +#' data set). The function returns a `character` vector with the relative +#' file names to the absolute FTP path (`mtbls_ftp_path()`) of the data set. +#' Parameter `pattern` allows to filter which file names should be returned. +#' +#' @param x `character(1)` with the ID of the MetaboLights data set (usually +#' starting with a *MTBLS* followed by a number). +#' +#' @param mustWork for `mtbls_ftp_path()`: `logical(1)` whether the validity of +#' the path should be verified or not. By default (with `mustWork = TRUE`) +#' the function throws an error if either the data set does not exist or +#' if the folder can not be accessed (e.g. if no internet connection is +#' available). +#' +#' @param pattern for `mtbls_list_files()`: `character(1)` defining a pattern +#' to filter the file names, such as `pattern = "^a_"` to retrieve the +#' file names of all assay files of the data set. This parameter is +#' passed to the [grepl()] function. +#' +#' @author Johannes Rainer, Philippine Louail +#' +#' @examples +#' +#' ## Get the FTP path to the data set MTBLS2 +#' mtbls_ftp_path("MTBLS2") +#' +#' ## Retrieve available files (and directories) for the data set MTBLS2 +#' mtbls_list_files("MTBLS2") +#' +#' ## Retrieve the available assay files (file names starting with "a_"). +#' afiles <- mtbls_list_files("MTBLS2", pattern = "^a_") +#' afiles +#' +#' ## Read the content of one file +#' a <- read.table(paste0(mtbls_ftp_path("MTBLS2"), afiles[1L]), +#' header = TRUE, sep = "\t", check.names = FALSE) +#' head(a) +NULL + +#' @rdname MetaboLights-utils +#' +#' @export +mtbls_ftp_path <- function(x = character(), mustWork = TRUE) { + if (length(x) != 1L) + stop("'x' has to be a single ID.") + res <- paste0("ftp://ftp.ebi.ac.uk/pub/databases/metabolights/", + "studies/public/", x, "/") + if (mustWork) + mtbls_list_files(x) + res +} + +#' @importFrom curl new_handle handle_setopt curl +#' +#' @rdname MetaboLights-utils +#' +#' @export +mtbls_list_files <- function(x = character(), pattern = NULL) { + cu <- new_handle() + handle_setopt(cu, ftp_use_epsv = TRUE, dirlistonly = TRUE) + tryCatch({ + con <- curl(url = mtbls_ftp_path(x, mustWork = FALSE), "r", handle = cu) + }, error = function(e) { + stop("Failed to connect to MetaboLights. No internet connection? ", + "Does the data set \"", x, "\" exist?\n - ", e$message, + call. = FALSE) + }) + fls <- readLines(con) + close(con) + if (length(pattern)) + fls[grepl(pattern, fls)] + else fls +} + +#' retrieves and reads the/all assay data file(s) for a given MetaboLights +#' data set and returns it/them as a `list` of `data.frame`s. The file +#' names of the respective assay data file(s) are reported as `names()` of +#' the returned `list`. +#' +#' @param x `character(1)` with the MetaboLights ID of the data set. +#' +#' @importFrom utils read.table +#' +#' @noRd +.mtbls_assay_list <- function(x = character()) { + fpath <- mtbls_ftp_path(x, mustWork = FALSE) + a_fls <- mtbls_list_files(x, pattern = "^a_") + res <- lapply(a_fls, function(z) { + read.table(paste0(fpath, z), + sep = "\t", header = TRUE, + check.names = FALSE) + }) + names(res) <- a_fls + res +} + +#' Extract the MS data files from MTBLS assay tables' +#' *Derived Spectral Data File* column(s). The function checks all present +#' columns in the provided `data.frame` and returns the content of the first of +#' these columns with files matching the provided `pattern`. +#' +#' @param x `data.frame` representing the content of an *assay* ISA file from +#' MetaboLights. +#' +#' @param pattern `character(1)` with supported data types. +#' +#' @return `character` with the file names matching the provided pattern. +#' +#' @noRd +.mtbls_data_file_from_assay <- + function(x, pattern = "mzML$|CDF$|cdf$|mzXML$", + colname = "Derived Spectral Data File") { + cls <- which(colnames(x) == colname) + res <- character() + for (i in cls) { + keep <- grepl(pattern, x[[i]]) + if (any(keep)) { + res <- x[[i]][keep] + break + } + } + res + } + +################################################################################ +## +## File caching utils +## +################################################################################ + +#' Get information on data files for a given MTBLS ID/assay eventually +#' downloading and caching them. This function needs an active internet +#' connection as it queries the MTBLS ftp server for available data files +#' that are then cached. The function returns the **local** file names +#' **from the cache**. +#' +#' The function: +#' - retrieves all "Derived Data Files" for all assays (or for specified assays) +#' for one MetaboLights ID. +#' - uses BiocFileCache to cache these files, i.e. downloading them if they +#' are not yet cached. +#' - returns a `data.frame` with all information. +#' +#' This `data.frame` has one row per data file with columns: +#' - `"rid"`: the BiocFileCache ID of each file. +#' - `"mtbls_id"`: the MTBLS ID +#' - `"mtbls_assay_name"`: the name of the assay file for each data file +#' - `"derived_spectral_data_file"`: the name of the data file in the assay +#' file/table +#' - `"rpath"`: the name of the cached data file (full local path) +#' +#' @importFrom BiocFileCache BiocFileCache +#' +#' @importMethodsFrom BiocFileCache bfcrpath bfcmeta<- +#' +#' @noRd +.mtbls_data_files <- function(mtblsId = character(), assayName = character(), + pattern = "mzML$|CDF$|mzXML$") { + assays <- .mtbls_assay_list(mtblsId) + anames <- names(assays) + if (length(assayName)) { + if (!all(assayName %in% anames)) + stop("Not all assay names defined with 'assayName' are available ", + "for ", mtblsId, ". Available assay names are: \n", + paste0(" - \"", anames, "\"", collapse = "\n"), call. = FALSE) + assays <- assays[anames %in% assayName] + } + fpath <- mtbls_ftp_path(mtblsId, mustWork = FALSE) + dfiles <- lapply(assays, .mtbls_data_file_from_assay, pattern = pattern) + ffiles <- unlist(dfiles, use.names = FALSE) + if (!length(ffiles)) { + ## Failsafe; use evaluate also raw data file + dfiles <- lapply(assays, .mtbls_data_file_from_assay, pattern = pattern, + colname = "Raw Spectral Data File") + ffiles <- unlist(dfiles, use.names = FALSE) + if (!length(ffiles)) + stop("No files matching the provided file pattern found for ", + "MetaboLights data set ", mtblsId, ".", call. = FALSE) + else + message("Used data files from the assay's column \"Raw Spectral ", + "Data File\" since none were available in column ", + "\"Derived Spectral Data File\".") + } + ## Cache files + bfc <- BiocFileCache() + lfiles <- bfcrpath(bfc, paste0(fpath, ffiles), fname = "exact") + ## Add and store metadata to the cached files + mdata <- data.frame( + rid = names(lfiles), + mtbls_id = mtblsId, + mtbls_assay_name = rep(names(dfiles), lengths(dfiles)), + derived_spectral_data_file = unlist(dfiles, use.names = FALSE)) + bfcmeta(bfc, name = "MTBLS", overwrite = TRUE) <- mdata + mdata$rpath <- lfiles + mdata +} + +#' Check for a given MTBLS ID and assay IDs/file names if we have cached data +#' files. This function is supposed to work also offline using only previously +#' cached content. In contrast to `.mtbls_data_files()`, this function just +#' queries the BiocFileCache for content and returns a `data.frame` with +#' all cached data files for a given MTBLS ID, assay name and pattern. The +#' returned `data.frame` has the same format as the one returned by +#' `.mtbls_data_files()`. +#' +#' @importMethodsFrom BiocFileCache bfcmetalist bfcquery +#' +#' @noRd +.mtbls_data_files_offline <- function(mtblsId = character(), + assayName = character(), + pattern = "mzML$|CDF$|mzXML$") { + bfc <- BiocFileCache() + if (!any(bfcmetalist(bfc) == "MTBLS")) + stop("No local MetaboLights cache available. Please re-run with ", + "'offline = FALSE' first.", call. = FALSE) + res <- as.data.frame(bfcquery(bfc, mtblsId, field = "mtbls_id")) + if (length(assayName)) { + res <- res[res$mtbls_assay_name %in% assayName, ] + } + res <- res[grepl(pattern, res$derived_spectral_data_file), ] + if (!nrow(res)) + stop("No locally cached data files found for the specified ", + "parameters.", call. = FALSE) + res[, c("rid", "mtbls_id", "mtbls_assay_name", + "derived_spectral_data_file", "rpath")] +} diff --git a/README.md b/README.md index 4ced0e8..16df79f 100644 --- a/README.md +++ b/README.md @@ -1,6 +1,9 @@ # Retrieve Mass Spectrometry Data from MetaboLights -[![Project Status: WIP – Initial development is in progress, but there has not yet been a stable, usable release suitable for the public.](https://www.repostatus.org/badges/latest/wip.svg)](https://www.repostatus.org/#wip) +[![Project Status: Active – The project has reached a stable, usable state and is being actively developed.](https://www.repostatus.org/badges/latest/active.svg)](https://www.repostatus.org/#active) +[![R-CMD-check-bioc](https://github.com/RforMassSpectrometry/MsBackendMetaboLights/workflows/R-CMD-check-bioc/badge.svg)](https://github.com/RforMassSpectrometry/MsBackendMetaboLights/actions?query=workflow%3AR-CMD-check-bioc) +[![codecov](https://codecov.io/gh/rformassspectrometry/MsBackendMetaboLights/graph/badge.svg?token=jpxt7OlA2k)](https://codecov.io/gh/rformassspectrometry/MsBackendMetaboLights) +[![:name status badge](https://rformassspectrometry.r-universe.dev/badges/:name)](https://rformassspectrometry.r-universe.dev/) [![license](https://img.shields.io/badge/license-Artistic--2.0-brightgreen.svg)](https://opensource.org/licenses/Artistic-2.0) This repository provides a *backend* for diff --git a/man/MetaboLights-utils.Rd b/man/MetaboLights-utils.Rd index f8d9881..69efb99 100644 --- a/man/MetaboLights-utils.Rd +++ b/man/MetaboLights-utils.Rd @@ -1,5 +1,5 @@ % Generated by roxygen2: do not edit by hand -% Please edit documentation in R/MetaboLights.R +% Please edit documentation in R/MsBackendMetaboLights.R \name{MetaboLights-utils} \alias{MetaboLights-utils} \alias{mtbls_ftp_path} diff --git a/man/MsBackendMetaboLights.Rd b/man/MsBackendMetaboLights.Rd new file mode 100644 index 0000000..2537056 --- /dev/null +++ b/man/MsBackendMetaboLights.Rd @@ -0,0 +1,108 @@ +% Generated by roxygen2: do not edit by hand +% Please edit documentation in R/MsBackendMetaboLights.R +\name{MsBackendMetaboLights} +\alias{MsBackendMetaboLights} +\alias{MsBackendMetaboLights-class} +\alias{backendInitialize,MsBackendMetaboLights-method} +\alias{backendMerge,MsBackendMetaboLights-method} +\title{MsBackend representing MS data from MetaboLights} +\usage{ +MsBackendMetaboLights() + +\S4method{backendInitialize}{MsBackendMetaboLights}( + object, + mtblsId = character(), + assayName = character(), + filePattern = "mzML$|CDF$|cdf$|mzXML$", + offline = FALSE, + ... +) + +\S4method{backendMerge}{MsBackendMetaboLights}(object, ...) +} +\arguments{ +\item{object}{an instance of \code{MsBackendMetaboLights}.} + +\item{mtblsId}{\code{character(1)} with the ID of the MetaboLights data +set/experiment.} + +\item{assayName}{\code{character} with the file names of assay files of the data +set. If not provided (\code{assayName = character()}, the default), MS data +files of all data set's assays is loaded. Use +\verb{mtbls_list_files(, pattern = "^a_")} to list all +available assay files of a data set \verb{}.} + +\item{filePattern}{\code{character} with the pattern defining the supported (or +requested) file types. Defaults to +\code{filePattern = "mzML$|CDF$|cdf$|mzXML$"} hence restricting to mzML, +CDF and mzXML files supported by \emph{Spectra}'s \code{MsBackendMzR} backend.} + +\item{offline}{\code{logical(1)} whether only locally cached content should be +evaluated/loaded.} + +\item{...}{additional parameters; currently ignored.} +} +\description{ +The \code{MsBackendMetaboLights} retrieves and represents mass spectrometry (MS) +data from metabolomics experiments stored in the +\href{https://www.ebi.ac.uk/metabolights/}{MetaboLights} repository. The backend +directly extends the \link{MsBackendMzR} backend from the \emph{Spectra} package and +hence supports MS data in mzML, netCDF and mzXML format. Upon initialization +with the \code{backendInitialize()} method, the \code{MsBackendMetaboLights} backend +downloads and caches the MS data files of an experiment locally avoiding +hence repeated download of the data. +} +\details{ +Data files are by default extracted from the column \code{"Derived Spectral Data File"} of the MetaboLights data set's \emph{assay} table. If this column +does not contain any supported file names, the assay's column +\code{"Raw Spectral Data File"} is evaluated. + +The backend uses the +\href{https://bioconductor.org/packages/BiocFileCache}{BiocFileCache} package for +caching of the data files. These are stored in the default local +\emph{BiocFileCache} cache along with additional metadata that includes the +MetaboLights ID, the assay file name with which the data file is associated +with. Note that at present only MS data files in \emph{mzML}, \emph{CDF} and \emph{mzXML} +format are supported. + +The \code{MsBackendMetaboLights} backend defines and provides additional spectra +variables \code{"mtbls_id"}, \code{"mtbls_assay_name"} and +\code{"derived_spectral_data_file"} that list the MetaboLights ID, the name of +the assay file and the original data file name on the MetaboLights ftp +server for each individual spectrum. The \code{"derived_spectral_data_file"} can +be used for the mapping between the experiment/data sets samples and the +individual data files, respective their spectra. This mapping is provided +in the respective MetaboLights assay file. + +The \code{MsBackendMetaboLights()} is considered \emph{read-only} and does thus not +support changing \emph{m/z} and intensity values directly. + +Also, merging of MS data of \code{MsBackendMetaboLights} is not supported and +thus \code{c()} of several \code{Spectra} with MS data represented by +\code{MsBackendMetaboLights} will throw an error. +} +\section{Initialization and loading of data}{ + + +New instances of the class can be created with the \code{MsBackendMetaboLights()} +function. Data is loaded and initialized using the \code{backendInitialize()} +function with parameters \code{mtblsId}, \code{assayName} and \code{filePattern}. \code{mtblsId} +must be the ID of a \strong{single} (existing) MetaboLights data set. Parameter +\code{assayName} allows to define specific \emph{assays} of the MetaboLights data set +from which the data files should be loaded. If provided, it should be the +file names of the respective assays in MetaboLights (use e.g. +\verb{mtbls_list_files(, pattern = "^a_")} to list all available +assay files for a given MetaboLights ID \verb{}. By default, +with \code{assayName = character()} MS data files from all assays of a data set +are loaded. Optional parameter \code{filePattern} defines the pattern that should +be used to filter the file names. It defaults to data files with file +endings of supported MS data files. \code{backendInitialize()} requires by +default an active internet connection as the function first compares the +remote file content to eventually synchronize changes/updates. This can be +skipped with \code{offline = TRUE} in which case only locally cached content +is considered. +} + +\author{ +Philippine Louail, Johannes Rainer +} diff --git a/tests/testthat/test_MetaboLights.R b/tests/testthat/test_MetaboLights.R deleted file mode 100644 index dc5a1d1..0000000 --- a/tests/testthat/test_MetaboLights.R +++ /dev/null @@ -1,43 +0,0 @@ -alist_ms <- .mtbls_assay_list("MTBLS2") -alist_nmr <- .mtbls_assay_list("MTBLS123") - -test_that("mtbls_ftp_path works", { - res <- mtbls_ftp_path("A", mustWork = FALSE) - expect_true(grepl("^ftp://", res)) - expect_true(grepl("A/$", res)) - - expect_error(mtbls_ftp_path("A", mustWork = TRUE), "Failed to connect") - - res <- mtbls_ftp_path("MTBLS1") - expect_true(grepl("MTBLS1/$", res)) - - expect_error(mtbls_ftp_path(c("A", "B")), "single ID") -}) - -test_that("mtbls_list_files works", { - res <- mtbls_list_files("MTBLS8735", pattern = "^a_") - expect_true(length(res) == 2) - expect_error(mtbls_list_files("AAA"), "Failed to connect") -}) - -test_that(".mtbls_assay_list works", { - res <- .mtbls_assay_list("MTBLS8735") - expect_true(is.list(res)) - expect_true(length(res) == 2L) - expect_true(is.data.frame(res[[1L]])) -}) - -test_that(".mtbls_derived_data_file works", { - res <- .mtbls_derived_data_file(alist_ms[[1L]]) - expect_true(is.character(res)) - expect_true(length(res) == 16) - - ## check the second column containing mzData files - res <- .mtbls_derived_data_file(alist_ms[[1L]], pattern = "mzData") - expect_true(is.character(res)) - expect_true(length(res) == 16) - - res <- .mtbls_derived_data_file(alist_nmr[[1L]]) - expect_true(is.character(res)) - expect_true(length(res) == 0) -}) diff --git a/tests/testthat/test_MsBackendMetaboLights.R b/tests/testthat/test_MsBackendMetaboLights.R index e69de29..716026b 100644 --- a/tests/testthat/test_MsBackendMetaboLights.R +++ b/tests/testthat/test_MsBackendMetaboLights.R @@ -0,0 +1,115 @@ +## Note: clean BiocFileCache with cleanbfc(days = -10, ask = FALSE) +## MTBLS10555: lists files in assay column that don't exist. Will result in an +## error when we try to download the data. The data set contains +## small data files. maybe use pattern sham-1-10.mzML +## MTBLS39: cdf files listed in Raw Spectral Data File column. Maybe use a +## specific pattern to load/cache only some files. +## MTBLS243: mzML.gz files. Can eventually use an additional filter. + +test_that("MsBackendMetaboLights works", { + res <- MsBackendMetaboLights() + expect_s4_class(res, "MsBackendMetaboLights") + expect_true(inherits(res, "MsBackendMzR")) +}) + +test_that("backendInitialize,MsBackendMetaboLights works", { + ## Test errors + expect_error(backendInitialize(MsBackendMetaboLights(), + data = data.frame(a = 3)), + "Parameter 'data' is not supported") + expect_error(backendInitialize(MsBackendMetaboLights(), + mtblsId = c("a", "b")), + "Parameter 'mtblsId' is required and can") + expect_error(backendInitialize(MsBackendMetaboLights(), mtblsId = "a"), + "Failed to connect") + + ## Test NMR data set + expect_error(backendInitialize(MsBackendMetaboLights(), + mtblsId = "MTBLS100"), "No files matching") + ## Test real data set. + res <- backendInitialize(MsBackendMetaboLights(), mtblsId = "MTBLS39", + filePattern = "63A.cdf") + expect_s4_class(res, "MsBackendMetaboLights") + expect_true(all(c("mtbls_id", "mtbls_assay_name", + "derived_spectral_data_file") %in% + Spectra::spectraVariables(res))) + expect_true(all(res$mtbls_id == "MTBLS39")) +}) + +test_that(".mtbls_data_files and .mtbls_data_files_offline works", { + ## error + expect_error(.mtbls_data_files(mtblsId = "MTBLS2", + assayName = "does not exist"), + "Not all assay names") + expect_error(.mtbls_data_files(mtblsId = "MTBLS100"), "No files matching") + + ## Cache the data: MTBLS39 contains small cdf files, but they are listed + ## in the Raw Spectral Data File column. Will use a specfic pattern to + ## just load 3 files. + a <- .mtbls_data_files("MTBLS39", pattern = "63A.cdf") + expect_true(is.data.frame(a)) + expect_true(nrow(a) == 3) + expect_true(all(a$mtbls_id == "MTBLS39")) + ## Re-call function the data. + b <- .mtbls_data_files("MTBLS39", pattern = "63A.cdf") + expect_true(is.data.frame(a)) + expect_true(nrow(a) == 3) + expect_true(all(a$mtbls_id == "MTBLS39")) + expect_equal(a$rpath, b$rpath) + + ## Use offline + d <- .mtbls_data_files_offline("MTBLS39", pattern = "63A.cdf") + expect_true(is.data.frame(a)) + expect_true(nrow(a) == 3) + expect_true(all(a$mtbls_id == "MTBLS39")) + expect_equal(a$rpath, d$rpath) +}) + +test_that("backendMerge,MsBackendMetaboLights fails", { + b <- MsBackendMetaboLights() + expect_error(backendMerge(b, b), "Merging of backends") +}) + +alist_ms <- .mtbls_assay_list("MTBLS2") +alist_nmr <- .mtbls_assay_list("MTBLS123") + +test_that("mtbls_ftp_path works", { + res <- mtbls_ftp_path("A", mustWork = FALSE) + expect_true(grepl("^ftp://", res)) + expect_true(grepl("A/$", res)) + + expect_error(mtbls_ftp_path("A", mustWork = TRUE), "Failed to connect") + + res <- mtbls_ftp_path("MTBLS1") + expect_true(grepl("MTBLS1/$", res)) + + expect_error(mtbls_ftp_path(c("A", "B")), "single ID") +}) + +test_that("mtbls_list_files works", { + res <- mtbls_list_files("MTBLS8735", pattern = "^a_") + expect_true(length(res) == 2) + expect_error(mtbls_list_files("AAA"), "Failed to connect") +}) + +test_that(".mtbls_assay_list works", { + res <- .mtbls_assay_list("MTBLS8735") + expect_true(is.list(res)) + expect_true(length(res) == 2L) + expect_true(is.data.frame(res[[1L]])) +}) + +test_that(".mtbls_data_file_from_assay works", { + res <- .mtbls_data_file_from_assay(alist_ms[[1L]]) + expect_true(is.character(res)) + expect_true(length(res) == 16) + + ## check the second column containing mzData files + res <- .mtbls_data_file_from_assay(alist_ms[[1L]], pattern = "mzData") + expect_true(is.character(res)) + expect_true(length(res) == 16) + + res <- .mtbls_data_file_from_assay(alist_nmr[[1L]]) + expect_true(is.character(res)) + expect_true(length(res) == 0) +}) diff --git a/vignettes/MsBackendMetaboLights.Rmd b/vignettes/MsBackendMetaboLights.Rmd new file mode 100644 index 0000000..74a62b8 --- /dev/null +++ b/vignettes/MsBackendMetaboLights.Rmd @@ -0,0 +1,186 @@ +--- +title: "Using Mass Spectrometry Data from MetaboLights" +output: + BiocStyle::html_document: + toc_float: true +vignette: > + %\VignetteIndexEntry{Using Mass Spectrometry Data from MetaboLights} + %\VignetteEngine{knitr::rmarkdown} + %\VignetteEncoding{UTF-8} + %\VignettePackage{MsBackendMetaboLights} + %\VignetteDepends{Spectra,BiocStyle} +--- + +```{r style, echo = FALSE, results = 'asis', message=FALSE} +BiocStyle::markdown() +``` + +**Package**: `r Biocpkg("MsBackendMetaboLights")`
+**Authors**: `r packageDescription("MsBackendMetaboLights")[["Author"]] `
+**Last modified:** `r file.info("MsBackendMetaboLights.Rmd")$mtime`
+**Compiled**: `r date()` + +```{r, echo = FALSE, message = FALSE} +library(Spectra) +library(BiocStyle) +``` + +# Introduction + +The `r Biocpkg("Spectra")` package provides a central infrastructure for the +handling of Mass Spectrometry (MS) data. The package supports interchangeable +use of different *backends* to import MS data from a variety of sources and data +formats. The *MsBackendMetaboLights* package allows to retrieve MS data files +directly from the [MetaboLights](https://www.ebi.ac.uk/metabolights/) +repository. MetaboLights is one of the main public repositories for deposition +of metabolomics experiments including (raw) MS and/or NMR data files and the +related experimental and analytical results. The *MsBackendMetaboLights* package +downloads and locally caches MS data files for a MetaboLights data set and +enables further analyses of this data directly in R. + + +# Installation + +The package can be installed directly from within R with the few commands below: + +```{r, eval = FALSE} +if (!requireNamespace("BiocManager", quietly = TRUE)) + install.packages("BiocManager") + +BiocManager::install("RforMassSpectrometry/MsBackendMetaboLights") +``` + + +# Importing MS Data from MetaboLights + +[MetaboLights](https://www.ebi.ac.uk/metabolights/) is one of the main public +repositories for deposition of metabolomics experiments including (raw) mass +spectrometry (MS) and NMR data files and experimental/analysis results. The +experimental metadata and results are stored as plain text files in ISA-tab +format. Each MetaboLights experiment must provide a file describing the samples +analyzed and at least one *assay* file that links between the experimental +samples and the (raw and processed) data files with quantification of +metabolites/features in these samples. + +In this vignette we explore and load MS data files from a small MetaboLights +experiment. MetaboLights provides information on a data set/experiment as a set +of plain text files in *ISA-tab* format. These can be accessed and read from the +data set's ftp folder. The set of files consist generally of a file with +information on the experiment/investigation (in a file with the file name +starting with *i_*) the samples of the data set (file name starting with *s_*), +the *assay* (measurements/analysis) of the experiment and a file with quantified +metabolite abundances (file name starting with *m_*). Note that a data set can +have more than one assay file. + +Below we list all files from the MetaboLights data set with the ID *MTBLS39*. + +```{r} +library(MsBackendMetaboLights) + +#' List files of a MetaboLights data set +all_files <- mtbls_list_files("MTBLS39") +``` + +All these files are directly accessible in the ftp folder associated with the +MetaboLights data set. Below we use the `mtbls_ftp_path()` function to return +the ftp path for our test data set. + +```{r} +mtbls_ftp_path("MTBLS39") +``` + +We could inspect the content of this folder also using a browser supporting the +ftp file transfer protocol and also download the files. We can however access +the files also directly from within R. Below we read the *assay* data file +directly using the `read.table()` function. + +```{r} +#' Get the assay files of the data set +grep("^a_", all_files, value = TRUE) + +#' Read the assay file +a <- read.table(paste0(mtbls_ftp_path("MTBLS39"), + grep("^a_", all_files, value = TRUE)), + sep = "\t", header = TRUE, check.names = FALSE) +``` + +Each row in this assay table refers to one measurement (data file) of the data +set, with columns providing information on that measurement. The number and +content of columns can vary between data sets. Below we list the columns +available in the assay file of our test data set. + +```{r} +colnames(a) +``` + +MS data files are generally provided in a column named `"Derived Spectral Data +File"` or `"Raw Spectral Data File"`, depending into which data field the data +set submitted entered the data. Below we list the content of these data columns. + +```{r} +a[, c("Raw Spectral Data File", "Derived Spectral Data File")] +``` + +Note that providing MS data files is not absolutely mandatory, thus, for some +data sets no MS data files might be available. For this particular data set the +MS data files are provided in the `"Raw Spectral Data File"` column. These files +are in CDF format and can hence be loaded using the `MsBackendMetaboLights` +backend into R as a `Spectra` object (`MsBackendMetaboLights` directly extends +*Spectra*'s `MsBackendMzR` backend and therefore supports import of MS data +files in mzML, CDF or mzXML formats). By default, all MS data files of all +assays would be retrieved, but in our example below we restrict to few files to +reduce the amount of data that needs to be downloaded. We define for that a +pattern matching the file name of only some data files using the `filePattern` +parameter. Alternatively, for data sets with more than one assay, it would also +be possible to select MS data files from one particular assay only using the +`assayName` parameter. In our case we load all MS data files that end with +*63A.cdf*. + +```{r} +library(Spectra) + +#' Load MS data files of one data set +s <- Spectra("MTBLS39", filePattern = "63A.cdf", + source = MsBackendMetaboLights()) +s +``` + +This call now downloaded the files to the local cache and loaded these files as +a `Spectra` object. The downloading and caching of the data is handled by +Bioconductor's `r Biocpkg("BiocFileCache")`. Any subsequent loading of the same +data files will load the locally cached versions avoiding thus repetitive +download of the same data. + +The message that is shown by the call above indicates that the MS data files +were not provided in the expected column (`"Derived Spectral Data File"`) but in +the column for raw data files. + +The `Spectra` object with the MS data files of the MetaboLights data set enables +now any subsequent of the data that supports this type of data. On top of the +spectra variables and mass peak data values that are provided by the MS data +files also additional information related to the MetaboLights data set are +available as specific *spectra variables*. We list all available spectra +variables of the data set below. + +```{r} +spectraVariables(s) +``` + +The MetaboLights-specific variables are `"mtbls_id"`, `"mtbls_assay_name"` and +`"derived_spectral_data_file"`. + +```{r} +spectraData(s, c("mtbls_id", "mtbls_assay_name", + "derived_spectral_data_file")) +``` + +These variables can also be used to e.g. link the individual spectra back to the +original sample (e.g. through the *assay* and *sample* tables of the +MetaboLights data set. + + +# Session information + +```{r} +sessionInfo() +```