Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add new class SpectrumList #361

Merged
merged 6 commits into from
Oct 13, 2018
Merged

Add new class SpectrumList #361

merged 6 commits into from
Oct 13, 2018

Conversation

jorainer
Copy link
Collaborator

@jorainer jorainer commented Oct 5, 2018

The SpectrumList class extends S4Vectors::SimpleList and allows to store Spectrum1 and Spectrum2 objects. The advantage over a conventional list:

  • has mcols and thus allows to add arbitrary information to each individual spectrum.
  • supports dedicated functions mz, msLevel etc to extract the respective information from the stored Spectrum objects.

This object will be needed in future as we plan to add functionality in xcms that allows to identify MS2 spectra to chromatographic peaks and features. Thus we need additional annotation fields to be assigned to a spectrum, such as chromatographic peak ID or feature ID. SpectrumList allows this without having to add additional slots to the already existins Spectrum objects.

Johannes Rainer added 3 commits October 5, 2018 13:23
@jorainer jorainer requested review from lgatto and sgibb October 5, 2018 13:35
@jorainer
Copy link
Collaborator Author

jorainer commented Oct 5, 2018

Thinking it over, Spectra might be a better name than SpectrumList. This would be somewhat similar to GRanges and we could in future eventually add a SpectraList (similar to GRangesList). any thoughts @lgatto @sgibb ?

@jorainer
Copy link
Collaborator Author

jorainer commented Oct 5, 2018

Note: the Travis CI builds failed because of a timeout - not an error.

Copy link
Collaborator

@sgibb sgibb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Dear @jotsetung, thanks for the PR and your effort.

I am not sure what the aim of this new class should be. It is somewhere in the middle between a base list of Spectrum objects (e.g. returned by spectra/spectrapply) and an (OnDisk)MSnExp.
I guess an MSnExp object is too heavy and a list contains not enough metadata for your approach?
Currently I don't see a real advantage of SpectrumList. In fact it is just a as(spectra(MSnExpObject), "SimpleList") that provides some accessors to avoid lapply calls (e.g. lapply(list, mz)). In my opinion this creates a lot of overhead and much more code to maintain. An idea would be to replace the environment in MSnExp by such a SpectrumList. But that would be a big change.

Could you please explain a little more in detail why you want this new data structure. I read sneumann/xcms#302 but still don't understand why this could not be handled in an MSnExp/classical list object.

#' @section Accessing spectrum attributes:
#'
#' These methods allow to access the attributes and values of the individual
#' [Spectrum] ([Spectrum1] or [Spectrum2]) objects within the list.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please use [Spectrum-class] ([Spectrum1-class], [Spectrum2-class]) here (and everywhere else) to force roxygen2 to create \linkS4class{Spectrum} links. Currently the manual page contains a lot of [Spectrum] references (they are not treated/converted as links).

IMHO it is not clear from https://cran.r-project.org/web/packages/roxygen2/vignettes/markdown.html#links that you notation isn't allowed but it seems not to work properly.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the hint. I changed it.

if (is(z, "Spectrum2"))
if (length(z@precursorMz)) z@precursorMz else NA_real_
else NA_real_
}, numeric(1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't like the nested if call here (and in the methods below). What do you think about (I know it is essential the same):

if (is(z, "Spectrum2") && length(z@precursorMz)) z@precursorMz else NA_real_

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point. I've changed it.

#' ## Get the number of peaks per spectrum.
#' peaksCount(spl)
setMethod("peaksCount", "SpectrumList", function(object) {
vapply(object, peaksCount, integer(1))
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a reason why you access the slots of Spectrum objects directly in all the methods above (mz, intensity, rtime, ...) and start to use the accessor methods for the slots here/below? I think we should use a consistent way for all of the slots (where possible).

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For performance reasons I tend to directly access slots (when I know it's save). S4 method dispatching is quite slow. In some cases I do however use the accessor methods, especially if they have more functionality than just returning the slot value (I think the peaksCount is such a case, that can calculate the number of peaks based on the number of mz values.

#'
#' @examples
#'
#' ## Extract the collision energy for spectrum.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each spectrum

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed. Thanks.

#'
#' @examples
#'
#' ## Extract the file index for the spectrum.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

for each spectrum

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fixed too. Thanks.

R/SpectrumList.R Outdated
.show_SpectrumList <- function(x, margin = "", print.classinfo = FALSE) {
cat("SpectrumList with", length(x), "spectra and", length(mcols(x)),
"metadata column(s):\n")
if (length(x) > 0) {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While if (length(x) > 0) may be easier to understand if(length(x)) would be enough.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you're right. Changed to if(length(x))

R/SpectrumList.R Outdated
x, .COL2CLASS)
out <- rbind(classinfo, out)
}
if (nrow(out) != 0L)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nrow(out)?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks; I changed that too.

@@ -64,6 +64,7 @@ setMethod("[[", c("MSnSetList", "ANY", "missing"),

setMethod("split", c("MSnSet", "character"),
function(x, f) {
cat("split,MSnSet,character\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a "left-over" from debugging. I think we should remove this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes indeed. Thanks for pointing out.

@@ -78,6 +79,7 @@ setMethod("split", c("MSnSet", "character"),

setMethod("split", c("MSnSet", "factor"),
function(x, f) {
cat("split,MSnSet,factor\n")
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I guess this is a "left-over" from debugging. I think we should remove this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes. I removed it now

@jorainer
Copy link
Collaborator Author

jorainer commented Oct 8, 2018

Dear @sgibb , thanks for your review! I appreciate your critical feedback!

So, why a new class. That's a valid question. For xcms and also for another package I'm currently developing (CompoundDb) I need a way to add arbitrary annotation columns to an Spectrum object. CompoundDb provides an infrastructure to build annotation databases/resources from e.g. hmdb (http://www.hmdb.ca). I'll also add functionality to search/match against MS2 spectra that are provided by hmdb. I need to add annotations to the Spectrum2 objects I get from there (such as spectrum_id but also others). For xcms we want to implement a workflow that allows exporting data for GNPS (https://gnps.ucsd.edu/). In simple words, identify and export Spectrum2 data that is linked to chromatographics peaks. We thus need to add the information to which chromatographic peak a Spectrum2 is associated (i.e. the peak/feature ID).

Can this be done with an MSnExp? In theory I guess yes, because it should be possible to add arbitrary annotations to its fData. In practice users could get confused (why is an MSnExp object returned if all I want is a list of spectra?). For xcms I will need a function that returns all MS2 spectra that are within m/z-rt ranges of chromatographic peaks. Here I will get for each slice 0-n Spectrum2 objects. I find that quite complicated and unintuitive to handle with MSnExp object(s). IMHO it would be more natural and intuitive to have a list-like structure with subsetting, lapply etc being already handled by the base class SimpleList (not so much need for maintainance there).
Also in terms of MS2 spectra as annotation resources (as in CompoundDb) I think it makes more sense to have them as a list-like structure instead of an MSnExp. As mentioned, SimpleList could be somehow a GRanges object for proteomics/metabolimics. For users starting to integrate genomic/transcriptomic and proteomic/metabolomic data it might be more natural to have here and there similar objects.

I would not replace the environment in MSnExp with this object - but think of something else: we do have already much more additional fields read from mzML files than what is covered by the slots of Spectrum objects. Most of these informations get lost once users call spectra or access individual spectrum objects within an OnDiskMSnExp. SpectrumList could help here as it is possible to pass along some/all additional columns in the object's feature data.

In any rate. Happy to discuss further. Eventually we can also get the opinion from Laurent, @lgatto?

@jorainer
Copy link
Collaborator Author

jorainer commented Oct 9, 2018

Another thing worth mentioning: we do also need (for the GNPS workflow) to write MGF files with additional fields (the feature ID or chrom peak ID to which a MS2 spectrum can be associated). I had planned to implement a writeMgfData method for SpectrumList that exports the spectrum data and adds all information from the mcols as additional fields to the mgf. The SpectrumList would very much simplify the export of such data for the user.

Thinking it all over and testing some stuff I still very much believe that using an MSnExp as a simple container for spectra is an overkill. A list is just a more intuitive way to represent a collection of spectra - and simpler to use for the end user, and without the, in this case, unneeded additional information (instrument infos, history etc).

@sgibb
Copy link
Collaborator

sgibb commented Oct 10, 2018

@jotsetung thanks for the great explanation. You are right that an SpectrumList object would be easier to understand for the user. But I am still not convinced and I fear the maintenance effort. While something like lapply etc. is already implemented for SimpleList you maybe want to compareSpectra, plot them, removePeaks, calculateFragments or what ever. We (= you 😉) would have to reimplement each of these functions for the SpectrumList as well.

Thinking it all over and testing some stuff I still very much believe that using an MSnExp as a simple
container for spectra is an overkill.

I don't think it really add much additional useless stuff. In fact it is just an more or less empty pData data.frame and an (empty) MSnProcess object.
But to be honest I don't have a really strong opinion here. If you need a new class I am fine with it. I would say that we wait for @lgatto's comment.

A few points I would like to see if we decide to implement SpectrumList:

  • spectra should return a SpectrumList instead of a list (maybe we could add a drop=FALSE|TRUE argument to control whether mcols should keep/removed.
  • maybe the same for spectrapply
  • as(spectrumList, "MSnExp"), as(msnExp, "SpectrumList")

@jorainer
Copy link
Collaborator Author

@sgibb thanks for the reply. Regarding maintainance and implementation of additional methods such as compareSpectra and alike, I think that should be managable. I'm also OK for adding the additional things you suggest (coercion, spectra returning SpectrumList). Then let's wait for @lgatto's opinion.

@lgatto
Copy link
Owner

lgatto commented Oct 12, 2018

I have to say I'm not convinced (yet?) by the need for a list of spectra class. I certainly agree that a SimpleList can be better than a list, if one wants metadata, but ... that's essentially what an MSnExp is, except for the additional (optional) information, and that the spectra are stored as an environment (but that's just implementation, unknown to the user). If it's really only a matter not confusing the users, one could also create a subclass Spectra from an MSnSet and let users run with it.

@jotsetung - what can you tell me about performance of the SimpleList class? If it does have nice properties, it could be used as an assayData of an MSnExp object.

Also, in your use cases, I understand that the spectra in a SpectraList will be in memory. Will this scale to millions of MS2 spectra?

@jorainer
Copy link
Collaborator Author

Thanks @lgatto for the feedback. So summarizing, we're not going to add the object. There are no real strong arguments in favour of SpectrumList I have to admit. For me it just feels more natural to work with a list of spectra than with an MSnExp/OnDiskMSnExp in my use case.

Regarding performance of SimpleList: it's an object containing a list, so I guess copying etc applies to it (while that's not the case for environment). It could be that Herve et al implemented some fancy C code giving List a better performance over a simple list but I haven't checked.

And yes, SpectraList will be in memory, so at some point it will hit the memory limit.

@lgatto
Copy link
Owner

lgatto commented Oct 12, 2018

we're not going to add the object

No, that's not what I wanted to imply. I'm just trying to get the bigger picture and understand the wider implications.

In proteomics, spectral libraries (i.e. collections of MS2 spectra) can become really big, which is why it might be worth thinking about it already now. Can you read the full GNPS in a SpectrumList for the moment?

And how would you feel about having a Spetra (or SpectrumList) class that is just really an re-branded [OnDisk]MSnExp object. Again, I'm not necessarily wanting this, just trying to understand the needs better.

@jorainer
Copy link
Collaborator Author

Haven's tried to read GNPS yet. But I have implemented the SpectrumList in the CompoundDb package (https://github.com/EuracBiomedicalResearch/CompoundDb) that I am developing when time permits. What I did there is to create a CompDb database from HMDB (www.hmdb.ca) and loading its MS2 spectra from that database into a SpectrumList is no problem. It contains 458,901 MS2 spectra and the size in memory is 1.26GB.

I'm currently testing to use MSnExp/OnDiskMSnExp as an alternative. Only, for CompoundDb the OnDiskMSnExp makes no sense - there are simply no mzML files to load the data from.

Eventually, it helps to explain the history of the SpectrumList (now that I mentioned also CompoundDb. My idea was that CompoundDb could be something similar to ensembldb, but for MS data. It has (for now) a very simple layout with some generic compound annotations (mass, formula, inchi, ...) and allows also to store MS/MS spectra in it. In the end it should aid in the identification of compounds found in untargeted metabolomics experiments. The MS2 spectra can be extracted as Spectrum2 objects, but I need additional fields, such as the spectrum ID, compound ID etc, that I did not want to add as additional slots to e.g. an XSpectrum2 object extending Spectrum2. Here I got inspired by the GRanges object (that you essentially extract from an ensembldb EnsDb database). These allow to add arbitrary annotations as mcols and that's so simple and user friendly that I thought that we could also have something similar for MS data -> Spectra was born. Now, in xcms I extract again MS2 spectra that can be associated to a MS1 feature or peak. Again I need the ID of the feature or peak to be put onto each Spectrum2 object -> again an ideal use case for the Spectra object.

From the design I see also the Spectra at a lower level than the MSnExp/OnDiskMSnExp. The Spectra is only a simple list structure with optional metadata columns. It's a truly one-dimensional vector. The MSnExp is already something more. It adds the information on samples, columns, files and hence represents (as the name suggests) an experiment.

@lgatto
Copy link
Owner

lgatto commented Oct 12, 2018

Only, for CompoundDb the OnDiskMSnExp makes no sense - there are simply no mzML files to load the data from.

That is an important piece of information indeed. What format is the data in, and how, in general, do you create the SpectumList objects?

From the design I see also the Spectra at a lower level than the MSnExp/OnDiskMSnExp. The Spectra is only a simple list structure with optional metadata columns.

Yes, that is also how I see it, hence the idea to be able to use it as a way to populate an experiment assayData, or even have it as a backend. That would be a medium/long-term plan; of course.

@jorainer
Copy link
Collaborator Author

Currently I generate Spectra by passing one or a list of Spectrum objects to it (e.g. in xcms I extract the MS2 spectra from an OnDiskMSnExp and return them). In CompoundDb I have a function that takes a data.frame and creates the object from that. For now I have no import function planned. The Spectra get then exported with MSnbase:::writeMgfDataFile with the mcols added as additional fields (that's what the people from GNPS need).

What could make sense in the long run is to replace the asssayData and featureData with an Spectra. The featureData would then simply be the mcols. The nice thing is that subsetting, lapply etc is all covered by SimpleList. There is not much that has to be implemented. The main disadvantage over having the assayData as an environment is that the data will get copied.

Just realized a potential shortcoming of the MSnExp (possibly related to assayData being an environment) the [ subsetting method does not allow arbitrary ordering. i.e. [c(2, 4, 1, 3)] will subset to c(1, 2, 3, 4). That is however something I would really need/expect from a list of spectra.

@lgatto
Copy link
Owner

lgatto commented Oct 12, 2018

So for now at least, the new class isn't meant to be exposed, only used as part of your CompoundDb development.

You say you create the SpectumList from a data.frame. What's the underlying data? A data.frame sounds like a SQL database. Is that the case?

@jorainer
Copy link
Collaborator Author

For now it was implemented in CompoundDb (and not exposed because I'm the only one kind of using CompoundDb for now). I would however like it to be exposed/exported now. I would need it in xcms. Since CompoundDb is anyway importing Spectrum from MSnbase (and since it's a rather low level infrastructure object) I wanted to add it to MSnbase.

Indeed, the underlying data in CompoundDb is a data.frame (exported from a SQL database). The MSMS data is imported from xml files in the HMDB-specific format (for now; the plan is then to suport also other formats as well).

@lgatto
Copy link
Owner

lgatto commented Oct 12, 2018

Ok, thanks. I'll do a quick code review later today.

R/SpectrumList-methods.R Outdated Show resolved Hide resolved
R/SpectrumList.R Outdated
#' @rdname SpectrumList
NULL

.SpectrumList <- setClass("SpectrumList",
Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about defining the class in DataClasses.R and the other functions in functions-SpectumList.R, as is generally the convention (although I'm sure I must have broken it at some point)? This might also save us from having a collate field in the DESCRIPTION file.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Of course we will add it to DataClasses.R. I simply copied the files over from CompoundDb.R.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the end it depends if we agree to add the object or not. If so I would add it to DataClasses.R (I would also call it Spectra instead of SpectrumList) and would address also all points raised by Sebastian.

Copy link
Owner

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Fine then. Go ahead.

@jorainer
Copy link
Collaborator Author

With the latest commit I have reorganized the files, renamed SpectrumList into Spectra and addressed all of @sgibb's comments.

@lgatto lgatto merged commit 077bdf4 into master Oct 13, 2018
@jorainer jorainer deleted the SpectrumList branch October 13, 2018 17:26
@LaurentGatto
Copy link

Just pushed this to Bioc.

@lgatto
Copy link
Owner

lgatto commented Oct 15, 2018

@jotsetung - I see you added S4Vectors to Depends; could it be moved to imports?

@jorainer
Copy link
Collaborator Author

I tried that, but then lapply, [ [[ etc don't work for the Spectra class - until the S4Vectors package is loaded.

@lgatto
Copy link
Owner

lgatto commented Oct 15, 2018

ok, thanks.

@sgibb
Copy link
Collaborator

sgibb commented Oct 17, 2018

I tried that, but then lapply, [ [[ etc don't work for the Spectra class - until the S4Vectors package is loaded.

Really? I just moved S4Vectors to Imports and lapply, [, [ work as they should. These methods are exported from BiocGenerics (which is already in Depends). The only method that is not working is mcols because we don't export it (and it isn't exported by BiocGenerics either (but it should IMHO)). After exporting mcols/mcols<- in our NAMESPACE everything of example("Spectra") work, except the following:

spl <- Spectra(sp1, sp2, elementMetadata = DataFrame(id = c("a", "b")))
Error in DataFrame(id = c("a", "b")) : 
  could not find function "DataFrame"

We could also export the DataFrame (but then we have to document it ...) or keep S4Vectors in Depends.

@jorainer
Copy link
Collaborator Author

I just find it strange that we export functions from other packages that we're only importing. If it helps we could indeed move S4Vectors from Imports to Depends.

@lgatto
Copy link
Owner

lgatto commented Oct 19, 2018

Well, documenting doesn't really mean writing much about it. An alias and a short line to mention that once can use mcols[<-] with a DataFrame on a Spectra object is enough, I believe.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants