-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable chunk-wise processing for all peaks data functions #306
Conversation
- Refactor the code to decide how to split `Spectra` for parallel processing: it's no longer done automatically by `dataStorage`. - Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304 - With chunk-wise processing only the data of one chunk is realized to memory in each iteration. This also enables to process data in parallel.
- Add and modify functions to enable default chunk-wise processing of peaks data (issue #304). - Split the documentation for chunk-wise (parallel) processing into a separate documentation entry.
- Use `processingChunkFactor` instead of `dataStorage` in functions to define splitting and processing. - Remove unnecessary functions. - Add a vignette on parallel/chunk-wise processing.
@philouail , I've fixed some more things, can you please give again a careful look - any questions, concerns, comments or change requests highly welcome! I've added now also a vignette describing the parallel processing settings. Please have a look at that too. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks super good to me.
The processingChunkSize
and processingChunkFactor
description is really really good.
The vignette with the tips for large dataset is super good and straight to the point i like it.
I made very few comments and I had one thing I was confused about.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Seems very good to me!
vignettes/Spectra.Rmd
Outdated
@@ -41,7 +41,9 @@ This vignette provides general examples and descriptions for the *Spectra* | |||
package. Additional information and tutorials are available, such as | |||
[SpectraTutorials](https://jorainer.github.io/SpectraTutorials/), | |||
[MetaboAnnotationTutorials](https://jorainer.github.io/MetaboAnnotationTutorials), | |||
or also in [@rainer_modular_2022]. | |||
or also in [@rainer_modular_2022]. For information how to handle and (parallel) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"For information on how" maybe
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
Thanks for the reviews @andreavicini and @philouail ! I will merge now after having another look myself again. |
@jorainer sorry but I didn't review the code but a small suggestion anyway: If we add a new slot to the |
Hi Sebastian @sgibb , thanks for the suggestion. I'll increment the class version. I ensured backward compatibility through the accessor function that checks if the object has the slot and if not returns I would maybe not bump the minor version of the package to not interfere with the Bioconductor versioning? |
This PR fixes issue #304 . In brief: it adds the possibility for the user to set and define chunk-wise processing of a
Spectra
. This will affect all functions working on peaks data (e.g. evenlengths
,mz
,peaksData
) and ensures that even large-scale data can be handled reducing out of memory errors.What this PR adds:
Spectra
gains a new slot@processingChunkSize
.processingChunkSize
andprocessingChunkSize<-
to get or set the size of chunks for chunk-wise processing. The default isInf
hence no chunk-wise processing is performed (important e.g. for small data sets of in-memory backends).backendParallelFactor,MsBackend
method: this allows backends to suggest a preferred splitting of the data into chunks. The default is to returnfactor()
(i.e. no preferred splitting),MsBackendMzR
on the other hand returns afactor
depending on the"dataStorage"
spectra variable (hence suggests splitting by original data file).peaksapply
function uses either chunks defined throughprocessingChunkSize
for chunk-wise processing, or if not set, uses the suggested splitting from the backend (throughbackendParallelFactor
).Spectra
will be split using theprocessingChunkFactor
function that returns afactor
representing the chunks (defined throughprocessingChunkSize
), or, if not set, the suggested splitting (throughbackendParallelFactor
) orfactor()
in which case no chunk-wise processing is performed.This processing is used for all
Spectra
methods accessing peaks data (or processing the peaks data). To avoid performance loss for small data sets or in-memory backends it is not performed by default. If enabled by the user, it allows to process also large data.I think this is a very important improvement allowing the analysis of large (on-disk) data - for which we ran into unexpected issues (see #304).
Happy to discuss @sgibb @lgatto @philouail .