Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enable chunk-wise processing for all peaks data functions #306

Merged
merged 10 commits into from
Nov 30, 2023

Conversation

jorainer
Copy link
Member

This PR fixes issue #304 . In brief: it adds the possibility for the user to set and define chunk-wise processing of a Spectra. This will affect all functions working on peaks data (e.g. even lengths, mz, peaksData) and ensures that even large-scale data can be handled reducing out of memory errors.

What this PR adds:

  • Spectra gains a new slot @processingChunkSize.
  • Add functions processingChunkSize and processingChunkSize<- to get or set the size of chunks for chunk-wise processing. The default is Inf hence no chunk-wise processing is performed (important e.g. for small data sets of in-memory backends).
  • Add backendParallelFactor,MsBackend method: this allows backends to suggest a preferred splitting of the data into chunks. The default is to return factor() (i.e. no preferred splitting), MsBackendMzR on the other hand returns a factor depending on the "dataStorage" spectra variable (hence suggests splitting by original data file).
  • The internal peaksapply function uses either chunks defined through processingChunkSize for chunk-wise processing, or if not set, uses the suggested splitting from the backend (through backendParallelFactor).
  • The user can check if and how the Spectra will be split using the processingChunkFactor function that returns a factor representing the chunks (defined through processingChunkSize), or, if not set, the suggested splitting (through backendParallelFactor) or factor() in which case no chunk-wise processing is performed.

This processing is used for all Spectra methods accessing peaks data (or processing the peaks data). To avoid performance loss for small data sets or in-memory backends it is not performed by default. If enabled by the user, it allows to process also large data.

I think this is a very important improvement allowing the analysis of large (on-disk) data - for which we ran into unexpected issues (see #304).

Happy to discuss @sgibb @lgatto @philouail .

- Refactor the code to decide how to split `Spectra` for parallel processing:
  it's no longer done automatically by `dataStorage`.
- Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304
- With chunk-wise processing only the data of one chunk is realized to memory in
  each iteration. This also enables to process data in parallel.
- Add and modify functions to enable default chunk-wise processing of peaks
  data (issue #304).
- Split the documentation for chunk-wise (parallel) processing into a separate
  documentation entry.
R/Spectra.R Outdated Show resolved Hide resolved
- Use `processingChunkFactor` instead of `dataStorage` in functions to define
  splitting and processing.
- Remove unnecessary functions.
- Add a vignette on parallel/chunk-wise processing.
@jorainer
Copy link
Member Author

@philouail , I've fixed some more things, can you please give again a careful look - any questions, concerns, comments or change requests highly welcome!

I've added now also a vignette describing the parallel processing settings. Please have a look at that too.

Copy link
Collaborator

@philouail philouail left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks super good to me.
The processingChunkSize and processingChunkFactor description is really really good.
The vignette with the tips for large dataset is super good and straight to the point i like it.

I made very few comments and I had one thing I was confused about.

R/Spectra-functions.R Show resolved Hide resolved
R/Spectra-functions.R Outdated Show resolved Hide resolved
R/Spectra.R Show resolved Hide resolved
vignettes/Spectra-large-scale.Rmd Outdated Show resolved Hide resolved
@jorainer jorainer marked this pull request as ready for review November 24, 2023 14:14
Copy link
Collaborator

@andreavicini andreavicini left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Seems very good to me!

@@ -41,7 +41,9 @@ This vignette provides general examples and descriptions for the *Spectra*
package. Additional information and tutorials are available, such as
[SpectraTutorials](https://jorainer.github.io/SpectraTutorials/),
[MetaboAnnotationTutorials](https://jorainer.github.io/MetaboAnnotationTutorials),
or also in [@rainer_modular_2022].
or also in [@rainer_modular_2022]. For information how to handle and (parallel)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"For information on how" maybe

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks!

@jorainer
Copy link
Member Author

Thanks for the reviews @andreavicini and @philouail ! I will merge now after having another look myself again.

@jorainer jorainer merged commit 9ac911b into main Nov 30, 2023
6 checks passed
@sgibb
Copy link
Member

sgibb commented Nov 30, 2023

@jorainer sorry but I didn't review the code but a small suggestion anyway: If we add a new slot to the Spectra class it would break backward compatibility. So I would suggest to increment the "version" of the class and increment the minor number of the version field in the DESCRIPTION file.

@jorainer
Copy link
Member Author

Hi Sebastian @sgibb , thanks for the suggestion. I'll increment the class version. I ensured backward compatibility through the accessor function that checks if the object has the slot and if not returns Inf (the default for the slot value). Also, Spectra methods will (automatically) call updateObject if required. So, backward compatibility should be guaranteed.

I would maybe not bump the minor version of the package to not interfere with the Bioconductor versioning?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants