-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Memory issues and (potential) chunk-wise processing of peaks data #304
Comments
A possibility would be to add a specific parameter for this to any of the user-facing functions: A parameter |
An example for the high memory requirement: determine the number of peaks per spectrum from a > peakRAM(tmp <- lengths(sps[1:500000]))
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 tmp<-lengths(sps[1:5e+05]) 63.094 6.4 32650.3 Thus, to complete this simple operation 32GB of memory are required, because the full peaks data is first loaded into memory before the number of peaks is determined. |
After some tests and evaluating the code: passing parameters to individual functions might not be the ideal solution (also because some functions don't allow that, such as An alternative would be to have a global setting processingChunkSize(sps) <- 10000 any subsequent operation that calculates something on the peaks data will then load and process the data in chunks of 10000. We just have to ensure that functions like |
For some functions the chunk-wise processing needs to be disabled:
|
- Refactor the code to decide how to split `Spectra` for parallel processing: it's no longer done automatically by `dataStorage`. - Add a slot to `Spectra` allowing to set a processing chunk size. Issue #304 - With chunk-wise processing only the data of one chunk is realized to memory in each iteration. This also enables to process data in parallel.
- Add and modify functions to enable default chunk-wise processing of peaks data (issue #304). - Split the documentation for chunk-wise (parallel) processing into a separate documentation entry.
This is currently being implemented in the processingChunkSize branch. |
Seems impossible to check/benchmark the memory usage properly from within R (at least I did not find any option). So, I'll perform some more tests locally and check if/when I'm running out of memory and how to prevent that. |
Evaluation and performance comparison of chunk-wise processingHere we evaluate how the chunk-wise processing works on a large data set (300 samples, 516,299 spectra). Memory estimation might not be correct, still we report here run time and (approximated/estimated) memory usage using Performance on two > length(a)
[1] 516299
> b <- a
> processingChunkSize(b) <- 100000 Next we filter the data removing peaks with an intensity lower 300 - this operation is just added to the processing queue. > library(peakRAM)
> peakRAM(
+ a <- filterIntensity(a, intensity = 300),
+ b <- filterIntensity(b, intensity = 300)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-filterIntensity(a,intensity=300) 0.127 0 0.1
2 b<-filterIntensity(b,intensity=300) 0.062 0 0.1 Next we > peakRAM(
+ a <- bin(a, binSize = 0.1),
+ b <- bin(b, binSize = 0.2)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 a<-bin(a,binSize=0.1) 507.889 4.1 58706.8
2 b<-bin(b,binSize=0.2) 301.128 0.8 35783.2 Here, the At last we calculate the tic for each spectrum. This call will require loading the data and executing the processing queue. > peakRAM(
+ ionCount(a),
+ ionCount(b)
+ )
Function_Call Elapsed_Time_sec Total_RAM_Used_MiB Peak_RAM_Used_MiB
1 ionCount(a) 552.087 3.8 55898.9
2 ionCount(b) 366.852 3.9 26741.5 Chunk-wise processing thus reduced both run-time and memory demand. As a side effect, chunk-wise processing would enable also parallel processing, as a set of chunks can then be processed in parallel. |
This is now implemented and activated by default. Closing issue. |
By using a
MsBackendSql
for a very large MS experiment we ran into memory issues (see issue #303 ) processing the data. All operations working on peaks data use the internal.peaksapply
function that by default splits the processing based ondataStorage(x)
and performs the operation in parallel (even if processing might only mean loading the data from the raw data files). For in memory backends and the e.g.MsBackendSql
,dataStorage
will only have a single value, thus no splitting/parallel processing is performed.Advantages of splitting and parallel processing
These advantages are good for very large data sets. The downside of splitting and parallel processing is however also the overhead of splitting and combining the data. Thus, in many cases (e.g. also for in-memory backends) it might actually be good to not do it.
Maybe it would make sense to add a parameter or let the user choose to enable parallel/split processing, even if backends other than
MsBackendMzR
are used.The text was updated successfully, but these errors were encountered: