-
Notifications
You must be signed in to change notification settings - Fork 9
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance issue with matching functions #93
Comments
As a side note, here is some benchmark I did regarding the different backends:
|
Hi Adriano, |
yes, should be - but would still be nice to know where exactly the bottleneck is... |
Just tried both
and
and observed no difference |
Just a general observation: you use always Also, question: the data that you provide has 7897 query spectra and 289429 target spectra, right? Is this already the subset you mentioned above? I'm just asking because it takes me less than 5 minutes to do the matching with |
I simply find it easier for reading, might be some additional performance improvement to remove it indeed...for now just tried to make things as explicit as possible, so also mapping where functions come from looked like a good idea. Yes, this is the number of spectra. Are you matching all of them in < 5min? |
well, yes, I'm running the 7897 query spectra against all 289429 target spectra in a little less than 5 minutes - that's why I was asking if that is what you considered slow. How long does it take for you to run the same data set? |
Hmmmm interesting. Well, I guess we should split things: With such a big MGF, loading it is what takes the most time. In the example above, loading the target MGF took 5 minutes. For this reason, what I was doing was converting it to SQL once to then always access it faster. The matching part I just re-did takes also ± 3 minutes (with params_1) (12minutes with params_2). So something around 17 in total. While doing it with https://github.com/mandelbrot-project/spectral_lib_matcher , first thing I do is the same, saving the big MGF once as pickle to access it faster later on, and the whole process (loading + matching) takes 2 minutes (approx 1 minute loading + cleaning and 1 minute matching). I somehow got the feeling there is some kind of "sweet spot" between loading slow with MGFBackend but then matching faster and loading with SQLBackend but matching slower. I probably was overseeing things as I did not think of mixing both. Might it be that the most efficient solution is an hybrid: loading as SQL, MSBackendMemory as suggested above (which I did not think about until now), and then matching? |
Ah, well, let's not mix things. I would here first focus on the matching itself - once the data is loaded. Yes, importing from Mgf takes a long time, but let's fix that later. Also, I would first focus on one matching parameter only. the different versions have different performance, but that's obvious (e.g. the forward reverse needs to run each query twice). I will try to tune a bit and then post the results here. |
OK, first performance check. Below data import (query and target were downloaded first with the scripts above) and settings. library(MetaboAnnotation)
library(MsCoreUtils)
library(MsBackendMgf)
library(microbenchmark)
query <- Spectra("query.mgf", source = MsBackendMgf())
target <- Spectra("target.mgf", source = MsBackendMgf())
csp <- CompareSpectraParam(
ppm = 10,
tolerance = 0.02,
requirePrecursor = TRUE,
THRESHFUN = function(x) {
which(x >= 0.6)
}
)
microbenchmark(
matchSpectra(query, target, param = csp),
times = 5
)
Unit: seconds
expr min lq mean median
matchSpectra(query, target, param = csp) 361.4349 364.0958 374.273 368.2352
uq max neval
388.1526 389.4467 5 Takes thus about 6 minutes - not ideal. The same using using Unit: seconds
expr min lq mean median
matchSpectra(query, target, param = csp) 113.8469 116.7591 116.9893 117.0349
uq max neval
117.1121 120.1934 5 Next I'll check some different backends. |
|
So from |
|
Parallel processingParallel processing can also help to improve the performance of p2 <- MulticoreParam(2)
p3 <- MulticoreParam(3)
p4 <- MulticoreParam(4)
microbenchmark(
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = SerialParam()),
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p2),
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p3),
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p4),
times = 5
)
Unit: seconds
expr
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = SerialParam())
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p2)
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p3)
matchSpectra(query_mem, target_mem, param = csp, BPPARAM = p4)
min lq mean median uq max neval cld
37.31365 38.12915 38.62866 39.01555 39.06048 39.62445 5 a
22.31075 22.33289 22.92925 22.73671 23.46847 23.79745 5 b
16.65630 16.99816 17.71939 18.04966 18.38344 18.50940 5 c
13.57029 13.57957 14.35026 14.30086 14.76527 15.53531 5 d Thus, parallel processing can improve performance. |
SummaryNext to the settings used for the spectral matching, using a more efficient |
I'm keeping this issue open because I think it provides some nice examples how to improve matching performance. |
Hi,
I am trying to regularly match 1000+ query spectra to 100,000+ target spectra.
Currently, this is easily doable in minutes using
matchms
in python.I never obtained such results using
MetaboAnnotation
in R.Here below a tentative script trying to describe what I would like to achieve.
It currently uses
MsBackendMgf
, but this is not the issue.I had to subset the data to remain in reasonable times (approx. 5 minutes).
Happy to further develop and exchange if needed.
CODE:
The text was updated successfully, but these errors were encountered: