-
Notifications
You must be signed in to change notification settings - Fork 27
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
compareSpectra to report also the number of matching peaks #350
Comments
A function that returns such a 3d array could be: #' report results as array with 3 dimensions, the 3rd being the number of
#' matching peaks
.peaks_compare_npeaks <- function(x, y, MAPFUN = joinPeaks, tolerance = 0,
ppm = 20, FUN = ndotproduct, xPrecursorMz,
yPrecursorMz, ...) {
mat <- array(NA_real_, dim = c(length(x), length(y), 2),
dimnames = list(names(x), names(y), c("score", "peak_count")))
for (i in seq_along(x)) {
for (j in seq_along(y)) {
peak_map <- MAPFUN(x[[i]], y[[j]],
tolerance = tolerance, ppm = ppm,
xPrecursorMz = xPrecursorMz[i],
yPrecursorMz = yPrecursorMz[j],
.check = FALSE, ...)
mat[i, j, 1L] <- FUN(peak_map[[1L]], peak_map[[2L]],
xPrecursorMz = xPrecursorMz[i],
yPrecursorMz = yPrecursorMz[j],
ppm = ppm, tolerance = tolerance, ...)
mat[i, j, 2L] <- sum(!is.na(peak_map[[1L]][, 1L] +
peak_map[[2L]][, 1L]))
}
}
mat
} This is essentially the same as .peaks_compare() but it reports in addition the number of mapped peaks. |
And the equivalent (updated) function to calculate similarities (modified from .compare_spectra_chunk()) could be: .compare_spectra_chunk_npeaks <- function(x, y = NULL, MAPFUN = joinPeaks,
tolerance = 0, ppm = 20,
FUN = ndotproduct,
chunkSize = 10000, ...,
BPPARAM = bpparam(),
peakCount = FALSE) {
nx <- length(x)
ny <- length(y)
if (peakCount) {
compare_fun <- .peaks_compare_npeaks
dn <- list(spectraNames(x), spectraNames(y),
c("score", "peak_count"))
dims <- c(nx, ny, 2L)
z_chunk <- 1:2
} else {
compare_fun <- Spectra:::.peaks_compare
dn <- list(spectraNames(x), spectraNames(y))
dims <- c(nx, ny, 1L)
z_chunk <- 1L
}
bppx <- backendBpparam(x@backend, BPPARAM)
if (is(bppx, "SerialParam"))
fx <- NULL
else fx <- backendParallelFactor(x@backend)
pqvx <- Spectra:::.processingQueueVariables(x)
bppy <- backendBpparam(y@backend, BPPARAM)
if (is(bppy, "SerialParam"))
fy <- NULL
else fy <- backendParallelFactor(y@backend)
pqvy <- Spectra:::.processingQueueVariables(y)
if (nx <= chunkSize && ny <= chunkSize) {
mat <- compare_fun(
Spectra:::.peaksapply(x, f = fx, spectraVariables = pqvx, BPPARAM = bppx),
Spectra:::.peaksapply(y, f = fy, spectraVariables = pqvy, BPPARAM = bppy),
MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
FUN = FUN, xPrecursorMz = precursorMz(x),
yPrecursorMz = precursorMz(y), ...)
dimnames(mat) <- dn
return(mat)
}
x_idx <- seq_len(nx)
x_chunks <- split(x_idx, ceiling(x_idx / chunkSize))
rm(x_idx)
y_idx <- seq_len(ny)
y_chunks <- split(y_idx, ceiling(y_idx / chunkSize))
rm(y_idx)
mat <- array(NA_real_, dim = dims, dimnames = dn)
for (x_chunk in x_chunks) {
x_buff <- x[x_chunk]
if (length(fx))
fxc <- backendParallelFactor(x_buff@backend)
else fxc <- NULL
x_peaks <- Spectra:::.peaksapply(x_buff, f = fxc, spectraVariables = pqvx,
BPPARAM = bppx)
for (y_chunk in y_chunks) {
y_buff <- y[y_chunk]
if (length(fy))
fyc <- backendParallelFactor(y_buff@backend)
else fyc <- NULL
mat[x_chunk, y_chunk, z_chunk] <- compare_fun(
x_peaks, Spectra:::.peaksapply(
y_buff, f = fyc, spectraVariables = pqvy,
BPPARAM = bppy),
MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
FUN = FUN, xPrecursorMz = precursorMz(x_buff),
yPrecursorMz = precursorMz(y_buff), ...)
}
}
if (peakCount)
mat
else as.matrix(mat[, , 1L])
} i.e. the function would report an |
Example use: library(msdata)
library(Spectra)
library(MsCoreUtils)
s <- Spectra(system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
package = "msdata"))
s <- setBackend(s, MsBackendMemory())
s <- filterMsLevel(s, 2L) The result array: > res <- .compare_spectra_chunk_npeaks(s[1:5], s[1:5], tolerance = 0.1,
peakCount = TRUE)
> res
, , score
58 106 108 186 201
58 1.00000000 0.03261724 0.12798950 0 0
106 0.03261724 1.00000000 0.07398183 0 0
108 0.12798950 0.07398183 1.00000000 0 0
186 0.00000000 0.00000000 0.00000000 1 0
201 0.00000000 0.00000000 0.00000000 0 1
, , peak_count
58 106 108 186 201
58 3 1 1 0 0
106 1 3 1 0 0
108 1 1 3 0 0
186 0 0 0 3 0
201 0 0 0 0 1
so, |
The performance of the new function is comparable to the library(microbenchmark)
microbenchmark(
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1),
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, peakCount = TRUE),
compareSpectra(s[1:50], s[1:50], tolerance = 0.1)
)
Unit: milliseconds
expr min lq mean
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1) 29.71174 30.12260 33.59158
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, peakCount = TRUE) 33.33027 33.69567 35.15999
compareSpectra(s[1:50], s[1:50], tolerance = 0.1) 29.65729 30.13398 32.43487
median uq max neval cld
30.39748 31.26644 205.87626 100 a
34.02357 34.85042 45.53969 100 a
30.45295 34.80112 41.20724 100 a So, calculating in addition to the score also the number of matching peaks is a bit slower. And using chunk-wise processing with microbenchmark(
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10),
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10, peakCount = TRUE),
compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10)
)
Unit: milliseconds
expr min
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10) 36.86689
.compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10, peakCount = TRUE) 40.69225
compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10) 36.73441
lq mean median uq max neval cld
37.34624 39.27973 37.73616 39.02049 50.96747 100 a
41.08257 43.19608 41.53456 43.22486 57.66129 100 b
37.13597 39.88893 37.87106 44.12141 54.28043 100 a Again, doing the additional peak counts is slower, but the performance of the default method is comparable. |
@mar-garcia , @michaelwitting @Adafede @philouail : would that be something you would find helpful (i.e. reporting in addition the number of matching peaks)? |
I often use the number of matched peaks as an additional metric for quality control of match results. Therefore, thumbs up from my side! |
looks great to me, and being able to later on filter the similarity scores based on the number of matched peaks sounds good ! I would maybe rename the |
Sounds good also for me! I also think it's a good/interesting point to be able to consider the number of matched peaks in addition to the similarity score itself. |
Thank you for the ping @jorainer! I totally support it! I already added the matched peaks in addition to similarity in https://github.com/taxonomicallyinformedannotation/tima/blob/main/R/calculate_entropy_and_similarity.R, also with the entropy of the spectra as (another) QC. |
- Add parameter `matchedPeaksCount` to `compareSpectra()` to allow reporting of the number of matching peaks between the compared spectra (issue #350).
compareSpectra()
performs the spectra similarity calculations in a two-step approach: first it matches (maps) peaks between the compared spectra and then it calculates the similarity of these. As a result,compareSpectra()
returns anumeric
matrix
with the similarity score for all pairs of spectra. It might however also be interesting to report in addition the number of matched peaks between the compared spectra. The result could then be anumeric
array
with 3 dimensions, so thatsim[ , , 1L]
would be the similarity scores andsim[, , 2L]
the number of common peaks between the matched spectra.The text was updated successfully, but these errors were encountered: