compareSpectra to report also the number of matching peaks #350

jorainer · 2025-02-11T08:23:27Z

compareSpectra() performs the spectra similarity calculations in a two-step approach: first it matches (maps) peaks between the compared spectra and then it calculates the similarity of these. As a result, compareSpectra() returns a numeric matrix with the similarity score for all pairs of spectra. It might however also be interesting to report in addition the number of matched peaks between the compared spectra. The result could then be a numeric array with 3 dimensions, so that sim[ , , 1L] would be the similarity scores and sim[, , 2L] the number of common peaks between the matched spectra.

The text was updated successfully, but these errors were encountered:

jorainer · 2025-02-11T08:25:38Z

A function that returns such a 3d array could be:

#' report results as array with 3 dimensions, the 3rd being the number of
#' matching peaks
.peaks_compare_npeaks <- function(x, y, MAPFUN = joinPeaks, tolerance = 0,
                                  ppm = 20, FUN = ndotproduct, xPrecursorMz,
                                  yPrecursorMz, ...) {
    mat <- array(NA_real_, dim = c(length(x), length(y), 2),
                 dimnames = list(names(x), names(y), c("score", "peak_count")))
    for (i in seq_along(x)) {
        for (j in seq_along(y)) {
            peak_map <- MAPFUN(x[[i]], y[[j]],
                               tolerance = tolerance, ppm = ppm,
                               xPrecursorMz = xPrecursorMz[i],
                               yPrecursorMz = yPrecursorMz[j],
                               .check = FALSE, ...)
            mat[i, j, 1L] <- FUN(peak_map[[1L]], peak_map[[2L]],
                                 xPrecursorMz = xPrecursorMz[i],
                                 yPrecursorMz = yPrecursorMz[j],
                                 ppm = ppm, tolerance = tolerance, ...)
            mat[i, j, 2L] <- sum(!is.na(peak_map[[1L]][, 1L] +
                                        peak_map[[2L]][, 1L])) 
        }
    }
    mat
}

This is essentially the same as .peaks_compare() but it reports in addition the number of mapped peaks.

jorainer · 2025-02-11T08:28:05Z

And the equivalent (updated) function to calculate similarities (modified from .compare_spectra_chunk()) could be:

.compare_spectra_chunk_npeaks <- function(x, y = NULL, MAPFUN = joinPeaks,
                                          tolerance = 0, ppm = 20,
                                          FUN = ndotproduct,
                                          chunkSize = 10000, ...,
                                          BPPARAM = bpparam(),
                                          peakCount = FALSE) {
    nx <- length(x)
    ny <- length(y)
    if (peakCount) {
        compare_fun <- .peaks_compare_npeaks
        dn <- list(spectraNames(x), spectraNames(y),
                   c("score", "peak_count"))
        dims <- c(nx, ny, 2L)
        z_chunk <- 1:2
    } else {
        compare_fun <- Spectra:::.peaks_compare
        dn <- list(spectraNames(x), spectraNames(y))
        dims <- c(nx, ny, 1L)
        z_chunk <- 1L
    }
    bppx <- backendBpparam(x@backend, BPPARAM)
    if (is(bppx, "SerialParam"))
        fx <- NULL
    else fx <- backendParallelFactor(x@backend)
    pqvx <- Spectra:::.processingQueueVariables(x)
    bppy <- backendBpparam(y@backend, BPPARAM)
    if (is(bppy, "SerialParam"))
        fy <- NULL
    else fy <- backendParallelFactor(y@backend)
    pqvy <- Spectra:::.processingQueueVariables(y)
    if (nx <= chunkSize && ny <= chunkSize) {
        mat <- compare_fun(
            Spectra:::.peaksapply(x, f = fx, spectraVariables = pqvx, BPPARAM = bppx),
            Spectra:::.peaksapply(y, f = fy, spectraVariables = pqvy, BPPARAM = bppy),
            MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
            FUN = FUN, xPrecursorMz = precursorMz(x),
            yPrecursorMz = precursorMz(y), ...)
            dimnames(mat) <- dn
        return(mat)
    }

    x_idx <- seq_len(nx)
    x_chunks <- split(x_idx, ceiling(x_idx / chunkSize))
    rm(x_idx)

    y_idx <- seq_len(ny)
    y_chunks <- split(y_idx, ceiling(y_idx / chunkSize))
    rm(y_idx)

    mat <- array(NA_real_, dim = dims, dimnames = dn)
    for (x_chunk in x_chunks) {
        x_buff <- x[x_chunk]
        if (length(fx))
            fxc <- backendParallelFactor(x_buff@backend)
        else fxc <- NULL
        x_peaks <- Spectra:::.peaksapply(x_buff, f = fxc, spectraVariables = pqvx,
                               BPPARAM = bppx)
        for (y_chunk in y_chunks) {
            y_buff <- y[y_chunk]
            if (length(fy))
                fyc <- backendParallelFactor(y_buff@backend)
            else fyc <- NULL
            mat[x_chunk, y_chunk, z_chunk] <- compare_fun(
                x_peaks, Spectra:::.peaksapply(
                                       y_buff, f = fyc, spectraVariables = pqvy,
                                       BPPARAM = bppy),
                MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
                FUN = FUN, xPrecursorMz = precursorMz(x_buff),
                yPrecursorMz = precursorMz(y_buff), ...)
        }
    }
    if (peakCount)
        mat
    else as.matrix(mat[, , 1L])
}

i.e. the function would report an array if peakCount = TRUE and a matrix for peakCount = FALSE.

jorainer · 2025-02-11T09:25:11Z

Example use:

library(msdata)
library(Spectra)
library(MsCoreUtils)
s <- Spectra(system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                         package = "msdata"))
s <- setBackend(s, MsBackendMemory())
s <- filterMsLevel(s, 2L)

The result array:

> res <- .compare_spectra_chunk_npeaks(s[1:5], s[1:5], tolerance = 0.1,
                                     peakCount = TRUE)
> res
, , score

            58        106        108 186 201
58  1.00000000 0.03261724 0.12798950   0   0
106 0.03261724 1.00000000 0.07398183   0   0
108 0.12798950 0.07398183 1.00000000   0   0
186 0.00000000 0.00000000 0.00000000   1   0
201 0.00000000 0.00000000 0.00000000   0   1

, , peak_count

    58 106 108 186 201
58   3   1   1   0   0
106  1   3   1   0   0
108  1   1   3   0   0
186  0   0   0   3   0
201  0   0   0   0   1

so, res[ , , 1] has the scores and res[, , 2] the number of matching peaks.

jorainer · 2025-02-11T09:30:35Z

The performance of the new function is comparable to the compareSpectra():

library(microbenchmark)
microbenchmark(
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1),
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, peakCount = TRUE),
    compareSpectra(s[1:50], s[1:50], tolerance = 0.1)
)
Unit: milliseconds
                                                                                    expr      min       lq     mean
                        .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1) 29.71174 30.12260 33.59158
 .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      peakCount = TRUE) 33.33027 33.69567 35.15999
                                       compareSpectra(s[1:50], s[1:50], tolerance = 0.1) 29.65729 30.13398 32.43487
   median       uq       max neval cld
 30.39748 31.26644 205.87626   100   a
 34.02357 34.85042  45.53969   100   a
 30.45295 34.80112  41.20724   100   a

So, calculating in addition to the score also the number of matching peaks is a bit slower.

And using chunk-wise processing with chunkSize = 10:

microbenchmark(
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10),
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10, peakCount = TRUE),
    compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10)
)
Unit: milliseconds
                                                                                                    expr      min
                   .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      chunkSize = 10) 36.86689
 .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      chunkSize = 10, peakCount = TRUE) 40.69225
                                       compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10) 36.73441
       lq     mean   median       uq      max neval cld
 37.34624 39.27973 37.73616 39.02049 50.96747   100  a 
 41.08257 43.19608 41.53456 43.22486 57.66129   100   b
 37.13597 39.88893 37.87106 44.12141 54.28043   100  a

Again, doing the additional peak counts is slower, but the performance of the default method is comparable.

jorainer · 2025-02-11T09:35:02Z

@mar-garcia , @michaelwitting @Adafede @philouail : would that be something you would find helpful (i.e. reporting in addition the number of matching peaks)?

michaelwitting · 2025-02-11T09:56:11Z

I often use the number of matched peaks as an additional metric for quality control of match results. Therefore, thumbs up from my side!

philouail · 2025-02-11T09:56:27Z

looks great to me, and being able to later on filter the similarity scores based on the number of matched peaks sounds good !

I would maybe rename the peakCount = variable to something more specific, such as matchedPeakCount = ?

mar-garcia · 2025-02-11T12:00:28Z

Sounds good also for me! I also think it's a good/interesting point to be able to consider the number of matched peaks in addition to the similarity score itself.

Adafede · 2025-02-11T21:21:49Z

Thank you for the ping @jorainer!

I totally support it! I already added the matched peaks in addition to similarity in https://github.com/taxonomicallyinformedannotation/tima/blob/main/R/calculate_entropy_and_similarity.R, also with the entropy of the spectra as (another) QC.

- Add parameter `matchedPeaksCount` to `compareSpectra()` to allow reporting of the number of matching peaks between the compared spectra (issue #350).

jorainer self-assigned this Feb 11, 2025

jorainer added the enhancement New feature or request label Feb 11, 2025

jorainer added a commit that referenced this issue Feb 19, 2025

refactor: report also matching peak count in compareSpectra

c1b3016

- Add parameter `matchedPeaksCount` to `compareSpectra()` to allow reporting of the number of matching peaks between the compared spectra (issue #350).

jorainer mentioned this issue Feb 19, 2025

Add fillCoreSpectraVariables utility function and report number of matching peaks with compareSpectra #351

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

compareSpectra to report also the number of matching peaks #350

compareSpectra to report also the number of matching peaks #350

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025 •

edited

Loading

michaelwitting commented Feb 11, 2025

philouail commented Feb 11, 2025

mar-garcia commented Feb 11, 2025

Adafede commented Feb 11, 2025

compareSpectra to report also the number of matching peaks #350

compareSpectra to report also the number of matching peaks #350

Comments

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025

jorainer commented Feb 11, 2025 • edited Loading

michaelwitting commented Feb 11, 2025

philouail commented Feb 11, 2025

mar-garcia commented Feb 11, 2025

Adafede commented Feb 11, 2025

jorainer commented Feb 11, 2025 •

edited

Loading