Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

compareSpectra to report also the number of matching peaks #350

Open
jorainer opened this issue Feb 11, 2025 · 9 comments
Open

compareSpectra to report also the number of matching peaks #350

jorainer opened this issue Feb 11, 2025 · 9 comments
Assignees
Labels
enhancement New feature or request

Comments

@jorainer
Copy link
Member

compareSpectra() performs the spectra similarity calculations in a two-step approach: first it matches (maps) peaks between the compared spectra and then it calculates the similarity of these. As a result, compareSpectra() returns a numeric matrix with the similarity score for all pairs of spectra. It might however also be interesting to report in addition the number of matched peaks between the compared spectra. The result could then be a numeric array with 3 dimensions, so that sim[ , , 1L] would be the similarity scores and sim[, , 2L] the number of common peaks between the matched spectra.

@jorainer jorainer self-assigned this Feb 11, 2025
@jorainer jorainer added the enhancement New feature or request label Feb 11, 2025
@jorainer
Copy link
Member Author

A function that returns such a 3d array could be:

#' report results as array with 3 dimensions, the 3rd being the number of
#' matching peaks
.peaks_compare_npeaks <- function(x, y, MAPFUN = joinPeaks, tolerance = 0,
                                  ppm = 20, FUN = ndotproduct, xPrecursorMz,
                                  yPrecursorMz, ...) {
    mat <- array(NA_real_, dim = c(length(x), length(y), 2),
                 dimnames = list(names(x), names(y), c("score", "peak_count")))
    for (i in seq_along(x)) {
        for (j in seq_along(y)) {
            peak_map <- MAPFUN(x[[i]], y[[j]],
                               tolerance = tolerance, ppm = ppm,
                               xPrecursorMz = xPrecursorMz[i],
                               yPrecursorMz = yPrecursorMz[j],
                               .check = FALSE, ...)
            mat[i, j, 1L] <- FUN(peak_map[[1L]], peak_map[[2L]],
                                 xPrecursorMz = xPrecursorMz[i],
                                 yPrecursorMz = yPrecursorMz[j],
                                 ppm = ppm, tolerance = tolerance, ...)
            mat[i, j, 2L] <- sum(!is.na(peak_map[[1L]][, 1L] +
                                        peak_map[[2L]][, 1L])) 
        }
    }
    mat
}

This is essentially the same as .peaks_compare() but it reports in addition the number of mapped peaks.

@jorainer
Copy link
Member Author

And the equivalent (updated) function to calculate similarities (modified from .compare_spectra_chunk()) could be:

.compare_spectra_chunk_npeaks <- function(x, y = NULL, MAPFUN = joinPeaks,
                                          tolerance = 0, ppm = 20,
                                          FUN = ndotproduct,
                                          chunkSize = 10000, ...,
                                          BPPARAM = bpparam(),
                                          peakCount = FALSE) {
    nx <- length(x)
    ny <- length(y)
    if (peakCount) {
        compare_fun <- .peaks_compare_npeaks
        dn <- list(spectraNames(x), spectraNames(y),
                   c("score", "peak_count"))
        dims <- c(nx, ny, 2L)
        z_chunk <- 1:2
    } else {
        compare_fun <- Spectra:::.peaks_compare
        dn <- list(spectraNames(x), spectraNames(y))
        dims <- c(nx, ny, 1L)
        z_chunk <- 1L
    }
    bppx <- backendBpparam(x@backend, BPPARAM)
    if (is(bppx, "SerialParam"))
        fx <- NULL
    else fx <- backendParallelFactor(x@backend)
    pqvx <- Spectra:::.processingQueueVariables(x)
    bppy <- backendBpparam(y@backend, BPPARAM)
    if (is(bppy, "SerialParam"))
        fy <- NULL
    else fy <- backendParallelFactor(y@backend)
    pqvy <- Spectra:::.processingQueueVariables(y)
    if (nx <= chunkSize && ny <= chunkSize) {
        mat <- compare_fun(
            Spectra:::.peaksapply(x, f = fx, spectraVariables = pqvx, BPPARAM = bppx),
            Spectra:::.peaksapply(y, f = fy, spectraVariables = pqvy, BPPARAM = bppy),
            MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
            FUN = FUN, xPrecursorMz = precursorMz(x),
            yPrecursorMz = precursorMz(y), ...)
            dimnames(mat) <- dn
        return(mat)
    }

    x_idx <- seq_len(nx)
    x_chunks <- split(x_idx, ceiling(x_idx / chunkSize))
    rm(x_idx)

    y_idx <- seq_len(ny)
    y_chunks <- split(y_idx, ceiling(y_idx / chunkSize))
    rm(y_idx)

    mat <- array(NA_real_, dim = dims, dimnames = dn)
    for (x_chunk in x_chunks) {
        x_buff <- x[x_chunk]
        if (length(fx))
            fxc <- backendParallelFactor(x_buff@backend)
        else fxc <- NULL
        x_peaks <- Spectra:::.peaksapply(x_buff, f = fxc, spectraVariables = pqvx,
                               BPPARAM = bppx)
        for (y_chunk in y_chunks) {
            y_buff <- y[y_chunk]
            if (length(fy))
                fyc <- backendParallelFactor(y_buff@backend)
            else fyc <- NULL
            mat[x_chunk, y_chunk, z_chunk] <- compare_fun(
                x_peaks, Spectra:::.peaksapply(
                                       y_buff, f = fyc, spectraVariables = pqvy,
                                       BPPARAM = bppy),
                MAPFUN = MAPFUN, tolerance = tolerance, ppm = ppm,
                FUN = FUN, xPrecursorMz = precursorMz(x_buff),
                yPrecursorMz = precursorMz(y_buff), ...)
        }
    }
    if (peakCount)
        mat
    else as.matrix(mat[, , 1L])
}

i.e. the function would report an array if peakCount = TRUE and a matrix for peakCount = FALSE.

@jorainer
Copy link
Member Author

Example use:

library(msdata)
library(Spectra)
library(MsCoreUtils)
s <- Spectra(system.file("TripleTOF-SWATH", "PestMix1_DDA.mzML",
                         package = "msdata"))
s <- setBackend(s, MsBackendMemory())
s <- filterMsLevel(s, 2L)

The result array:

> res <- .compare_spectra_chunk_npeaks(s[1:5], s[1:5], tolerance = 0.1,
                                     peakCount = TRUE)
> res
, , score

            58        106        108 186 201
58  1.00000000 0.03261724 0.12798950   0   0
106 0.03261724 1.00000000 0.07398183   0   0
108 0.12798950 0.07398183 1.00000000   0   0
186 0.00000000 0.00000000 0.00000000   1   0
201 0.00000000 0.00000000 0.00000000   0   1

, , peak_count

    58 106 108 186 201
58   3   1   1   0   0
106  1   3   1   0   0
108  1   1   3   0   0
186  0   0   0   3   0
201  0   0   0   0   1

so, res[ , , 1] has the scores and res[, , 2] the number of matching peaks.

@jorainer
Copy link
Member Author

The performance of the new function is comparable to the compareSpectra():

library(microbenchmark)
microbenchmark(
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1),
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, peakCount = TRUE),
    compareSpectra(s[1:50], s[1:50], tolerance = 0.1)
)
Unit: milliseconds
                                                                                    expr      min       lq     mean
                        .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1) 29.71174 30.12260 33.59158
 .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      peakCount = TRUE) 33.33027 33.69567 35.15999
                                       compareSpectra(s[1:50], s[1:50], tolerance = 0.1) 29.65729 30.13398 32.43487
   median       uq       max neval cld
 30.39748 31.26644 205.87626   100   a
 34.02357 34.85042  45.53969   100   a
 30.45295 34.80112  41.20724   100   a

So, calculating in addition to the score also the number of matching peaks is a bit slower.

And using chunk-wise processing with chunkSize = 10:

microbenchmark(
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10),
    .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10, peakCount = TRUE),
    compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10)
)
Unit: milliseconds
                                                                                                    expr      min
                   .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      chunkSize = 10) 36.86689
 .compare_spectra_chunk_npeaks(s[1:50], s[1:50], tolerance = 0.1,      chunkSize = 10, peakCount = TRUE) 40.69225
                                       compareSpectra(s[1:50], s[1:50], tolerance = 0.1, chunkSize = 10) 36.73441
       lq     mean   median       uq      max neval cld
 37.34624 39.27973 37.73616 39.02049 50.96747   100  a 
 41.08257 43.19608 41.53456 43.22486 57.66129   100   b
 37.13597 39.88893 37.87106 44.12141 54.28043   100  a

Again, doing the additional peak counts is slower, but the performance of the default method is comparable.

@jorainer
Copy link
Member Author

jorainer commented Feb 11, 2025

@mar-garcia , @michaelwitting @Adafede @philouail : would that be something you would find helpful (i.e. reporting in addition the number of matching peaks)?

@michaelwitting
Copy link

I often use the number of matched peaks as an additional metric for quality control of match results. Therefore, thumbs up from my side!

@philouail
Copy link
Collaborator

looks great to me, and being able to later on filter the similarity scores based on the number of matched peaks sounds good !

I would maybe rename the peakCount = variable to something more specific, such as matchedPeakCount = ?

@mar-garcia
Copy link
Contributor

Sounds good also for me! I also think it's a good/interesting point to be able to consider the number of matched peaks in addition to the similarity score itself.

@Adafede
Copy link

Adafede commented Feb 11, 2025

Thank you for the ping @jorainer!

I totally support it! I already added the matched peaks in addition to similarity in https://github.com/taxonomicallyinformedannotation/tima/blob/main/R/calculate_entropy_and_similarity.R, also with the entropy of the spectra as (another) QC.

jorainer added a commit that referenced this issue Feb 19, 2025
- Add parameter `matchedPeaksCount` to `compareSpectra()` to allow reporting
  of the number of matching peaks between the compared spectra (issue #350).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

5 participants