Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove Peaks #320

Merged
merged 4 commits into from
Mar 21, 2018
Merged

Remove Peaks #320

merged 4 commits into from
Mar 21, 2018

Conversation

sgibb
Copy link
Collaborator

@sgibb sgibb commented Mar 18, 2018

This is a work-in-progress PR.
After IRanges were replaced by IRangesList in utils.removePeaks in d519ec5 I recognized a large slow down of this function. So I started to rewrite this function without IRanges (in favour of #201 I removed IRanges dependency completely, avoiding IRanges is faster than the recently merged solution by @hpages but uses a for loop again, see below).

library("MSnbase")
library("IRanges")
library("rbenchmark")


utils.removePeaksIRangesList <- function(int, t) {
    peakRanges <- as(IRanges(int > 0L), "IRangesList")
    sapply(peakRanges, function(x) {
        idx <- start(x):end(x)
        ## we get the indices of every peak in int
        if(all(int[idx]<=t))
            int[idx] <<- 0
        })
    int
}

utils.removePeaksIRanges <- function(int, t) {
    peakRanges <- IRanges(int > 0L)
    s <- start(peakRanges)
    e <- end(peakRanges)

    for (i in seq_along(s)) {
        idx <- s[[i]]:e[[i]]
        if (all(int[idx] <= t)) {
            int[idx] <- 0L
        }
    }
    int
}

utils.removePeaksIRangesOverlap <- function(int, t) {
    peakRanges <- IRanges(int > 0L)
    aboveThresh <- IRanges(int > t)
    belowThresh <- peakRanges %outside% aboveThresh
    s <- start(peakRanges[belowThresh])
    e <- end(peakRanges[belowThresh])
    int[unlist(mapply(seq, s, e))] <- 0L
    int
}

utils.removePeaksIRangesExtractList <- function(int, t) {
    peakRanges <- as(int > 0, "IRanges")
    peak_is_too_low <- max(extractList(int, peakRanges)) <= t
    replaceROWS(int, peakRanges[peak_is_too_low], 0L)
}

utils.removePeaksDiff <- function(int, t) {
    d <- diff(int > 0L)
    s <- which(d == 1L)
    e <- which(d == -1L)
    for (i in seq_along(s)) {
        j <- s[[i]]:e[[i]]
        if (all(int[j] <= t)) {
            int[j] <- 0L
        }
    }
    int
}

data(itraqdata)

int <- intensity(itraqdata)
thr <- 2e5


all.equal(utils.removePeaksIRangesList(int[[1]], thr),
          utils.removePeaksIRanges(int[[1]], thr))
# [1] TRUE
all.equal(utils.removePeaksIRanges(int[[1]], thr),
          utils.removePeaksIRangesOverlap(int[[1]], thr))
# [1] TRUE
all.equal(utils.removePeaksIRangesOverlap(int[[1]], thr),
          utils.removePeaksIRangesExtractList(int[[1]], thr))
# [1] TRUE
all.equal(utils.removePeaksIRangesExtractList(int[[1]], thr),
          utils.removePeaksDiff(int[[1]], thr))
# [1] TRUE

b <- lapply(int[1:5], function(i) {
    benchmark(IRangesList=utils.removePeaksIRangesList(i, thr),
              IRanges=utils.removePeaksIRanges(i, thr),
              Overlap=utils.removePeaksIRangesOverlap(i, thr),
              ExtractList=utils.removePeaksIRangesExtractList(i, thr),
              diff=utils.removePeaksDiff(i, thr),
              columns=c("test", "elapsed", "relative"),
              replications=10,
              order="relative")
})
b
# $X1
#          test elapsed relative
# 5        diff   0.002      1.0
# 2     IRanges   0.004      2.0
# 3     Overlap   0.028     14.0
# 4 ExtractList   0.029     14.5
# 1 IRangesList   0.937    468.5
# 
# $X10
#          test elapsed relative
# 5        diff   0.002      1.0
# 2     IRanges   0.004      2.0
# 3     Overlap   0.026     13.0
# 4 ExtractList   0.028     14.0
# 1 IRangesList   0.833    416.5
# 
# $X11
#          test elapsed relative
# 5        diff   0.002      1.0
# 2     IRanges   0.004      2.0
# 3     Overlap   0.025     12.5
# 4 ExtractList   0.029     14.5
# 1 IRangesList   0.781    390.5
# 
# $X12
#          test elapsed relative
# 5        diff   0.002      1.0
# 2     IRanges   0.005      2.5
# 3     Overlap   0.027     13.5
# 4 ExtractList   0.028     14.0
# 1 IRangesList   1.162    581.0
# 
# $X13
#          test elapsed relative
# 5        diff   0.003    1.000
# 2     IRanges   0.004    1.333
# 3     Overlap   0.029    9.667
# 4 ExtractList   0.029    9.667
# 1 IRangesList   1.264  421.333

I also wanted to address the issue #252 with this PR but unfortunately I won't have time the next days (maybe weeks) to finish this PR. If anybody wants to take over please feel free to do so.

@lgatto
Copy link
Owner

lgatto commented Mar 18, 2018

That's great, thanks. I won't have time either, but no hurry. Let me know if you prefer this PR to be merged now wait for #252.

@jorainer
Copy link
Collaborator

Nice @sgibb 👍

@sgibb
Copy link
Collaborator Author

sgibb commented Mar 21, 2018

@lgatto Because I won't be able to finish this in the near future and the current PR is an (non-destructive) improvement I merge it. But I am afraid nobody will work on #252 then ...

@sgibb sgibb changed the title [WIP] Remove Peaks Remove Peaks Mar 21, 2018
@sgibb sgibb merged commit 1360d3e into master Mar 21, 2018
@sgibb sgibb deleted the removePeaks-issue252 branch March 21, 2018 20:22
sgibb added a commit that referenced this pull request Mar 21, 2018
@lgatto
Copy link
Owner

lgatto commented Mar 21, 2018

Thanks @sgibb - I won't be able to commit time to #252 at the moment. Hopefully later though.

@hpages
Copy link
Contributor

hpages commented Mar 22, 2018

Be aware that utils.removePeaksDiff() doesn't exactly preserve the semantic of the original utils.removePeaks():

## Original utils.removePeaks() (slightly modified to work with BioC 3.7):
utils.removePeaks <- function(int, t) {
  peakRanges <- IRanges(sapply(int,">",0))
  sapply(seq_along(peakRanges),function(i) {
    x <- IRanges:::unlist_as_integer(peakRanges[i])
    ## we get the indices of every peak in int
    if(all(int[x]<=t))
      int[x] <<- 0
  })
  return(int)
}
> int1 <- c(2.5, -0.3, 0, 3.1, 4.2, -1, 2.2, 0)
> utils.removePeaks(int1, 2.5)
[1]  0.0 -0.3  0.0  3.1  4.2 -1.0  0.0  0.0
> utils.removePeaksDiff(int1, 2.5)
[1]  0.0  0.0  0.0  3.1  4.2 -1.0  2.2  0.0

Also it sometimes fails (the original utils.removePeaks() would fail only when the input is a zero-length numeric vector):

> int2 <- c(0, 0, 2.5, -0.3, 3.1, 4.2, -1, 2.2)
> utils.removePeaks(int2, 0.1)
[1]  0.0  0.0  2.5 -0.3  3.1  4.2 -1.0  2.2
> utils.removePeaksDiff(int2, 0.1)
Error in e[[i]] : subscript out of bounds

Note that with utils.removePeaksIRangesExtractList() you pay a fixed overhead for using S4 objects. That overhead penalizes any S4-based solution vs a non-S4-based solution when the input is small to medium-size. However it becomes neglectable when the input is big:

> system.time(res3a <- utils.removePeaksDiff(int3, 0.75))
   user  system elapsed 
 10.148   1.036  11.184 
> system.time(res3b <- utils.removePeaksIRangesExtractList(int3, 0.75))
   user  system elapsed 
  1.998   0.564   2.563 

H.

@lgatto
Copy link
Owner

lgatto commented Mar 22, 2018

Thanks @hpages for following up!

sgibb added a commit that referenced this pull request Mar 26, 2018
Revokes:
- c41e4c4
- 501c22d
- 53ba540

Co-authored-by: Hervé Pagès <hpages@fredhutch.org>
@sgibb
Copy link
Collaborator Author

sgibb commented Mar 26, 2018

@hpages thanks for this great comment! Indeed the utils.removePeaksDiff implementation contains a bug. A fixed version would be

utils.removePeaks <- function(int, t) {
    int <- c(0L, int, 0L)
    d <- diff(int > 0L)
    s <- which(d == 1L) + 1L
    e <- which(d == -1L)
    for (i in seq_along(s)) {
        j <- s[[i]]:e[[i]]
        if (all(int[j] <= t)) {
            int[j] <- 0L
        }
    }
    int[seq_len(length(int) - 2L) + 1L]
}

But you are right that the diff/for loop concept is just faster because it doesn't need to do S4-method-dispatch. I revoke all my chances (because of all the merge from bioc/github etc. it is easier to add another commit: ) and reapply your IRanges based implementation (but I removed the snake_case_variable_name 😉). Thanks for enlighten me.

@lgatto and @jotsetung sorry for this stupid PR and all its inconvenience (and I said it would be a non-destructive PR ...)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants