Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add n.expected.cells estimation to emptyDropsCellRanger #88

Open
jashapiro opened this issue Jul 13, 2022 · 4 comments
Open

Add n.expected.cells estimation to emptyDropsCellRanger #88

jashapiro opened this issue Jul 13, 2022 · 4 comments

Comments

@jashapiro
Copy link

As of Cell Ranger 7.0, the initial expected number of cells is estimated by a fairly simple OrdMag algorithm if not specified by the user, as described in the 10X Gene Expression documentation.

It would be nice if a version of this feature were added to emptyDropsCellRanger() to preserve comparability with Cell Ranger, as well as to improve default performance with libraries of varying sizes.

Tagging @DongzeHE and @rob-p as the primary authors of the existing emptyDropsCellRanger function.

@LTLA
Copy link
Collaborator

LTLA commented Jul 13, 2022

Happy to take a PR from whoever's interested in this. The plain-text description in the 10X docs is incomprehensible to me, but maybe someone else can figure out what is happening there and why.

@jashapiro
Copy link
Author

My read of the algorithm is something like this:

ordMagLoss <- function(x, ordered_totals){
  top <- ordered_totals[1:x]
  cutoff <- quantile(top, 0.99)/10
  ordmag <- sum(ordered_totals > cutoff)
  loss <- (ordmag - x)^2/x
}

Apply that over a range for x and pick the value with the minimum loss.

How to choose and space the range for coverage across a broad range of possible values without wasting too much time seems to be left as an exercise to the reader.

@LTLA
Copy link
Collaborator

LTLA commented Jul 14, 2022

Ah. I figured it was trying to check some kind of self-consistency.

Anyway, a linear-time algorithm should be fairly straightforward, given that the cutoff can only decrease.

ordMagExpected <- function(ordered_totals) {
    cutoff_point <- 1
    loss <- numeric(length(ordered_totals))

    for (i in seq_along(ordered_totals)) {
        # Computing the type-7 quantile.
        quantile_point <- (i - 1) * 0.01 + 1
        left <- floor(quantile_point)
        right <- ceiling(quantile_point)

        if (left == right) {
            quantile_val <- ordered_totals[left]
        } else {
            leftval <- ordered_totals[left]
            rightval <- ordered_totals[right]
            leftgap <- quantile_point - left
            rightgap <- right - quantile_point
            quantile_val <- leftval * rightgap + rightval * leftgap 
        }

        # Finding the cut-off. Currently only accepting cells > cutoff,
        # but this could be changed to >= if desired.
        cutoff <- quantile_val / 10
        while (cutoff_point <= length(ordered_totals) && ordered_totals[cutoff_point] > cutoff) {
             cutoff_point <- cutoff_point + 1
        }

        # Number of cells is one less than the cutoff_point because we're 1-indexed.
        num_cells <- cutoff_point - 1

        loss[i] <- (i - num_cells)^2
    }

    which.min(loss)
}

Not tested on real data at all. Could be faster in C++, given that we have a src/. But maybe it's already fast enough.

@DongzeHE
Copy link
Contributor

Hi all,

I will work on this during Thanksgiving. Thanks so much for pointing this out and coming up with the solutions!

Best,
Dongze

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants