Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor and expand onepass model #300

Draft
wants to merge 18 commits into
base: main
Choose a base branch
from

Conversation

sjfleming
Copy link
Contributor

@sjfleming sjfleming commented Feb 5, 2025

Closes #163
Closes #296

This is a refactor of onepass to make it more extensible. It implements the Welford algorithm for online variance calculation and it also implements a gene-gene covariance computation via an online algorithm similar to Welford (https://en.wikipedia.org/wiki/Algorithms_for_calculating_variance#Covariance subsection "Online"). The latter can be run using the raw data or ranks (which allows for a computation of gene-gene Spearman correlations).

Currently the Welford implementation is actually implemented in a different cellarium class, as is the covariance implementation. I thought this might be cleaner than having one huge class with more input arguments, but I'm open to opposing views. Also, the class heirarchy worked a lot better when Welford was a separate class, since Welford keeps track of different sufficient statistics than the naive/shifted algorithms. And the Welford-like gene-gene covariance keeps track of the same sufficient statistics as Welford (plus more).

Todo:

  • the refactor of the existing naive and shifted algorithms must pass existing tests
  • add tests for the Welford variance calculation
  • add tests for the Welford-like covariance calculation
  • add tests for the Welford-like correlation calculation
  • add tests for the Welford-like Spearman correlation calculation
  • ensure Welford works on more than one device
  • create CLI for Welford
  • create CLI for Welford-like covariance
  • add CLI tests for Welford
  • add CLI tests for Welford-like covariance

@sjfleming
Copy link
Contributor Author

sjfleming commented Feb 6, 2025

Welford covariance computation on all 33k genes (using a batch size of 10k cells) spikes to like 31GB memory usage on my laptop (eyeballing Activity Monitor). Just wanted to make a note of this as a ballpark figure. I am guessing that's why the github actions runner went OOM on 5495ce0. The CLI test now uses a Filter and only computes covariance using a handful of genes.

For speed purposes, the rule seems to be "make the batch as big as you can accommodate in memory".

@sjfleming
Copy link
Contributor Author

Ensure batch is not empty in update()

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
1 participant