[r] add tf-idf and log normalization functions #168

immanuelazn · 2024-12-12T22:50:17Z

Details

As discussed, we are looking to add in normalization functions to allow for passing in transformations into orchestrator functions like LSI/iterative LSI. I add in two functions, normalize_tfidf(), and normalize_log() (shown in #167).

There were a few departures from the design doc, in order to provide a little bit more flexibility. Particularly, I was thinking about the case where the feature means are not ordered in the same way. To add in a little bit of safety, I added some logic for index invariant matching for feature means to matrix features.

Other than that, I also provided an option to do a traditional log transform by boolean flag, rather than log1p. As we don't directly expose a log function in BPCells C++ side, I just added a - 1. However, I'm noticing that this isn't translated into a -Inf like in dgCMatrix/generic matrix, and is instead a very small number. Might need to evaluate if this is something we would want to support

immanuelazn · 2024-12-12T23:17:36Z

r/tests/testthat/test-matrix_transforms.R

+    # Test that removing the add_one works
+    # log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers.
+    res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix")
+    res_3@x[res_3@x < -60] <- -Inf


Any better way of doing this?

As suggested above, I think we just get rid of the add_one option

bnprks

This overall looks reasonable, though I'm not sure if the formulas implemented are quite what we want.

For documenting the formulas, I'd suggest taking advantage of the fact that we can render latex inside \eqn{} blocks from docstrings. Due to a current pkgdown bug we'll need to edit _pkgdown.yml to un-break our equation rendering:

includes:
    in_header: |
       <script defer data-domain="benparks.net" src="https://plausible.benparks.net/js/visit-counts.js"></script>
       <link rel="stylesheet" href="https://cdn.jsdelivr.net/npm/katex@0.16.15/dist/katex.min.css" integrity="sha384-Htz9HMhiwV8GuQ28Xr9pEs1B4qJiYu/nYLLwlDklR53QibDfmQzi7rYxXhMH/5/u" crossorigin="anonymous">
       <!-- The loading of KaTeX is deferred to speed up page rendering -->
       <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.15/dist/katex.min.js" integrity="sha384-bxmi2jLGCvnsEqMuYLKE/KsVCxV3PqmKeK6Y6+lmNXBry6+luFkEOsmp5vD9I/7+" crossorigin="anonymous"></script>
       <!-- To automatically render math in text elements, include the auto-render extension: -->
       <script defer src="https://cdn.jsdelivr.net/npm/katex@0.16.15/dist/contrib/auto-render.min.js" integrity="sha384-hCXGrW6PitJEwbkoStFjeJxv+fSOOQKOPbJxSfM6G5sWZjAyWhXiTIIAmQqnlLlh" crossorigin="anonymous" onload="renderMathInElement(document.body);"></script>

I'd suggest documenting the transformation equation for a single matrix element, like this for tfidf: eqn{\tilde{x}_{ij} = \log(\frac{x_{ij} \cdot \text{scaleFactor}}{ \text{rowMean}_i\cdot \text{colSum}_j} + 1)}

This would render as:

bnprks · 2024-12-13T01:04:18Z

r/R/transforms.R

+#' @param add_one (logical) Add one to the matrix before log normalization
+#' @returns log normalized matrix.
+#' @export
+normalize_log <- function(mat, scale_factor = 1e4, add_one = TRUE) {


We should remove the add_one parameter and always just do log1p. Every time I've seen this normalization it's done with a log1p, as otherwise the zero values would become -Inf (dgCMatrix actually messes this up)

We should divide by the colSums prior to multiplying by scale_factor. It might make sense to use colMeans from matrix_stats() so we can do multi-threading

We should give the specific normalization formula in the docs (perhaps in the returns section)

Agree on removing the add_one, and providing formulas in docs

bnprks · 2024-12-13T03:18:54Z

r/R/transforms.R

+#' Else, map each feature name to its mean value.
+#' @returns tf-idf normalized matrix.
+#' @export
+normalize_tfidf <- function(mat, feature_means = NULL, threads = 1L) {


Nice touch thinking of the name-matching option for feature_means

I think we might want to follow the ArchR/Signac TF-IDF default formulas which include some logarithms. Perhaps you were thinking users would combine normalize_log with normalize_tfidf but I think it's better to keep each standalone

We should also give the specific normalization formula here

bnprks · 2024-12-13T03:19:47Z

r/tests/testthat/test-matrix_transforms.R

+    # Test that removing the add_one works
+    # log of 0 is -inf, but we don't do that on the c side, and just have really large negative numbers.
+    res_3 <- as(normalize_log(m2, add_one = FALSE), "dgCMatrix")
+    res_3@x[res_3@x < -60] <- -Inf


As suggested above, I think we just get rid of the add_one option

immanuelazn · 2024-12-13T19:41:53Z

Thanks for the review Ben! I agree with the styling changes. I believe a small discussion on the logic of these functions would be a good idea to make sure we're aligned.

My initial thought was that normalize_log() and normalize_tfidf() would be to the tee of what these functions are named as. In the case of log normalization we would have the division by cell sums as built in the wrapper function (such as in select_features_by_variance() instead). What I'm worried about is in cases where we would want to pass the function, but we don't want a separate cell sum normalization as well. select_features_by_dispersion() is an example in the design docs, which has an optional log normalization without doing cell normalization This reflects the design decisions in ArchR here.

Similarly, from what I recall, tf-idf normalization doesn't necessarily indicate that a log normalization will take place after it. I was thinking in the case of LSI/iterative LSI, we would again just have the log normalization as a required step post tf-idf normalization.

I think having a boolean flag for cell normalization for normalize_log(), as well as having a boolean flag for log normalization in normalize_tfidf() would provide a little bit more flexibility. What do you think?

bnprks · 2024-12-13T21:59:21Z

I'm thinking of normalize_log() and normalize_tfidf() as trying to simplify single cell workflows by implementing common normalizations. We don't need these to cater to every possible need, just focus on making common cases easy (hard cases can be implemented by hand using our existing math operators). I think the versions I've suggested are the common cases

I think the ArchR example you link is actually an example of ArchR performing no normalization rather than doing a log without normalizing by read depth. That's a use-case we can easily support by allowing normalize=NULL or something to that effect.
I don't think I've ever seen a case of log normalization not include normalizing by colSums first. (e.g. this is what Seurat does, when it says "logNormalize"). I really don't think people will be confused here, and we can include the formula in the docs for clarity. I don't think we need a boolean flag unless we have a known example of a common analysis approach it would enable.
For tfidf, it would be okay to implement method argument from Signac, though just providing the default method is okay, as probably 99% of people just use the default that both ArchR/Signac use. Annoyingly Signac and ArchR disagree on method and LSIMethod numbering scheme, so we might need to use string naming if we implement this. (again, optional for now to implement the other variants. Implementing the ArchR/Signac default is most important)

immanuelazn · 2024-12-13T22:08:26Z

Okay, my bad on the ArchR thing! You're right, I will address these changes

[r] add tf-idf and log normalization functions

1ef6091

immanuelazn force-pushed the ia/normalizations branch 2 times, most recently from a93f2c4 to f1c33cd Compare December 12, 2024 23:11

[r] fix normalization tests

98675d0

immanuelazn force-pushed the ia/normalizations branch from f1c33cd to 98675d0 Compare December 12, 2024 23:13

immanuelazn commented Dec 12, 2024

View reviewed changes

bnprks reviewed Dec 13, 2024

View reviewed changes

immanuelazn added 2 commits December 13, 2024 17:09

[r] add in requested changes

2f83ae6

[r] removed unused variable

6381f74

immanuelazn force-pushed the ia/normalizations branch 2 times, most recently from d8d9ed2 to 6381f74 Compare December 15, 2024 03:20

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[r] add tf-idf and log normalization functions #168

[r] add tf-idf and log normalization functions #168

immanuelazn commented Dec 12, 2024 •

edited

Loading

immanuelazn Dec 12, 2024

bnprks Dec 13, 2024

bnprks left a comment

bnprks Dec 13, 2024

immanuelazn Dec 13, 2024

bnprks Dec 13, 2024

bnprks Dec 13, 2024

immanuelazn commented Dec 13, 2024

bnprks commented Dec 13, 2024

immanuelazn commented Dec 13, 2024

[r] add tf-idf and log normalization functions #168

Are you sure you want to change the base?

[r] add tf-idf and log normalization functions #168

Conversation

immanuelazn commented Dec 12, 2024 • edited Loading

Details

immanuelazn Dec 12, 2024

Choose a reason for hiding this comment

bnprks Dec 13, 2024

Choose a reason for hiding this comment

bnprks left a comment

Choose a reason for hiding this comment

bnprks Dec 13, 2024

Choose a reason for hiding this comment

immanuelazn Dec 13, 2024

Choose a reason for hiding this comment

bnprks Dec 13, 2024

Choose a reason for hiding this comment

bnprks Dec 13, 2024

Choose a reason for hiding this comment

immanuelazn commented Dec 13, 2024

bnprks commented Dec 13, 2024

immanuelazn commented Dec 13, 2024

immanuelazn commented Dec 12, 2024 •

edited

Loading