New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

Add function to sweep clustering parameters #765

Merged

sjspielman merged 33 commits into AlexsLemonade:feature/ropenscpca from sjspielman:755-sweep-clustering

Sep 19, 2024

Member

sjspielman commented Sep 18, 2024

Closes #755

This PR adds a function and tests to perform clustering across a set of parameters. Implementation details:

First, I am hardly enamored by my function and file naming. Do you agree and if so what thoughts do you have for me??
The function ends up creating a data frame with an indicator column cluster_set to indicate which round of clustering the values pertain to, and for easier spliting in the future. Note that I could also leave this as a list and remove the dplyr::bind_rows() and allow users to take this step themselves if they prefer. A list might also be preferable since this function will end up getting used in a template Rmd to perform/evaluate clustering parameters, so for a lot of that we'd have to revert to a list anyways.
- In part because of the need to keep these columns consistent, and also because it felt overkill/beyond the scope of this function's specific purpose, I do not allow users to vary the algorithm. You get to pick 1 algorithm and vary parameters for that algorithm.
I ended up adding a threads argument, since this can get time consuming with real SCEs
- Speaking of real SCEs, how long does it take to extract a matrix from an SCE or Seurat object (a real big one, not simulated!)? It's essentially instantaneous. I still extract the matrix to begin with to save some time, but it's small potatoes either way.
Any other tests we'd like to see?

sjspielman added 8 commits

September 17, 2024 16:11


          add function to sweep parameters


          run document

7e10ed6


          add tests for sweep function

952acfb


          update description

93ca2af


          add argument for threads

8adb36f


          document threads

523550d


          actually use threads

3b03fe2


          fix docs typo

e52227a

sjspielman requested a review from jaclyn-taroni as a code owner

September 18, 2024 14:03

sjspielman removed the request for review from jaclyn-taroni

September 18, 2024 14:04

jashapiro reviewed

View reviewed changes

Member

jashapiro left a comment

Overall this looks good.

I think my biggest comment is that I would want to support multiple algorithms, but I can see your argument for not doing so. It definitely does introduce some complexity, but that will have to be handled somewhere anyway... I guess the stongest argument would be that sometimes parameter ranges should be different for different algorithms. But supporting it doesn't mean you have to use it!

Part of that reasoning is that I think you do want to check the parameters against the algorithm anyway: if there are unused parameters given they should not be passed along to the sweep, as that will result in duplicate runs.

I think leaving the output as a list probably makes sense, as it makes it easier to combine different runs later. Once you put in an id column in the data frame, then you have to worry about it being unique.

I didn't evaluate testing yet, as I think it will change if the output is changing!

packages/rOpenScPCA/R/sweep-parameters.R Outdated

+              #' Calculate clusters across a set of parameters
+              #'
+              #' This function can be used to perform reproducible clustering while varying a set of parameters.
+              #' A single clustering algorithm is required, but multiple values can be provided for any of:

Member

jashapiro Sep 18, 2024

I'm curious why you made this decision? Why not allow comparison across clustering algorithms? I assume it is because other parameter variations were annoying to handle, but I certainly imagine that such comparisons might be desired, so the effort might be worth it.

packages/rOpenScPCA/R/sweep-parameters.R Outdated

Comment on lines 83 to 84

		# Even parameters that won't be used can be included
		# since calculate_clusters() will ignore them anyways

Member

jashapiro Sep 18, 2024

If this is true, then why not allow variation in algorithms too?

Member

jashapiro Sep 18, 2024

Thinking about this a bit more: we should probably discard/convert to NA any unused parameters. Otherwise there could be a lot of extra clustering when an unused parameter is set to vary. Again, I think this is worth the effort to support multiple clustering algorithms, amutate() with a few ifelse() substitutions followed by a distinct() and we should be all set.

Member Author

sjspielman Sep 18, 2024

Ah I see this is relevant to this comment - #765 (comment)

Otherwise there could be a lot of extra clustering when an unused parameter is set to vary.

I didn't see this at first, but yeah I see it now...

packages/rOpenScPCA/R/sweep-parameters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/R/sweep-parameters.R Outdated

+                      )
+                    }
+                  ) |>
+                  dplyr::bind_rows(.id = "cluster_set")

Member

jashapiro Sep 18, 2024

Is this id just an integer? If so, do we need it? Or would it bet better to generate this from unique combinations of the parameters? Or at least add the algorithm as a prefix? Something to make it more likely to remain unique if multiple tables are joined later.

Update: yes, I see that it is needed in the current case to handle the case where an unused parameter is given with variation.

Member

jashapiro Sep 18, 2024

Looking back at your introductory comments, I think I would probably leave it a list. But I will note that bind_rows handles mismatch columns just fine, so you shouldn't need to worry about that!

packages/rOpenScPCA/R/sweep-parameters.R Outdated

Member

jashapiro Sep 18, 2024

since you asked, I would call this cluster-sweep.R or sweep-clustering.R

packages/rOpenScPCA/R/sweep-parameters.R Outdated Show resolved Hide resolved

Member Author

sjspielman commented Sep 18, 2024

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

objective_function will apply to only leiden since it's algorithm-specific
but nn would apply to both louvain and leiden.

sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

Member

jashapiro commented Sep 18, 2024

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

objective_function will apply to only leiden since it's algorithm-specific

but nn would apply to both louvain and leiden.
sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)
Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?
sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

I would do the former, but then handle it in the function withmutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_) after the expand_grid

Member Author

sjspielman commented Sep 18, 2024

I would do the former, but then handle it in the function with mutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_)

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄


          Apply suggestions from code review

589fd1f

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

Member

jashapiro commented Sep 18, 2024

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

The reason we can't just let calculate_clusters ignore it is the multiple runs problem. The mutate function should be followed by a distinct to remove repeats.

sjspielman added 7 commits

September 18, 2024 14:48


          renames

3ef6db5


          Update function for multiple algorithms and redocument

4122ff4


          can't use NA since match.args isn't here for it. use calculate_cluste…

a51ff78

…r default instead


          update existing tests

dd4fd31


          Fixed a bug: don't include cluster_args if it's empty

0e47c81


          more comment for future us

0d3773d


          more tests

fa0c887

Member Author

sjspielman commented Sep 18, 2024

This is getting closer, but is definitely not there yet. Some updates:

While NA makes sense for avoiding duplicate runs when irrelevant params are varied, match.args doesn't want it. So, instead of NA, I used the values that represent the defaults for those parameters, which maybe seemed like a decent middle ground (maybe? decent? middle? i'm not hedging, you're hedging!). Did you have a different thought there?
Note that I caught a bug where we can't be adding in cluster_args to the final data frame if it's empty, so I fixed (and tested) here 0e47c81
Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -
- We could cancel cluster_args for the sweep function
- We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

sjspielman requested a review from jashapiro

September 18, 2024 20:06

jashapiro reviewed

View reviewed changes

packages/rOpenScPCA/R/sweep-clusters.R Outdated Show resolved Hide resolved

Member

jashapiro commented Sep 18, 2024

Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -

We could cancel cluster_args for the sweep function

We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

I would vote for no cluster_args in the sweep function. If you really need it, you can write your own sweep. And if there are particular cluster_args worth adding, we can add support as needed.

sjspielman added 3 commits

September 18, 2024 16:26


          no more cluster_args in sweep function

604e1ea


          one more associated docs update

0baaa75


          check NA for objective_function before match.arg'ing

e93c779


          back to NA values, char for objective_function and real for resolution

9cc7d96

jashapiro reviewed

View reviewed changes

Member

jashapiro left a comment

Mostly style comments here, but also I think you can revert the last change (setting and dealing with NA). I hadn't appreciated how well we throw out the default values, so I think we can leave your previous solution which did what I was worried it did not!

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/R/sweep-clusters.R Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/R/calculate-clusters.R Outdated

Comment on lines 89 to 101

-                objective_function <- match.arg(objective_function)
+                # this might be NA if it came from the sweep_clusters() function
+                if (is.na(objective_function)) {
+                  objective_function <- NULL
+                } else {
+                  objective_function <- match.arg(objective_function)
+                }

Member

jashapiro Sep 18, 2024

On reflection, since this parameter only shows up in the output table if it is not NA, the previous solution should solve this just fine, and you can revert this change! I thought it would end up included with the default value, but we already don't do that because of the way we (properly!) handle cluster_args. (If you do want to keep it though, you will need to wrap is.na with an all() or check the length, I realized.)

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Show resolved Hide resolved

jashapiro reviewed

View reviewed changes

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Show resolved Hide resolved

sjspielman and others added 8 commits

September 19, 2024 09:39


          back to defaults, not NA, for additional parameters

0f5b286


          Apply suggestions from code review

3ad8c09

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>


          better alg checking

3a044ce


          tests styling/spacing

7b3b575


          use map instead

b613668


          update comment'

ee2af5b


          one more spot to remove unique

7b86bdf


          remove redundant tests

99fd963

sjspielman requested a review from jashapiro

September 19, 2024 14:33

jashapiro approved these changes

View reviewed changes

Member

jashapiro left a comment

LGTM

I suggested two more quick tests (Seurat and matrix inputs), but I don't think I need to see this again.

packages/rOpenScPCA/R/sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R

+              test_that("sweep_clusters works as expected with default algorithm & weighting", {
+                sweep_list <- sweep_clusters(
+                  sce,

Member

jashapiro Sep 19, 2024

since there is logic in the sweep function for the conversion, it is probably worth including a quick test that this function works for matrices and Seurat objects as well. I don't think you need to do anything but check that those functions run and produce output, so using full defaults (one clustering) should be fine.

sjspielman and others added 3 commits

September 19, 2024 13:09


          Apply suggestions from code review

b53f16d

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>


          test sweep with seurat and matrix

021a365


          remove a .gitkeep straggler

ff083f7

jashapiro reviewed

View reviewed changes

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

jashapiro reviewed

View reviewed changes

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

sjspielman and others added 2 commits

September 19, 2024 13:55


          Apply suggestions from code review

01628d2

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>


          simplify

64d82b2

sjspielman merged commit 6f67c73 into AlexsLemonade:feature/ropenscpca

3 checks passed

sjspielman deleted the 755-sweep-clustering branch

September 19, 2024 19:05

sjspielman mentioned this pull request

rOpenScpCA: Add function to sweep clustering parameters #755

Closed

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet