Add function for cluster stability #779

sjspielman · 2024-09-24T18:13:14Z

Closes #773

This PR adds a function to bootstrap clusters and calculate ARI for a given number of reps. I ended up writing a that mostly wraps calculate_clusters(), thereby letting that function handle argument checking (on the first bootstrap iteration). This function differs from the other evaluation functions in that it takes a vector of clusters (hence, I check it's not a data frame; I do that b/c, as I learned today, is.vector(df$column) is FALSE). The function returns a data frame of ari results and clustering parameters, as returned by calculate_clusters().

Note also that I updated examples across function docs to use a seed; we want to encourage seeds!
(I'll also note, at one point I had the cute idea of actually providing cluster_df to this stability function and grabbing cluster parameters directly from the df, but changed my mind because, mainly, we won't necessarily know all the parameter columns that could be in that df because of cluster_args, and users may have added their own.).

…ad, check that it's not a data frame as this differs from other evaluation functions

sjspielman · 2024-09-24T18:14:20Z

packages/rOpenScPCA/tests/testthat/test-evaluate-clusters.R

+
+
+test_that("calculate_stability works as expected with different replicates", {
+  suppressWarnings({


fyi, suppressing warnings since all these calculations done on test data give warnings about ties from the ARI calculation.

Worth making that a comment in the code.

jashapiro

Overall, I think this seems like this is doing what we want, but I have two concerns:

The small one is that I think you can reduce a lot of duplication in code and documentation by using ....

The big one is I this might not always do the replications we need: there is a string risk of repeated replicates:

As I see the flow, it seems like it goes like this:

set seed
sample 1st set of cells
run clustering, which sets the seed again, then performs clustering calculations
sample 2nd set of cells. These will be different because the clustering calculation has called the RNG, advancing the seed
run clustering on second set of cells, which sets the seed again.
After this point, we may be in a different place from step 4, depending on the number of calls to the RNG, but there is a strong possibility (since the matrices are the same size) that we will be in exactly the same place. If we are, then every sampling/replicate after this will have the same values.

The solution, I think, is just not to set the seed in the internal calls to calculate_clusters(). We can just let the RNG advance as it will through the full set of replicates, setting the seed only the one time. (This does not affect the resetting in the sweep, which seems justified to ensure that the same parameter set will always result in the same clustering with the same seed, regardless of how many other parameter sets are being tested at the same time)

packages/rOpenScPCA/R/evaluate-clusters.R

jashapiro · 2024-09-24T18:49:57Z

packages/rOpenScPCA/R/evaluate-clusters.R

+          ) |>
+          dplyr::distinct() |>
+          dplyr::mutate(
+            replicate = i,


I would probably not do this here, but would instead use bind_rows(.id = "replicate") at the end.

Noting I tried this, but turns out bind_rows() will force this column to be character, and I think we'd really like this to be numeric. In this case we'd need a mutate anyways to coerce, so 1 mutate to create it within the map seemed better.

packages/rOpenScPCA/R/evaluate-clusters.R

jashapiro · 2024-09-24T19:01:22Z

packages/rOpenScPCA/R/evaluate-clusters.R

+          objective_function = objective_function,
+          cluster_args = cluster_args,
+          threads = threads,
+          seed = seed


I think there is a subtle but major error here. The issue is that every time you rerun calculate_clusters() you are resetting the seed to the same value, so the next time you sample(), you will get the same values for every run after the first. I don't think there should actually be any need to set the seed in the call to calculate_clusters(), as setting it before the replicates start will be sufficient for consistency.

packages/rOpenScPCA/R/evaluate-clusters.R

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

… since we want a numeric

…d test for this case

sjspielman · 2024-09-25T13:14:44Z

Should be ready for another look! Note also that I had to keep the pc_name argument since it's not one of the arguments pass into calculate_clusters() (it could be, but I'd prefer to extract the matrix, if needed, once and not each iteration).

jashapiro

Looks good. A few minor additional comments, but I love the simplicity of a good application of ...!

packages/rOpenScPCA/R/evaluate-clusters.R

jashapiro · 2024-09-25T13:35:18Z

packages/rOpenScPCA/R/evaluate-clusters.R

+            replicate = i, # define this variable here to ensure it's numeric
+            ari = ari


You can keep the .before here; the only reason I commented was the assumption that replicate was moving to bind_rows, which would have made this a one-liner.

packages/rOpenScPCA/R/evaluate-clusters.R

packages/rOpenScPCA/tests/testthat/test-evaluate-clusters.R

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

…ctor

sjspielman · 2024-09-25T14:02:04Z

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: c38ff0a

jashapiro · 2024-09-25T14:12:23Z

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: c38ff0a

I can imagine this failing in other ways you don't expect that would otherwise be fine (any object with an attribute will fail is.vector, not just factors; for example I think a list of clusters would actually work in the function), so I personally would not have bothered with this. If people do horrible things that result in failures down the line, we can't always stop them.

sjspielman added 11 commits September 24, 2024 12:16

add ari stability function, and use pdfCluster package for ARI

3fe6803

use seeds in examples

5011a36

add docs for stability function

941c51c

replicates, not iterations

cde0fb0

more replicates

645d0e3

syntax

4e0856e

singular colname

ade885f

fix bug and add TODO

dfecce1

can't test for vector since indexed df columns are not vectors. inste…

1bbbdf3

…ad, check that it's not a data frame as this differs from other evaluation functions

tests for stability function

a9127c2

update tests accordingly

da7f38b

sjspielman requested a review from jaclyn-taroni as a code owner September 24, 2024 18:13

sjspielman requested review from jashapiro and removed request for jaclyn-taroni September 24, 2024 18:13

sjspielman commented Sep 24, 2024

View reviewed changes

jashapiro reviewed Sep 24, 2024

View reviewed changes

sjspielman and others added 5 commits September 25, 2024 08:40

Apply suggestions from code review

8fcecd5

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

simplify docs with ..., and dont actually use bind_rows for replicate…

9823689

… since we want a numeric

test comment

499c8bf

use slice instead of head for checks

e6637b0

Need a pc_name argument since it's used before calculate_clusters. Ad…

ac6929c

…d test for this case

sjspielman requested a review from jashapiro September 25, 2024 13:14

jashapiro approved these changes Sep 25, 2024

View reviewed changes

sjspielman and others added 2 commits September 25, 2024 09:46

Apply suggestions from code review

9b90c0a

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>

one more check for clusters to ensure we fail if clusters is not a ve…

c38ff0a

…ctor

remove that check/test

52e1fe1

sjspielman merged commit e2c71af into AlexsLemonade:feature/ropenscpca Sep 25, 2024
3 checks passed

sjspielman deleted the 773-cluster-stability branch September 25, 2024 14:47

sjspielman mentioned this pull request Sep 25, 2024

rOpenScPCA: function to calculate stability #773

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add function for cluster stability #779

Add function for cluster stability #779

sjspielman commented Sep 24, 2024

sjspielman Sep 24, 2024

jashapiro Sep 24, 2024

jashapiro left a comment

jashapiro Sep 24, 2024

sjspielman Sep 25, 2024

jashapiro Sep 24, 2024

sjspielman commented Sep 25, 2024

jashapiro left a comment

jashapiro Sep 25, 2024

sjspielman commented Sep 25, 2024

jashapiro commented Sep 25, 2024



		test_that("calculate_stability works as expected with different replicates", {
		suppressWarnings({

		replicate = i, # define this variable here to ensure it's numeric
		ari = ari

Add function for cluster stability #779

Add function for cluster stability #779

Conversation

sjspielman commented Sep 24, 2024

sjspielman Sep 24, 2024

Choose a reason for hiding this comment

jashapiro Sep 24, 2024

Choose a reason for hiding this comment

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Sep 24, 2024

Choose a reason for hiding this comment

sjspielman Sep 25, 2024

Choose a reason for hiding this comment

jashapiro Sep 24, 2024

Choose a reason for hiding this comment

sjspielman commented Sep 25, 2024

jashapiro left a comment

Choose a reason for hiding this comment

jashapiro Sep 25, 2024

Choose a reason for hiding this comment

sjspielman commented Sep 25, 2024

jashapiro commented Sep 25, 2024