Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function for cluster stability #779

Merged

Conversation

sjspielman
Copy link
Member

Closes #773

This PR adds a function to bootstrap clusters and calculate ARI for a given number of reps. I ended up writing a that mostly wraps calculate_clusters(), thereby letting that function handle argument checking (on the first bootstrap iteration). This function differs from the other evaluation functions in that it takes a vector of clusters (hence, I check it's not a data frame; I do that b/c, as I learned today, is.vector(df$column) is FALSE). The function returns a data frame of ari results and clustering parameters, as returned by calculate_clusters().

Note also that I updated examples across function docs to use a seed; we want to encourage seeds!
(I'll also note, at one point I had the cute idea of actually providing cluster_df to this stability function and grabbing cluster parameters directly from the df, but changed my mind because, mainly, we won't necessarily know all the parameter columns that could be in that df because of cluster_args, and users may have added their own.).

@sjspielman sjspielman requested review from jashapiro and removed request for jaclyn-taroni September 24, 2024 18:13


test_that("calculate_stability works as expected with different replicates", {
suppressWarnings({
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fyi, suppressing warnings since all these calculations done on test data give warnings about ties from the ARI calculation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Worth making that a comment in the code.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall, I think this seems like this is doing what we want, but I have two concerns:

The small one is that I think you can reduce a lot of duplication in code and documentation by using ....

The big one is I this might not always do the replications we need: there is a string risk of repeated replicates:

As I see the flow, it seems like it goes like this:

  1. set seed
  2. sample 1st set of cells
  3. run clustering, which sets the seed again, then performs clustering calculations
  4. sample 2nd set of cells. These will be different because the clustering calculation has called the RNG, advancing the seed
  5. run clustering on second set of cells, which sets the seed again.
    After this point, we may be in a different place from step 4, depending on the number of calls to the RNG, but there is a strong possibility (since the matrices are the same size) that we will be in exactly the same place. If we are, then every sampling/replicate after this will have the same values.

The solution, I think, is just not to set the seed in the internal calls to calculate_clusters(). We can just let the RNG advance as it will through the full set of replicates, setting the seed only the one time. (This does not affect the resetting in the sweep, which seems justified to ensure that the same parameter set will always result in the same clustering with the same seed, regardless of how many other parameter sets are being tested at the same time)

packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
) |>
dplyr::distinct() |>
dplyr::mutate(
replicate = i,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would probably not do this here, but would instead use bind_rows(.id = "replicate") at the end.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Noting I tried this, but turns out bind_rows() will force this column to be character, and I think we'd really like this to be numeric. In this case we'd need a mutate anyways to coerce, so 1 mutate to create it within the map seemed better.

packages/rOpenScPCA/R/evaluate-clusters.R Show resolved Hide resolved
packages/rOpenScPCA/R/evaluate-clusters.R Show resolved Hide resolved
objective_function = objective_function,
cluster_args = cluster_args,
threads = threads,
seed = seed
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there is a subtle but major error here. The issue is that every time you rerun calculate_clusters() you are resetting the seed to the same value, so the next time you sample(), you will get the same values for every run after the first. I don't think there should actually be any need to set the seed in the call to calculate_clusters(), as setting it before the replicates start will be sufficient for consistency.

packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

Should be ready for another look! Note also that I had to keep the pc_name argument since it's not one of the arguments pass into calculate_clusters() (it could be, but I'd prefer to extract the matrix, if needed, once and not each iteration).

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good. A few minor additional comments, but I love the simplicity of a good application of ...!

packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
Comment on lines 235 to 236
replicate = i, # define this variable here to ensure it's numeric
ari = ari
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can keep the .before here; the only reason I commented was the assumption that replicate was moving to bind_rows, which would have made this a one-liner.

packages/rOpenScPCA/R/evaluate-clusters.R Outdated Show resolved Hide resolved
sjspielman and others added 2 commits September 25, 2024 09:46
@sjspielman
Copy link
Member Author

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: c38ff0a

@jashapiro
Copy link
Member

Thought of one more edge case - the nrow/length check between the matrix and clusters will pass even if clusters is not a vector, but eg a data frame with the same number of columns as the matrix. This is really unlikely to happen but can't hurt to catch. I updated here if you want to have a look: c38ff0a

I can imagine this failing in other ways you don't expect that would otherwise be fine (any object with an attribute will fail is.vector, not just factors; for example I think a list of clusters would actually work in the function), so I personally would not have bothered with this. If people do horrible things that result in failures down the line, we can't always stop them.

@sjspielman sjspielman merged commit e2c71af into AlexsLemonade:feature/ropenscpca Sep 25, 2024
3 checks passed
@sjspielman sjspielman deleted the 773-cluster-stability branch September 25, 2024 14:47
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants