-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function for cluster stability #779
Add function for cluster stability #779
Conversation
…ad, check that it's not a data frame as this differs from other evaluation functions
|
||
|
||
test_that("calculate_stability works as expected with different replicates", { | ||
suppressWarnings({ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fyi, suppressing warnings since all these calculations done on test data give warnings about ties from the ARI calculation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Worth making that a comment in the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, I think this seems like this is doing what we want, but I have two concerns:
The small one is that I think you can reduce a lot of duplication in code and documentation by using ...
.
The big one is I this might not always do the replications we need: there is a string risk of repeated replicates:
As I see the flow, it seems like it goes like this:
- set seed
- sample 1st set of cells
- run clustering, which sets the seed again, then performs clustering calculations
- sample 2nd set of cells. These will be different because the clustering calculation has called the RNG, advancing the seed
- run clustering on second set of cells, which sets the seed again.
After this point, we may be in a different place from step 4, depending on the number of calls to the RNG, but there is a strong possibility (since the matrices are the same size) that we will be in exactly the same place. If we are, then every sampling/replicate after this will have the same values.
The solution, I think, is just not to set the seed in the internal calls to calculate_clusters()
. We can just let the RNG advance as it will through the full set of replicates, setting the seed only the one time. (This does not affect the resetting in the sweep, which seems justified to ensure that the same parameter set will always result in the same clustering with the same seed, regardless of how many other parameter sets are being tested at the same time)
) |> | ||
dplyr::distinct() |> | ||
dplyr::mutate( | ||
replicate = i, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would probably not do this here, but would instead use bind_rows(.id = "replicate")
at the end.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Noting I tried this, but turns out bind_rows()
will force this column to be character, and I think we'd really like this to be numeric. In this case we'd need a mutate anyways to coerce, so 1 mutate to create it within the map seemed better.
objective_function = objective_function, | ||
cluster_args = cluster_args, | ||
threads = threads, | ||
seed = seed |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think there is a subtle but major error here. The issue is that every time you rerun calculate_clusters()
you are resetting the seed to the same value, so the next time you sample()
, you will get the same values for every run after the first. I don't think there should actually be any need to set the seed in the call to calculate_clusters()
, as setting it before the replicates start will be sufficient for consistency.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
… since we want a numeric
…d test for this case
Should be ready for another look! Note also that I had to keep the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. A few minor additional comments, but I love the simplicity of a good application of ...
!
replicate = i, # define this variable here to ensure it's numeric | ||
ari = ari |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can keep the .before here; the only reason I commented was the assumption that replicate
was moving to bind_rows
, which would have made this a one-liner.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Thought of one more edge case - the |
I can imagine this failing in other ways you don't expect that would otherwise be fine (any object with an attribute will fail |
Closes #773
This PR adds a function to bootstrap clusters and calculate ARI for a given number of reps. I ended up writing a that mostly wraps
calculate_clusters()
, thereby letting that function handle argument checking (on the first bootstrap iteration). This function differs from the other evaluation functions in that it takes a vector of clusters (hence, I check it's not a data frame; I do that b/c, as I learned today,is.vector(df$column)
isFALSE
). The function returns a data frame of ari results and clustering parameters, as returned bycalculate_clusters()
.Note also that I updated examples across function docs to use a seed; we want to encourage seeds!
(I'll also note, at one point I had the cute idea of actually providing
cluster_df
to this stability function and grabbing cluster parameters directly from the df, but changed my mind because, mainly, we won't necessarily know all the parameter columns that could be in that df because ofcluster_args
, and users may have added their own.).