-
Notifications
You must be signed in to change notification settings - Fork 18
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add function to sweep clustering parameters #765
Add function to sweep clustering parameters #765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall this looks good.
I think my biggest comment is that I would want to support multiple algorithms, but I can see your argument for not doing so. It definitely does introduce some complexity, but that will have to be handled somewhere anyway... I guess the stongest argument would be that sometimes parameter ranges should be different for different algorithms. But supporting it doesn't mean you have to use it!
Part of that reasoning is that I think you do want to check the parameters against the algorithm anyway: if there are unused parameters given they should not be passed along to the sweep, as that will result in duplicate runs.
I think leaving the output as a list probably makes sense, as it makes it easier to combine different runs later. Once you put in an id column in the data frame, then you have to worry about it being unique.
I didn't evaluate testing yet, as I think it will change if the output is changing!
#' Calculate clusters across a set of parameters | ||
#' | ||
#' This function can be used to perform reproducible clustering while varying a set of parameters. | ||
#' A single clustering algorithm is required, but multiple values can be provided for any of: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm curious why you made this decision? Why not allow comparison across clustering algorithms? I assume it is because other parameter variations were annoying to handle, but I certainly imagine that such comparisons might be desired, so the effort might be worth it.
# Even parameters that won't be used can be included | ||
# since calculate_clusters() will ignore them anyways |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is true, then why not allow variation in algorithms too?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking about this a bit more: we should probably discard/convert to NA
any unused parameters. Otherwise there could be a lot of extra clustering when an unused parameter is set to vary. Again, I think this is worth the effort to support multiple clustering algorithms, amutate()
with a few ifelse()
substitutions followed by a distinct()
and we should be all set.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ah I see this is relevant to this comment - #765 (comment)
Otherwise there could be a lot of extra clustering when an unused parameter is set to vary.
I didn't see this at first, but yeah I see it now...
) | ||
} | ||
) |> | ||
dplyr::bind_rows(.id = "cluster_set") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this id just an integer? If so, do we need it? Or would it bet better to generate this from unique combinations of the parameters? Or at least add the algorithm as a prefix? Something to make it more likely to remain unique if multiple tables are joined later.
Update: yes, I see that it is needed in the current case to handle the case where an unused parameter is given with variation.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looking back at your introductory comments, I think I would probably leave it a list. But I will note that bind_rows
handles mismatch columns just fine, so you shouldn't need to worry about that!
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since you asked, I would call this cluster-sweep.R
or sweep-clustering.R
Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms? For example, here,
Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?
|
I would do the former, but then handle it in the function with |
I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since |
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
The reason we can't just let |
…r default instead
This is getting closer, but is definitely not there yet. Some updates:
|
I would vote for no |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mostly style comments here, but also I think you can revert the last change (setting and dealing with NA). I hadn't appreciated how well we throw out the default values, so I think we can leave your previous solution which did what I was worried it did not!
objective_function <- match.arg(objective_function) | ||
|
||
# this might be NA if it came from the sweep_clusters() function | ||
if (is.na(objective_function)) { | ||
objective_function <- NULL | ||
} else { | ||
objective_function <- match.arg(objective_function) | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
On reflection, since this parameter only shows up in the output table if it is not NA
, the previous solution should solve this just fine, and you can revert this change! I thought it would end up included with the default value, but we already don't do that because of the way we (properly!) handle cluster_args. (If you do want to keep it though, you will need to wrap is.na
with an all()
or check the length, I realized.)
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
I suggested two more quick tests (Seurat and matrix inputs), but I don't think I need to see this again.
|
||
test_that("sweep_clusters works as expected with default algorithm & weighting", { | ||
sweep_list <- sweep_clusters( | ||
sce, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
since there is logic in the sweep function for the conversion, it is probably worth including a quick test that this function works for matrices and Seurat objects as well. I don't think you need to do anything but check that those functions run and produce output, so using full defaults (one clustering) should be fine.
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
Closes #755
This PR adds a function and tests to perform clustering across a set of parameters. Implementation details:
cluster_set
to indicate which round of clustering the values pertain to, and for easiersplit
ing in the future. Note that I could also leave this as a list and remove thedplyr::bind_rows()
and allow users to take this step themselves if they prefer. A list might also be preferable since this function will end up getting used in a template Rmd to perform/evaluate clustering parameters, so for a lot of that we'd have to revert to a list anyways.