Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add function to sweep clustering parameters #765

Merged

Conversation

sjspielman
Copy link
Member

Closes #755

This PR adds a function and tests to perform clustering across a set of parameters. Implementation details:

  • First, I am hardly enamored by my function and file naming. Do you agree and if so what thoughts do you have for me??
  • The function ends up creating a data frame with an indicator column cluster_set to indicate which round of clustering the values pertain to, and for easier spliting in the future. Note that I could also leave this as a list and remove the dplyr::bind_rows() and allow users to take this step themselves if they prefer. A list might also be preferable since this function will end up getting used in a template Rmd to perform/evaluate clustering parameters, so for a lot of that we'd have to revert to a list anyways.
    • In part because of the need to keep these columns consistent, and also because it felt overkill/beyond the scope of this function's specific purpose, I do not allow users to vary the algorithm. You get to pick 1 algorithm and vary parameters for that algorithm.
  • I ended up adding a threads argument, since this can get time consuming with real SCEs
    • Speaking of real SCEs, how long does it take to extract a matrix from an SCE or Seurat object (a real big one, not simulated!)? It's essentially instantaneous. I still extract the matrix to begin with to save some time, but it's small potatoes either way.
  • Any other tests we'd like to see?

@sjspielman sjspielman removed the request for review from jaclyn-taroni September 18, 2024 14:04
Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall this looks good.

I think my biggest comment is that I would want to support multiple algorithms, but I can see your argument for not doing so. It definitely does introduce some complexity, but that will have to be handled somewhere anyway... I guess the stongest argument would be that sometimes parameter ranges should be different for different algorithms. But supporting it doesn't mean you have to use it!

Part of that reasoning is that I think you do want to check the parameters against the algorithm anyway: if there are unused parameters given they should not be passed along to the sweep, as that will result in duplicate runs.

I think leaving the output as a list probably makes sense, as it makes it easier to combine different runs later. Once you put in an id column in the data frame, then you have to worry about it being unique.

I didn't evaluate testing yet, as I think it will change if the output is changing!

#' Calculate clusters across a set of parameters
#'
#' This function can be used to perform reproducible clustering while varying a set of parameters.
#' A single clustering algorithm is required, but multiple values can be provided for any of:
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm curious why you made this decision? Why not allow comparison across clustering algorithms? I assume it is because other parameter variations were annoying to handle, but I certainly imagine that such comparisons might be desired, so the effort might be worth it.

Comment on lines 83 to 84
# Even parameters that won't be used can be included
# since calculate_clusters() will ignore them anyways
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is true, then why not allow variation in algorithms too?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thinking about this a bit more: we should probably discard/convert to NA any unused parameters. Otherwise there could be a lot of extra clustering when an unused parameter is set to vary. Again, I think this is worth the effort to support multiple clustering algorithms, amutate() with a few ifelse() substitutions followed by a distinct() and we should be all set.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah I see this is relevant to this comment - #765 (comment)

Otherwise there could be a lot of extra clustering when an unused parameter is set to vary.

I didn't see this at first, but yeah I see it now...

packages/rOpenScPCA/R/sweep-parameters.R Outdated Show resolved Hide resolved
)
}
) |>
dplyr::bind_rows(.id = "cluster_set")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this id just an integer? If so, do we need it? Or would it bet better to generate this from unique combinations of the parameters? Or at least add the algorithm as a prefix? Something to make it more likely to remain unique if multiple tables are joined later.

Update: yes, I see that it is needed in the current case to handle the case where an unused parameter is given with variation.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looking back at your introductory comments, I think I would probably leave it a list. But I will note that bind_rows handles mismatch columns just fine, so you shouldn't need to worry about that!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since you asked, I would call this cluster-sweep.R or sweep-clustering.R

packages/rOpenScPCA/R/sweep-parameters.R Outdated Show resolved Hide resolved
@sjspielman
Copy link
Member Author

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

  • objective_function will apply to only leiden since it's algorithm-specific
  • but nn would apply to both louvain and leiden.
sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

@jashapiro
Copy link
Member

Here's a question for you regarding allowing multiple algorithms: What is the right way to handle varying parameters that are shared across algorithms?

For example, here,

  • objective_function will apply to only leiden since it's algorithm-specific
  • but nn would apply to both louvain and leiden.
sweep_clusters(
  sce, 
  algorithm = c("louvain", "leiden"), 
  objective_function = "modularity", 
  nn = c(10, 15)
)

Alternatively, we could have some very fine control. Taking a stab at this conceptually but it feels a bit gnarly?

sweep_clusters(
  sce, 
  list(
    "louvain" = list(nn = c(10, 15), 
    "leiden" = list(nn = c(20, 25), objective_function = "modularity")
  )
)

I would do the former, but then handle it in the function withmutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_) after the expand_grid

@sjspielman
Copy link
Member Author

I would do the former, but then handle it in the function with mutate(objective_function = ifelse(algorithm == "leiden", objective_function, NA_character_)

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@jashapiro
Copy link
Member

I had basically this exact this code in there yesterday before I locked down the algorithm! But in the end, I didn't think it was actually needed since calculate_clusters will ignore irrelevant parameters. I suspect we can just toss everything into the parameters list and the situation will sort itself out, but I look forward to finding out what I might be missing 😄

The reason we can't just let calculate_clusters ignore it is the multiple runs problem. The mutate function should be followed by a distinct to remove repeats.

@sjspielman
Copy link
Member Author

This is getting closer, but is definitely not there yet. Some updates:

  • While NA makes sense for avoiding duplicate runs when irrelevant params are varied, match.args doesn't want it. So, instead of NA, I used the values that represent the defaults for those parameters, which maybe seemed like a decent middle ground (maybe? decent? middle? i'm not hedging, you're hedging!). Did you have a different thought there?
  • Note that I caught a bug where we can't be adding in cluster_args to the final data frame if it's empty, so I fixed (and tested) here 0e47c81
  • Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -
    • We could cancel cluster_args for the sweep function
    • We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

@jashapiro
Copy link
Member

Speaking of cluster_args - this is a problem! We don't check that these are sane for the given algorithm, and I tend to think we indeed should not be in the business of doing so since it depends on igraph. That said, bluster will fail if an irrelevant parameter is passed in, and that's very much a possibility with the sweep function. Here's what I've thought of (noting I prefer the latter) -

  • We could cancel cluster_args for the sweep function
  • We could have users supply this argument a nested list per algorithm, e.g. cluster_args = list(louvain = ..., walktrap = ....)).

I would vote for no cluster_args in the sweep function. If you really need it, you can write your own sweep. And if there are particular cluster_args worth adding, we can add support as needed.

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Mostly style comments here, but also I think you can revert the last change (setting and dealing with NA). I hadn't appreciated how well we throw out the default values, so I think we can leave your previous solution which did what I was worried it did not!

packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/R/sweep-clusters.R Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved
Comment on lines 89 to 101
objective_function <- match.arg(objective_function)

# this might be NA if it came from the sweep_clusters() function
if (is.na(objective_function)) {
objective_function <- NULL
} else {
objective_function <- match.arg(objective_function)
}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

On reflection, since this parameter only shows up in the output table if it is not NA, the previous solution should solve this just fine, and you can revert this change! I thought it would end up included with the default value, but we already don't do that because of the way we (properly!) handle cluster_args. (If you do want to keep it though, you will need to wrap is.na with an all() or check the length, I realized.)

Copy link
Member

@jashapiro jashapiro left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

I suggested two more quick tests (Seurat and matrix inputs), but I don't think I need to see this again.

packages/rOpenScPCA/R/sweep-clusters.R Outdated Show resolved Hide resolved
packages/rOpenScPCA/tests/testthat/test-sweep-clusters.R Outdated Show resolved Hide resolved

test_that("sweep_clusters works as expected with default algorithm & weighting", {
sweep_list <- sweep_clusters(
sce,
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

since there is logic in the sweep function for the conversion, it is probably worth including a quick test that this function works for matrices and Seurat objects as well. I don't think you need to do anything but check that those functions run and produce output, so using full defaults (one clustering) should be fine.

sjspielman and others added 2 commits September 19, 2024 13:55
Co-authored-by: Joshua Shapiro <josh.shapiro@ccdatalab.org>
@sjspielman sjspielman merged commit 6f67c73 into AlexsLemonade:feature/ropenscpca Sep 19, 2024
3 checks passed
@sjspielman sjspielman deleted the 755-sweep-clustering branch September 19, 2024 19:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants