Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Large data mode needs more granular checkpointing during clustering #178

Closed
jason-c-kwan opened this issue Jul 8, 2021 · 1 comment · Fixed by #207
Closed

Large data mode needs more granular checkpointing during clustering #178

jason-c-kwan opened this issue Jul 8, 2021 · 1 comment · Fixed by #207
Labels
enhancement New feature or request python Python related issues/code

Comments

@jason-c-kwan
Copy link
Collaborator

This issue is to record Brian Couger's request that large data mode should checkpoint during clustering so that it won't waste too much time if the pipeline has to be resumed in a HPC environment. Below are the notes I took during the recent Autometa strategy meeting, going into a potential way of implementing.

We start with a set of datapoints that is to be clustered.

  1. Cycle through EPS values
  2. Decide upon the “best” EPS value
  3. Take out “good” bins
  4. Go back to 1 (until no more “good” bins are yielded).

Checkpointing:

Every time 4 finishes, record good bins somewhere, and note which contigs are left.
Also delete all EPS tables when 4 finishes
For resume - just start 1 at the contigs you have left

Every time 1 finishes, record a table of the groupings for a given EPS value, remember which EPS value is “next”.
For resume: Note that tables already exist, read them into memory and start clustering algorithm at the next EPS value.

@jason-c-kwan jason-c-kwan added the enhancement New feature or request label Jul 8, 2021
@evanroyrees evanroyrees linked a pull request Aug 3, 2021 that will close this issue
@evanroyrees
Copy link
Collaborator

The checkpointing behavior you have outlined above is not quite the same as the checkpointing behavior implemented in the linked PR, but the linked PR behavior will allow resuming from where binning stopped (at the taxon iteration level).

@evanroyrees evanroyrees added the python Python related issues/code label Sep 28, 2021
@evanroyrees evanroyrees linked a pull request Jan 19, 2022 that will close this issue
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request python Python related issues/code
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants