Large data mode needs more granular checkpointing during clustering #178

jason-c-kwan · 2021-07-08T20:45:30Z

This issue is to record Brian Couger's request that large data mode should checkpoint during clustering so that it won't waste too much time if the pipeline has to be resumed in a HPC environment. Below are the notes I took during the recent Autometa strategy meeting, going into a potential way of implementing.

We start with a set of datapoints that is to be clustered.

Cycle through EPS values
Decide upon the “best” EPS value
Take out “good” bins
Go back to 1 (until no more “good” bins are yielded).

Checkpointing:

Every time 4 finishes, record good bins somewhere, and note which contigs are left.
Also delete all EPS tables when 4 finishes
For resume - just start 1 at the contigs you have left

Every time 1 finishes, record a table of the groupings for a given EPS value, remember which EPS value is “next”.
For resume: Note that tables already exist, read them into memory and start clustering algorithm at the next EPS value.

evanroyrees · 2021-08-03T20:05:57Z

The checkpointing behavior you have outlined above is not quite the same as the checkpointing behavior implemented in the linked PR, but the linked PR behavior will allow resuming from where binning stopped (at the taxon iteration level).

jason-c-kwan added the enhancement New feature or request label Jul 8, 2021

evanroyrees linked a pull request Aug 3, 2021 that will close this issue

Large data mode #182

Closed

evanroyrees added the python Python related issues/code label Sep 28, 2021

evanroyrees linked a pull request Jan 19, 2022 that will close this issue

🐍🐎 Large data mode #207

Merged

evanroyrees closed this as completed Jan 24, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Large data mode needs more granular checkpointing during clustering #178

Large data mode needs more granular checkpointing during clustering #178

jason-c-kwan commented Jul 8, 2021

evanroyrees commented Aug 3, 2021

Large data mode needs more granular checkpointing during clustering #178

Large data mode needs more granular checkpointing during clustering #178

Comments

jason-c-kwan commented Jul 8, 2021

evanroyrees commented Aug 3, 2021