Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Better snakemake cluster submission support #129

Merged
merged 38 commits into from
Sep 7, 2023
Merged

Better snakemake cluster submission support #129

merged 38 commits into from
Sep 7, 2023

Conversation

AroneyS
Copy link
Collaborator

@AroneyS AroneyS commented Aug 29, 2023

Add rule to localrules if simple, otherwise add resources with estimated cpu/mem/time.

Not sure about all the long-reads assemblies resources.

  • Add multiplier for resources with retry (if submitted job fails)
  • Add proper arguments to support snakemake args
    • e.g. --profile qsub --retries 3, where profile points to ~/.config/snakemake/qsub/config.yaml which contains cluster, cluster-status, jobs, cluster-cancel and other snakemake arguments
  • Add logs for each rule (stdout/stderr captured by cluster software)

better snakemake cluster submission
@AroneyS AroneyS changed the base branch from main to dev August 29, 2023 00:22
@AroneyS AroneyS requested review from rhysnewell and wwood August 29, 2023 00:22
rhysnewell
rhysnewell previously approved these changes Aug 29, 2023
@wwood
Copy link
Collaborator

wwood commented Aug 29, 2023 via email

@rhysnewell rhysnewell self-requested a review August 31, 2023 01:45
@rhysnewell rhysnewell dismissed their stale review August 31, 2023 01:46

More commits added, needs re-review

@rhysnewell
Copy link
Owner

Let me know when you're ready for this to be reviewed

run all other independent jobs when job errors
@AroneyS
Copy link
Collaborator Author

AroneyS commented Aug 31, 2023

@rhysnewell @wwood Found a bug in snakemake where it solves the dag and then exits with no error. Only happened with >1 binner with cluster-submission. After much debugging, it turns out that its because the refinery rules don't have the group: keyword. So, since cluster submission tries to combine groups, it couldn't find anything to submit.

Anyway, solution is to either add the refinery rules to group: 'binning', or remove groups entirely. I'm leaning towards the latter (or to replace them with specific grouped rules) to make the submitted jobs smaller. Was there a reason that the groups were added in the first place? I see they are split into binning/assembly/annotation/isolate...

@wwood
Copy link
Collaborator

wwood commented Aug 31, 2023

Who wrote them in according to git blame? I thought they were only for cluster/cloud submission, so probably best to delete, I think, but maybe there's a reason I don't appreciate.

@AroneyS
Copy link
Collaborator Author

AroneyS commented Sep 1, 2023

@rhysnewell @wwood Ran integration tests: test_short_read_recovery, test_long_read_recovery and test_short_read_recovery_queue_submission all pass.

The only issue I see now is that the refinery logs (from rosella_refinery.py) are all empty. Not sure why, since I used the same logging method as run_checkm.py, etc., and their logs are populated.

@rhysnewell
Copy link
Owner

I just had a quick look, and stdout is being written to the log file for both the run_checkm and refine_rosella rules. All logging info for rosella comes from stderr, so you'd need to pipe stderr to the log file. Stdout should generally remain empty for a rosella run. I could be incorrect though, just from a glance

@AroneyS
Copy link
Collaborator Author

AroneyS commented Sep 3, 2023

The stdout and stderr should both be going to log files since stderr is redirected to STDOUT. I did that since some of the tools aren't well-behaved...

subprocess.run(
    f"checkm2 predict -i {bin_folder}/ -x {bin_ext} -o {output_folder} -t {threads} --force".split(),
    env=os.environ,
    stdout=logf,
    stderr=subprocess.STDOUT
    )

@rhysnewell
Copy link
Owner

Ah gotcha, hmm. I guess you'd have to check firstly that the log output isn't just being piped somewhere, and secondly that the refinement is even running? Like it might be getting skipped in your tests due to lack of bins or something

@AroneyS
Copy link
Collaborator Author

AroneyS commented Sep 3, 2023

Hah, yeh of course. The input CheckM files are empty. I'll log that refinement is being skipped.

@AroneyS
Copy link
Collaborator Author

AroneyS commented Sep 4, 2023

@rhysnewell @wwood I think its ready. Good luck reading +1191/-676 lines, at least most of them are just removing groups and adding resources/logs.

Copy link
Collaborator

@wwood wwood left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi Sam, thanks for all this work. I didn't have time to go through each and every line, but the ones I looked at were good.

Happy to see there's further testing going on too.

I've only really requested a bit more documentation, I think a short example on using the cluster profile would work well enough.

Beyond this, I think it would be worth gathering a list of advantages of Aviary over other workflows, and listing those on the README. Can you please make a start on that, maybe in another PR?

Ta

base_group.add_argument(
'--snakemake-profile',
help='Snakemake profile (see https://snakemake.readthedocs.io/en/stable/executing/cli.html#profiles)\n'
'Create profile as `~/.config/snakemake/[CLUSTER_PROFILE]/config.yaml`. \n'
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Unclear whether you mean the user or aviary does the creating. I think an example yml in the doco would go a long way here.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added some stuff under Advanced Usage. Is that what you wanted?

@wwood
Copy link
Collaborator

wwood commented Sep 5, 2023

Sounds good, but maybe add a subtitle rather than just advanced usage.

Also, is this still true? From that same doc. Would be worth adding a few words to say that it is only sorta true e.g. reruns in a cluster mode are requeued asking for increasingly more RAM

When performing assembly, users are required to estimate how much RAM they will need to use via -m, --max-memory, --max_memory

@AroneyS
Copy link
Collaborator Author

AroneyS commented Sep 5, 2023

@wwood
The memory arg is used as a cap on job submission memory. So the job memory will increase from a set amount with each retry but stop at the cap.

@AroneyS AroneyS merged commit de1a48a into dev Sep 7, 2023
@AroneyS AroneyS deleted the add-resources branch September 7, 2023 04:07
@AroneyS AroneyS mentioned this pull request Sep 13, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants