Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

aws-batch: support Snakemake --report #373

Closed
joverlee521 opened this issue Jun 17, 2024 · 7 comments · Fixed by #374 or nextstrain/docker-base#220
Closed

aws-batch: support Snakemake --report #373

joverlee521 opened this issue Jun 17, 2024 · 7 comments · Fixed by #374 or nextstrain/docker-base#220
Assignees
Labels
enhancement New feature or request

Comments

@joverlee521
Copy link
Contributor

Context

Snakemake has removed the --stats option in v8, so I'm looking into the --report option for long term workflow stats.

The Snakemake report must be generated after the workflow has finished. I thought this would be as simple as attaching/downloading an old AWS Batch job then running nextstrain build . --report.

When I did this for ncov-ingest, I saw a bunch of warnings along the lines of:

Missing metadata for file data/gisaid/metadata.tsv. Maybe metadata was deleted or it was created using an older version of Snakemake. This is a non critical warning.

I then realized we are explicitly excluding Snakemake state in the downloads from AWS Batch:

# We don't want the remote Snakemake state to interfere locally…
".snakemake/",
# Ignore Python bytecode
"*.pyc",
"__pycache__/",
])
included = path_matcher([
# But we do want the Snakemake logs to come over.
".snakemake/log/",
])

Possible solutions

  1. Include .snakemake/metadata in the downloads from AWS Batch so that users can generate the Snakemake report locally.
  2. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report

[2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

@joverlee521 joverlee521 added the enhancement New feature or request label Jun 17, 2024
@tsibley
Copy link
Member

tsibley commented Jun 17, 2024

Hmm. Downloading the Snakemake state locally may fix this problem, but it can/will cause other problems. I don't know if it'd be ok if we scope it down to not all of .snakemake/ but just .snakemake/metadata/ as you suggest in 1.

What's the effect of the warnings? Is there useful information missing from the report? Or is just wanting to suppress the noise from Snakemake?

@joverlee521
Copy link
Contributor Author

What's the effect of the warnings? Is there useful information missing from the report? Or is just wanting to suppress the noise from Snakemake?

The generated report does not include any runtime info:

Screenshot 2024-06-17 at 2 21 33 PM

@tsibley
Copy link
Member

tsibley commented Jun 17, 2024

Ah, looking more closely at the contents of .snakemake/metadata/, I do think we want to start downloading it by default. It's mostly info used to determine if Snakemake needs to re-run rules based on inputs/outputs, and thus is akin to the file mtimes which we already preserve on download.

@tsibley tsibley self-assigned this Jun 17, 2024
@tsibley
Copy link
Member

tsibley commented Jun 17, 2024

  1. Include .snakemake/metadata in the downloads from AWS Batch so that users can generate the Snakemake report locally.

We should do this, per above. I'll open a PR.

2. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report

[2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

We could also do this as well, but it requires a little more consideration about how/where/when. Would you open it as a separate issue if you'd like to see it?

@tsibley
Copy link
Member

tsibley commented Jun 17, 2024

This will also need a new docker-base image, as the same exclusions of .snakemake/ are recapitulated there:

https://github.com/nextstrain/docker-base/blob/ccac0787cbc6118d0e39518e2220f650e121b8bd/entrypoint-aws-batch#L43-L44

tsibley added a commit to nextstrain/docker-base that referenced this issue Jun 17, 2024
Snakemake stores state information per-input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not upload all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Related-to: <nextstrain/cli#373>
tsibley added a commit that referenced this issue Jun 17, 2024
Snakemake stores state information per input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not download all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Resolves: <#373>
Related-to: <nextstrain/docker-base#220>
tsibley added a commit to nextstrain/docker-base that referenced this issue Jun 17, 2024
Snakemake stores state information per input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not upload all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Resolves: <nextstrain/cli#373>
Related-to: <nextstrain/cli#374>
tsibley added a commit that referenced this issue Jun 17, 2024
Snakemake stores state information per input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not download all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Resolves: <#373>
Related-to: <nextstrain/docker-base#220>
tsibley added a commit that referenced this issue Jun 17, 2024
Snakemake stores state information per input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not download all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Resolves: <#373>
Related-to: <nextstrain/docker-base#220>
@joverlee521
Copy link
Contributor Author

  1. Automatically generate the Snakemake report within the AWS Batch job so that users can download the rendered report
    [2] definitely seems like the nicer option and maybe should be applied across all runtimes for nextstrain build?

We could also do this as well, but it requires a little more consideration about how/where/when. Would you open it as a separate issue if you'd like to see it?

Hmm, maybe this doesn't need to be built into the Nextsrain CLI. It could just be a separate step in the pathogen-repo-build workflow so we have reports for our automated pathogens.

@tsibley
Copy link
Member

tsibley commented Jun 17, 2024

It could just be a separate step in the pathogen-repo-build workflow so we have reports for our automated pathogens.

Totally.

tsibley added a commit that referenced this issue Jun 17, 2024
Snakemake stores state information per input/output here and uses it to
determine if it needs to re-run rules or not.  It seems akin to the file
mtimes which we already take care to preserve on upload/download.
Additionally, the metadata recorded is used in Snakemake's report
generation and is generally useful for looking at workflow statistics.

Continue to not download all of .snakemake/ en masse because it can
potentially contain files that interfere with local usage and/or are
large and unnecessary.

Resolves: <#373>
Related-to: <nextstrain/docker-base#220>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
2 participants