Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Snakemake HTTP remote to download starting points #15

Merged
merged 5 commits into from
Mar 22, 2022
Merged

Conversation

trvrb
Copy link
Member

@trvrb trvrb commented Mar 22, 2022

Here, I chose to use snakemake.remote.HTTP as it reads from the Cloudfront-backed https://data.nextstrain.org rather than the S3 bucket nextstrain-data. I did this for two reasons:

  1. Cloud-front backed data.nextstrain.org should be more efficient to download from and sets a good pattern
  2. I liked surfacing the simple URLs data.nextstrain.org/files/zika/sequences.fasta.xz and data.nextstrain.org/files/zika/metadata.tsv.gz directly in the workflow.

I also used metadata.tsv.gz rather than metadata.tsv.xz to mirror what we do for ncov.

@trvrb trvrb requested a review from j23414 March 22, 2022 01:16
@j23414
Copy link
Contributor

j23414 commented Mar 22, 2022

Ah agree with your changes to use HTTP.remote instead of s3!

I like having the same compression method for metadata and sequences, but am willing to give this a pass so that metadata.tsv.gz is consistent with ncov.

Optional: Do we want to modify the .travis test to run on a smaller dataset (split out the example_data/zika.fasta into its own metadata.tsv.gz and sequences.fasta.xz)? (I'm happy to add this but no pressure.)

Right now the test runs on the full dataset, which is perfectly okay if we're adding a final deploy step at some point.

@trvrb
Copy link
Member Author

trvrb commented Mar 22, 2022

I like having the same compression method for metadata and sequences, but am willing to give this a pass so that metadata.tsv.gz is consistent with ncov.

We can revisit with larger group, but I thought there was a reason that .gz was chosen for metadata in this case, but I don't remember what it is. In splitting between .xz and .gz I was assuming there was some rationale. Maybe let's start with .gz for metadata and follow up.

Optional: Do we want to modify the .travis test to run on a smaller dataset (split out the example_data/zika.fasta into its own metadata.tsv.gz and sequences.fasta.xz)? (I'm happy to add this but no pressure.)
Right now the test runs on the full dataset, which is perfectly okay if we're adding a final deploy step at some point.

Thanks for catching this. I think Travis should run quickly on small example data. This is also how ncov works. If you could update PR that would be great.

@j23414
Copy link
Contributor

j23414 commented Mar 22, 2022

Finished connecting the smaller dataset, and checks passed (down to 4 minutes).

Looks good to merge on my end! Although feel free to make changes and/or suggestions.

trvrb added 3 commits March 22, 2022 15:27
This swaps to downloading via "curl" rather than the Snakemake remote input through HTTP provider.

This is more straight forward and avoids issue with identification of gzip encoding by HTTP provider.
Switching to uncompressed example data to make it easier for someone to understand file format via GitHub inspection.
@trvrb
Copy link
Member Author

trvrb commented Mar 22, 2022

This is now working and documented. I'm going to merge this PR.

@trvrb trvrb merged commit d98d4c7 into master Mar 22, 2022
@trvrb trvrb deleted the http-download branch March 22, 2022 23:21
Copy link
Member

@tsibley tsibley left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

A few post-merge notes.

Comment on lines +53 to +55
[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz)
and metadata from
[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These links are broken because the URL-part doesn't include the scheme (https://). When the link text should be the same as the URL, I'd suggest relying on auto-linking of base URLs.

Suggested change
[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz)
and metadata from
[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz).
https://data.nextstrain.org/files/zika/sequences.fasta.xz
and metadata from
https://data.nextstrain.org/files/zika/metadata.tsv.gz.

Comment on lines +57 to +59
from NCBI GenBank via ViPR and performing additional bespoke curation. Our
curation is described
[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Links like "here" and "click here" are an anti-pattern. Among their issues is that they impede the accessibility of the links. I'd suggest linking the previous reference to curation instead:

Suggested change
from NCBI GenBank via ViPR and performing additional bespoke curation. Our
curation is described
[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).
from NCBI GenBank via ViPR and performing
[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md).

Snakefile Show resolved Hide resolved
Comment on lines +44 to +45
gzip --decompress --keep {input.metadata}
xz --decompress --keep {input.sequences}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Was there a reason you chose to a) decompress separately and b) keep the compressed copies around? My instinct would be decompress on the fly during download and thus make this whole decompress rule unnecessary and avoid the double disk space usage (which while insignificant for Zika, sets what I think is a bad precedent).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I vote to drop --keep in order to remove unnessary intermediate files. Although I also vote for xz over gzip for a smaller memory footprint. ;) I have no strong opinions on "decompress on the fly during download", will follow the group decision. @trvrb feel free to comment on your flag decisions

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
No open projects
Development

Successfully merging this pull request may close these issues.

3 participants