-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Use Snakemake HTTP remote to download starting points #15
Conversation
Ah agree with your changes to use I like having the same compression method for metadata and sequences, but am willing to give this a pass so that Optional: Do we want to modify the Right now the test runs on the full dataset, which is perfectly okay if we're adding a final deploy step at some point. |
We can revisit with larger group, but I thought there was a reason that
Thanks for catching this. I think Travis should run quickly on small example data. This is also how |
Finished connecting the smaller dataset, and checks passed (down to 4 minutes). Looks good to merge on my end! Although feel free to make changes and/or suggestions. |
This swaps to downloading via "curl" rather than the Snakemake remote input through HTTP provider. This is more straight forward and avoids issue with identification of gzip encoding by HTTP provider.
Switching to uncompressed example data to make it easier for someone to understand file format via GitHub inspection.
This is now working and documented. I'm going to merge this PR. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A few post-merge notes.
[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz) | ||
and metadata from | ||
[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These links are broken because the URL-part doesn't include the scheme (https://
). When the link text should be the same as the URL, I'd suggest relying on auto-linking of base URLs.
[https://data.nextstrain.org/files/zika/sequences.fasta.xz](data.nextstrain.org/files/zika/sequences.fasta.xz) | |
and metadata from | |
[https://data.nextstrain.org/files/zika/metadata.tsv.gz](data.nextstrain.org/files/zika/metadata.tsv.gz). | |
https://data.nextstrain.org/files/zika/sequences.fasta.xz | |
and metadata from | |
https://data.nextstrain.org/files/zika/metadata.tsv.gz. |
from NCBI GenBank via ViPR and performing additional bespoke curation. Our | ||
curation is described | ||
[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Links like "here" and "click here" are an anti-pattern. Among their issues is that they impede the accessibility of the links. I'd suggest linking the previous reference to curation instead:
from NCBI GenBank via ViPR and performing additional bespoke curation. Our | |
curation is described | |
[here](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md). | |
from NCBI GenBank via ViPR and performing | |
[additional bespoke curation](https://github.com/nextstrain/fauna/blob/master/builds/ZIKA.md). |
gzip --decompress --keep {input.metadata} | ||
xz --decompress --keep {input.sequences} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Was there a reason you chose to a) decompress separately and b) keep the compressed copies around? My instinct would be decompress on the fly during download and thus make this whole decompress
rule unnecessary and avoid the double disk space usage (which while insignificant for Zika, sets what I think is a bad precedent).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I vote to drop --keep
in order to remove unnessary intermediate files. Although I also vote for xz
over gzip
for a smaller memory footprint. ;) I have no strong opinions on "decompress on the fly during download", will follow the group decision. @trvrb feel free to comment on your flag decisions
Here, I chose to use
snakemake.remote.HTTP
as it reads from the Cloudfront-backed https://data.nextstrain.org rather than the S3 bucketnextstrain-data
. I did this for two reasons:I also used
metadata.tsv.gz
rather thanmetadata.tsv.xz
to mirror what we do forncov
.