-
Notifications
You must be signed in to change notification settings - Fork 596
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merge VAT TSV files into single bgzipped file [VS-304] #7848
Conversation
Codecov Report
@@ Coverage Diff @@
## ah_var_store #7848 +/- ##
================================================
Coverage ? 86.304%
Complexity ? 35189
================================================
Files ? 2170
Lines ? 164837
Branches ? 17775
================================================
Hits ? 142261
Misses ? 16252
Partials ? 6324 |
done | ||
|
||
echo_date "making header.gz" | ||
echo "vid transcript contig position ref_allele alt_allele gvs_all_ac gvs_all_an gvs_all_af gvs_all_sc gvs_max_af gvs_max_ac gvs_max_an gvs_max_sc gvs_max_subpop gvs_afr_ac gvs_afr_an gvs_afr_af gvs_afr_sc gvs_amr_ac gvs_amr_an gvs_amr_af gvs_amr_sc gvs_eas_ac gvs_eas_an gvs_eas_af gvs_eas_sc gvs_eur_ac gvs_eur_an gvs_eur_af gvs_eur_sc gvs_mid_ac gvs_mid_an gvs_mid_af gvs_mid_sc gvs_oth_ac gvs_oth_an gvs_oth_af gvs_oth_sc gvs_sas_ac gvs_sas_an gvs_sas_af gvs_sas_sc gene_symbol transcript_source aa_change consequence dna_change_in_transcript variant_type exon_number intron_number genomic_location dbsnp_rsid gene_id gene_omim_id is_canonical_transcript gnomad_all_af gnomad_all_ac gnomad_all_an gnomad_failed_filter gnomad_max_af gnomad_max_ac gnomad_max_an gnomad_max_subpop gnomad_afr_ac gnomad_afr_an gnomad_afr_af gnomad_amr_ac gnomad_amr_an gnomad_amr_af gnomad_asj_ac gnomad_asj_an gnomad_asj_af gnomad_eas_ac gnomad_eas_an gnomad_eas_af gnomad_fin_ac gnomad_fin_an gnomad_fin_af gnomad_nfr_ac gnomad_nfr_an gnomad_nfr_af gnomad_sas_ac gnomad_sas_an gnomad_sas_af gnomad_oth_ac gnomad_oth_an gnomad_oth_af revel splice_ai_acceptor_gain_score splice_ai_acceptor_gain_distance splice_ai_acceptor_loss_score splice_ai_acceptor_loss_distance splice_ai_donor_gain_score splice_ai_donor_gain_distance splice_ai_donor_loss_score splice_ai_donor_loss_distance omim_phenotypes_id omim_phenotypes_name clinvar_classification clinvar_last_updated clinvar_phenotype" | gzip > header.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not for this pr, for the future---I hate that this is hardcoded, but I dont see a way around this since it's also hard coded for the export query (also not good). Like maybe run the query twice, once with a limit of 0 and just grab the header?!?! I dunno
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we generate the TSVs with a header line in the EXPORT command, and then you can get this header from the first TSV instead (and grep it out of the others when you concatenate)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I could, but it was complexity that I wasn't sure would add much. If you have the bash code on hand to do it, I'd be happy to add it 😉 .
echo_date "concatenating $files" | ||
cat $(echo $files) > vat_complete.tsv.gz | ||
echo_date "bgzipping concatenated file" | ||
cat vat_complete.tsv.gz | gunzip | bgzip > vat_complete.bgz.tsv.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should output be named 'vat_complete.tsv.bgz'?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz
suffix.
echo_date "concatenating $files" | ||
cat $(echo $files) > vat_complete.tsv.gz | ||
echo_date "bgzipping concatenated file" | ||
cat vat_complete.tsv.gz | gunzip | bgzip > vat_complete.bgz.tsv.gz |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would have thought this would end up vat_complete.tsv.bgz
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz
suffix.
Closes https://broadworkbench.atlassian.net/browse/VS-304