Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merge VAT TSV files into single bgzipped file [VS-304] #7848

Merged
merged 16 commits into from
May 13, 2022
1 change: 1 addition & 0 deletions .dockstore.yml
Original file line number Diff line number Diff line change
Expand Up @@ -143,6 +143,7 @@ workflows:
branches:
- master
- ah_var_store
- rsa_merge_vat_tsvs
- name: GvsCreateVATAnnotations
subclass: WDL
primaryDescriptorPath: /scripts/variantstore/wdl/GvsCreateVATAnnotations.wdl
Expand Down
90 changes: 89 additions & 1 deletion scripts/variantstore/wdl/GvsCreateVAT.wdl
Original file line number Diff line number Diff line change
Expand Up @@ -15,6 +15,7 @@ workflow GvsCreateVAT {
File? genes_schema_json_file = "gs://broad-dsp-spec-ops/scratch/rcremer/Nirvana/schemas/genes_schema.json"
String output_path

Int? merge_vcfs_disk_size_override
String? service_account_json_path
File ancestry_file
}
Expand Down Expand Up @@ -87,6 +88,20 @@ workflow GvsCreateVAT {
validate_jsons_done = BigQuerySmokeTest.done
}
}

call MergeVatTSVs {
input:
export_done = BigQueryExportVat.done,
contig_array = contig_array,
output_path = output_path,
project_id = project_id,
merge_vcfs_disk_size_override = merge_vcfs_disk_size_override,
service_account_json_path = service_account_json_path
}

output {
File final_tsv_file = MergeVatTSVs.tsv_file
}
}

################################################################################
Expand Down Expand Up @@ -478,7 +493,7 @@ task BigQueryExportVat {
format="CSV",
compression="GZIP",
overwrite=true,
header=true,
header=false,
field_delimiter=" ") AS
SELECT
vid,
Expand Down Expand Up @@ -604,3 +619,76 @@ task BigQueryExportVat {
}
}

task MergeVatTSVs {
input {
Array[Boolean] export_done
Array[String] contig_array
String project_id
String output_path

Int? merge_vcfs_disk_size_override
String? service_account_json_path
}

# going large with the default to make gsutil -m cp really zippy
Int disk_size = if (defined(merge_vcfs_disk_size_override)) then merge_vcfs_disk_size_override else 250
String has_service_account_file = if (defined(service_account_json_path)) then 'true' else 'false'

command <<<
apt-get update
rsasch marked this conversation as resolved.
Show resolved Hide resolved
apt-get install tabix

# custom function to prepend the current datetime to an echo statement "borrowed" from ExtractAnAcAfFromVCF
echo_date () { echo "`date "+%Y/%m/%d %H:%M:%S"` $1"; }

if [ ~{has_service_account_file} = 'true' ]; then
gsutil cp ~{service_account_json_path} local.service_account.json
gcloud auth activate-service-account --key-file=local.service_account.json
gcloud config set project ~{project_id}
fi

mkdir TSVs
contigs=( ~{sep=' ' contig_array} )
files="header.gz"

echo_date "looping over contigs: $contigs"
for i in "${contigs[@]}"
do
echo_date "copying files from ~{output_path}export/$i/*.tsv.gz"
gsutil -m cp ~{output_path}export/$i/*.tsv.gz TSVs/
echo_date "concatenating local tsv.gz files"

# the internet says that * is deterministic, see https://serverfault.com/questions/122737/in-bash-are-wildcard-expansions-guaranteed-to-be-in-order
cat TSVs/*.tsv.gz > vat_$i.tsv.gz
rsasch marked this conversation as resolved.
Show resolved Hide resolved

echo_date "removing now concatenated files"
rm TSVs/*.tsv.gz
files="$files vat_$i.tsv.gz"
done

echo_date "making header.gz"
echo "vid transcript contig position ref_allele alt_allele gvs_all_ac gvs_all_an gvs_all_af gvs_all_sc gvs_max_af gvs_max_ac gvs_max_an gvs_max_sc gvs_max_subpop gvs_afr_ac gvs_afr_an gvs_afr_af gvs_afr_sc gvs_amr_ac gvs_amr_an gvs_amr_af gvs_amr_sc gvs_eas_ac gvs_eas_an gvs_eas_af gvs_eas_sc gvs_eur_ac gvs_eur_an gvs_eur_af gvs_eur_sc gvs_mid_ac gvs_mid_an gvs_mid_af gvs_mid_sc gvs_oth_ac gvs_oth_an gvs_oth_af gvs_oth_sc gvs_sas_ac gvs_sas_an gvs_sas_af gvs_sas_sc gene_symbol transcript_source aa_change consequence dna_change_in_transcript variant_type exon_number intron_number genomic_location dbsnp_rsid gene_id gene_omim_id is_canonical_transcript gnomad_all_af gnomad_all_ac gnomad_all_an gnomad_failed_filter gnomad_max_af gnomad_max_ac gnomad_max_an gnomad_max_subpop gnomad_afr_ac gnomad_afr_an gnomad_afr_af gnomad_amr_ac gnomad_amr_an gnomad_amr_af gnomad_asj_ac gnomad_asj_an gnomad_asj_af gnomad_eas_ac gnomad_eas_an gnomad_eas_af gnomad_fin_ac gnomad_fin_an gnomad_fin_af gnomad_nfr_ac gnomad_nfr_an gnomad_nfr_af gnomad_sas_ac gnomad_sas_an gnomad_sas_af gnomad_oth_ac gnomad_oth_an gnomad_oth_af revel splice_ai_acceptor_gain_score splice_ai_acceptor_gain_distance splice_ai_acceptor_loss_score splice_ai_acceptor_loss_distance splice_ai_donor_gain_score splice_ai_donor_gain_distance splice_ai_donor_loss_score splice_ai_donor_loss_distance omim_phenotypes_id omim_phenotypes_name clinvar_classification clinvar_last_updated clinvar_phenotype" | gzip > header.gz
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Not for this pr, for the future---I hate that this is hardcoded, but I dont see a way around this since it's also hard coded for the export query (also not good). Like maybe run the query twice, once with a limit of 0 and just grab the header?!?! I dunno

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we generate the TSVs with a header line in the EXPORT command, and then you can get this header from the first TSV instead (and grep it out of the others when you concatenate)?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I could, but it was complexity that I wasn't sure would add much. If you have the bash code on hand to do it, I'd be happy to add it 😉 .


echo_date "concatenating $files"
cat $(echo $files) > vat_complete.tsv.gz
echo_date "bgzipping concatenated file"
cat vat_complete.tsv.gz | gunzip | bgzip > vat_complete.bgz.tsv.gz
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should output be named 'vat_complete.tsv.bgz'?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz suffix.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would have thought this would end up vat_complete.tsv.bgz?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I thought so, too, but everything I saw that showed the handling of bgzipped files had them with a .gz suffix.

echo_date "copying bgzipped file to ~{output_path}"
gsutil -m cp vat_complete.bgz.tsv.gz ~{output_path}
>>>
# ------------------------------------------------
# Runtime settings:
runtime {
docker: "us.gcr.io/broad-gatk/gatk:4.2.6.1"
memory: "2 GB"
rsasch marked this conversation as resolved.
Show resolved Hide resolved
preemptible: 3
cpu: "1"
disks: "local-disk ~{disk_size} HDD"
}
# ------------------------------------------------
# Outputs:
output {
File tsv_file = "vat_complete.bgz.tsv.gz"
}
}