Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Open
Ananya-swi opened this issue Oct 18, 2024 · 7 comments
Open
Assignees

Comments

@Ananya-swi
Copy link

Ananya-swi commented Oct 18, 2024

Hi,

I am working on annotating large datasets, specifically Whole Genome Sequencing (WGS) VCF files, using the Variant Effect Predictor (VEP). However, the annotation process is taking significantly longer than expected. For example, annotating a 1.8GB VCF file took approximately 15 hours.

Environment Details:

  • Platform: Azure
  • VM Configuration: 32 vCPUs, 64GB RAM
  • VEP Setup: Running within a Docker container

I am seeking guidance on how to optimize VEP for faster annotation. Could you provide recommendations on:

  1. Configuring the VM or container for better performance.
  2. Any VEP parameters or caching strategies that could improve processing times.
  3. Alternative VM sizes or architectures that might be better suited for WGS annotations.

Thank you for your support and insights.

Best regards,
Ananya Saji

@jamie-m-a jamie-m-a self-assigned this Oct 18, 2024
@jamie-m-a
Copy link
Contributor

Hi @Ananya-swi thanks for reaching out to us.

It would be useful to know the full VEP command you are using, so we can try and identify potential speed ups. However, even without that, I can suggest a possible option which is to use our Nextflow VEP, which offers a degree of parallelisation to speed up processing large data.

I should say that we haven't tested it on cloud compute yet - so if you do decide to try it, let us know if you encounter any challenges.

@Ananya-swi
Copy link
Author

Ananya-swi commented Oct 18, 2024

Hi @jamie-m-a,

Thank you for the recommendation! I’m sharing the full VEP command I used below for your reference:

docker run -i -v /data:/opt/vep/.vep ensemblorg/ensembl-vep:release_106.1 vep --cache --refseq --CACHE_VERSION 106 --dir_plugins /opt/vep/.vep/Plugins --no_stats -i input.vcf -o output.txt --symbol --hgvs --hgvsg --variant_class --gene_phenotype --flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane --sift p --polyphen p --fasta /opt/vep/.vep/homo_sapiens_refseq/106_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz --species homo_sapiens --assembly GRCh37 --af --af_gnomad --no_escape --custom /opt/vep/.vep/GRCh37/clinvar_GRCh37.vcf.gz,ClinVar,vcf,exact,0,CLNSIG,CLNREVSTAT,CLNDN, --custom /opt/vep/.vep/GRCh37/gnomad.genomes.r2.1.1.sites.1-22X.GRCh37_MinInfo.vcf.gz,gnomad.genomes,vcf,exact,0,controls_AC,controls_AN,controls_nhomalt,AF,AF_afr,AF_sas,AF_amr,AF_eas,AF_nfe,AF_fin,AF_asj,AF_oth --custom /opt/vep/.vep/GRCh37/gnomad.exomes.r2.1.1.sites.1-22XY.GRCh37_MinInfo.vcf.gz,gnomad.exomes,vcf,exact,0,controls_AC,controls_AN,controls_nhomalt --custom /opt/vep/.vep/GRCh37/hg19_rmsk.bed.gz,repeats,bed --plugin CADD,/opt/vep/.vep/GRCh37/whole_genome_SNVs.tsv.gz,/opt/vep/.vep/GRCh37/InDels.tsv.gz --custom /opt/vep/.vep/GRCh37/hg19.phyloP100way.bw,phyloP100way,bigwig --custom /opt/vep/.vep/GRCh37/gerp_conservation_scores.homo_sapiens.GRCh37.bw,GERP_vep_all,bigwig --plugin SpliceAI,snv=/opt/vep/.vep/GRCh37/spliceai_scores.raw.snv.hg19.vcf.gz,indel=/opt/vep/.vep/GRCh37/spliceai_scores.raw.indel.hg19.vcf.gz --plugin dbscSNV,/opt/vep/.vep/GRCh37/dbscSNV1.1_GRCh37.txt.gz --plugin NMD --dir_plugins /opt/vep/.vep/Plugins --tab --offline --buffer_size 100000 --fork 32 --force_overwrite

I’ll also explore the Nextflow VEP option to see if it speeds up the annotation process. If any issues arise on the cloud platform, I’ll follow up accordingly.

Best regards,
Ananya

@jamie-m-a
Copy link
Contributor

Hi @Ananya-swi

No problem! Now that I can see your command, I notice you're not using forks, which can have a significant speed impact. Some general instructions for speeding up Ensembl VEP can be found here.

Let us know how you get on.

@Ananya-swi
Copy link
Author

Hi @jamie-m-a,

Thank you for the feedback! I wanted to clarify that I did use the --fork option, setting it to 32. However, I still observed long runtimes, with the process taking around 15 hours for a 1.8GB input VCF.

It would be great if you could share any additional insights or optimization tips, particularly regarding other parameters that could improve performance. I’ll also explore the general recommendations provided in the link you shared.

Looking forward to hearing from you!

Thanks again,
Ananya

@jamie-m-a
Copy link
Contributor

Apologies @Ananya-swi - I missed the fork flag. The other easy thing to check is whether your input VCF is properly sorted. Your run time does seem long for a file that size. Can you advise how many variants are in your input?

@Ananya-swi
Copy link
Author

Hi @jamie-m-a,

Thank you for your response! I appreciate the suggestion about checking the sorting of my input VCF. I have confirmed that the VCF file is sorted correctly.

Regarding your question, the input VCF contains 6,139,369 variants.

Thanks again for your help!

Best,
Ananya

@jamie-m-a
Copy link
Contributor

Thanks for the update @Ananya-swi the run time does seem slow - I'll try running some tests on a similarly sized input and get back to you.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants