Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Ananya-swi · 2024-10-18T11:26:30Z

Hi,

I am working on annotating large datasets, specifically Whole Genome Sequencing (WGS) VCF files, using the Variant Effect Predictor (VEP). However, the annotation process is taking significantly longer than expected. For example, annotating a 1.8GB VCF file took approximately 15 hours.

Environment Details:

Platform: Azure
VM Configuration: 32 vCPUs, 64GB RAM
VEP Setup: Running within a Docker container

I am seeking guidance on how to optimize VEP for faster annotation. Could you provide recommendations on:

Configuring the VM or container for better performance.
Any VEP parameters or caching strategies that could improve processing times.
Alternative VM sizes or architectures that might be better suited for WGS annotations.

Thank you for your support and insights.

Best regards,
Ananya Saji

jamie-m-a · 2024-10-18T12:37:43Z

Hi @Ananya-swi thanks for reaching out to us.

It would be useful to know the full VEP command you are using, so we can try and identify potential speed ups. However, even without that, I can suggest a possible option which is to use our Nextflow VEP, which offers a degree of parallelisation to speed up processing large data.

I should say that we haven't tested it on cloud compute yet - so if you do decide to try it, let us know if you encounter any challenges.

Ananya-swi · 2024-10-18T13:15:32Z

Hi @jamie-m-a,

Thank you for the recommendation! I’m sharing the full VEP command I used below for your reference:

docker run -i -v /data:/opt/vep/.vep ensemblorg/ensembl-vep:release_106.1 vep --cache --refseq --CACHE_VERSION 106 --dir_plugins /opt/vep/.vep/Plugins --no_stats -i input.vcf -o output.txt --symbol --hgvs --hgvsg --variant_class --gene_phenotype --flag_pick_allele_gene --canonical --appris --ccds --numbers --total_length --mane --sift p --polyphen p --fasta /opt/vep/.vep/homo_sapiens_refseq/106_GRCh37/Homo_sapiens.GRCh37.dna.toplevel.fa.gz --species homo_sapiens --assembly GRCh37 --af --af_gnomad --no_escape --plugin NMD --dir_plugins /opt/vep/.vep/Plugins --tab --offline --buffer_size 100000 --fork 32 --force_overwrite

I’ll also explore the Nextflow VEP option to see if it speeds up the annotation process. If any issues arise on the cloud platform, I’ll follow up accordingly.

Best regards,
Ananya

jamie-m-a · 2024-10-18T14:12:45Z

Hi @Ananya-swi

No problem! Now that I can see your command, I notice you're not using forks, which can have a significant speed impact. Some general instructions for speeding up Ensembl VEP can be found here.

Let us know how you get on.

Ananya-swi · 2024-10-20T12:02:19Z

Hi @jamie-m-a,

Thank you for the feedback! I wanted to clarify that I did use the --fork option, setting it to 32. However, I still observed long runtimes, with the process taking around 15 hours for a 1.8GB input VCF.

It would be great if you could share any additional insights or optimization tips, particularly regarding other parameters that could improve performance. I’ll also explore the general recommendations provided in the link you shared.

Looking forward to hearing from you!

Thanks again,
Ananya

jamie-m-a · 2024-10-21T07:41:20Z

Apologies @Ananya-swi - I missed the fork flag. The other easy thing to check is whether your input VCF is properly sorted. Your run time does seem long for a file that size. Can you advise how many variants are in your input?

Ananya-swi · 2024-10-21T10:53:15Z

Hi @jamie-m-a,

Thank you for your response! I appreciate the suggestion about checking the sorting of my input VCF. I have confirmed that the VCF file is sorted correctly.

Regarding your question, the input VCF contains 6,139,369 variants.

Thanks again for your help!

Best,
Ananya

jamie-m-a · 2024-10-21T15:22:29Z

Thanks for the update @Ananya-swi the run time does seem slow - I'll try running some tests on a similarly sized input and get back to you.

jamie-m-a self-assigned this Oct 18, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Ananya-swi commented Oct 18, 2024 •

edited

Loading

jamie-m-a commented Oct 18, 2024

Ananya-swi commented Oct 18, 2024 •

edited

Loading

jamie-m-a commented Oct 18, 2024

Ananya-swi commented Oct 20, 2024

jamie-m-a commented Oct 21, 2024

Ananya-swi commented Oct 21, 2024

jamie-m-a commented Oct 21, 2024

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Optimize WGS VCF File Annotation for Improved Performance and Speed #1769

Comments

Ananya-swi commented Oct 18, 2024 • edited Loading

jamie-m-a commented Oct 18, 2024

Ananya-swi commented Oct 18, 2024 • edited Loading

jamie-m-a commented Oct 18, 2024

Ananya-swi commented Oct 20, 2024

jamie-m-a commented Oct 21, 2024

Ananya-swi commented Oct 21, 2024

jamie-m-a commented Oct 21, 2024

Ananya-swi commented Oct 18, 2024 •

edited

Loading

Ananya-swi commented Oct 18, 2024 •

edited

Loading