High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts #1732

davmlaw · 2024-08-01T05:55:31Z

The current GRCh38 RefSeq annotation (GCF_000001405.40-RS_2023_10) has 368 transcripts for BRCA1 - up from 6 the previous version.

I can't see anything obviously wrong with the new sequences but I am going to take it up with RefSeq and see what's going on.

In the meantime, this is causing a lot of trouble for VEP annotation runs as this is a commonly sequenced disease gene, and in my pipeline a single variant eg 17:43093010 G/A has a CSQ size of 475294 bytes

Perl was killed OutOfMemory after exceeding 16G of RAM as the only job. I had to drop down the buffer size from 5000 to 2000 top stop it being killed. It hasn't crashed yet but the run time and amount of annotations are extremely large

An option here of eg "--max_transcripts=50" would be great, but the trick would be not to have to calculate them all

If pick is based on something to do with the transcript (MANE, length etc) that's easy, but if using "most damaging", you'd ideally want to calculate just the damaging effects, then sort, then only proceed with further annotation (eg custom etc) on the ones you keep

nakib103 · 2024-08-07T15:29:00Z

Hi @davmlaw,

Sorry for the delay in response and thanks for pointing us to this interesting case.

The solution you mentioned is not achievable with the current vep options. But we can add a option such as --pick_count to say number of transcripts to pick while using --pick option. Currently it only return one transcript (per variant, allele, or gene). Together with these options you can use the --pick_order to set rank as the first criteria so you get the transcripts that faces the more severe consequence first.

This will need changing how we filter affected transcripts for --pick (it is optimised to get a single one). Unfortunately, I cannot guarantee this will be in our priority list. I will create a ticket to look into it and let you know once we have any updates in place.

Best regards,
Nakib

davmlaw · 2024-08-08T04:23:12Z

I aksed RefSeq how on 17th of March 2024 one of the most intensely studied genes of all time went from 6 to 368 isoforms overnight, without them being reported in other species, and only citing references from the 1990s (latest being paper from 2003) and they said:

It is not a mistake. There are 368 transcript variants for BRCA1 and 147 isoforms (some transcript variants encode the same isoform) represented in RefSeq.
You can see these in table format from the Datasets page. https://www.ncbi.nlm.nih.gov/gene/672/ortholog/?scope=7742

davmlaw · 2024-08-12T02:28:44Z

Another option would be transcript allow/blocklists?

Presumably that would allow quick rejects before all the calculations are done?

This could also be generally useful as people often have their choice of transcript per gene, that isn't always the MANE one

nakib103 self-assigned this Aug 2, 2024

davmlaw changed the title ~~Option for max transcripts~~ Transcript blocklist / allowlists or option for max transcripts Aug 12, 2024

nakib103 added the enhancement label Aug 20, 2024

davmlaw changed the title ~~Transcript blocklist / allowlists or option for max transcripts~~ RefSeq memory usage due to 368 BRCA1 transcripts - Transcript blocklist / allowlists or option for max transcripts Sep 4, 2024

davmlaw changed the title ~~RefSeq memory usage due to 368 BRCA1 transcripts - Transcript blocklist / allowlists or option for max transcripts~~ High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts Sep 4, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts #1732

High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts #1732

davmlaw commented Aug 1, 2024 •

edited

Loading

nakib103 commented Aug 7, 2024 •

edited

Loading

davmlaw commented Aug 8, 2024 •

edited

Loading

davmlaw commented Aug 12, 2024 •

edited

Loading

High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts #1732

High memory usage due to 368 BRCA1 RefSeq transcripts - Transcript blocklist / allowlists or option for max transcripts #1732

Comments

davmlaw commented Aug 1, 2024 • edited Loading

nakib103 commented Aug 7, 2024 • edited Loading

davmlaw commented Aug 8, 2024 • edited Loading

davmlaw commented Aug 12, 2024 • edited Loading

davmlaw commented Aug 1, 2024 •

edited

Loading

nakib103 commented Aug 7, 2024 •

edited

Loading

davmlaw commented Aug 8, 2024 •

edited

Loading

davmlaw commented Aug 12, 2024 •

edited

Loading