Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

davmlaw · 2024-10-01T07:20:01Z

https://asia.ensembl.org/Homo_sapiens/Transcript/Summary?db=core;g=ENSG00000136250;r=7:36512941-36724494;t=ENST00000617537

Web page Reports that the sequence is 2385 bases long

Ensembl API is in agreement:

In [25]: import requests
In [26]: url = "https://rest.ensembl.org/sequence/id/ENST00000617537?type=cdna"
In [27]: r = requests.get(url, headers={"Content-Type": "application/json"}, timeout=60)
In [28]: len(r.json()["seq"])
Out[28]: 2385

SeqRepo returns much longer sequence:

In [20]: from hgvs.dataproviders.seqfetcher import SeqFetcher

In [21]: sf = SeqFetcher()

In [22]: seq = sf.fetch_seq("ENST00000617537.5")

In [23]: len(seq)
Out[23]: 211554

The text was updated successfully, but these errors were encountered:

reece · 2024-10-02T22:23:02Z

Well, that's very surprising. I'll need to investigate.

FWIW, UCSC appears to have fallen into the same issue:

(from https://genome.cse.ucsc.edu/cgi-bin/hgGene?hgg_gene=UC032ZJW.2, taken today)

reece · 2024-10-02T22:41:29Z

Okay, found it. Not a bug. Ensembl itself returns this sequence:

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537' | jq -r .seq | wc  -c
211555

(+1 char for the newline)

Also, the code path that you're using is bioutils.seqfetcher, via hgvs dataproviders.

I double checked seqrepo and it's fine:

❯ seqrepo  -r /usr/local/share/seqrepo  start-shell -i latest 
Python 3.12.3 (main, Sep 11 2024, 14:17:37) [GCC 13.2.0]
Type 'copyright', 'credits' or 'license' for more information
IPython 8.27.0 -- An enhanced Interactive Python. Type '?' for help.


seqrepo (https://github.com/biocommons/biocommons.seqrepo/)
version: 0.6.9
instance path: /usr/local/share/seqrepo/latest

>>> q=sr["ENST00000617537.5"]
>>> len(q)
2385

>>> q = bioutils.seqfetcher.fetch_seq("ENST00000617537.5")
>>> len(q)
211554

davmlaw · 2024-10-03T02:27:13Z

@reece the Ensembl REST API endpoint https://rest.ensembl.org/documentation/info/sequence_id

has an argument type which:

Defaults to genomic where applicable, i.e. not translations. cdna refers to the spliced transcript sequence with UTR; cds refers to the spliced transcript sequence without UTR.

If you look, I'm using type=cdna - while when you leave it off, you're using genomic - which explains the difference.

So should SeqRepo should use the cdna rather than genomic? That looks like how it's done with RefSeq?

reece · 2024-10-03T04:15:52Z

This issue is totally unrelated to SeqRepo. SeqFetcher is part of bioutils. I've just moved this issue from biocommons.seqrepo to bioutils.

The underlying cause is that an ENST id refers to a family of sequences, not a single sequence. The cDNA sequence is what I consider to be the main sequence, and that's what's in SeqRepo. It's also what's shown on Ensembl web pages for a given ENST. And it's what corresponds to RefSeq transcripts. It is very unfortunate that a single id refers to three sequences.

Honestly, this all makes me think that we should remove support for fetching ENSTs with SeqFetcher (here) since it can't be done unambiguously. It's better to not have this support than to try to chase Ensembl's intentions with identifiers. Alternatively, we could assume cdna and add type=cdna to the http get.

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537?type=cds' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 1728
}

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537?type=cdna' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 2385
}

❯ curl -s -H "accept: application/json" 'https://rest.ensembl.org/sequence/id/ENST00000617537' | jq '{id: .id, version: .version, seq_length: (.seq | length)}'     
{
  "id": "ENST00000617537",
  "version": 5,
  "seq_length": 211554
}

davmlaw · 2024-10-03T04:20:47Z

OK with you moving issue to wherever you think is best.

I've re-opened it as I think it's an open question on what to do regarding ENST's - hope that's ok

reece · 2024-10-04T13:55:08Z

I'm fine with you reopening it if you think it needs more discussion.

The whole point of seqfetcher is to create a simple interface for fetching a sequence for a given identifier (and just the identifier). Identifiers by definition should map to only one value. ENSTs are not actually identifiers at all in this sense.

So, my hot take is that we should just remove support for ENSTs from seqfetcher.

If we're going to leave this open, would you please suggest what outcome you would like to see from this issue?

jsstevenson · 2024-10-04T14:15:55Z

Just pitching in: I'd be a little bummed if we lost support for Ensembl transcripts. Could we just assume cDNA and log a warning indicating as much every time an ENST is requested? That way, behavior is at least consistent and somewhat transparent.

reece · 2024-10-04T15:13:23Z

Yep we could assume cDNA. That's definitely the best solution.

davmlaw · 2024-10-05T21:52:40Z

I also vote for cdna by default.

My main goal is I want Ensembl HGVS to resolve correctly, or failing that, not at all (better to not support)

davmlaw added the bug Something isn't working label Oct 1, 2024

davmlaw changed the title ~~Ensembl transcript sequence is wrong~~ Ensembl transcript ENST00000617537.5 sequence is wrong Oct 1, 2024

davmlaw changed the title ~~Ensembl transcript ENST00000617537.5 sequence is wrong~~ Ensembl transcript ENST00000617537.5 sequence is wrong (88x larger than it should be) Oct 1, 2024

reece closed this as completed Oct 2, 2024

reece transferred this issue from biocommons/biocommons.seqrepo Oct 3, 2024

davmlaw reopened this Oct 3, 2024

davmlaw changed the title ~~Ensembl transcript ENST00000617537.5 sequence is wrong (88x larger than it should be)~~ Ensembl transcript ENST00000617537.5 sequence is genomic not cdna Oct 3, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

davmlaw commented Oct 1, 2024

reece commented Oct 2, 2024 •

edited

Loading

reece commented Oct 2, 2024

davmlaw commented Oct 3, 2024

reece commented Oct 3, 2024

davmlaw commented Oct 3, 2024

reece commented Oct 4, 2024

jsstevenson commented Oct 4, 2024

reece commented Oct 4, 2024

davmlaw commented Oct 5, 2024 •

edited

Loading

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

Ensembl transcript ENST00000617537.5 sequence is genomic not cdna #75

Comments

davmlaw commented Oct 1, 2024

reece commented Oct 2, 2024 • edited Loading

reece commented Oct 2, 2024

davmlaw commented Oct 3, 2024

reece commented Oct 3, 2024

davmlaw commented Oct 3, 2024

reece commented Oct 4, 2024

jsstevenson commented Oct 4, 2024

reece commented Oct 4, 2024

davmlaw commented Oct 5, 2024 • edited Loading

reece commented Oct 2, 2024 •

edited

Loading

davmlaw commented Oct 5, 2024 •

edited

Loading