Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensembl fasta files no long usable as references #1295

Open
bwlang opened this issue Feb 20, 2019 · 9 comments
Open

Ensembl fasta files no long usable as references #1295

bwlang opened this issue Feb 20, 2019 · 9 comments

Comments

@bwlang
Copy link

bwlang commented Feb 20, 2019

As a result of this change, it's now longer possible to use gencode's transcript fasta as a reference database with mark duplicates because parentheses are common in their transcript names.

I read a bit on the background and I'm not sure I understand why the sam spec should be changed to disallow parentheses and brackets. I think @tfenne 's comment is right on. It's too late to make these changes without downstream breakage.

Maybe the @HD-VN checking will help with this?
I switched back to picard 2.18.23 (just before 2.18.24 update htsjdk)

Your environment

  • version of htsjdk - 2.18.2
  • version of java 11.01
  • version of linux: ubuntu 16.04

Steps to reproduce

attempt to mark duplicates of a bam file aligned to an encode list of transcripts

Expected behaviour

should mark duplicates

Actual behaviour

complains that contig names do not match regex

@lbergelson
Copy link
Member

@bwlang Is this the ensemble human transcripts? As far as I understood they all looked something like this: ENST00000514109.1

@bwlang
Copy link
Author

bwlang commented Feb 20, 2019

I noticed this on mouse:

ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense|

@lbergelson
Copy link
Member

Ah, I don't see any in the gencode human transcripts.

@lbergelson
Copy link
Member

Yeah, I see 4 transcripts in the v20M with () in them. That's unfortunate. We'll have to figure out how to deal with that.

@lbergelson
Copy link
Member

We've sent a message to the people at ensemble, hopefully we can work to fix the transcript names in future versions. Until then we have to decide how we want to enable people to continue working with things that have pathological reference names.

@bwlang
Copy link
Author

bwlang commented Feb 22, 2019

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain - but I suspect net happiness is going to go down as a result.
If that decision is immutable, perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

@jmarshall
Copy link
Member

I think it's a bad idea to tighten the SAM spec a this point. The goal is to reduce pain

In the case of commas, the goal was to make the spec self-consistent and the format parsable. In the case of the various brackets, statistics were gathered (see samtools/hts-specs#333 et al) and a grand total of zero instances of () were observed over a period of months.

However teething problems like this were not unanticipated. Please raise an issue over at hts-specs if you would like the SAM specification folks to consider relaxing the spec to allow parenthesis characters here.

What exact file is this from? gencode.vM20.transcripts.fa.gz?

ENSMUST00000159544.2|ENSMUSG00000086429.9|OTTMUSG00000034748.3|OTTMUST00000088412.2|Gt(ROSA)26Sor-203|Gt(ROSA)26Sor|822|antisense|

I am curious though. Do you really enjoy having all that as a reference sequence name? Do people doing this sort of thing ever truncate the FASTA headers at the first | and work with ENSMUST00000159544.2 etc instead, which seems rather less unwieldy?

perhaps applying strictness only on those files that have been marked with a later SAM version is a practical approach.

In fact that is what the spec recommends.

(OTOH htsjdk not doing so is what has brought your issue to light so quickly…)

@bwlang
Copy link
Author

bwlang commented Feb 22, 2019

I think the statistics gathering approaching did not gather information covering the diverse world of reference databases.

There is utility in using transcript names unmodified from sources like gencode. All that extra information does turn out to be useful sometimes.

Here's the link to the data:
ftp://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_mouse/release_M20/gencode.vM20.transcripts.fa.gz

@jmarshall
Copy link
Member

Clearly what we need is more statistics 😄

Thanks for bringing this mouse transcript file to people's attention. Are there other examples of the diverse world of reference databases that you'd like to bring to our attention?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants