-
Notifications
You must be signed in to change notification settings - Fork 111
Are zero-based positions non-negative? Context: Circular genomes #132
Comments
Hi @pgrosu, |
@pcingola, I wrote a simple Python program to illustrate this:
To me it's fairly straightforward, but I'm sure others might also want to chime in :) |
Hi Paul, /** /** So one of the correct representations would be getSubSeq( 5, 8, "ACGTACGT") = "CGTA" // This seems to be the preferred representation (is it the only one?) It is not clear whether [-3, 1) is allowed or not: getSubSeq( -3, 1, "ACGTACGT") = "CGTA" // I'm not sure if this is correct Looking at the comment in #93, I saw this quote: "I am assuming that we permit start > end to imply that the interval should be considered in the reverse direction". So the only conclusion I made was that using [5, 1) was incorrect:
What I'm proposing it to simply add a comment to GAVariant / GAInterval that reads as follows: /** This would help to eliminate the current redundancy issue, by only allowing getSubSeq( 5, 8, "ACGTACGT") = "CGTA". What do you think? |
Hi Pablo, I wrote the proof-of-concept program the most concise way I could for a 0-based [start, end) closed-open interval, that would cover most cases and would be flexible. Yes, I agree that tweaks would be appropriate for different cases, and you raised some valid points. But before I add to it, I would like others to comment as well. Paul P.S. For |
Most aligners don't support circular genomes. Most big short-read data |
We only need to formalize that negative positions are not allowed in |
I am back from vacation and think we should On 26 Aug 2014, at 19:49, maximilianh notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
Great. (a) is perfectly clear from the norm, so I'll add (b) and (c) as comments to GAInterval record (and may be GAVariant) in order to close the issue. |
NCBI sra format rule for circular references:
This matches what we observe in bams submitted to SRA. Complete Genomics definitively does it for MT, but I believe we've seen the same in some other aligners. |
@pcingola - did you have time to make that PR? is this issue ready to close? |
It was incorporated into PR #126 (GAInterval). |
We can leave this open until the PR goes up (so that we don't forget or lose the context) Thanks! |
Hi @cassiedoll, I've created a PR #158 to solve and close this issue. |
Closed by #158 |
The schema definition says
So some people may assume these are non-negative (unsigned) numbers. Nevertheless
negative genomic coordinates are used often for circular genomes as shown in this example
from ENSEMBL's Escherichia_coli_o104_h4_str_2011c_3493:
$ zcat GCA_000299455.1.22/genes.gtf.gz | cut -f 2-5 | grep -e "-"
protein_coding exon -1567 1477
protein_coding CDS -1564 1477
protein_coding stop_codon -1567 -1565
protein_coding exon -236 240
protein_coding CDS -236 237
protein_coding start_codon -236 -234
How should we handle these cases? The word "offset" suggests that negative coordinates should be converted to non-negative by adding the total length, but I didn't see any explicit comment about it (sorry if I missed it). Should we clarify how to handle negative coordinates for circular genomes? If so, what's the preferred way?
i) Using negative coordinates is a pain
ii) Using positive coordinates introduces problems due intervals having 'start' position after 'end'
So I don't have a strong bias for/against either option.
The text was updated successfully, but these errors were encountered: