Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Are zero-based positions non-negative? Context: Circular genomes #132

Closed
pcingola opened this issue Aug 25, 2014 · 15 comments
Closed

Are zero-based positions non-negative? Context: Circular genomes #132

pcingola opened this issue Aug 25, 2014 · 15 comments

Comments

@pcingola
Copy link
Contributor

The schema definition says

/**
  The 0-based offset from the start of the forward strand for that reference.
*/
long position;

So some people may assume these are non-negative (unsigned) numbers. Nevertheless
negative genomic coordinates are used often for circular genomes as shown in this example
from ENSEMBL's Escherichia_coli_o104_h4_str_2011c_3493:

$ zcat GCA_000299455.1.22/genes.gtf.gz | cut -f 2-5 | grep -e "-"
protein_coding exon -1567 1477
protein_coding CDS -1564 1477
protein_coding stop_codon -1567 -1565
protein_coding exon -236 240
protein_coding CDS -236 237
protein_coding start_codon -236 -234

How should we handle these cases? The word "offset" suggests that negative coordinates should be converted to non-negative by adding the total length, but I didn't see any explicit comment about it (sorry if I missed it). Should we clarify how to handle negative coordinates for circular genomes? If so, what's the preferred way?

i) Using negative coordinates is a pain

ii) Using positive coordinates introduces problems due intervals having 'start' position after 'end'

So I don't have a strong bias for/against either option.

@pgrosu
Copy link
Contributor

pgrosu commented Aug 26, 2014

Hi Pablo,

I like your nice example and no worries :) Here are links to the previous discussions on 0- and 1-based positions, but not specific to circular and negative genomic coordinates:

#121
#5
#93
#49

Paul

@pcingola
Copy link
Contributor Author

Hi @pgrosu,
Thank you. I've read the links, but I still don't see how they answer
my specific question. Do you mind explaining a little bit more how to
express a negative position (or an interval spanning across
coordinate 0)? I'm probably missing something obvious here...

@pgrosu
Copy link
Contributor

pgrosu commented Aug 26, 2014

@pcingola, I wrote a simple Python program to illustrate this:

def interval( start, end, seq_len ):

  if start < 0:
    start = start + seq_len
  if end < 0:
    end = end + seq_len 

  if start > end:
    return( range(start, seq_len) + range(0, end) )
  else:
    return( range(start, end) )

def getSubSeq( start, end, sequence ):

  seq_in_interval = ""

  for i in interval( start, end, len(sequence) ):
    seq_in_interval = seq_in_interval + sequence[i]

  return( seq_in_interval )

getSubSeq( 2, 4, "ACGTACGT") will return GT as @richarddurbin posted in #93
getSubSeq( 7, 3, "ACGTACGT") will return TACG
getSubSeq( -3, 1, "ACGTACGT") will return CGTA like you suggested :)

To me it's fairly straightforward, but I'm sure others might also want to chime in :)

@pcingola
Copy link
Contributor Author

Hi Paul,
Indeed, I agree that it is simple, but the correct representation for this particular case, is not clear from the norm/comments. The problem is that it seems to be more than one way of representing them, which can be a headache when implementing the API.
According to GAVariant:

/**
The start position at which this variant occurs (0-based).
This corresponds to the first base of the string of reference bases.
*/
long start;

/**
The end position (exclusive), resulting in [start, end) closed-open interval.
This is typically calculated by start + referenceBases.length.
*/
long end;

So one of the correct representations would be

getSubSeq( 5, 8, "ACGTACGT") = "CGTA" // This seems to be the preferred representation (is it the only one?)

It is not clear whether [-3, 1) is allowed or not:

getSubSeq( -3, 1, "ACGTACGT") = "CGTA" // I'm not sure if this is correct

Looking at the comment in #93, I saw this quote: "I am assuming that we permit start > end to imply that the interval should be considered in the reverse direction". So the only conclusion I made was that using [5, 1) was incorrect:

 getSubSeq( 5, 1, "ACGTACGT") = "CGTA"   // This is not allowed

What I'm proposing it to simply add a comment to GAVariant / GAInterval that reads as follows:

/**
In case of circular genomes, 'end' can be greater than the reference's length (end = start + length), denoting intervals that span across reference's zero position .
Coordinates must be non-negative numbers.
*/

This would help to eliminate the current redundancy issue, by only allowing getSubSeq( 5, 8, "ACGTACGT") = "CGTA". What do you think?

@pgrosu
Copy link
Contributor

pgrosu commented Aug 26, 2014

Hi Pablo,

I wrote the proof-of-concept program the most concise way I could for a 0-based [start, end) closed-open interval, that would cover most cases and would be flexible.

Yes, I agree that tweaks would be appropriate for different cases, and you raised some valid points. But before I add to it, I would like others to comment as well.

Paul

P.S. For getSubSeq( 5, 8, "ACGTACGT") I get CGT with my program.

@maximilianh
Copy link
Contributor

Most aligners don't support circular genomes. Most big short-read data
files we will have to deal with are most likely from human. I don't think
we need to support circular genomes, at least in this version.

@pcingola
Copy link
Contributor Author

We only need to formalize that negative positions are not allowed in
order to have full support for circular genomes. Why avoid it?

@richarddurbin
Copy link
Contributor

I am back from vacation and think we should
(a) stick with 0-based coordinates
(b) not allow negative coordinates
(c) handle circular genomes in the graph framework, with subsequences spanning the join being made of two segments, one on each side of the join

On 26 Aug 2014, at 19:49, maximilianh notifications@github.com wrote:

Most aligners don't support circular genomes. Most big short-read data
files we will have to deal with are most likely from human. I don't think
we need to support circular genomes, at least in this version.

Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@pcingola
Copy link
Contributor Author

Great. (a) is perfectly clear from the norm, so I'll add (b) and (c) as comments to GAInterval record (and may be GAVariant) in order to close the issue.

@yaschenk
Copy link

yaschenk commented Sep 3, 2014

NCBI sra format rule for circular references:

  1. Circular references are explicitly marked (all INDSC records are)
  2. Negative start positions are not allowed
  3. CIGAR is allowed to go beyond the end of the sequence which is not an error for circular reference

This matches what we observe in bams submitted to SRA. Complete Genomics definitively does it for MT, but I believe we've seen the same in some other aligners.

@cassiedoll
Copy link
Member

@pcingola - did you have time to make that PR? is this issue ready to close?

@pcingola
Copy link
Contributor Author

It was incorporated into PR #126 (GAInterval).
I will create a separate PR to add this into the other files.
You can close the issue.

@cassiedoll
Copy link
Member

We can leave this open until the PR goes up (so that we don't forget or lose the context)
Separating out the new comments from #126 would be great though - as it'll be a much smaller PR and will most likely merge more quickly.

Thanks!

@pcingola
Copy link
Contributor Author

Hi @cassiedoll, I've created a PR #158 to solve and close this issue.

@cassiedoll
Copy link
Member

Closed by #158

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

6 participants