-
Notifications
You must be signed in to change notification settings - Fork 111
What's the GA4GH position on 0 vs 1 based positions? #121
Comments
A lot of bio formats and databases use 0-based coordinates, including most UCSC tables/formats (e.g. BED and BLAT) and most binary formats (e.g. BAM, BCF and CRAM). |
@calbach, stick with 0-based. You'll begin to love it as the API becomes more complex, and if you'll have to build customized pipelines in the future :) |
@lh3 I think you are most closely advocating for (2); correct me if I'm wrong. From my point of view a file format is just an implementation detail. A GA4GH implementation can of course store the position however they wish internally, I think the interface/library used to access these files makes for a better comparison when discussing an API. I guess for BED your point is valid, as the file format appears also to be the API. However, most sane code I've seen which interacts with BAM parses it as Picard Regardless, I think an API must be held to a higher standard for being broadly consumable than does a file format. |
But also, point taken that there is significant precedent for 0-based bio data formats. |
Thanks for reminding me that the schema is for APIs, which I frequently forget. That being said, 0-based and 1-based coordinate systems have coexisted for over 10 years. Most UCSC users know the coordinates they get from the table browser are 0-based. Note that I am neutral on 0-based or 1-based coordinates in APIs. Just point out that 0-based coordinates are also frequently seen in web interface, though not as often as 1-based. PS: in the C BAM/SAM/VCF/BCF APIs, coordinates directly accessed by programmers are all 0-based. |
@calbach, I know what you mean in that you want to make this perfect, and in this case you can have your cake and eat it too :) So internally if you manipulate sequences directly it would probably be more 0-based, but if you shift to methods/classes (i.e. getOverlaps(ListOfSequences)) then 1-based will become just as easy to work with. In the end, to help out the end-users you can have a number-line (or 0/1-indicator) in the browser/exported file to help them interpret the data properly. If necessary users can just perform a quick update to the data, as shown below. So I generated a very simplistic pipeline to demonstrate this - the little mouse symbol indicates a right-click on the pipeline module, with the associated context menu: |
IMHO the correct position for GA4GH to take is to recognise that there are two forms of communication going on here: there is mechanical communication between programs via the API, and there is communication when a program displays output to a human. These aren't the same: the former might be binary while the latter will certainly be textual or graphical; so they needn't be the same in the 0-based v. 1-based detail either. There is a boatload of previous discussion in #49 too.
The correct representation for doing arithmetic on intervals is 0-based half-open (see #49 discussion). So the correct representation for mechanical communication for intervals is 0-based half-open and by extension for positions it is 0-based. So IMHO the entire API should be 0-based, because the API is for mechanical communication between programs. When it comes to presenting results to biological users, many programs will probably choose to do so by presenting them in the familiar (to this human audience) 1-based form in human-readable output. For example, by looking at the source code and table dumps (as @lh3 noted) it is clear that the UCSC browser is internally 0-based but presents web pages to users in a 1-based view. So I suppose that the position I'm advocating is your (2), except that your characterisation doesn't distinguish between clients producing machine-readable output (cf BED files) that will likely want 0-based coordinates as is, and clients when they are producing human-readable output that will likely want to add 1 to positions and interval start positions just once, in their model-to-view layer.
I suppose there is a trade-off in error-proneness here. Representing indels, computing interval lengths, reverse-complementing coordinates, and interval arithmetic in general are all very error-prone in anything other than 0-based half-open coordinates. IMHO this source of bugs greatly outweighs the smaller source for bugs from forgetting to add 1 in a well-defined model->view translation layer.
I don't think this problem will eventuate. Output suitable for piping into another GA4GH tool will naturally be machine-readable output, and will thus be machine-orientated 0-based. For flexibility, where practical, tools would probably be able to read human-orientated output from other tools too, but such output would identify itself as 1-based and as needing conversion on input. We have this today: BAM files have 0-based representation, while human-readable SAM files have an externally 1-based representation but are typically converted to their BAM equivalent on input. Programs like samtools read BAM or SAM and convert SAM on I/O as appropriate. |
Thanks @jmarshall for the response.
First and foremost the API exists for human (programmer) comprehension. Defining the API in a language like Avro IDL means we can generate language-specific constructs and can convert to a wire protocol for communication between programs. A machine will not care if the contents of the field are 0 or 1 based, as it will make little impact on performance, unless someone can produce a specific convincing example of why it matters for ga4gh specifically (it generally doesn't). The contents of this field should be optimized for consumption by a programmer; perhaps this is what you were saying. If you accept this, the question can now be framed as follows: Note that with (2), the programmer must make a subjective decision to emit the coordinate as 0-based or 1-based given the context. As a programmer I'm happier, more productive, and introduce fewer bugs when I don't have to make such decisions.
Thanks, I hadn't seen #49. I agree that all other things equal, 0-based half-open intervals are ideal. In this case I just happen to think that all other things are not equal.
All other things equal, I would say that an environment where you have to make this distinction is more complicated than one where you don't.
Okay, I buy that they implemented it this way - why is this superior to using 1 throughout? Also, as I've noted before an internal implementation should be free to choose whatever 0 or 1 or -1 based positioning it desires so long as the API is uniform.
I don't see a reason why 1-based half-open intervals would be more prone to these errors than 0-based half-open intervals. I'm pretty sure there's a cost to (2) and I think you agree from your comments. I'm just still failing to see major benefits to the 0-based approach, aside from the fact that I would have made it 0-based if I were designing it from scratch, and it will slightly bother me when I have to use 1-based coordinates. |
Then you should try implementing some interval arithmetic! Suppose you have some feature on an exon, represented as (This is a trivial example. It gets a lot less fun when you have to negate and reverse things due to strandedness.) |
Thanks, I see it now - I was previously considering only intervals in chromosome space. This is a good tangible downside to (1). Certainly nested relative 1-based intervals are cumbersome to work with. |
Here are some helpful visuals and discussion from Biostars that seem very relevant to this discussion: |
The UCSC coordinates system represents the first 100 bases of a chromosome as 0,100 and the next 100 as 100,200, and so forth. The advantages of this are many. Perhaps most importantly you can represent something (such as a cut or an insertion) that happens between bases by using the same number for start and end. Less important the calculation to get the size of the region is simple. In general it resorts to a lot less + and - 1's in the processing code, though admittedly you have to |
Thanks @JimKent. These are properties of closed vs half-open intervals; I believe both should hold true of 1 or 0 based coordinates, right? |
By convention, the 0-based coordinate system uses half-close-half-open intervals; the 1-based uses closed intervals. The two systems have caused enough confusions. Certainly we wouldn't want to see 0-based closed intervals or other combinations (there are 8 in total). |
Yes, you're right. 1-based exclusive doesn't really make sense and indeed there are additional downsides to an inclusive range. I've yet to hear from a proponent of 1-based intervals on this thread which makes me a bit uneasy as I'm confident such people exist, but the arguments around interval arithmetic are fairly convincing so I'll stop playing devil's advocate. I think the next step here is to document a disclaimer and given example scenarios where a client might want to 'add one' before presenting the position to a user. One example scenario I have in my head is where a user wants to build a quick weekend project which visualizes some data in a ga4gh repository (say, reads). If the user doesn't know any better, positions in this visualization will wind up being off-by-one. To be concrete, should the ga4gh documentation recommend adding 1 for commandline I/O and/or GUIs? Or should it be silent on the issue? I will send a pull request with the documentation change on |
I would just keep silent. Most users understand the implication of 0-based coordinates and know how to deal with them. 1-based coordinates are more popular, but they are not really "typical". |
I agree with @lh3. Best to fully explain what is there, and stay mute on other systems. |
@macieksmuga is temporarily unavailable, so I'm merging my own changes.
All coordinates of the GA4GH schema are currently 0-based. Some concern came up recently when I was discussing this with a colleague who is very familiar with the genetic research field. My impression and his, is that nearly all biologists (and many existing databases/tools/repositories) communicate in terms of 1-based genomic coordinates. For example, scientific papers typically refer to a specific locus by a position which is implicitly 1-based. I didn’t see a convincing amount of discussion of this topic in past GitHub issues (#5, #93), and I’m unsure as to what message the choice of a 0-based coordinate is sending. I can imagine the GA4GH taking one of three positions on the matter:
Computer scientists including myself dislike (1) for reasons of consistency. (2) seems very error-prone to me, as all clients must remember to add one before presenting results to the user. Off by one errors will be rampant; for instance, it might be difficult to pipe one GA4GH tool into another, as both would assume 0-based input and 1-based output. (3) would require a large amount of migration and community buy-in; I’m unqualified to assess the impact.
Currently, the GA4GH is either in camp (2) or (3). I’m trying to understand which.
The text was updated successfully, but these errors were encountered: