Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Should GA4GH.cigar be something other than a string? #8

Closed
cassiedoll opened this issue Apr 2, 2014 · 12 comments
Closed

Should GA4GH.cigar be something other than a string? #8

cassiedoll opened this issue Apr 2, 2014 · 12 comments

Comments

@cassiedoll
Copy link
Member

(Issue split out from #3)

We might consider an alternative representation of the cigar that requires less regex usage while still being compact.

(maybe a more formal structure, like [{type: deletion, count: 10}, {type: match, count: 30}]? or anything else that might help with parsing and be easy for api providers to support)

@massie
Copy link
Member

massie commented Apr 2, 2014

This makes sense. Here's a snippet of schema from an email I sent to the read task team mailing list...

enum CigarEventType {
  ALIGNMENT_MATCH,
  INSERTION_TO_REFERENCE,
  DELETION_FROM_REFERENCE,
  SKIPPED_REGION,
  SOFT_CLIPPING,
  HARD_CLIPPING,
  PADDING,
  SEQUENCE_MATCH,
  SEQUENCE_MISMATCH
}

record CigarEvent {
  CigarEventType eventType;
  long eventLength;
}

record Cigar {
   array<CigarEvent> events;
}

@massie
Copy link
Member

massie commented Apr 2, 2014

Btw, this format would require only 1 + 2n bytes of space where n is the number of events. Just as compact as a string based CIGAR and faster to parse.

@cassiedoll
Copy link
Member Author

That looks good for a v1. I hope that at a later point we could consolidate some of the types though. (perhaps get rid of alignment_match and force sequence match/mismatch to be used or vice versa, etc)

(passing should probably be padding btw)

@massie
Copy link
Member

massie commented Apr 2, 2014

The sequence (mis)match semantics are rarely used because the MD tag provides that information (and more) in a more compact space. The MD tag also also for calculation of edit distance (which the CIGAR string does not).

From the SAM spec...

The MD fi eld aims to achieve SNP/indel calling without looking at the reference. For example, a
string `10A5^AC6' means from the leftmost reference base in the alignment, there are 10 matches
followed by an A on the reference which is different from the aligned read base; the next 5
reference bases are matches followed by a 2bp deletion from the reference; the deleted sequence
is AC; the last 6 bases are matches. The MD fi eld ought to match the CIGAR string

@cassiedoll
Copy link
Member Author

Right so, we could get rid of those two types :)

@massie
Copy link
Member

massie commented Apr 2, 2014

Sorry, I misread your comment to mean we drop ALIGNMENT_MATCH and use SEQUENCE_(MIS)MATCH exclusively. I see the vice versa now. :)

I guess this raises the issue -- we haven't specified an MD tag analog.

@cassiedoll
Copy link
Member Author

Sounds like a new read field to me - we should probably make a new bug.

@massie
Copy link
Member

massie commented Apr 2, 2014

+1

Created #9

@fnothaft fnothaft self-assigned this Apr 17, 2014
@dglazer
Copy link
Member

dglazer commented Apr 17, 2014

On yesterday's Reads task team call, Frank Nothaft volunteered to submit a pull request and drive this to resolution. (Thanks Frank!)

@dglazer
Copy link
Member

dglazer commented Apr 26, 2014

@fnothaft just submitted #30, which addresses this -- closing. (We can discuss implementation details there.)

@dglazer dglazer closed this as completed Apr 26, 2014
@dglazer
Copy link
Member

dglazer commented Apr 27, 2014

Re-opening, since #30 was closed after discussion, in favor of a less ambitious proposal (maybe similar to what @massie suggested above?)

@dglazer
Copy link
Member

dglazer commented May 1, 2014

Closing (addressed in #33)

Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

No branches or pull requests

4 participants