-
Notifications
You must be signed in to change notification settings - Fork 112
Replaced CIGAR with descriptive nested alignment structure. #30
Conversation
There are a few issues:
I would suggest the following change if we are unhappy with a CIGAR string:
I don't know if it is a good idea to give the full chimeric alignment (using an array) and to describe the full mateAlignment in GARead. I am neutral. |
@lh3 thanks for the comments. Here are my thoughts; I would appreciate further feedback.
Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips. I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.
From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment. A possible compromise is to maintain a mapping quality per alignment block. I think that this would have the following advantages:
I am not sure that I entirely understand the difference between mapping quality vs. alignment quality. Is the distinction essentially that mapping quality scores the global uniqueness of the alignment of the read, while an alignment quality would score the (possibly weighted) edit distance between the read and the reference sequence it is aligned against?
I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example? If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.
I don't think this is a problem, per se. I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments (which is important for both DNA and RNA reads), and can flexibly be extended to future proposals for describing reference genomes (e.g., the graph schema proposed by @haussler @adamnovak et al). Read stores that implement the GA4GH schemas will just need to provide a translation between this schema and their underlying data format. |
There is approximately 20 years of experience and thought with CIGAR, which has been used by many people for many billions of alignments, not just of sequencing I advise this group in building its first release interface to adopt that experience, as Heng proposes, and not to try to invent something to replace it from scratch. Mapping quality is the confidence that this piece of the read maps to this place in the reference. Alignment quality is the confidence in the local alignment conditional Richard On 26 Apr 2014, at 20:58, Frank Austin Nothaft notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
@richarddurbin Thanks for the thoughts. Indeed, CIGARs have been used successfully for 20+ years. However, CIGARs are are optimized for a text-based world, and there are use cases for which they are not a natural choice. Given our move to a schema, I think we have a great opportunity to rethink how we express alignments. This may lead to a novel data structure, as I have proposed, it may involve a binary implementation of CIGARs, as @lh3 has proposed, or perhaps it will lead to an implementation that is a fusion of the two. There is a rich design space surrounding all of these implementations. In any case, I do not argue against CIGAR, as much as I argue in favor of my implementation, as I feel that it clarifies several vague corner cases when using current CIGARs, and that it is more extensible for future changes in the reference genome or mapping techniques. At the present moment, I believe that it is possible to fully convert CIGAR to my representation (possibly with some small changes, like the soft/hard clip disambiguation discussed above), and that most cases of my representation have a matching CIGAR analog. |
This adds another field, additional complexity.
You are right. It is not necessary to distinguish N and D with your schema.
Say we have a 100bp read bridging an MEI (mobile element insertion) break point. The first 50bp is mapped uniquely (high mapQ) and the second half to an ALT repeat (mapQ=0). We cannot sufficiently describe the mapping confidence with one mapping quality score.
That is a little overkilling and wastes space. For a PacBio read alignment, there will be hundreds of alignmentBlocks. It is not necessary to repeat the same mapping quality, the same alignmentCongtigName and similar alignmentPositions hundreds of times. The right unit to assign a mapping quality and position is a linear alignment.
Traditional aligners such as blast, blat, etc. output local hits that may have overlaps with each other. Suppose we have one 100bp read bridging a translocation break point. If we run blast/blat/bwa-mem/bwa-sw, we will get two hits possibly overlapping on the query sequence. In SAM, this chimeric alignment is described with two linear alignments, for example, {chr1:100,55M45S,mapQ=50} and {chr2:200,40S60M,mapQ=30} (this is why in my schema GARead::alignment is an array). Perhaps in an ideal world we should disallow overlaps in such cases, but mappers are producing them right now.
This adds complexity and increases the chance to produce inconsistent alignment information.
In principle, we can keep adding fields until we achieve all that SAM/my proposal can describe. However, this is becoming too complex. Imagine an alignment parser that has to compare alignmentContigName, alignmentPosition and readPosition across blocks to reconstruct linear alignments, and has to deal with potential unordered and inconsistent cases such as |
In my opinion, this isn't higher complexity than needing to parse for an operator. After more thought, in the approach I have presented, I do not think that the hard clip operator needs to be preserved. Would it be correct to say that the hard clip operator is only used to "mask" bases that are part of a separate alignment in a read which is aligned with a non-linear alignment? This is necessary as CIGARs cannot express non-linear alignment. However, since we can express non-linear alignments with this proposed schema, hard clipping is not needed.
The space overhead of storing the mapping quality per block is generally negligible, especially when we consider the relative size of the mapping quality (~4 bytes per alignment block) with the sum size of the base quality score and base per read (~2 bytes each per base in the read). I would note that we are defining an interchange format, instead of a specific implementation. In ADAM for example, we would pay a negligible penalty for storing mapping quality in this fashion, as we store data in a columnar store with run length encoding. For APIs that are layered on top of BAM, the mapping quality is already O(1) per record on disk anyways.
As I understand in SAM, these overlapping alignments would be emitted as 1 primary and n secondary alignments, no? If my understanding is correct, I don't think it'd be accurate to say that CIGARs currently handle this anyways, and I think they're somewhat out of scope for storing in the alignment.
I think the fundamental difference between our approaches stems around how we express non-linear alignments. I would be interested in hearing more opinions about this. I feel that taking an approach that is more tolerant of non-linear mappings will be necessary as reference genomes start to contain more information that describes variation across populations. |
Anyway, as Richard said, it is probably not a good idea for GA to move away from the proven best practice and to adopt radical changes that lack substantial practical evaluation and are actually questionable. |
I'll have another go at making my point. It is a misrepresentation to say that CIGAR comes from an era of text representation. We should separate the text string from the data model behind it. It is entirely different to want to redesign the data model. There are reasons why each of the operators have been included. Many will not be If we want to introduce a new software package that does something cool, and think it needs a different logical representation, then fine to do so, and Richard On 27 Apr 2014, at 03:41, Frank Austin Nothaft notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
-0 on this specific proposal (but still supportive of the general idea). @fnothaft, thanks for making the discussion concrete -- as always, putting it in code helps make sure we're all talking about the same thing. It sounds like this pull request is doing two things at once: I'm a big fan of (a) in the short term, and would like to see us proceed with that. And I'm a fan of (b) in the long term -- I agree with @fnothaft that the new thinking on graph schemas for variation creates an opportunity for new thinking on alignment. But I don't think that conversation should be rushed. (I'm not qualified to comment on the detailed pros and cons of the proposed new data model -- thanks @lh3 and @richarddurbin for doing so.) I suggest we: |
+1 I support David's suggestions.
|
Thanks all for the feedback; I am coming to see that my proposal may be a little premature. I haven't much experience with PacBio or other alternative read technologies, so I haven't as nuanced of an understanding of the issues that @lh3 and @richarddurbin have brought up. I will table this PR for now. Perhaps the next step moving forward should be for @lh3 to open a PR with the schema he provided above. I'll open an issue to discuss how we can revise alignment representations for the graph based reference, and try to bring @adamnovak and @haussler in on that issue. |
Thanks David and Frank. I would support Heng's proposal. One issue is how to place in the spec the enumeration of the possible operations. If there is not an enum type then I agree that thinking about how to represent alignments against a graph reference is more open, and allows for new approaches. Still I think we Richard On 27 Apr 2014, at 17:23, Frank Austin Nothaft notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
@richarddurbin Avro does support enums, so the intermediate representation should be straightforward. Thanks for the link to the SQG format; I'll review that! The proposal that we've been considering for graph alignment descriptions considers the concept of "runs" in a graph, which makes the graph alignments look more similar to traditional linear alignments. I'm fairly heavily loaded this next week, but I will shoot to get a brief writeup posted next week, that discusses our current thoughts about expressing graph alignments. |
Possible Maven build fix; doc updates; new tests; bug fixed
This pull request replaces the CIGAR and originalBases fields, with a nested mapping structure. There are several advantages to this mapping structure:
This pull request also adds the alignment contig to the schema, which appears to have been missing from previous versions of the schema.
Additionally, I have updated the .gitignore file for emacs users.