Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Replaced CIGAR with descriptive nested alignment structure. #30

Closed
wants to merge 1 commit into from

Conversation

fnothaft
Copy link
Contributor

This pull request replaces the CIGAR and originalBases fields, with a nested mapping structure. There are several advantages to this mapping structure:

  • Since each alignment block contains it's start position and reference name, it is easy to express alignments for reads that are split mapped.
  • This structure addresses an ambiguity in the CIGAR/alignment spec, which does not clearly document what the alignment start position should be for a read that starts with an insertion.

This pull request also adds the alignment contig to the schema, which appears to have been missing from previous versions of the schema.

Additionally, I have updated the .gitignore file for emacs users.

@lh3
Copy link
Member

lh3 commented Apr 26, 2014

There are a few issues:

  1. It seems that we cannot describe all CIGAR operations: soft clipping (S) and hard clipping (H) seem to be the same; deletions (D) and reference skips (N) are not distinguished.
  2. The quality in SAM is mapping quality, not alignment quality. Mapping and alignment are different. For now, mappers usually assign a mapping quality to each linear alignment, not to an entire chimeric or fusion alignment (see the SAM spec about linear and chimeric alignments).
  3. The schema requires blocks to have no overlaps on the read as I understand, which is not flexible enough.
  4. It is not so straightforward to convert between cigar and this representation. This representation also describes cases not representable with cigar. For example, if the first block is "chr1:100,blockLenth=20" and the next block is "chr1:200" or "chr1:90", I am not sure what the alignment should be. We should try to avoid a representation allowing inconsistencies, if possible.

I would suggest the following change if we are unhappy with a CIGAR string:

record GACIGARUnit {
    int operation; // can be replaced with enum;
    int operationLength;
    union {null, string} contigSequence = null; // the contig seq at mismatches (X) and dels (D)
}
record GALinearAlignment {
    string contig;
    long leftPosition;
    int mappingQuality;
    array<GACIGARUnit> cigar = []; 
}
record GARead {
    array<GALinearAlignment> alignment = [];
    array<GALinearAlignment> mateAlignment = [];
    ...
}

I don't know if it is a good idea to give the full chimeric alignment (using an array) and to describe the full mateAlignment in GARead. I am neutral.

@fnothaft
Copy link
Contributor Author

@lh3 thanks for the comments. Here are my thoughts; I would appreciate further feedback.

It seems that we cannot describe all CIGAR operations: soft clipping (S) and hard clipping (H) seem to be the same; deletions (D) and reference skips (N) are not distinguished.

Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips.

I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.

The quality in SAM is mapping quality, not alignment quality. Mapping and alignment are different. For now, mappers usually assign a mapping quality to each linear alignment, not to an entire chimeric or fusion alignment (see the SAM spec about linear and chimeric alignments).

From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment. A possible compromise is to maintain a mapping quality per alignment block. I think that this would have the following advantages:

  • Aligners that implement the current mapping quality scoring approach would assign the same quality score to all alignment blocks that are part of a continuous linear alignment
  • Future mappers would have the flexibility to implement this scoring scheme, or to implement a scheme that assigns a mapping quality to the entire alignment.

I am not sure that I entirely understand the difference between mapping quality vs. alignment quality. Is the distinction essentially that mapping quality scores the global uniqueness of the alignment of the read, while an alignment quality would score the (possibly weighted) edit distance between the read and the reference sequence it is aligned against?

The schema requires blocks to have no overlaps on the read as I understand, which is not flexible enough.

I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example? If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.

It is not so straightforward to convert between cigar and this representation. This representation also describes cases not representable with cigar. For example, if the first block is "chr1:100,blockLenth=20" and the next block is "chr1:200" or "chr1:90", I am not sure what the alignment should be. We should try to avoid a representation allowing inconsistencies, if possible.

I don't think this is a problem, per se. I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments (which is important for both DNA and RNA reads), and can flexibly be extended to future proposals for describing reference genomes (e.g., the graph schema proposed by @haussler @adamnovak et al). Read stores that implement the GA4GH schemas will just need to provide a translation between this schema and their underlying data format.

@richarddurbin
Copy link
Contributor

There is approximately 20 years of experience and thought with CIGAR, which has been used by many people for many billions of alignments, not just of sequencing
reads but also of other biological sequences to each other.

I advise this group in building its first release interface to adopt that experience, as Heng proposes, and not to try to invent something to replace it from scratch.
At a minimum you are going to want to translate from and back to CIGAR so as to talk to the rest of the world.

Mapping quality is the confidence that this piece of the read maps to this place in the reference. Alignment quality is the confidence in the local alignment conditional
on the position being correct. They are separable. With nearby indels and mismatches anchored by exact match flanks you may have low alignment confidence but high
mapping confidence. With a short exact match that looks like other places in the genome you might have low mapping confidence but high alignment confidence.
There is a per-base measure of alignment quality, known by its BAM tag name BAQ (introduced by Heng) which can be used to adjust call confidence in hard to align regions
(though another approach to the same problem is taken by callers that reassemble alternate haplotypes).

Richard

On 26 Apr 2014, at 20:58, Frank Austin Nothaft notifications@github.com wrote:

@lh3 thanks for the comments. Here are my thoughts; I would appreciate further feedback.

It seems that we cannot describe all CIGAR operations: soft clipping (S) and hard clipping (H) seem to be the same; deletions (D) and reference skips (N) are not distinguished.

Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips.

I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.

The quality in SAM is mapping quality, not alignment quality. Mapping and alignment are different. For now, mappers usually assign a mapping quality to each linear alignment, not to an entire chimeric or fusion alignment (see the SAM spec about linear and chimeric alignments).

From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment. A possible compromise is to maintain a mapping quality per alignment block. I think that this would have the following advantages:

Aligners that implement the current mapping quality scoring approach would assign the same quality score to all alignment blocks that are part of a continuous linear alignment
Future mappers would have the flexibility to implement this scoring scheme, or to implement a scheme that assigns a mapping quality to the entire alignment.
I am not sure that I entirely understand the difference between mapping quality vs. alignment quality. Is the distinction essentially that mapping quality scores the global uniqueness of the alignment of the read, while an alignment quality would score the (possibly weighted) edit distance between the read and the reference sequence it is aligned against?

The schema requires blocks to have no overlaps on the read as I understand, which is not flexible enough.

I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example? If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.

It is not so straightforward to convert between cigar and this representation. This representation also describes cases not representable with cigar. For example, if the first block is "chr1:100,blockLenth=20" and the next block is "chr1:200" or "chr1:90", I am not sure what the alignment should be. We should try to avoid a representation allowing inconsistencies, if possible.

I don't think this is a problem, per se. I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments (which is important for both DNA and RNA reads), and can flexibly be extended to future proposals for describing reference genomes (e.g., the graph schema proposed by @haussler @adamnovak et al). Read stores that implement the GA4GH schemas will just need to provide a translation between this schema and their underlying data format.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@fnothaft
Copy link
Contributor Author

@richarddurbin Thanks for the thoughts. Indeed, CIGARs have been used successfully for 20+ years. However, CIGARs are are optimized for a text-based world, and there are use cases for which they are not a natural choice. Given our move to a schema, I think we have a great opportunity to rethink how we express alignments. This may lead to a novel data structure, as I have proposed, it may involve a binary implementation of CIGARs, as @lh3 has proposed, or perhaps it will lead to an implementation that is a fusion of the two. There is a rich design space surrounding all of these implementations.

In any case, I do not argue against CIGAR, as much as I argue in favor of my implementation, as I feel that it clarifies several vague corner cases when using current CIGARs, and that it is more extensible for future changes in the reference genome or mapping techniques. At the present moment, I believe that it is possible to fully convert CIGAR to my representation (possibly with some small changes, like the soft/hard clip disambiguation discussed above), and that most cases of my representation have a matching CIGAR analog.

@lh3
Copy link
Member

lh3 commented Apr 27, 2014

Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips.

This adds another field, additional complexity.

I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.

You are right. It is not necessary to distinguish N and D with your schema.

From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment.

Say we have a 100bp read bridging an MEI (mobile element insertion) break point. The first 50bp is mapped uniquely (high mapQ) and the second half to an ALT repeat (mapQ=0). We cannot sufficiently describe the mapping confidence with one mapping quality score.

A possible compromise is to maintain a mapping quality per alignment block.

That is a little overkilling and wastes space. For a PacBio read alignment, there will be hundreds of alignmentBlocks. It is not necessary to repeat the same mapping quality, the same alignmentCongtigName and similar alignmentPositions hundreds of times. The right unit to assign a mapping quality and position is a linear alignment.

I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example?

Traditional aligners such as blast, blat, etc. output local hits that may have overlaps with each other. Suppose we have one 100bp read bridging a translocation break point. If we run blast/blat/bwa-mem/bwa-sw, we will get two hits possibly overlapping on the query sequence. In SAM, this chimeric alignment is described with two linear alignments, for example, {chr1:100,55M45S,mapQ=50} and {chr2:200,40S60M,mapQ=30} (this is why in my schema GARead::alignment is an array). Perhaps in an ideal world we should disallow overlaps in such cases, but mappers are producing them right now.

If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.

This adds complexity and increases the chance to produce inconsistent alignment information.

I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments, and can flexibly be extended to future proposals for describing reference genomes.

In principle, we can keep adding fields until we achieve all that SAM/my proposal can describe. However, this is becoming too complex. Imagine an alignment parser that has to compare alignmentContigName, alignmentPosition and readPosition across blocks to reconstruct linear alignments, and has to deal with potential unordered and inconsistent cases such as [{chr1:100,blockLength=20,readPos=0}, {chr2:200,blockLength=10,readPos=10}, {chr1:10,blockLength=10,readPos=5}]. This is not necessary. I believe my proposal can represent all meaningful alignments that can be represented by your proposal. It is closer to our current practice and is more compact. Note that my proposal is essentially the way SAM represents complex alignments.

@fnothaft
Copy link
Contributor Author

Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips.
This adds another field, additional complexity.

In my opinion, this isn't higher complexity than needing to parse for an operator. After more thought, in the approach I have presented, I do not think that the hard clip operator needs to be preserved. Would it be correct to say that the hard clip operator is only used to "mask" bases that are part of a separate alignment in a read which is aligned with a non-linear alignment? This is necessary as CIGARs cannot express non-linear alignment. However, since we can express non-linear alignments with this proposed schema, hard clipping is not needed.

I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.
You are right. It is not necessary to distinguish N and D with your schema.

From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment.
Say we have a 100bp read bridging an MEI (mobile element insertion) break point. The first 50bp is mapped uniquely (high mapQ) and the second half to an ALT repeat (mapQ=0). We cannot sufficiently describe the mapping confidence with one mapping quality score.

A possible compromise is to maintain a mapping quality per alignment block.
That is a little overkilling and wastes space. For a PacBio read alignment, there will be hundreds of alignmentBlocks. It is not necessary to repeat the same mapping quality, the same alignmentCongtigName and similar alignmentPositions hundreds of times. The right unit to assign a mapping quality and position is a linear alignment.

The space overhead of storing the mapping quality per block is generally negligible, especially when we consider the relative size of the mapping quality (~4 bytes per alignment block) with the sum size of the base quality score and base per read (~2 bytes each per base in the read). I would note that we are defining an interchange format, instead of a specific implementation. In ADAM for example, we would pay a negligible penalty for storing mapping quality in this fashion, as we store data in a columnar store with run length encoding. For APIs that are layered on top of BAM, the mapping quality is already O(1) per record on disk anyways.

I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example?
Traditional aligners such as blast, blat, etc. output local hits that may have overlaps with each other. Suppose we have one 100bp read bridging a translocation break point. If we run blast/blat/bwa-mem/bwa-sw, we will get two hits possibly overlapping on the query sequence. In SAM, this chimeric alignment is described with two linear alignments, for example, {chr1:100,55M45S,mapQ=50} and {chr2:200,40S60M,mapQ=30} (this is why in my schema GARead::alignment is an array). Perhaps in an ideal world we should disallow overlaps in such cases, but mappers are producing them right now.

If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.
This adds complexity and increases the chance to produce inconsistent alignment information.

As I understand in SAM, these overlapping alignments would be emitted as 1 primary and n secondary alignments, no? If my understanding is correct, I don't think it'd be accurate to say that CIGARs currently handle this anyways, and I think they're somewhat out of scope for storing in the alignment.

I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments, and can flexibly be extended to future proposals for describing reference genomes.
In principle, we can keep adding fields until we achieve all that SAM/my proposal can describe. However, this is becoming too complex. Imagine an alignment parser that has to compare alignmentContigName, alignmentPosition and readPosition across blocks to reconstruct linear alignments, and has to deal with potential unordered and inconsistent cases such as [{chr1:100,blockLength=20,readPos=0}, {chr2:200,blockLength=10,readPos=10}, {chr1:10,blockLength=10,readPos=5}]. This is not necessary. I believe my proposal can represent all meaningful alignments that can be represented by your proposal. It is closer to our current practice and is more compact. Note that my proposal is essentially the way SAM represents complex alignments.

I think the fundamental difference between our approaches stems around how we express non-linear alignments. I would be interested in hearing more opinions about this. I feel that taking an approach that is more tolerant of non-linear mappings will be necessary as reference genomes start to contain more information that describes variation across populations.

@lh3
Copy link
Member

lh3 commented Apr 27, 2014

Anyway, as Richard said, it is probably not a good idea for GA to move away from the proven best practice and to adopt radical changes that lack substantial practical evaluation and are actually questionable.

@richarddurbin
Copy link
Contributor

I'll have another go at making my point.

It is a misrepresentation to say that CIGAR comes from an era of text representation. We should separate the text string from the data model behind it.
As Heng illustrates there are already binary representations of the same data model used within software systems, including for example within aligners,
packages such as samtools and gatk, and database systems such as Ensembl or the UCSC browser, and these binary representations have existed for
a long time. The text string is only a way of exchanging the information. Fine to develop another such way that is more natural for the web interface world.
That is what I thought the goal of this effort was.

It is entirely different to want to redesign the data model. There are reasons why each of the operators have been included. Many will not be
apparent from looking at a limited number of Illumina BAMs. More complex scenarios with DNA include long reads with lots of indel errors such as PacBio,
plus structural variation, plus cDNA alignments. For Complete Genomics data the read may have match overlapping fragments on the genome. More
general CIGAR operations not used in BAM support protein alignments to DNA. CIGAR was developed from a previous era where each program/system
had its own internal binary representation of alignments, and in some cases textual exchange format. e.g. BLAST - you can look at the formats for that in
binary, ASN.1 and text - and BLAT from Jim Kent, which was the most efficient aligner for DNA for some time both had blocks somewhat like your new
proposal. In the context of genome databases that interacted with many different software systems CIGAR was progressively developed to capture the
types of relationship that are required in any case/by any software. I know I am not on top of all the arguments myself, and so I would be wary of getting
rid of things without doing a lot of reading and evaluation and consulting. For example consider the difference between hard and soft clipping. At a basic
level I can think of wanting to hard clip where there is data really shouldn't be there with respect to the alignment, e.g. when there are bases that come from
a clean-room tag used when building an ancient DNA sequencing library. Then I might soft clip when the data quality gets low at the end of a read so I
don't want to make a statement about the alignment. If you to look at the output of bwa-mem, the latest version of BWA, with 250bp reads that span
structural variants then you will see quite subtle use of both soft- and hard- clipping.

If we want to introduce a new software package that does something cool, and think it needs a different logical representation, then fine to do so, and
if people take it up they will work out how to interface to other things. But if our aim is to provide a modern web style interface for the existing world
of DNA alignments then I think we should work to understand that and represent it faithfully, rather than tell that world that they should tear up their
data model and adopt a different one that doesn't have a track record.

Richard

On 27 Apr 2014, at 03:41, Frank Austin Nothaft notifications@github.com wrote:

Your point about soft vs. hard clipping is correct. However, this could be addressed by adding a flag to distinguish hard vs. soft clips.
This adds another field, additional complexity.

In my opinion, this isn't higher complexity than needing to parse for an operator. After more thought, in the approach I have presented, I do not think that the hard clip operator needs to be preserved. Would it be correct to say that the hard clip operator is only used to "mask" bases that are part of a separate alignment in a read which is aligned with a non-linear alignment? This is necessary as CIGARs cannot express non-linear alignment. However, since we can express non-linear alignments with this proposed schema, hard clipping is not needed.

I don't intend for deletions and reference skips to be represented the same way. There is no analog for a reference skip (CIGAR N) in this schema. Instead, two sequential alignment blocks have discontinuous positions on the same reference chromosome.
You are right. It is not necessary to distinguish N and D with your schema.

From a schema perspective, we can change the schema to mappingQuality instead of alignmentQuality. However, I would be interested in hearing the original rationale for deciding to assign a mapping quality to linear alignments vs. an entire chimeric alignment.
Say we have a 100bp read bridging an MEI (mobile element insertion) break point. The first 50bp is mapped uniquely (high mapQ) and the second half to an ALT repeat (mapQ=0). We cannot sufficiently describe the mapping confidence with one mapping quality score.

A possible compromise is to maintain a mapping quality per alignment block.
That is a little overkilling and wastes space. For a PacBio read alignment, there will be hundreds of alignmentBlocks. It is not necessary to repeat the same mapping quality, the same alignmentCongtigName and similar alignmentPositions hundreds of times. The right unit to assign a mapping quality and position is a linear alignment.

The space overhead of storing the mapping quality per block is generally negligible, especially when we consider the relative size of the mapping quality (~4 bytes per alignment block) with the sum size of the base quality score and base per read (~2 bytes each per base in the read). I would note that we are defining an interchange format, instead of a specific implementation. In ADAM for example, we would pay a negligible penalty for storing mapping quality in this fashion, as we store data in a columnar store with run length encoding. For APIs that are layered on top of BAM, the mapping quality is already O(1) per record on disk anyways.

I suppose that I do not understand how that is used, or how that is supported in the current CIGAR representation. Could you give an example?
Traditional aligners such as blast, blat, etc. output local hits that may have overlaps with each other. Suppose we have one 100bp read bridging a translocation break point. If we run blast/blat/bwa-mem/bwa-sw, we will get two hits possibly overlapping on the query sequence. In SAM, this chimeric alignment is described with two linear alignments, for example, {chr1:100,55M45S,mapQ=50} and {chr2:200,40S60M,mapQ=30} (this is why in my schema GARead::alignment is an array). Perhaps in an ideal world we should disallow overlaps in such cases, but mappers are producing them right now.

If this is an important use case, per alignment block, we can have a read start index that describes the position of this alignment block in the read sequence.
This adds complexity and increases the chance to produce inconsistent alignment information.

As I understand in SAM, these overlapping alignments would be emitted as 1 primary and n secondary alignments, no? If my understanding is correct, I don't think it'd be accurate to say that CIGARs currently handle this anyways, and I think they're somewhat out of scope for storing in the alignment.

I propose this schema for the mapping, as it puts fewer constraints on mapping, provides cleaner support for split read alignments, and can flexibly be extended to future proposals for describing reference genomes.
In principle, we can keep adding fields until we achieve all that SAM/my proposal can describe. However, this is becoming too complex. Imagine an alignment parser that has to compare alignmentContigName, alignmentPosition and readPosition across blocks to reconstruct linear alignments, and has to deal with potential unordered and inconsistent cases such as [{chr1:100,blockLength=20,readPos=0}, {chr2:200,blockLength=10,readPos=10}, {chr1:10,blockLength=10,readPos=5}]. This is not necessary. I believe my proposal can represent all meaningful alignments that can be represented by your proposal. It is closer to our current practice and is more compact. Note that my proposal is essentially the way SAM represents complex alignments.

I think the fundamental difference between our approaches stems around how we express non-linear alignments. I would be interested in hearing more opinions about this. I feel that taking an approach that is more tolerant of non-linear mappings will be necessary as reference genomes start to contain more information that describes variation across populations.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@dglazer
Copy link
Member

dglazer commented Apr 27, 2014

-0 on this specific proposal (but still supportive of the general idea). @fnothaft, thanks for making the discussion concrete -- as always, putting it in code helps make sure we're all talking about the same thing.

It sounds like this pull request is doing two things at once:
a) syntax: shifting from a text-centric representation to a web-API-friendly structured representation
b) semantics: introducing a new data model for alignments

I'm a big fan of (a) in the short term, and would like to see us proceed with that. And I'm a fan of (b) in the long term -- I agree with @fnothaft that the new thinking on graph schemas for variation creates an opportunity for new thinking on alignment. But I don't think that conversation should be rushed. (I'm not qualified to comment on the detailed pros and cons of the proposed new data model -- thanks @lh3 and @richarddurbin for doing so.)

I suggest we:
a) come up with a new pull request that proposes a new representation for today's model
b) continue the discussion of new data models at a more thoughtful pace, probably tied to the discussion of the graph schema proposed by @haussler @adamnovak et al., and therefore probably in a different venue.

@delagoya
Copy link
Contributor

+1 I support David's suggestions.

On Apr 27, 2014, at 12:06 PM, David Glazer notifications@github.com wrote:

-0 on this specific proposal (but still supportive of the general idea). @fnothaft, thanks for making the discussion concrete -- as always, putting it in code helps make sure we're all talking about the same thing.

It sounds like this pull request is doing two things at once:
a) syntax: shifting from a text-centric representation to a web-API-friendly structured representation
b) semantics: introducing a new data model for alignments

I'm a big fan of (a) in the short term, and would like to see us proceed with that. And I'm a fan of (b) in the long term -- I agree with @fnothaft that the new thinking on graph schemas for variation creates an opportunity for new thinking on alignment. But I don't think that conversation should be rushed. (I'm not qualified to comment on the detailed pros and cons of the proposed new data model -- thanks @lh3 and @richarddurbin for doing so.)

I suggest we:
a) come up with a new pull request that proposes a new representation for today's model
b) continue the discussion of new data models at a more thoughtful pace, probably tied to the discussion of the graph schema proposed by @haussler @adamnovak et al., and therefore probably in a different venue.


Reply to this email directly or view it on GitHub.

@fnothaft
Copy link
Contributor Author

Thanks all for the feedback; I am coming to see that my proposal may be a little premature. I haven't much experience with PacBio or other alternative read technologies, so I haven't as nuanced of an understanding of the issues that @lh3 and @richarddurbin have brought up.

I will table this PR for now. Perhaps the next step moving forward should be for @lh3 to open a PR with the schema he provided above. I'll open an issue to discuss how we can revise alignment representations for the graph based reference, and try to bring @adamnovak and @haussler in on that issue.

@fnothaft fnothaft closed this Apr 27, 2014
@richarddurbin
Copy link
Contributor

Thanks David and Frank.

I would support Heng's proposal. One issue is how to place in the spec the enumeration of the possible operations. If there is not an enum type then
I guess this should be in a comment, explaining how each integer corresponds to a CIGAR operator and briefly what it means.

I agree that thinking about how to represent alignments against a graph reference is more open, and allows for new approaches. Still I think we
should build on what we have learnt for linear references. One open issue is whether one needs separate blocks for each graph component/contig, or can
proceed through a sequence of adjacent components in a single contiguous alignment block, somehow. It feels like one should be able to differentiate between
blocks that are adjacent and consistent, and ones that are not. (I am using the word "component" because I can't remember whether in Ben and David's
proposal they are nodes or edges - there are dual representations possible.) I had some comments related to this in the record-based SQG format proposed
a couple of years ago at ftp://ftp.sanger.ac.uk/pub/rd/111128.SQG_specification_1.0.docx, but this never got a useful implementation and never caught on.

Richard

On 27 Apr 2014, at 17:23, Frank Austin Nothaft notifications@github.com wrote:

Thanks all for the feedback; I am coming to see that my proposal may be a little premature. I haven't much experience with PacBio or other alternative read technologies, so I haven't as nuanced of an understanding of the issues that @lh3 and @richarddurbin have brought up.

I will table this PR for now. Perhaps the next step moving forward should be for @lh3 to open a PR with the schema he provided above. I'll open an issue to discuss how we can revise alignment representations for the graph based reference, and try to bring @adamnovak and @haussler in on that issue.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@fnothaft
Copy link
Contributor Author

@richarddurbin Avro does support enums, so the intermediate representation should be straightforward.

Thanks for the link to the SQG format; I'll review that! The proposal that we've been considering for graph alignment descriptions considers the concept of "runs" in a graph, which makes the graph alignments look more similar to traditional linear alignments. I'm fairly heavily loaded this next week, but I will shoot to get a brief writeup posted next week, that discusses our current thoughts about expressing graph alignments.

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016
Possible Maven build fix; doc updates; new tests; bug fixed
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants