-
Notifications
You must be signed in to change notification settings - Fork 112
Conversation
-1. Having implementation optional fields in the schema breaks cross-implementation compatibility. I tend to favor an implementation in the style of #38, as it is semantically richer. I find it moderately preferable to apply sorting as a transformation done lazily on top of the data, instead of expecting data to always come sorted by alignment. In the processing systems that I am working on, I expect to invoke a shuffle once I have my data loaded anyways, so having it come in already sorted isn't a big deal.
In a columnar system with predicate pushdown, this is reasonably efficient even if the fragments aren't sorted/indexed. |
Specific to this pull request:
On #38: I feel like giving reads IDs could solve some of the issues described, I'll update my request to try it out. @fnothaft - sorting can be handy for lighter weight clients (like GABrowse) when there is pagination involved in the response |
We cannot avoid implementation differences. Some information trivially available to one implementation may be hard for others. If we really want to avoid optional fields, we can introduce simplified GALinearAlignment for otherAlignments, but in that case, users have to do a two-pass search to retrieve full mate information even if a column-oriented back can get the info efficiently. I am okay with that if this is the consensus.
Firstly, the performance has not been fully evaluated. Secondly, we need to think about other implementations.
Interesting idea. But then users need a two-pass search even for basic mate information. I don't like two-pass search.
To me, technical correctness is very important to me. :) |
I'm not sure I see why having IDs is preferable (for either reads or alignments). Can you explain this a bit more?
Agreed; my point is that the sorting can be implemented as an intermediate translation between querying the data and receiving the data.
I'm -1 on this; the naming isn't critical when you're talking among people who are very familiar with the implementation, but there are a lot of end users who will be misled by the GARead name. |
From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code. I realize that inevitably implementation differences will occur, but I'm strongly against writing implementation differences into our spec. |
Okay, I am proposing a modified spec in the following. If more of us think it is better, I will commit it or its improved form.
You can get all these information from one SAM line, reducing the possibility of implementation differences. |
@fnothaft - an ID allows you to look up a mate precisely, whereas GAMappingPosition or its equivalent forces fuzzy searching on contig/position/name. The fuzzy searching has worked for SAM so far, but an ID eliminates ambiguity. To be clear, I don't think IDs are required here - just an idea. It does get rid of the chimeric lookup issues. @lh3 - that code you put in the comment is now looking much more similar to #51. re naming: then why do alignment records belong to readgroups? that would be confusing to me as an outside user. if we really do want to go with something like alignment record, we should at least change readgroup to be alignmentgroup or something like that. |
@cassiedoll I agree in the read case, but reads already have IDs via the read name (at least, in SAM/BAM and ADAM). My question is more centered around the IDs for alignment; I don't see why that is a useful stage of indirection.
Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction. Read groups are built around the lineage of the read --> typically all the reads in a single read group came from the same sequencing lane. |
|
Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...
A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment. |
Both this PR and #51 are attempts to mirror a line in a SAM file. They should be similar. However, they are very different conceptually. #51 is modeling a read, where "GARead::alignments" is an array. A GARead object may represent multiple SAM lines. You will have problems with indexing and sorting in implementation. This PR is modeling a linear alignment, where "GAAlignmentRecord::alignment" is a single alignment. Additional components in your GARead::alignments are described in the "GAAlignmentRecord::otherAlignments" array. A GAAlignmentRecord object always mirrors one line in a SAM file. The current practices on BAM files, which I admit are not always optimal, can be applied to to the model in this PR.
This is not a well defined corner. Bwa does not produce multiple paired hits, though I am sure some other mappers do. I don't know how they report such hits. For multiple secondary pairs, we might have to have an ID field to separate them out.
Yes, I agree. This is the price of following the SAM model. #38 is much cleaner.
I believe the hierarchy is also true for all the popular sequencing technologies so far. |
With respect to naming, what about GAReadAlignment as the name for Heng's GAAlignmentRecord? I am a bit confused. Is there only one of these AlignmentRecords for each read, or is there one for each LinearAlignment, with that one being the I think Cassie is correct that if we go this route we need a read id that is a true id unique in the repository, not just the name. BAM has that within a file I had another alternative, but on reflection I think it is likely to be similar to one of the other proposals, so should read them before commenting on that. Richard On 21 May 2014, at 04:45, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
Sorry. I just read Heng's last post in response to Frank after I sent my last post. Apart from the naming idea I proposed, which still has merit I think, the rest is probably Richard On 21 May 2014, at 19:06, Heng Li notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
I am okay with that.
Yes, one SAM record for each LinearAlignment and one GAAlignmentRecord for each SAM record. If "otherAlignments" is causing confusion, I can remove it and let users decode the SA tag to get other alignments - probably this is better for now.
Yes, SAM only keeps the next mate position, not all mates. I am fine to change matePrimaryPositions[] to nextMatePosition as I have already made GAAlignmentRecord almost identical to SAM, which seems the consensus. |
@lh3 can you update your code with the changes you have proposed? (I'm having a hard time following everything) |
Updated. |
Thank you! I like this much better. |
union { null, boolean } properPlacement = false; // extension of SAM flag 0x2 | ||
union { null, boolean } duplicateFragment = false; // SAM flag 0x400 | ||
int numberReads; // number of reads in the fragment; extension of SAM flag 0x1 | ||
union { null, int } templateLength = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can you bring back the comment for templateLength?
also lets try to use /** */
style comments for all new comments
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In SAM, the precise definition of template length is still in question (blame me for that). I only say "equivalent to TLEN in SAM" in case TLEN is changed in SAM in future.
Conflicts: src/main/resources/avro/reads.avdl
@lh3 , this is feeling good to me -- I think it strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. I'm still trying to make sure I understand a few subtleties of secondary/supplement alignments; assuming no surprises there I'll be +1. |
Hi David, To understand the difference between primaryAligment, secondaryAlignment and supplementAlignment, I will extend Heng’s description in #63 using the following diagram (but the length of the reads in this case would be 300bp rather than 200bp): Suppose Read_1 maps to chr1:10000 (which means chromosome 1 at position 10000), and Read_2 to chr1:10500, and has the optimal alignment and thus best mappingQuality score. Since there is no primaryAlignement flag, one denotes that in
This now is your primary line of the read, and contains the full read sequence and quality score. In SAM the primary line of a read is also used for other purposes. These would be for instance to mark duplicates (MarkDuplicates), convert to Sanger fastq format (SamToFastq), to ensure that information between mate-pairs is synchronized (FixMateInformation), among other things. Secondary Alignment (SAM Flag 0x100)These are other alignments of the same read but might be suboptimal compared to the primary. For instance, the first 100 basepairs (bp) of Read_1 are mapped to chr2:20000, and that the last 100bp of Read_1 are mapped to chr3:30000. You will also notice that Read_2 is mapped to chr3:30500, though it might not be optimal as the one mapped to chr1:10500. Note: Turning this flag on (to true) would alert other tools to not use this read in their analysis. Supplementary Alignment (SAM Flag 0x800)For the supplement alignment, notice that Read_1’s last 100 basepairs (bp) maps in an inverted way around the chromosome 3 position 5000 (chr3:5000) as a set of linear alignments with little overlap, as compared to the one mapped at chr3:30000. This type alignment is useful in denoting chimeric alignments. In SAM it is referred to as the supplementary line. Reference LinksBelow are some reference links that might help: http://seqanswers.com/forums/showthread.php?t=40239 |
// The number of reads in the fragment (extension to SAM flag 0x1) | ||
union { null, int } numberReads = null; | ||
|
||
union { null, int } templateLength = null; // equivalent to TLEN in SAM |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We - as part of 0.5 - should decide on a definition for this field. Even if the SAM spec is kinda ambiguous, we should really strive not to be. (like how we changed species in GAReferenceSequence to be an ncbi_taxon_id)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
suggestion: call this fragmentLength
, and put it next to fragmentName
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for fragmentLength next to fragmentName
On 24 May 2014, at 14:04, David Glazer notifications@github.com wrote:
In src/main/resources/avro/reads.avdl:
- // fragment attributes
- // The fragment name. Equivalent to QNAME (query template name) in SAM.
- string fragmentName;
- // The orientation and the distance between reads from the fragment are
- // consistent with the sequencing protocol (extension to SAM flag 0x2)
- union { null, boolean } properPlacement = false;
- // The fragment is a PCR or optical duplicate (SAM flag 0x400)
- union { null, boolean } duplicateFragment = false;
- // The number of reads in the fragment (extension to SAM flag 0x1)
- union { null, int } numberReads = null;
- union { null, int } templateLength = null; // equivalent to TLEN in SAM
suggestion: call this fragmentLength, and put it next to fragmentName—
Reply to this email directly or view it on GitHub.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
+1 - as I said earlier, I think this strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. There are a few small things we may want to refine in a later pull request, but this is a big step in the right direction. |
|
||
// The mapping of the primary alignment of the (readNumber+1)%numberReads | ||
// read in the fragment. It replaces mate position and mate strand in SAM. | ||
union { null, GAMappingPosition } nextMatePosition = null; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm a bit worried about pulling positions up this high in the chain of abstractions. If we were to add another alignment type (say, GAGraphAlignment to the fancy new reference graph types from the RefVar schema), we could just have that optionally replace GALinearAlignment wherever it appears. But as soon as we bring (contig, base, strand) positions up into the actual record objects, it makes graph compatibility more complicated. It would be easier for me if this were an ID reference.
A solution where GAGraphAlignment also came with a GAGraphPosition type (and maybe also had GAMappingPosition renamed to GALinearPosition to match GALinearAlignment?) would be another option, but that seems more complex.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)
I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
+1 for nextMatePosition from Heng for the reasons given by David Glazer
On 27 May 2014, at 20:19, David Glazer notifications@github.com wrote:
In src/main/resources/avro/reads.avdl:
- // By convention, each read has one and only one alignment with both of the
- // two following flags being flase. The full read sequence and quality
- // should be present in this alignment.
- union { null, boolean } secondaryAlignment = false; // SAM flag 0x100
- union { null, boolean } supplementAlignment = false; // SAM flag 0x800
- // The portion of the read sequence and quality in the alignment. In a
- // supplementary or seconday alignment, alignedSequence and alignedQuality
- // may be shorter than the read sequence and quality, or even absent.
- union { null, string } alignedSequence = null;
- array alignedQuality = [];
- // The mapping of the primary alignment of the (readNumber+1)%numberReads
- // read in the fragment. It replaces mate position and mate strand in SAM.
- union { null, GAMappingPosition } nextMatePosition = null;
IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.
—
Reply to this email directly or view it on GitHub.
The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
-0, I agree nextMatePosition is rather implementation specific, in that it likely assumes you have an efficient way to go from reference positions to reads, as you do in reference sorted read file. One can imagine other ways of storing reads that don't use this reference ordering, but not sure if anyone cares about this.
If mapping to a reference graph structure rather than reference sequence, I think we can imagine how we'd translate the nextMatePosition field to a node within the graph.
As a general comment, I agree with Adam, the more the reads data structures promote the reference in pure sequence terms, the harder any fork will be to move to a more general reference structure.
long position; | ||
boolean reverseStrand; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we move reverseStrand
out of here and into GALinearAlignment
instead?
Rationale: GAMappingPosition
is only used in two places -- GALinearAlignment
(for whom this move would be a nop) and the nextMatePosition
field (where afaik it doesn't matter, since that field is just used to find mates).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This isn't a good abstraction; reverse strandedness is a property of the read itself, not the alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Strand is the property of an alignment, not a property of a read. It is in GAMappingPosition because the strand of the mate is important and is kept in the SAM file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps I misunderstand, but doesn't the reverse strand flag mean that the read was sequenced from the complimentary strand of a DNA fragment? This would imply that the strand is a property of a read, not an alignment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@lh3, when you say:
It is in GAMappingPosition because the strand of the mate is important
That's the key question to me -- is it common for someone looking at a particular alignment to need to know its mate's strand, without also needing to know full info about that mate (e.g. flags, bases, qualities)? If so then I withdraw my suggestion (although I would like to understand more about that use case). But if not, then I think it's cleaner to strip down GAMappingPosition
to only contain the information necessary to find the mate.
Re SAM file -- yes, I see that SAM flags contain
0x20: SEQ of the next segment in the template being reversed
And as I say above, if it's commonly used, no problem. But if not, then I think it's cleaner to drop it.
`
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@fnothaft, so since the alignment is in relation to a |
+1 once lh3 confirms that he has tidied up (within 2 hours) On 28 May 2014, at 09:14, Paul Grosu notifications@github.com wrote:
The Wellcome Trust Sanger Institute is operated by Genome Research |
I think this PR is ready for merge. Someone else may do it. |
Replace GARead with GAReadAlignment
Done! |
Added script to convert test data from text to binary
(This is a refinement of #47, which has been closed because I mistakenly created a branch on the GA4GH repository and sent the pull request from that branch.)
In a SAM file, each record line describes a linear alignment annotated with the fragment and read attributes as well as limited and duplicated information about the mate and other linear alignments if the alignment is chimeric. GAAlignmentRecord in this PR closely mirrors and extends a SAM record. It optionally allows more information, such as the mate cigar and mate sequence, to be retrieved when the read store backend supports such operations efficiently.
#38 and #51 are two alternatives to this PR. #38 explicitly describes the concept hierarchy: Fragment <- Read <- Alignment <- LinearAlignment. However, in #38, a Fragment may contain >1 alignments that may be distant from each other. Retrieving a full Fragment object may either require to duplicate data in large scale or incur frequent random access which is slow. I couldn't think of an efficient implementation. #51 takes Read as the primary object. It unfortunately has the same problem with #38: a Read may contain >1 alignments distant from each others. #38 and #51 are also incompatible with readmethods.avdl which requires sorted GARead - reads cannot be sorted; only alignments can.