Replace GARead with GAReadAlignment #60

lh3 · 2014-05-21T02:50:42Z

(This is a refinement of #47, which has been closed because I mistakenly created a branch on the GA4GH repository and sent the pull request from that branch.)

In a SAM file, each record line describes a linear alignment annotated with the fragment and read attributes as well as limited and duplicated information about the mate and other linear alignments if the alignment is chimeric. GAAlignmentRecord in this PR closely mirrors and extends a SAM record. It optionally allows more information, such as the mate cigar and mate sequence, to be retrieved when the read store backend supports such operations efficiently.
#38 and #51 are two alternatives to this PR. #38 explicitly describes the concept hierarchy: Fragment <- Read <- Alignment <- LinearAlignment. However, in #38, a Fragment may contain >1 alignments that may be distant from each other. Retrieving a full Fragment object may either require to duplicate data in large scale or incur frequent random access which is slow. I couldn't think of an efficient implementation. #51 takes Read as the primary object. It unfortunately has the same problem with #38: a Read may contain >1 alignments distant from each others. #38 and #51 are also incompatible with readmethods.avdl which requires sorted GARead - reads cannot be sorted; only alignments can.

fnothaft · 2014-05-21T03:03:40Z

It optionally allows more information, such as the mate cigar and mate sequence, to be retrieved when the read store backend supports such operations efficiently.

-1. Having implementation optional fields in the schema breaks cross-implementation compatibility.

I tend to favor an implementation in the style of #38, as it is semantically richer. I find it moderately preferable to apply sorting as a transformation done lazily on top of the data, instead of expecting data to always come sorted by alignment. In the processing systems that I am working on, I expect to invoke a shuffle once I have my data loaded anyways, so having it come in already sorted isn't a big deal.

Retrieving a full Fragment object may either require to duplicate data in large scale or incur frequent random access which is slow. I couldn't think of an efficient implementation.

In a columnar system with predicate pushdown, this is reasonably efficient even if the fragments aren't sorted/indexed.

cassiedoll · 2014-05-21T03:13:29Z

Specific to this pull request:

I'd rather give your alignments object an ID and have array<array<GALinearAlignment>> otherAlignments change to array<string> otherAlignmentIds.
I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

On #38: I feel like giving reads IDs could solve some of the issues described, I'll update my request to try it out.

@fnothaft - sorting can be handy for lighter weight clients (like GABrowse) when there is pagination involved in the response

lh3 · 2014-05-21T03:22:07Z

-1. Having implementation optional fields in the schema breaks cross-implementation compatibility.

We cannot avoid implementation differences. Some information trivially available to one implementation may be hard for others. If we really want to avoid optional fields, we can introduce simplified GALinearAlignment for otherAlignments, but in that case, users have to do a two-pass search to retrieve full mate information even if a column-oriented back can get the info efficiently. I am okay with that if this is the consensus.

In a columnar system with predicate pushdown, this is reasonably efficient even if the fragments aren't sorted/indexed.

Firstly, the performance has not been fully evaluated. Secondly, we need to think about other implementations.

I'd rather give your alignments object an ID and have array<array> otherAlignments change to array otherAlignmentIds.

Interesting idea. But then users need a two-pass search even for basic mate information. I don't like two-pass search.

I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

To me, technical correctness is very important to me. :)

fnothaft · 2014-05-21T03:24:43Z

On #38: I feel like giving reads IDs could solve some of the issues described, I'll update my request to try it out.

I'm not sure I see why having IDs is preferable (for either reads or alignments). Can you explain this a bit more?

@fnothaft - sorting can be handy for lighter weight clients (like GABrowse) when there is pagination involved in the response

Agreed; my point is that the sorting can be implemented as an intermediate translation between querying the data and receiving the data.

I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

I'm -1 on this; the naming isn't critical when you're talking among people who are very familiar with the implementation, but there are a lot of end users who will be misled by the GARead name.

fnothaft · 2014-05-21T03:32:05Z

We cannot avoid implementation differences. Some information trivially available to one implementation may be hard for others. If we really want to avoid optional fields, we can introduce simplified GALinearAlignment for otherAlignments, but in that case, users have to do a two-pass search to retrieve full mate information even if a column-oriented back can get the info efficiently. I am okay with that if this is the consensus.

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

I realize that inevitably implementation differences will occur, but I'm strongly against writing implementation differences into our spec.

lh3 · 2014-05-21T03:45:11Z

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

Okay, I am proposing a modified spec in the following. If more of us think it is better, I will commit it or its improved form.

record GAMappingPosition {
    string contigName;
    long position;
    boolean reverseStrand; // alignment is on the reverse strand.
}
record GALinearAlignment { // a linear alignment can be represented by one CIGAR string
    GAMappingPosition position;
    union { null, int } mappingQuality = null;
    array<GACigarUnit> cigar = []; 
}
record GAAlignmentRecord {
    ... // <- other fields are here 
    union { null, GALinearAlignment } alignment = null; // null if unmapped
    union { null, string } alignedSequence = null;
    union { null, string } alignedQuality = null;
    array<GALinearAlignment> otherAlignments = []; // other alignments of this read (SA tag)
    array<GAMappingPosition> primaryMatePositions = []; // pos of the primary aln of the mate(s)
}

You can get all these information from one SAM line, reducing the possibility of implementation differences.

cassiedoll · 2014-05-21T15:11:08Z

@fnothaft - an ID allows you to look up a mate precisely, whereas GAMappingPosition or its equivalent forces fuzzy searching on contig/position/name. The fuzzy searching has worked for SAM so far, but an ID eliminates ambiguity. To be clear, I don't think IDs are required here - just an idea. It does get rid of the chimeric lookup issues.

@lh3 - that code you put in the comment is now looking much more similar to #51.

re naming: then why do alignment records belong to readgroups? that would be confusing to me as an outside user. if we really do want to go with something like alignment record, we should at least change readgroup to be alignmentgroup or something like that.

fnothaft · 2014-05-21T15:26:13Z

@fnothaft - an ID allows you to look up a mate precisely, whereas GAMappingPosition or its equivalent forces fuzzy searching on contig/position/name. The fuzzy searching has worked for SAM so far, but an ID eliminates ambiguity. To be clear, I don't think IDs are required here - just an idea. It does get rid of the chimeric lookup issues.

@cassiedoll I agree in the read case, but reads already have IDs via the read name (at least, in SAM/BAM and ADAM). My question is more centered around the IDs for alignment; I don't see why that is a useful stage of indirection.

re naming: then why do alignment records belong to readgroups? that would be confusing to me as an outside user. if we really do want to go with something like alignment record, we should at least change readgroup to be alignmentgroup or something like that.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction. Read groups are built around the lineage of the read --> typically all the reads in a single read group came from the same sequencing lane.

cassiedoll · 2014-05-21T15:41:40Z

re IDs: for alignments, you still need to look up your mates, right? if so, looking up by contig/position is still messy. for the higher coverage data I've seen there is more than one alignment at a given contig/position. so to get the mate bases lets say, a user has to search for alignments at a position, and iterate through the results to find the one that has the same qname.

not saying it doesn't work aok. its just in most apis that aren't based on files, you use IDs to represent this kind of relationship. we can just skip over this issue though and not use IDs :)
re naming: so why does say Added GAFragment type #38 have a GAFragment contain multiple GAReads with multiple GALinearAlignments on them? Does fragment == read? or alignmentRecord == read? The 2 pull requests seem to disagree a bit on that.

fnothaft · 2014-05-21T15:56:39Z

re IDs: for alignments, you still need to look up your mates, right? if so, looking up by contig/position is still messy. for the higher coverage data I've seen there is more than one alignment at a given contig/position. so to get the mate bases lets say, a user has to search for alignments at a position, and iterate through the results to find the one that has the same qname.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

re naming: so why does say #38 have a GAFragment contain multiple GAReads with multiple GALinearAlignments on them? Does fragment == read? or alignmentRecord == read? The 2 pull requests seem to disagree a bit on that.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

lh3 · 2014-05-21T18:06:15Z

that code you put in the comment is now looking much more similar to #51.

Both this PR and #51 are attempts to mirror a line in a SAM file. They should be similar. However, they are very different conceptually. #51 is modeling a read, where "GARead::alignments" is an array. A GARead object may represent multiple SAM lines. You will have problems with indexing and sorting in implementation. This PR is modeling a linear alignment, where "GAAlignmentRecord::alignment" is a single alignment. Additional components in your GARead::alignments are described in the "GAAlignmentRecord::otherAlignments" array. A GAAlignmentRecord object always mirrors one line in a SAM file. The current practices on BAM files, which I admit are not always optimal, can be applied to to the model in this PR.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

This is not a well defined corner. Bwa does not produce multiple paired hits, though I am sure some other mappers do. I don't know how they report such hits. For multiple secondary pairs, we might have to have an ID field to separate them out.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction.

Yes, I agree. This is the price of following the SAM model. #38 is much cleaner.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

I believe the hierarchy is also true for all the popular sequencing technologies so far.

richarddurbin · 2014-05-21T18:13:15Z

With respect to naming, what about GAReadAlignment as the name for Heng's GAAlignmentRecord?

I am a bit confused. Is there only one of these AlignmentRecords for each read, or is there one for each LinearAlignment, with that one being the
alignment (perhaps it should be primaryAlignment?) and the others being in otherAlignments? I think there is one SAM record for each LinearAlignment.

I think Cassie is correct that if we go this route we need a read id that is a true id unique in the repository, not just the name. BAM has that within a file
defined by the names in SAM being unique within a file. But we can't guarantee that these names from SAM/BAM will be unique in our repository since
we want to put multiple BAMs into a single repository. Also from my memory in BAM you just have the MappingPosition of the next mate, not of all mates,
so you need a sequence of requests if there are more than two reads from a fragment.

I had another alternative, but on reflection I think it is likely to be similar to one of the other proposals, so should read them before commenting on that.
Logically I think we want Fragment { readGroupId ; array ; }, Read { fragmentId ; array ; } LinearAlignment { readId ; position ; }
with each level being one-to-many. We are discussing options to default collapse some of these levels, both because in most cases there will just be
one (or two in the case of read mate pairs) of the level below, and to ensure efficient implementation and interchange with SAM/BAM/CRAM.

Richard

On 21 May 2014, at 04:45, Heng Li notifications@github.com wrote:

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

Okay, I am proposing a modified spec in the following. If more of us think it is better, I will commit it or its improved form.

record GAMappingPosition {
string contigName;
long position;
boolean reverseStrand; // alignment is on the reverse strand.
}
record GALinearAlignment { // a linear alignment can be represented by one CIGAR string
GAMappingPosition position;
union { null, int } mappingQuality = null;
array cigar = [];
}
record GAAlignmentRecord {
// <- other fields are here
union { null, GALinearAlignment } alignment = null; // null if unmapped
union { null, string } alignedSequence = null;
union { null, string } alignedQuality = null;
array otherAlignments = []; // from the SA tag if present
array matePositions = [];
}
You can get all these information from one SAM line, reducing the possibility of implementation differences.

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

richarddurbin · 2014-05-21T18:17:40Z

Sorry. I just read Heng's last post in response to Frank after I sent my last post. Apart from the naming idea I proposed, which still has merit I think, the rest is probably
redundant/less clear than this exchange between Heng and Frank.

Richard

On 21 May 2014, at 19:06, Heng Li notifications@github.com wrote:

that code you put in the comment is now looking much more similar to #51.

Both this PR and #51 are attempts to mirror a line in a SAM file. They should be similar. However, they are very different conceptually. #51 is modeling a read, where "GARead::alignments" is an array. A GARead object may represent multiple SAM lines. You will have problems with indexing and sorting in implementation. This PR is modeling a linear alignment, where "GAAlignmentRecord::alignment" is a single alignment. Additional components in your GARead::alignments are described in the "GAAlignmentRecord::otherAlignments" array. A GAAlignmentRecord object always mirrors one line in a SAM file. The current practices on BAM files, which I admit are not always optimal, can be applied to to the model in this PR.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

This is not a well defined corner. Bwa does not produce multiple paired hits, though I am sure some other mappers do. I don't know how they report such hits. For multiple secondary pairs, we might have to have an ID field to separate them out.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction.

Yes, I agree. This is the price of following the SAM model. #38 is much cleaner.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

I believe the hierarchy is also true for all the popular sequencing technologies so far.

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

lh3 · 2014-05-21T18:23:14Z

With respect to naming, what about GAReadAlignment as the name for Heng's GAAlignmentRecord?

I am okay with that.

I am a bit confused. Is there only one of these AlignmentRecords for each read, or is there one for each LinearAlignment, with that one being the alignment (perhaps it should be primaryAlignment?) and the others being in otherAlignments? I think there is one SAM record for each LinearAlignment.

Yes, one SAM record for each LinearAlignment and one GAAlignmentRecord for each SAM record. If "otherAlignments" is causing confusion, I can remove it and let users decode the SA tag to get other alignments - probably this is better for now.

Also from my memory in BAM you just have the MappingPosition of the next mate, not of all mates,
so you need a sequence of requests if there are more than two reads from a fragment.

Yes, SAM only keeps the next mate position, not all mates. I am fine to change matePrimaryPositions[] to nextMatePosition as I have already made GAAlignmentRecord almost identical to SAM, which seems the consensus.

cassiedoll · 2014-05-21T18:39:00Z

@lh3 can you update your code with the changes you have proposed? (I'm having a hard time following everything)

lh3 · 2014-05-21T18:52:25Z

Updated.

cassiedoll · 2014-05-21T19:46:07Z

Thank you! I like this much better.

cassiedoll · 2014-05-21T19:52:35Z

src/main/resources/avro/reads.avdl

+    union { null, boolean } properPlacement = false; // extension of SAM flag 0x2
+    union { null, boolean } duplicateFragment = false; // SAM flag 0x400
+    int numberReads; // number of reads in the fragment; extension of SAM flag 0x1
+    union { null, int } templateLength = null;


can you bring back the comment for templateLength?
also lets try to use /** */ style comments for all new comments

In SAM, the precise definition of template length is still in question (blame me for that). I only say "equivalent to TLEN in SAM" in case TLEN is changed in SAM in future.

Conflicts: src/main/resources/avro/reads.avdl

dglazer · 2014-05-23T12:45:25Z

@lh3 , this is feeling good to me -- I think it strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. I'm still trying to make sure I understand a few subtleties of secondary/supplement alignments; assuming no surprises there I'll be +1.

pgrosu · 2014-05-23T17:10:27Z

Hi David,

To understand the difference between primaryAligment, secondaryAlignment and supplementAlignment, I will extend Heng’s description in #63 using the following diagram (but the length of the reads in this case would be 300bp rather than 200bp):

Suppose Read_1 maps to chr1:10000 (which means chromosome 1 at position 10000), and Read_2 to chr1:10500, and has the optimal alignment and thus best mappingQuality score. Since there is no primaryAlignement flag, one denotes that in GAReadAlignment by setting both of secondaryAlignment and supplementAlignment to false:

secondaryAlignment = false;
supplementAlignment = false;

This now is your primary line of the read, and contains the full read sequence and quality score. In SAM the primary line of a read is also used for other purposes. These would be for instance to mark duplicates (MarkDuplicates), convert to Sanger fastq format (SamToFastq), to ensure that information between mate-pairs is synchronized (FixMateInformation), among other things.

Secondary Alignment (SAM Flag 0x100)

These are other alignments of the same read but might be suboptimal compared to the primary. For instance, the first 100 basepairs (bp) of Read_1 are mapped to chr2:20000, and that the last 100bp of Read_1 are mapped to chr3:30000. You will also notice that Read_2 is mapped to chr3:30500, though it might not be optimal as the one mapped to chr1:10500. Note: Turning this flag on (to true) would alert other tools to not use this read in their analysis.

Supplementary Alignment (SAM Flag 0x800)

For the supplement alignment, notice that Read_1’s last 100 basepairs (bp) maps in an inverted way around the chromosome 3 position 5000 (chr3:5000) as a set of linear alignments with little overlap, as compared to the one mapped at chr3:30000. This type alignment is useful in denoting chimeric alignments. In SAM it is referred to as the supplementary line.

Reference Links

Below are some reference links that might help:

http://seqanswers.com/forums/showthread.php?t=40239
http://sourceforge.net/p/samtools/mailman/message/30853577/
http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores

cassiedoll · 2014-05-23T18:13:16Z

src/main/resources/avro/reads.avdl

+    // The number of reads in the fragment (extension to SAM flag 0x1)
+    union { null, int } numberReads = null;
+
+    union { null, int } templateLength = null; // equivalent to TLEN in SAM


We - as part of 0.5 - should decide on a definition for this field. Even if the SAM spec is kinda ambiguous, we should really strive not to be. (like how we changed species in GAReferenceSequence to be an ncbi_taxon_id)

suggestion: call this fragmentLength, and put it next to fragmentName

+1 for fragmentLength next to fragmentName

On 24 May 2014, at 14:04, David Glazer notifications@github.com wrote:

In src/main/resources/avro/reads.avdl:

// fragment attributes

// The fragment name. Equivalent to QNAME (query template name) in SAM.

string fragmentName;

// The orientation and the distance between reads from the fragment are

// consistent with the sequencing protocol (extension to SAM flag 0x2)

union { null, boolean } properPlacement = false;

// The fragment is a PCR or optical duplicate (SAM flag 0x400)

union { null, boolean } duplicateFragment = false;

// The number of reads in the fragment (extension to SAM flag 0x1)

union { null, int } numberReads = null;

union { null, int } templateLength = null; // equivalent to TLEN in SAM
suggestion: call this fragmentLength, and put it next to fragmentName

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

dglazer · 2014-05-27T17:20:09Z

+1 - as I said earlier, I think this strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. There are a few small things we may want to refine in a later pull request, but this is a big step in the right direction.

adamnovak · 2014-05-27T17:54:08Z

src/main/resources/avro/reads.avdl

+
+    // The mapping of the primary alignment of the (readNumber+1)%numberReads
+    // read in the fragment. It replaces mate position and mate strand in SAM.
+    union { null, GAMappingPosition } nextMatePosition = null;


I'm a bit worried about pulling positions up this high in the chain of abstractions. If we were to add another alignment type (say, GAGraphAlignment to the fancy new reference graph types from the RefVar schema), we could just have that optionally replace GALinearAlignment wherever it appears. But as soon as we bring (contig, base, strand) positions up into the actual record objects, it makes graph compatibility more complicated. It would be easier for me if this were an ID reference.

A solution where GAGraphAlignment also came with a GAGraphPosition type (and maybe also had GAMappingPosition renamed to GALinearPosition to match GALinearAlignment?) would be another option, but that seems more complex.

IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)

I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.

+1 for nextMatePosition from Heng for the reasons given by David Glazer

On 27 May 2014, at 20:19, David Glazer notifications@github.com wrote:

In src/main/resources/avro/reads.avdl:

// By convention, each read has one and only one alignment with both of the

// two following flags being flase. The full read sequence and quality

// should be present in this alignment.

union { null, boolean } secondaryAlignment = false; // SAM flag 0x100

union { null, boolean } supplementAlignment = false; // SAM flag 0x800

// The portion of the read sequence and quality in the alignment. In a

// supplementary or seconday alignment, alignedSequence and alignedQuality

// may be shorter than the read sequence and quality, or even absent.

union { null, string } alignedSequence = null;

array alignedQuality = [];

// The mapping of the primary alignment of the (readNumber+1)%numberReads

// read in the fragment. It replaces mate position and mate strand in SAM.

union { null, GAMappingPosition } nextMatePosition = null;
IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)

I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

-0, I agree nextMatePosition is rather implementation specific, in that it likely assumes you have an efficient way to go from reference positions to reads, as you do in reference sorted read file. One can imagine other ways of storing reads that don't use this reference ordering, but not sure if anyone cares about this.

If mapping to a reference graph structure rather than reference sequence, I think we can imagine how we'd translate the nextMatePosition field to a node within the graph.

As a general comment, I agree with Adam, the more the reads data structures promote the reference in pure sequence terms, the harder any fork will be to move to a more general reference structure.

dglazer · 2014-05-28T04:11:06Z

src/main/resources/avro/reads.avdl

    long position;
+    boolean reverseStrand;


Can we move reverseStrand out of here and into GALinearAlignment instead?

Rationale: GAMappingPosition is only used in two places -- GALinearAlignment (for whom this move would be a nop) and the nextMatePosition field (where afaik it doesn't matter, since that field is just used to find mates).

This isn't a good abstraction; reverse strandedness is a property of the read itself, not the alignment.

Strand is the property of an alignment, not a property of a read. It is in GAMappingPosition because the strand of the mate is important and is kept in the SAM file.

Perhaps I misunderstand, but doesn't the reverse strand flag mean that the read was sequenced from the complimentary strand of a DNA fragment? This would imply that the strand is a property of a read, not an alignment.

@lh3, when you say:

It is in GAMappingPosition because the strand of the mate is important

That's the key question to me -- is it common for someone looking at a particular alignment to need to know its mate's strand, without also needing to know full info about that mate (e.g. flags, bases, qualities)? If so then I withdraw my suggestion (although I would like to understand more about that use case). But if not, then I think it's cleaner to strip down GAMappingPosition to only contain the information necessary to find the mate.

Re SAM file -- yes, I see that SAM flags contain
0x20: SEQ of the next segment in the template being reversed
And as I say above, if it's commonly used, no problem. But if not, then I think it's cleaner to drop it.
`

@fnothaft, the reverseStrand indicates the strand of an alignment relative to the reference genome. @dglazer, strandness is as important as the coordinate. The use of 0x20 is as frequent as the use of mate positions.

pgrosu · 2014-05-28T08:14:19Z

@fnothaft, so since the alignment is in relation to a GAReferenceSequence then that should match in relation to that as denoted in SAM, as @lh3 mentioned. Though by cascading through the structures (GAReadAlignment -> GALinearAlignment -> GAMappingPosition) you should be able to determine strandedness fairly quickly. In fact in #66 it's completely gone :)

richarddurbin · 2014-05-28T19:22:22Z

+1 once lh3 confirms that he has tidied up (within 2 hours)

On 28 May 2014, at 09:14, Paul Grosu notifications@github.com wrote:

@fnothaft, so since the alignment is in relation to a GAReferenceSequence then that should match in relation to that as denoted in SAM, as @lh3 mentioned. Though by cascading through the structures (GAReadAlignment -> GALinearAlignment -> GAMappingPosition) you should be able to determine strandedness fairly quickly. In fact in #66 it's completely gone :)

—
Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

lh3 · 2014-05-28T20:54:46Z

I think this PR is ready for merge. Someone else may do it.

Replace GARead with GAReadAlignment

dglazer · 2014-05-28T22:55:52Z

Done!

Added script to convert test data from text to binary

lh3 added 2 commits May 20, 2014 21:46

Replace GARead with GAAlignmentRecord

e79ae0a

added a few more fields and comments

04ecc1b

lh3 mentioned this pull request May 21, 2014

Alternative fragment proposal - read field reorganization #51

Closed

lh3 mentioned this pull request May 21, 2014

Added GAFragment type #38

Closed

cassiedoll added the ReadTaskTeam label May 21, 2014

lh3 added 2 commits May 21, 2014 14:45

renamed GAAlignmentRecord and made closer to SAM

870fd51

TAB => SPACEs

72a5e8e

cassiedoll reviewed May 21, 2014
View reviewed changes

dglazer mentioned this pull request May 22, 2014

document fragments, reads, alignments, etc. #63

Closed

lh3 changed the title ~~Replace GARead with GAAlignmentRecord~~ Replace GARead with GAReadAlignment May 23, 2014

lh3 added 2 commits May 22, 2014 22:29

More comments

7033d69

Merge remote-tracking branch 'upstream/master' into AlignmentRecord

c180d78

Conflicts: src/main/resources/avro/reads.avdl

cassiedoll reviewed May 23, 2014
View reviewed changes

dglazer added this to the v0.5 (Reads) milestone May 25, 2014

pgrosu mentioned this pull request May 26, 2014

add a header comment giving an overview of objects in this file #67

Merged

adamnovak reviewed May 27, 2014
View reviewed changes

dglazer mentioned this pull request May 27, 2014

how unique should GA4GH.Read id's be? #18

Closed

dglazer reviewed May 28, 2014
View reviewed changes

Renamed a few fields; added one comment

3ec3348

dglazer added a commit that referenced this pull request May 28, 2014

Merge pull request #60 from lh3/AlignmentRecord

b8644b9

Replace GARead with GAReadAlignment

dglazer merged commit b8644b9 into ga4gh:master May 28, 2014

richarddurbin mentioned this pull request May 29, 2014

VCF-like variants proposal #49

Merged

pgrosu mentioned this pull request Mar 13, 2015

look up all reads in fragment #212

Open

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016

Merge pull request ga4gh#60 from hjellinek/compliance_redux

ba8e446

Added script to convert test data from text to binary

Replace GARead with GAReadAlignment #60

Replace GARead with GAReadAlignment #60

Conversation

lh3 commented May 21, 2014

fnothaft commented May 21, 2014

cassiedoll commented May 21, 2014

lh3 commented May 21, 2014

fnothaft commented May 21, 2014

fnothaft commented May 21, 2014

lh3 commented May 21, 2014

cassiedoll commented May 21, 2014

fnothaft commented May 21, 2014

cassiedoll commented May 21, 2014

fnothaft commented May 21, 2014

lh3 commented May 21, 2014

richarddurbin commented May 21, 2014

richarddurbin commented May 21, 2014

lh3 commented May 21, 2014

cassiedoll commented May 21, 2014

lh3 commented May 21, 2014

cassiedoll commented May 21, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dglazer commented May 23, 2014

pgrosu commented May 23, 2014

Secondary Alignment (SAM Flag 0x100)

Supplementary Alignment (SAM Flag 0x800)

Reference Links

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

dglazer commented May 27, 2014

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

pgrosu commented May 28, 2014

richarddurbin commented May 28, 2014

lh3 commented May 28, 2014

dglazer commented May 28, 2014