Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

Replace GARead with GAReadAlignment #60

Merged
merged 7 commits into from
May 28, 2014
Merged

Replace GARead with GAReadAlignment #60

merged 7 commits into from
May 28, 2014

Conversation

lh3
Copy link
Member

@lh3 lh3 commented May 21, 2014

(This is a refinement of #47, which has been closed because I mistakenly created a branch on the GA4GH repository and sent the pull request from that branch.)

In a SAM file, each record line describes a linear alignment annotated with the fragment and read attributes as well as limited and duplicated information about the mate and other linear alignments if the alignment is chimeric. GAAlignmentRecord in this PR closely mirrors and extends a SAM record. It optionally allows more information, such as the mate cigar and mate sequence, to be retrieved when the read store backend supports such operations efficiently.
#38 and #51 are two alternatives to this PR. #38 explicitly describes the concept hierarchy: Fragment <- Read <- Alignment <- LinearAlignment. However, in #38, a Fragment may contain >1 alignments that may be distant from each other. Retrieving a full Fragment object may either require to duplicate data in large scale or incur frequent random access which is slow. I couldn't think of an efficient implementation. #51 takes Read as the primary object. It unfortunately has the same problem with #38: a Read may contain >1 alignments distant from each others. #38 and #51 are also incompatible with readmethods.avdl which requires sorted GARead - reads cannot be sorted; only alignments can.

@fnothaft
Copy link
Contributor

It optionally allows more information, such as the mate cigar and mate sequence, to be retrieved when the read store backend supports such operations efficiently.

-1. Having implementation optional fields in the schema breaks cross-implementation compatibility.

I tend to favor an implementation in the style of #38, as it is semantically richer. I find it moderately preferable to apply sorting as a transformation done lazily on top of the data, instead of expecting data to always come sorted by alignment. In the processing systems that I am working on, I expect to invoke a shuffle once I have my data loaded anyways, so having it come in already sorted isn't a big deal.

Retrieving a full Fragment object may either require to duplicate data in large scale or incur frequent random access which is slow. I couldn't think of an efficient implementation.

In a columnar system with predicate pushdown, this is reasonably efficient even if the fragments aren't sorted/indexed.

@lh3 lh3 mentioned this pull request May 21, 2014
@cassiedoll
Copy link
Member

Specific to this pull request:

  • I'd rather give your alignments object an ID and have array<array<GALinearAlignment>> otherAlignments change to array<string> otherAlignmentIds.
  • I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

On #38: I feel like giving reads IDs could solve some of the issues described, I'll update my request to try it out.

@fnothaft - sorting can be handy for lighter weight clients (like GABrowse) when there is pagination involved in the response

@lh3
Copy link
Member Author

lh3 commented May 21, 2014

-1. Having implementation optional fields in the schema breaks cross-implementation compatibility.

We cannot avoid implementation differences. Some information trivially available to one implementation may be hard for others. If we really want to avoid optional fields, we can introduce simplified GALinearAlignment for otherAlignments, but in that case, users have to do a two-pass search to retrieve full mate information even if a column-oriented back can get the info efficiently. I am okay with that if this is the consensus.

In a columnar system with predicate pushdown, this is reasonably efficient even if the fragments aren't sorted/indexed.

Firstly, the performance has not been fully evaluated. Secondly, we need to think about other implementations.

I'd rather give your alignments object an ID and have array<array> otherAlignments change to array otherAlignmentIds.

Interesting idea. But then users need a two-pass search even for basic mate information. I don't like two-pass search.

I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

To me, technical correctness is very important to me. :)

@fnothaft
Copy link
Contributor

On #38: I feel like giving reads IDs could solve some of the issues described, I'll update my request to try it out.

I'm not sure I see why having IDs is preferable (for either reads or alignments). Can you explain this a bit more?

@fnothaft - sorting can be handy for lighter weight clients (like GABrowse) when there is pagination involved in the response

Agreed; my point is that the sorting can be implemented as an intermediate translation between querying the data and receiving the data.

I also like the name GARead better than GAAlignmentRecord - even if alignmentRecord is technically correct :)

I'm -1 on this; the naming isn't critical when you're talking among people who are very familiar with the implementation, but there are a lot of end users who will be misled by the GARead name.

@fnothaft
Copy link
Contributor

We cannot avoid implementation differences. Some information trivially available to one implementation may be hard for others. If we really want to avoid optional fields, we can introduce simplified GALinearAlignment for otherAlignments, but in that case, users have to do a two-pass search to retrieve full mate information even if a column-oriented back can get the info efficiently. I am okay with that if this is the consensus.

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

I realize that inevitably implementation differences will occur, but I'm strongly against writing implementation differences into our spec.

@lh3
Copy link
Member Author

lh3 commented May 21, 2014

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

Okay, I am proposing a modified spec in the following. If more of us think it is better, I will commit it or its improved form.

record GAMappingPosition {
    string contigName;
    long position;
    boolean reverseStrand; // alignment is on the reverse strand.
}
record GALinearAlignment { // a linear alignment can be represented by one CIGAR string
    GAMappingPosition position;
    union { null, int } mappingQuality = null;
    array<GACigarUnit> cigar = []; 
}
record GAAlignmentRecord {
    ... // <- other fields are here 
    union { null, GALinearAlignment } alignment = null; // null if unmapped
    union { null, string } alignedSequence = null;
    union { null, string } alignedQuality = null;
    array<GALinearAlignment> otherAlignments = []; // other alignments of this read (SA tag)
    array<GAMappingPosition> primaryMatePositions = []; // pos of the primary aln of the mate(s)
}

You can get all these information from one SAM line, reducing the possibility of implementation differences.

@cassiedoll
Copy link
Member

@fnothaft - an ID allows you to look up a mate precisely, whereas GAMappingPosition or its equivalent forces fuzzy searching on contig/position/name. The fuzzy searching has worked for SAM so far, but an ID eliminates ambiguity. To be clear, I don't think IDs are required here - just an idea. It does get rid of the chimeric lookup issues.

@lh3 - that code you put in the comment is now looking much more similar to #51.

re naming: then why do alignment records belong to readgroups? that would be confusing to me as an outside user. if we really do want to go with something like alignment record, we should at least change readgroup to be alignmentgroup or something like that.

@fnothaft
Copy link
Contributor

@fnothaft - an ID allows you to look up a mate precisely, whereas GAMappingPosition or its equivalent forces fuzzy searching on contig/position/name. The fuzzy searching has worked for SAM so far, but an ID eliminates ambiguity. To be clear, I don't think IDs are required here - just an idea. It does get rid of the chimeric lookup issues.

@cassiedoll I agree in the read case, but reads already have IDs via the read name (at least, in SAM/BAM and ADAM). My question is more centered around the IDs for alignment; I don't see why that is a useful stage of indirection.

re naming: then why do alignment records belong to readgroups? that would be confusing to me as an outside user. if we really do want to go with something like alignment record, we should at least change readgroup to be alignmentgroup or something like that.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction. Read groups are built around the lineage of the read --> typically all the reads in a single read group came from the same sequencing lane.

@cassiedoll
Copy link
Member

  • re IDs: for alignments, you still need to look up your mates, right? if so, looking up by contig/position is still messy. for the higher coverage data I've seen there is more than one alignment at a given contig/position. so to get the mate bases lets say, a user has to search for alignments at a position, and iterate through the results to find the one that has the same qname.

    not saying it doesn't work aok. its just in most apis that aren't based on files, you use IDs to represent this kind of relationship. we can just skip over this issue though and not use IDs :)

  • re naming: so why does say Added GAFragment type #38 have a GAFragment contain multiple GAReads with multiple GALinearAlignments on them? Does fragment == read? or alignmentRecord == read? The 2 pull requests seem to disagree a bit on that.

@fnothaft
Copy link
Contributor

re IDs: for alignments, you still need to look up your mates, right? if so, looking up by contig/position is still messy. for the higher coverage data I've seen there is more than one alignment at a given contig/position. so to get the mate bases lets say, a user has to search for alignments at a position, and iterate through the results to find the one that has the same qname.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

re naming: so why does say #38 have a GAFragment contain multiple GAReads with multiple GALinearAlignments on them? Does fragment == read? or alignmentRecord == read? The 2 pull requests seem to disagree a bit on that.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

@lh3
Copy link
Member Author

lh3 commented May 21, 2014

that code you put in the comment is now looking much more similar to #51.

Both this PR and #51 are attempts to mirror a line in a SAM file. They should be similar. However, they are very different conceptually. #51 is modeling a read, where "GARead::alignments" is an array. A GARead object may represent multiple SAM lines. You will have problems with indexing and sorting in implementation. This PR is modeling a linear alignment, where "GAAlignmentRecord::alignment" is a single alignment. Additional components in your GARead::alignments are described in the "GAAlignmentRecord::otherAlignments" array. A GAAlignmentRecord object always mirrors one line in a SAM file. The current practices on BAM files, which I admit are not always optimal, can be applied to to the model in this PR.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

This is not a well defined corner. Bwa does not produce multiple paired hits, though I am sure some other mappers do. I don't know how they report such hits. For multiple secondary pairs, we might have to have an ID field to separate them out.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction.

Yes, I agree. This is the price of following the SAM model. #38 is much cleaner.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

I believe the hierarchy is also true for all the popular sequencing technologies so far.

@richarddurbin
Copy link
Contributor

With respect to naming, what about GAReadAlignment as the name for Heng's GAAlignmentRecord?

I am a bit confused. Is there only one of these AlignmentRecords for each read, or is there one for each LinearAlignment, with that one being the
alignment (perhaps it should be primaryAlignment?) and the others being in otherAlignments? I think there is one SAM record for each LinearAlignment.

I think Cassie is correct that if we go this route we need a read id that is a true id unique in the repository, not just the name. BAM has that within a file
defined by the names in SAM being unique within a file. But we can't guarantee that these names from SAM/BAM will be unique in our repository since
we want to put multiple BAMs into a single repository. Also from my memory in BAM you just have the MappingPosition of the next mate, not of all mates,
so you need a sequence of requests if there are more than two reads from a fragment.

I had another alternative, but on reflection I think it is likely to be similar to one of the other proposals, so should read them before commenting on that.
Logically I think we want Fragment { readGroupId ; array ; }, Read { fragmentId ; array ; } LinearAlignment { readId ; position ; }
with each level being one-to-many. We are discussing options to default collapse some of these levels, both because in most cases there will just be
one (or two in the case of read mate pairs) of the level below, and to ensure efficient implementation and interchange with SAM/BAM/CRAM.

Richard

On 21 May 2014, at 04:45, Heng Li notifications@github.com wrote:

From the position of someone who would write software that would pull data from an API like this, I'd seriously rather not have a cross platform API, than to have a cross platform API that is not actually consistent across platforms. If you make a "cross platform" API with core parts that are optional, then the developer using the API needs to add grievously obnoxious exception handling code.

Okay, I am proposing a modified spec in the following. If more of us think it is better, I will commit it or its improved form.

record GAMappingPosition {
string contigName;
long position;
boolean reverseStrand; // alignment is on the reverse strand.
}
record GALinearAlignment { // a linear alignment can be represented by one CIGAR string
GAMappingPosition position;
union { null, int } mappingQuality = null;
array cigar = [];
}
record GAAlignmentRecord {
// <- other fields are here
union { null, GALinearAlignment } alignment = null; // null if unmapped
union { null, string } alignedSequence = null;
union { null, string } alignedQuality = null;
array otherAlignments = []; // from the SA tag if present
array matePositions = [];
}
You can get all these information from one SAM line, reducing the possibility of implementation differences.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@richarddurbin
Copy link
Contributor

Sorry. I just read Heng's last post in response to Frank after I sent my last post. Apart from the naming idea I proposed, which still has merit I think, the rest is probably
redundant/less clear than this exchange between Heng and Frank.

Richard

On 21 May 2014, at 19:06, Heng Li notifications@github.com wrote:

that code you put in the comment is now looking much more similar to #51.

Both this PR and #51 are attempts to mirror a line in a SAM file. They should be similar. However, they are very different conceptually. #51 is modeling a read, where "GARead::alignments" is an array. A GARead object may represent multiple SAM lines. You will have problems with indexing and sorting in implementation. This PR is modeling a linear alignment, where "GAAlignmentRecord::alignment" is a single alignment. Additional components in your GARead::alignments are described in the "GAAlignmentRecord::otherAlignments" array. A GAAlignmentRecord object always mirrors one line in a SAM file. The current practices on BAM files, which I admit are not always optimal, can be applied to to the model in this PR.

Ah! Good point. This is probably a good question for @lh3, as I'm not sure what the semantics here are. Not all aligners do paired end alignment, so we may not expect the number of secondary alignments for a read to match with it's mate pair. For aligners that do paired end alignment and that emit secondary alignments, do they emit the secondary alignments per read pair? Etc...

This is not a well defined corner. Bwa does not produce multiple paired hits, though I am sure some other mappers do. I don't know how they report such hits. For multiple secondary pairs, we might have to have an ID field to separate them out.

Realistically, linear alignments belong to reads/fragments, which belong to read groups. The middle translation step is implied in current formats; it's a bit of a leaky abstraction.

Yes, I agree. This is the price of following the SAM model. #38 is much cleaner.

A fragment is a longer section of sequence that paired end reads are sequenced from. At least for Illumina, the relationship is read group (sequencing lane) --> fragment (piece of DNA that paired end reads are sequenced from) --> read (single sequence read from a fragment) --> alignment --> linear alignment.

I believe the hierarchy is also true for all the popular sequencing technologies so far.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member Author

lh3 commented May 21, 2014

With respect to naming, what about GAReadAlignment as the name for Heng's GAAlignmentRecord?

I am okay with that.

I am a bit confused. Is there only one of these AlignmentRecords for each read, or is there one for each LinearAlignment, with that one being the alignment (perhaps it should be primaryAlignment?) and the others being in otherAlignments? I think there is one SAM record for each LinearAlignment.

Yes, one SAM record for each LinearAlignment and one GAAlignmentRecord for each SAM record. If "otherAlignments" is causing confusion, I can remove it and let users decode the SA tag to get other alignments - probably this is better for now.

Also from my memory in BAM you just have the MappingPosition of the next mate, not of all mates,
so you need a sequence of requests if there are more than two reads from a fragment.

Yes, SAM only keeps the next mate position, not all mates. I am fine to change matePrimaryPositions[] to nextMatePosition as I have already made GAAlignmentRecord almost identical to SAM, which seems the consensus.

@cassiedoll
Copy link
Member

@lh3 can you update your code with the changes you have proposed? (I'm having a hard time following everything)

@lh3
Copy link
Member Author

lh3 commented May 21, 2014

Updated.

@cassiedoll
Copy link
Member

Thank you! I like this much better.

union { null, boolean } properPlacement = false; // extension of SAM flag 0x2
union { null, boolean } duplicateFragment = false; // SAM flag 0x400
int numberReads; // number of reads in the fragment; extension of SAM flag 0x1
union { null, int } templateLength = null;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you bring back the comment for templateLength?
also lets try to use /** */ style comments for all new comments

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In SAM, the precise definition of template length is still in question (blame me for that). I only say "equivalent to TLEN in SAM" in case TLEN is changed in SAM in future.

@lh3 lh3 changed the title Replace GARead with GAAlignmentRecord Replace GARead with GAReadAlignment May 23, 2014
@dglazer
Copy link
Member

dglazer commented May 23, 2014

@lh3 , this is feeling good to me -- I think it strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. I'm still trying to make sure I understand a few subtleties of secondary/supplement alignments; assuming no surprises there I'll be +1.

@pgrosu
Copy link
Contributor

pgrosu commented May 23, 2014

Hi David,

To understand the difference between primaryAligment, secondaryAlignment and supplementAlignment, I will extend Heng’s description in #63 using the following diagram (but the length of the reads in this case would be 300bp rather than 200bp):

image

Suppose Read_1 maps to chr1:10000 (which means chromosome 1 at position 10000), and Read_2 to chr1:10500, and has the optimal alignment and thus best mappingQuality score. Since there is no primaryAlignement flag, one denotes that in GAReadAlignment by setting both of secondaryAlignment and supplementAlignment to false:

secondaryAlignment = false;
supplementAlignment = false;

This now is your primary line of the read, and contains the full read sequence and quality score. In SAM the primary line of a read is also used for other purposes. These would be for instance to mark duplicates (MarkDuplicates), convert to Sanger fastq format (SamToFastq), to ensure that information between mate-pairs is synchronized (FixMateInformation), among other things.

Secondary Alignment (SAM Flag 0x100)

These are other alignments of the same read but might be suboptimal compared to the primary. For instance, the first 100 basepairs (bp) of Read_1 are mapped to chr2:20000, and that the last 100bp of Read_1 are mapped to chr3:30000. You will also notice that Read_2 is mapped to chr3:30500, though it might not be optimal as the one mapped to chr1:10500. Note: Turning this flag on (to true) would alert other tools to not use this read in their analysis.

Supplementary Alignment (SAM Flag 0x800)

For the supplement alignment, notice that Read_1’s last 100 basepairs (bp) maps in an inverted way around the chromosome 3 position 5000 (chr3:5000) as a set of linear alignments with little overlap, as compared to the one mapped at chr3:30000. This type alignment is useful in denoting chimeric alignments. In SAM it is referred to as the supplementary line.

Reference Links

Below are some reference links that might help:

http://seqanswers.com/forums/showthread.php?t=40239
http://sourceforge.net/p/samtools/mailman/message/30853577/
http://genome.sph.umich.edu/wiki/Mapping_Quality_Scores

// The number of reads in the fragment (extension to SAM flag 0x1)
union { null, int } numberReads = null;

union { null, int } templateLength = null; // equivalent to TLEN in SAM
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We - as part of 0.5 - should decide on a definition for this field. Even if the SAM spec is kinda ambiguous, we should really strive not to be. (like how we changed species in GAReferenceSequence to be an ncbi_taxon_id)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

suggestion: call this fragmentLength, and put it next to fragmentName

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for fragmentLength next to fragmentName

On 24 May 2014, at 14:04, David Glazer notifications@github.com wrote:

In src/main/resources/avro/reads.avdl:

  • // fragment attributes
  • // The fragment name. Equivalent to QNAME (query template name) in SAM.
  • string fragmentName;
  • // The orientation and the distance between reads from the fragment are
  • // consistent with the sequencing protocol (extension to SAM flag 0x2)
  • union { null, boolean } properPlacement = false;
  • // The fragment is a PCR or optical duplicate (SAM flag 0x400)
  • union { null, boolean } duplicateFragment = false;
  • // The number of reads in the fragment (extension to SAM flag 0x1)
  • union { null, int } numberReads = null;
  • union { null, int } templateLength = null; // equivalent to TLEN in SAM
    suggestion: call this fragmentLength, and put it next to fragmentName


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@dglazer
Copy link
Member

dglazer commented May 27, 2014

+1 - as I said earlier, I think this strikes the right balance between existing SAM and new elegance, and between cost/complexity for backend implementors vs. frontend callers. There are a few small things we may want to refine in a later pull request, but this is a big step in the right direction.


// The mapping of the primary alignment of the (readNumber+1)%numberReads
// read in the fragment. It replaces mate position and mate strand in SAM.
union { null, GAMappingPosition } nextMatePosition = null;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm a bit worried about pulling positions up this high in the chain of abstractions. If we were to add another alignment type (say, GAGraphAlignment to the fancy new reference graph types from the RefVar schema), we could just have that optionally replace GALinearAlignment wherever it appears. But as soon as we bring (contig, base, strand) positions up into the actual record objects, it makes graph compatibility more complicated. It would be easier for me if this were an ID reference.

A solution where GAGraphAlignment also came with a GAGraphPosition type (and maybe also had GAMappingPosition renamed to GALinearPosition to match GALinearAlignment?) would be another option, but that seems more complex.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)

I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

+1 for nextMatePosition from Heng for the reasons given by David Glazer

On 27 May 2014, at 20:19, David Glazer notifications@github.com wrote:

In src/main/resources/avro/reads.avdl:

  • // By convention, each read has one and only one alignment with both of the
  • // two following flags being flase. The full read sequence and quality
  • // should be present in this alignment.
  • union { null, boolean } secondaryAlignment = false; // SAM flag 0x100
  • union { null, boolean } supplementAlignment = false; // SAM flag 0x800
  • // The portion of the read sequence and quality in the alignment. In a
  • // supplementary or seconday alignment, alignedSequence and alignedQuality
  • // may be shorter than the read sequence and quality, or even absent.
  • union { null, string } alignedSequence = null;
  • array alignedQuality = [];
  • // The mapping of the primary alignment of the (readNumber+1)%numberReads
  • // read in the fragment. It replaces mate position and mate strand in SAM.
  • union { null, GAMappingPosition } nextMatePosition = null;
    IIUC, in today's world the way you find mates is you look for reads that share a qname and contig, and match the specified position. Not very elegant from first principles, but changing it adds a bunch of burden to repository implementors that I'm very leery of doing in v0.5. (Since they'd have to build a whole new index.)

I'm inclined to leave it as proposed for now, and know we'll be changing it (e.g. by adding alternate options for specifying mate pairs) down the road. I'm open to other suggestions, but every time we tried it got hairy; this approach by @lh3 feels like a nice balance between the past and the future.


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

-0, I agree nextMatePosition is rather implementation specific, in that it likely assumes you have an efficient way to go from reference positions to reads, as you do in reference sorted read file. One can imagine other ways of storing reads that don't use this reference ordering, but not sure if anyone cares about this.

If mapping to a reference graph structure rather than reference sequence, I think we can imagine how we'd translate the nextMatePosition field to a node within the graph.

As a general comment, I agree with Adam, the more the reads data structures promote the reference in pure sequence terms, the harder any fork will be to move to a more general reference structure.

long position;
boolean reverseStrand;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we move reverseStrand out of here and into GALinearAlignment instead?

Rationale: GAMappingPosition is only used in two places -- GALinearAlignment (for whom this move would be a nop) and the nextMatePosition field (where afaik it doesn't matter, since that field is just used to find mates).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This isn't a good abstraction; reverse strandedness is a property of the read itself, not the alignment.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Strand is the property of an alignment, not a property of a read. It is in GAMappingPosition because the strand of the mate is important and is kept in the SAM file.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps I misunderstand, but doesn't the reverse strand flag mean that the read was sequenced from the complimentary strand of a DNA fragment? This would imply that the strand is a property of a read, not an alignment.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@lh3, when you say:

It is in GAMappingPosition because the strand of the mate is important

That's the key question to me -- is it common for someone looking at a particular alignment to need to know its mate's strand, without also needing to know full info about that mate (e.g. flags, bases, qualities)? If so then I withdraw my suggestion (although I would like to understand more about that use case). But if not, then I think it's cleaner to strip down GAMappingPosition to only contain the information necessary to find the mate.

Re SAM file -- yes, I see that SAM flags contain
0x20: SEQ of the next segment in the template being reversed
And as I say above, if it's commonly used, no problem. But if not, then I think it's cleaner to drop it.
`

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@fnothaft, the reverseStrand indicates the strand of an alignment relative to the reference genome. @dglazer, strandness is as important as the coordinate. The use of 0x20 is as frequent as the use of mate positions.

@pgrosu
Copy link
Contributor

pgrosu commented May 28, 2014

@fnothaft, so since the alignment is in relation to a GAReferenceSequence then that should match in relation to that as denoted in SAM, as @lh3 mentioned. Though by cascading through the structures (GAReadAlignment -> GALinearAlignment -> GAMappingPosition) you should be able to determine strandedness fairly quickly. In fact in #66 it's completely gone :)

@richarddurbin
Copy link
Contributor

+1 once lh3 confirms that he has tidied up (within 2 hours)

On 28 May 2014, at 09:14, Paul Grosu notifications@github.com wrote:

@fnothaft, so since the alignment is in relation to a GAReferenceSequence then that should match in relation to that as denoted in SAM, as @lh3 mentioned. Though by cascading through the structures (GAReadAlignment -> GALinearAlignment -> GAMappingPosition) you should be able to determine strandedness fairly quickly. In fact in #66 it's completely gone :)


Reply to this email directly or view it on GitHub.

The Wellcome Trust Sanger Institute is operated by Genome Research
Limited, a charity registered in England with number 1021457 and a
company registered in England with number 2742969, whose registered
office is 215 Euston Road, London, NW1 2BE.

@lh3
Copy link
Member Author

lh3 commented May 28, 2014

I think this PR is ready for merge. Someone else may do it.

dglazer added a commit that referenced this pull request May 28, 2014
Replace GARead with GAReadAlignment
@dglazer dglazer merged commit b8644b9 into ga4gh:master May 28, 2014
@dglazer
Copy link
Member

dglazer commented May 28, 2014

Done!

dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016
Added script to convert test data from text to binary
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

8 participants