Skip to content
This repository has been archived by the owner on Oct 28, 2022. It is now read-only.

replace GARead with GAFlattenedAlignment #47

Closed
wants to merge 4 commits into from
Closed

Conversation

lh3
Copy link
Member

@lh3 lh3 commented May 16, 2014

Major difference: a GARead may consist of multiple linear alignments but a GAFlattenedAlignment describes at most one linear alignment and is equivalent to a SAM line. Also extend to allow >2 reads per fragment.

This PR is the opposite of #38. It stuffs fragment, read and alignment attributes in one object.

PS: I don't like name "GAFlattenedAlignment". Open to better ones.

lh3 added 3 commits May 16, 2014 12:38
Major difference: a GARead may consist of multiple linear alignment but a
GAFlattenedAlignment describes at most one linear alignment and is equivalent
to a SAM line. Also extend to allow >2 reads per fragment.
... because there is no GARead any more.
@dglazer
Copy link
Member

dglazer commented May 16, 2014

Interesting -- thanks for exploring the opposite end of the design spectrum. This addresses my concern about too many moving parts, and it keeps the method signature nice and clean. But of course there are tradeoffs. Two initial comments:

  • major: to implement range search efficiently, I think the backend has to be indexed by linear alignment (since those are the things that have coordinates), and wants to store all LA's that are near each other in the genome near each other on disk. But then before returning results, the backend has to find all the related otherAlignments[] and read their data into the response. We can check with folks who have built these backends, but I'm worried those goals are in tension.
  • minor: what is fragmentId? I don't see it used in any methods, or any other objects

@lh3
Copy link
Member Author

lh3 commented May 16, 2014

to implement range search efficiently, I think the backend has to be indexed by linear alignment (since those are the things that have coordinates), and wants to store all LA's that are near each other in the genome near each other on disk. But then before returning results, the backend has to find all the related otherAlignments[] and read their data into the response. We can check with folks who have built these backends, but I'm worried those goals are in tension.

For efficient implementation, we'd better duplicate information. In SAM, the position of a read in a pair appears twice, in the self record and in its mate record. This way we don't need to seek to the mate to get the basic mate information. Similarly, SAM has an SA tag to keep the other alignments of the same read to avoid seeking. Sequence and quality are also partially duplicated for a chimeric alignment consists of multiple linear alignments.

The downside of this approach is obvious: it increases file/data size, potentially leads to inconsistencies across records, and only allows to retrieve duplicated information efficiently. I think we should not require all information to be present. An implementation may choose to fill the available fields and set the rest to missing values. It is up to the user to make further decision.

Generally, there are no perfect solutions. One advantage of this PR is that it is very close to the current practice with the SAM/BAM format.

@massie massie closed this May 19, 2014
@massie massie deleted the flattenedAlignment branch May 19, 2014 19:39
dcolligan pushed a commit to dcolligan/ga4gh-schemas that referenced this pull request Jul 20, 2016
Make Maven enforce minimum Maven version (3.3.3) and Java version (JDK 1.8)
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants