-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Fragment type to support more general "read-bucketing" #54
Milestone
Comments
laserson
added a commit
to laserson/bdg-formats
that referenced
this issue
May 5, 2015
What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.
laserson
added a commit
to laserson/bdg-formats
that referenced
this issue
May 5, 2015
What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.
laserson
added a commit
to laserson/bdg-formats
that referenced
this issue
Sep 14, 2015
What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.
Closing as WontFix |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
To support things like random barcodes, drop-seq, etc.
Followup to #44.
The text was updated successfully, but these errors were encountered: