Update Fragment type to support more general "read-bucketing" #54

laserson · 2015-04-13T16:55:44Z

To support things like random barcodes, drop-seq, etc.

Followup to #44.

@ilveroluca

What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.

@ilveroluca

What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.

@ilveroluca

What I think is an excellent idea is storing reads after a "groupby" operation has occurred, as the reads will likely always be analyzed as a group. Having multiple reads from a single fragment of DNA is one such use case, but there are others. Droplet-seq is one that I am interested in. Incorporating random barcodes is another. Here is a summary of my proposal, based in part on @ilveroluca's work: * Support a fastq-like object `Sequence`, though I don't think this is strictly necessary. * Rename to `Bucket`, as it sounds more general to my ears. * Move run-specific or instrument-specific metadata into separate objects, as they don't necessarily make sense as top-level objects. * Remove `fragmentSize`, as it's specific to one use case and it's rather easily computable. * Support for multiple types of grouped objects. What's the best way to deal with this? `union` somehow? I envision that we may add more types in the future that we'll want to persist as grouped objects. At the moment, there is just a set of arrays for the type of objects that could be grouped. This could be extended as we desire the ability to group other object types. * Sequence and quality information from alignments should be retrieved from `AlignmentRecord`s. * I don't think platform-specific information should be propagated through the entire chain of data types. Why don't we include it in `Genotype`, then? In my mind, any platform-specific analysis happens very early on, generally even before the fastq stage. Therefore, I've moved platform-specific metadata into the `Sequence` object. Fixes bigdatagenomics#54.

heuermh · 2019-07-02T04:42:06Z

Closing as WontFix

laserson mentioned this issue May 5, 2015

[BDG-FORMATS-54] Generalizing the Fragment type #56

Closed

heuermh mentioned this issue May 27, 2016

Add sequence, slice, and read schema #83

Merged

heuermh added this to the 0.14.0 milestone Jul 2, 2019

heuermh closed this as completed Jul 2, 2019

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Fragment type to support more general "read-bucketing" #54

Update Fragment type to support more general "read-bucketing" #54

laserson commented Apr 13, 2015

heuermh commented Jul 2, 2019

Update Fragment type to support more general "read-bucketing" #54

Update Fragment type to support more general "read-bucketing" #54

Comments

laserson commented Apr 13, 2015

heuermh commented Jul 2, 2019