-
Notifications
You must be signed in to change notification settings - Fork 111
Continuous Data #769
Comments
@ejacox @david4096 please add the key points from your slack discussion to this issue. |
Thanks for rebooting this effort @ejacox ! Being able to efficiently interchange wiggle data has been a desired feature for a while! Adding new messages to the data modelWhile trying to sidestep the issue of how to name it, I believe that we can interchange "continuous-type" data using the current Although this speaks to the flexibility of the data model, there are two problems with this approach. The first is that there is in some cases an order of magnitude (or more) increase in the transfer size for high resolution data, since we are repeating the The second problem is that these data often include missing values, for example, coverage counts of the exome will not present values for intergenic regions. Empty values and missing values may have different meaning in the data model, and returning a feature that says no value has been set is confusing. From a non-technical perspective, the Feature message derives from the sequence ontology for representing the type of a feature. Wiggles are usually results of analyses performed on genomic data, so calling a wiggle Now to your questions. First, I think that merging this with sequence annotations seems appropriate. Naming and continuitySecond, I understand the distaste for wiggle, I don't rush to the term "continuous". Continuous, to me, should be reserved for truly continuous values. Say, for example, I wanted to model the horizontal charge distribution along the genome to predict coiling patterns. In that case, I would specify a continuous function across the length of the genome that has values at every point. Although I don't expect these analytical functions to be present in the API, I would need to be able to construct interbase requests to properly model them. In a mathematical sense Wiggle files are always discrete and usually discontinuous. @ejacox offered the heuristic of thinking of Wiggles as streams of values in genome space. This is a useful way of thinking and leads me to the idea of a trace. Genomic data can be analyzed into a stream of scalars that define their own semantic meaning. A primary use case of wiggles is to describe the coverage of some genomic data--a stream of counts made in genome space. In this way, I can think of the wiggle as a trace of that stream of counts. Requesting a range replays the stream of counts as data, and so these wiggles are traces of streams. Trace SetAlthough it doesn't roll off the tongue, I'll propose With that in mind I'd like to consider how to associate Trace Sets with other message types. Standing alone it loses it's semantic meaning and it's unclear from the outside how it relates to the remainder of the API. For example, if I upload a trace set that describes the coverage of my read group set, this relationship could be made more clear to the client. Although trace sets are not necessarily derived data, providing some optional fields for linking it to other types seems important. Open questions
|
Thank you @david4096 for adding the comments. Some of your other comments: " It seems to me that to get the right interchange fidelity you need the |
If step is 2 and span is 1, then every other base is sampled. Step and span aren't part of the search range. Continuous isn't just chip-seq. Isn't the info field for experimental details? |
We can definitely just put anything else into the If there is a semantic relationship between Trace's and some other data type present in the API, we should make that as clear and useful as possible. |
This is the latest. I removed step and span in favor of just values with NaN for unsampled values. I also added reference_name and continuous_set_id, which just echo back what was in the query. ContinuousSet did not change. It is still just a copy of FeatureSet. I created the service with: I added them to sequence_annotation protos, though they are completely separate. I also implemented it in the server. I did not copy all of the FeatureSet tests yet, as these are copies of existing code. I will and them once the schema is established. Various points:
|
|
Thanks for the comments @david4096.
|
For mango related interest @akmorrow13 |
@maximilianh would you mind taking a few minutes to comment on our approach to adding BigWig data support to the GA4GH schemas? |
Looks good, though I'm not always sure what the final result after the discussion above is. Naming: gbrowse calls wiggle "xyplot". Ensembl calls it continous data. http://uswest.ensembl.org/info/website/upload/wig.html spec: Can your format transmit the wiggle "variableStep" case? variableStep chrom=chr21 span=5 |
Thank you for your comments @maximilianh. The final format is just a starting position and an array of values (plus the reference name and the dataset id). The format can transmit any format. It converts step or span to array values. So, at positions 9411191 to 9411195 there would be 50. Any gaps are filled with NaN. We might add span or step into the format later. |
there can be many multiple positions per reference, right? Some wiggle
files are very sparse and cover only the exons, so having one starting
position per exon would save many NaNs.
|
Indeed. We are working with the assumption that most queries will be for shorter regions. We can look at performance and add compression (e.g. span and step) later as needed. Another possibility is to send a series of messages with arrays covering only NaN values. This can be handled by the proposed message format already. It was kept simple so that we can move forward with actually using it, with the idea of expanding the format for better performance later after we have thought through any issues. |
@dzerbino please review |
Hello all, This looks mainly reasonable, and I agree with the way most threads of conversation converged. I think the Continuous message could do with an 'end' variable. The values array could then be mapped uniformly over a region, even if length(array) != (end - start). This scaled projection would be very useful if the client (typically a web browser) does not want full resolution data. It would therefore want to request the signal over, say chrom1:0-1,000,000 but in 1000 bins only because there are only so many pixels on the screen. This ability to request desired resolution is a key element of the BigWig success, and should be embedded in the SearchContinuousRequest function. HTH, Daniel |
Thanks for the comment @dzerbino. Would another variable called window_size or step be more explicit? How would you do the binning? Average value in the bin? |
window_size would be a good explicit parameter, as step had a slightly
different meaning in BigWig-speak (in BigWig, step and span are not
necessarily equal, so you could define non-contiguous bins).
I would want one of (or all of) (mean, standard deviation, max, min) per
bin. These can all be extracted one the command line using the Kent code in
/kent/src/utils/bigWigSummary/.
…On Mon, Feb 20, 2017 at 4:18 PM, ejacox ***@***.***> wrote:
Thanks for the comment @dzerbino <https://github.com/dzerbino>. Would
another variable called window_size or step be more explicit? How would you
do the binning? Average value in the bin?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#769 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb>
.
|
This is a great idea. I would have said exactly the same thing, only didn't
want to add too much to your plate. But Daniel is entirely right, without
summary levels the API cannot be used for browsing.
On Feb 20, 2017 10:30 AM, "Daniel Zerbino" <notifications@github.com> wrote:
window_size would be a good explicit parameter, as step had a slightly
different meaning in BigWig-speak (in BigWig, step and span are not
necessarily equal, so you could define non-contiguous bins).
I would want one of (or all of) (mean, standard deviation, max, min) per
bin. These can all be extracted one the command line using the Kent code in
/kent/src/utils/bigWigSummary/.
On Mon, Feb 20, 2017 at 4:18 PM, ejacox ***@***.***> wrote:
Thanks for the comment @dzerbino <https://github.com/dzerbino>. Would
another variable called window_size or step be more explicit? How would
you
do the binning? Average value in the bin?
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#769 (comment)>, or
mute
AACB10sYM1t8Foym9iZZ9XZFbAPdCXQxks5rebzDgaJpZM4LbfSb>
.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#769 (comment)>, or mute
the thread
<https://github.com/notifications/unsubscribe-auth/AAS-TX2QTd9AiCjq9OyA9AIY3T5-p3WAks5redvUgaJpZM4LbfSb>
.
|
Thanks @maximilianh and @dzerbino your points are well taken! Practically speaking, having control over zoom level is important and we would like to make useful the BigWig tools via the API. We would like to work from the basis @ejacox has provided and begin to evaluate "zoom" features. As noted by @dzerbino this schema may work nicely for some projections by simply adding an The general problem, how to deal with sparse data in a region, is one that has in part been solved in mango with the addition of a binned coverage count. Since BigWig are real-valued, the analogy might not hold, but I believe that more modeling will result in a understandable and useful access pattern that allows us to bin data across the API. #739 Please add your further suggestions about zoom levels and summary statistics for Continuous Data here so that we might close this issue by merging #802 . |
This issue was created in order to define a continuous data message type.
Continuous, or signal data, associates a value with sequence base pairs. This can be raw experimental data, such as ChIP-Seq data, or calculated values, such as conservation scores. There can be gaps (unsampled bases) in the values, so more than a simple array of values with a start point is needed. Such data is commonly stored in wiggle, bigwig, and bedgraph formats. In addition to just listing values, the formats can do some simple compression: either 'step', which can more compactly represent regularly sampled data, or 'span', which can more compactly represent windowed data.
There are at least a couple of points to discuss:
From PR #591
"The name was chosen because SequenceAnnotations will include more than Features; The second data type is continuous (wiggle) was not include in this PR to speed implementation."
From this discussion, it seems that it should be a part of the sequence_annotations. However, most reference to "sequence annotations" refer to genomic annotations, i.e. associated with genes, which does not apply here. The continous data could be its own service in that case.
Wiggle doesn't seem appropriate. 'Continuous' has been used as a place holder.
A draft schema follows:
`// This protocol defines a format for exchanging continous valued signal data,
// such as those produced experimentally (e.g. ChIP-Seq data) or through
// calculations (e.g. conservation scores). It can be used, for example,
// to share data stored in Wiggle, BigWig, and BedGraph formats.
//
// Only bases with a signal are represented. Gaps in the values indicate bases
// with no signal.
//
// Step and span from the wiggle format are used for a simple compression.
// A chunk of continuous data. Due to gaps in the signal, the values
// cannot be represented by a single array, but require a set of arrays.
message Continuous {
// The start position at which this signal occurs (0-based). This
// corresponds to the first base of the string of reference bases. Genomic
// positions are non-negative integers less than reference length.
int64 start = 1;
// If not one, values are defined for every 'step' base, leaving
// gaps of undefined regions.
int32 step = 2;
// The number of bases each value spans. For example if span is 5,
// then the first value is defined over start plus the next 4 bases, the
// second value over the following 5 bases, ...
int32 span = 3;
// The data values.
repeated double values = 4;
}
// A set of continuous data.
message ContinuousSet {
// The ID of this continuous data set.
string id = 1;
// The ID of the dataset this set belongs to.
string dataset_id = 2;
// The ID of the reference set which defines the coordinate-space for this
// set of data.
string reference_set_id = 3;
// The display name for this dataset.
string name = 4;
// The source URI describing the file from which this set was
// generated, if any.
string source_uri = 5;
// Remaining structured metadata key-value pairs.
map<string, google.protobuf.ListValue> info = 6;
}`
The text was updated successfully, but these errors were encountered: