-
Notifications
You must be signed in to change notification settings - Fork 7
time varying bins notes
This wiki page is for discussion of https://github.com/hapi-server/data-specification/issues/71
consider a setting that could be sent to the server to indicate whether the server should do all de-referencing; default would be to do all de-refs on server, and clients could indicate if they want a more sophisticated server could indicate in the capabilities if could take a request to leave references as is (with no dereferencing)
agenda
summary of the issues make sure we all understand the 4 use cases connect each use case with real-world problems to be solved decide which use cases to try and implement (if no real-world case connects, then why solve it?) talk about implementation ideas - staying consistent with existing conventions, as in CDFs, etc. explanation of $ref in JSON set up time for virtual hack-a-thon for implementations (this is likely to be a smaller subset of people)
discussion points:
5th use case: ability to handle changing number of bins; spec really only handles fixed number of bins; need impl. note to explain how to work around this using FILL for bin ranges to indicate absence of bin (given that the max number of bins was specified initially)
about option 3: a value in the data stream is replaced by a constant value in the header.
This is an optimization that allows bin ranges that don't change very often to be indicated as a non-varying, time varying parameter; it has lots of implications (for caching, etc) and the re-sending of the same bin ranges might not be too bad (because of automated server-side compression), so perhaps we can deal with this one later, after talking with Jeremy on implications
consider a setting that could be sent to the server to indicate wether the server should do all de-referencing; default would be to do all de-refs on server, and clients could indicate if they want a more sophisticated server could indicate in the capabilities if could take a request to leave references as is (with no dereferencing)
use case 4 - this would allow linking uncertainties to their value parameters; not likely to collide with solution of 1-3, so also deal with this one later; other possible linkage - quality flags; linking these together would be beneficial so that when subsetting, the support variables would get subsetted in the same way
use cases 1 and 2 are the focus; 3 is questionable; 4 is defered
implementation ideas: use case 1: $ref mechanism in JSON is sufficient, possibly with some arrangement constraints to keep un-dereferenced JSON headers clean and/or uniform; emphasize that is is just for header substitution, not assembly of complex structures / objects with linkages, etc.
use case 2: might need special syntax to capture how to associate parameter values with time-varying bin values, but we should stay close to or reuse existing $ref syntax as much as possible - don't re-invent unnecessarily!
The most pressing issue is case 2, and this is not solvable with $ref techniques alone. There are implications for the client data model, that now must manage time-varying bins.
We will have a follow-up virtual hack-a-thon session on Thursday, August 15, from 11am to 1pm, Eastern.
The focus of the hack-a-thon session will be to discuss implementation options for supporting use cases 1 and 2. use case 1: create a way to re-use values in the header via some kind of reference mechanism use case 2: allow for the bin centers and/or ranges to be time varying and therefore not be present in the header, but have the header indicate which parameter columns to use for determining the bin ranges and/or centers at each time step use case 3 was deferred. This involves allowing what Bob called "ghost parameters" that appear in the header as constants, and are assumed to be present on every row, but are not repeated in the stream since they are constant over the time range requested. This is an optimization, and could be added later. use case 4 seems separate and is also deferred -- this option would allow uncertainty parameters to be explicitly linked to their value parameters
The reference mechanism for use case 1 should re-use as fully as possible the $ref mechanism already possible for JSON content. Whether the HAPI spec needs to constrain this $ref capability with additional guidelines is something to be determined. Perhaps we require all values to be referenced to appear inside a "constants" block in the header just to keep the reference values from appearing in random places.
Please come to the hack-a-thon with existing, preliminary proposals for what to do -- starting points for our discussion. We will explore the suggested implementations and hopefully come up with something we all can agree on moving forward.
There are 3 types of references:
- A header value is a reference to another element's value in the header
- A header value is a reference to one or more columns in the data response
- One or more columns in data response are implied by a value in the header
As noted below in the type 2 discussion below, the previously discussed type 4. should be considered as a part of type 2 and the type 5 mentioned in the telecon notes above ("5th use case") is not independent of type 2.
Motivation: One or more parameters have bin ranges that are identical to another parameter. Referencing can reduce the metadata size, simplify its maintenance, and reduce the "diff" size when metadata changes.
Proposal:
Use a constrained version of JSON schema instead of developing our own referencing syntax. Ordinary JSON schema parsers will work with our syntax, since our syntax is a subset of regular JSON. However, the constraints are such that it would not be hard for us to write our own parser. Also, if we later end up using more of the complex features of JSON reference syntax, a custom parser would be harder, but since we plan to stick with standard JSON syntax, we could then use regular JSON schema parsers to handle the complex features.
Use the $ref
notation but constrain:
- anything referenced must appear in a node called
definitions
, - the
definitions
node may not contain references, - do not allow referencing by
id
, and -
size
may not be a reference - By default, a server resolves these references unless the request includes
resolve=false
.
Note the following code block is not valid JSON because it includes comments and duplicate elements. A valid version is given below this code block.
{
"HAPI": "3.0",
"status": {"code": 1200, "message": "OK"},
"startDate": "2016-01-01T00:00:00.000Z",
"stopDate": "2016-01-31T24:00:00.000Z",
"definitions": {
"spectrum_units": "particles/(sec ster cm^2 keV)",
"spectrum_centers": [ 15, 25, 35 ],
"spectrum_bins": {
"$id": "spectrum_bins_id", // Not allowed by constraint 3.
"name": "energy",
"units": "keV",
"centers": [ 15, 25, 35 ],
"centers": {"$ref": "#/definitions/centers"} // Not allowed by constraint 2.
}
},
"parameters":
[
{
"name": "Time",
"type": "isotime",
"units": "UTC",
"fill": null,
"length": 24
},
{
"name": "proton_spectrum",
"type": "double",
"size": [16],
"units": {"$ref": "#/definitions/spectrum_units"},
"fill": "-1e31",
"bins":
[
{
"name": "energy",
"units": "keV",
"centers": {"$ref": "#/definitions/spectrum_centers"},
}
]
},
{
"name": "proton_spectrum2",
"type": "double",
"size": [16],
"units": {"$ref": "#/definitions/spectrum_units"},
"fill": "-1e31",
"fill": {"$ref": "#/parameters/proton_spectrum/fill"}, // Not allowed by constraint 1.
"bins": [ {"$ref": "#/definitions/spectrum_bins"} ],
"bins": [ {"$ref": "#spectrum_bins_id"} ] // Not allowed by constraint 3.
}
]
}
Valid JSON for testing given below. To test in Python, save as a.json
and then use
from pprint import pprint
from jsonref import JsonRef
import json
with open('a.json') as json_file:
data = json.load(json_file)
pprint(data)
print("---")
pprint(JsonRef.replace_refs(data))
a.json
{
"HAPI": "3.0",
"status": {
"code": 1200,
"message": "OK"
},
"startDate": "2016-01-01T00:00:00.000Z",
"stopDate": "2016-01-31T24:00:00.000Z",
"definitions": {
"spectrum_units": "particles/(sec ster cm^2 keV)",
"spectrum_centers": [15, 25, 35],
"spectrum_bins": {
"$id": "spectrum_bins_id",
"name": "energy",
"units": "keV",
"centers": [15, 25, 35]
}
},
"parameters": [{
"name": "Time",
"type": "isotime",
"units": "UTC",
"fill": null,
"length": 24
},
{
"name": "proton_spectrum",
"type": "double",
"size": [16],
"units": {
"$ref": "#/definitions/spectrum_units"
},
"fill": "-1e31",
"bins": [{
"name": "energy",
"units": "keV",
"centers": {
"$ref": "#/definitions/spectrum_centers"
}
}]
},
{
"name": "proton_spectrum2",
"type": "double",
"size": [16],
"units": {
"$ref": "#/definitions/spectrum_units"
},
"bins": [{
"$ref": "#/definitions/spectrum_bins"
}]
}
]
}
Motivation: Bin centers and/or ranges vary with time. The HAPI 2 schema requires their values to be specified in the header. If their values change with time, one would either need to create new parameters (if they only change a few times) or specify nominal bin centers and/or ranges.
Note that previously there was a fourth type of reference "a value present in one variable's definition is actually another variable in the data". I have removed this as a type of reference we need to consider because it is not independent of this (2nd) type of reference.
A related issue is how to handle the case where the number of columns (the size
) for a parameter changes with time. That is an instrument switches between making measurements in 16 energy channels to 64 energy channels. In this case, the size
of the variable is time-dependent. This is complicated and gets more complicated if we implement reference type 3.
Proposal:
Define a $paramref
that refers to one or more columns. Use the #/
notation to emphasize that the reference is to a parameter in this dataset (in HTML/XML, #/
is used to refer to a location or section in the current document).
Constraints
- Only certain entities may use it (
centers
andranges
) - A
$paramref
ed parameter may not have a fill value - A
$paramref
ed parameter must exist in the dataset
Example:
{
"HAPI": "3.0",
"status": {"code": 1200, "message": "OK"},
"startDate": "2016-01-01T00:00:00.000Z",
"stopDate": "2016-01-31T24:00:00.000Z",
"parameters":
[
{
"name": "Time",
"type": "isotime",
"units": "UTC",
"fill": null,
"length": 24
},
{
"name": "proton_spectrum",
"type": "double",
"size": [16],
"units": "particles/(sec ster cm^2 keV)",
"fill": "-1e31",
"bins":
[
{
"name": "energy",
"units": "keV",
"centers": {"$paramref": "#/proton_spectrum_centers"},
"ranges": {"$paramref": "#/proton_spectrum_ranges"}
}
]
},
{
"name": "proton_spectrum_centers",
"type": "double",
"size": [16], // Must match product of elements in #/proton_spectrum/size
"units": "keV", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}
{
"name": "proton_spectrum_ranges",
"type": "double",
"size": [32], // Must match 2 x (product of elements in #/proton_spectrum/size)
"units": "keV", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}
]
}
Motivation: An instrument has bin ranges that vary with time (but rarely change) so that reference mechanism 2. must be used to provide the bin ranges. For most data requests, the columns associated with the bin ranges will be time-invariant. To reduce data volume, the header could indicate that, for the selected time range, the bin ranges are constant and so the bin columns are not provided in the data response.
This is an optimization and has many implications to consider (for referencing proposed in 2., caching, client- and server-complexity). In addition, the benefit of the optimization needs to have been shown with real datasets.
I missed last week's meeting, so I met with Jon to go over things, and we went through Bob's document above. This captures that meeting.
Regarding (1) [https://github.com/hapi-server/data-specification/wiki/time-varying-bins-notes#1-a-header-value-is-a-reference-to-another-elements-value-in-the-header], this all looks good, but regarding (2) [https://github.com/hapi-server/data-specification/wiki/time-varying-bins-notes#2-a-header-value-is-a-reference-to-one-or-more-columns-in-the-data-response], $paramref should not be used because it confuses things with JSON macros. We control the name space where $paramref is used, so just "paramref" (or something similar) should be used for this case. Likewise its value need not be prefixed with #/, though I'm not sure what this refers to and there may be some reason for it. The assertion is that the value must match the name of another parameter, so I assume this would be a semantic check.
Regarding (3) [https://github.com/hapi-server/data-specification/wiki/time-varying-bins-notes#3-one-or-more-columns-in-data-response-are-implied-by-a-value-in-the-header], we spoke a bit about having a header which might be included in all responses, data with or without the header, which would indicate that constants should be used for what could be time-varying values. This would be of non-JSON syntax so that it could follow the info part of the data response.
I proposed "maxim 1" (maybe it should be "axiom 1") that the info response is non-time-varying. I also proposed "maxim 2" which is that the info part of the data response is a subset of the info response, for example describing just three of five parameters.
(A few more comments by Jon V): In the case of time-varying energy ranges, there is no need to use JSON reference syntax, since that risks confusing JSON parsers (or even human readers) since we are not asking JSON to be involved at all with understanding this syntax. If we prefix the "paramref" with a "$" then it looks like we are asking the JSON parser to make a special note or treat this value differently, when in fact, all the special treatment has to be after JSON is done and our code interprets the linkage defined by the "paramref."
In terms of allowing what Bob has called "ghost parameters" (present only in the header and having a constant value, so not present in the stream, since they would just be repeated each time), I agree with Jeremy's "axiom 1" that the full info header should really be a fixed entity that does not change with time. So if we want to completely disconnect the header references mechanism (and indeed the entire header) from any attempt to optimize the stream (by not sending values that are constant), this means that any "ghost parameter" mechanism would be only a property of the data stream. In other words, the data stream istelf would need to have some kind of leading directive indicating the presence (and column position and value) of a non-time-varying parameter. This greatly simplifies many issues. It would be an optional server capability (can this server compress the stream by not repeating columns that are constant over time), it keeps the info header as fixed for a dataset, and if data is requested without the header, all info needed to interpret that data is still present (since it's embedded as directives at the start of the data stream). Jeremy suggests that we not use JSON syntax for these directives, but the main requirement I think is that they be separate from (and certainly after) any info header that may be present with the data.
This is a lot of discussion for a feature for which we are deferring implementation.
all refs must be within a "definitions" block (otherwise a generic reference might be to a part of the header that is not present when a subset of parameters is requested)
by default, servers do the dereferencing resolve_references=true (this is compatible with previous versions of HAPI that did not have refs in headers)
To get the more complex info header that contains references: resolve_references=false
Presence and content of "definitions"
- if everything is already derferenced, do not include the "definitions" block
- the definitions block should only contain references to entities that are used in the info request for all parameters. If a subset of parameters is requested, the definitions block can still contain the full set of definitions, although servers are allowed to reduce the definitions block to the minimum items needed for fulfilling all references. This implies that there are no user-defined entities allowed in the definitions block
- most items in the header can be replaced with references, except for "size" and "name."
Case 2.
constraints:
- only for being used centers and ranges
- The centers and ranges pointed to by a parameter references can contain FILL to indicate absence of a bin at a given time step
- paramater references must point to a parameter existing within the same dataset
Note: the size for the "ranges" must be [2,N] where N is the size of bins in the given dimension.
Do we want nominal values that can be static in the header? Risk of misuse of is high if defaults are present. One option (to consider for the future): allow on any parameter
{
"name": "proton_spectrum",
"type": "double",
"size": [16,3],
"units": "particles/(sec ster cm^2 keV)",
"fill": "-1e31",
"bins":
[
{
"name": "energy",
"units": "keV",
"centers": "proton_d1_energy_centers",
"ranges": "proton_d1_energy_ranges"
},
{
"name": "PA",
"units": "degrees",
"centers": "proton_d2_pitchangle_centers",
"ranges": "proton_d2_pitchangle_ranges"
}
]
},
{
"name": "proton_d1_energy_centers",
"type": "double",
"size": [16], // Must match product of elements in #/proton_spectrum/size
"units": "keV", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}
{
"name": "proton_d1_energy_ranges",
"type": "double",
"size": [2,16], // Must match 2 x (product of elements in #/proton_spectrum/size)
"units": "keV", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}
{
"name": "proton_d2_pitchangle_centers",
"type": "double",
"size": [3], // Must match product of elements in #/proton_spectrum/size
"units": "degrees", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}
{
"name": "proton_d1_pitchangle_ranges",
"type": "double",
"size": [2,3], // Must match 2 x (product of elements in #/proton_spectrum/size)
"units": "degrees", // Must match #/proton_spectrum/units
"fill": "-1e31" // Not allowed
}