Decoupling data schema from data format #196

sffc · 2020-08-05T21:06:52Z

In order to parse a JSON blob in Serde, one needs to know the data schema (struct definition).

Currently, the data provider passes Rust structs (encoded as Anys) in the Reponse objects through the pipeline.

Together, these two statements mean that the source data provider, the one reading the JSON blob from the file system, needs to know ahead of time the mapping from JSON files to structs.

This is undesirable because:

It requires some form of "key-to-struct" dispatch in the source data provider, whether a match statement, table lookup, etc. This dispatch needs to be maintained and could be a source of failures or performance bottlenecks.
Having that dispatch means that a lot of struct parsing code needs to be carried in the source data provider, even if it is not being used downstream. This goes against the grain of having dead code elimination also eliminate unused data structs.

I've considered a few solutions:

Make data providers return a blob instead of a parsed struct. Parsing of the blob would occur at the end of the chain when the type is known.
- We'd need to either declare that the data blobs are always in one specific format (e.g. bincode), or have the source data provider also pass a deserializer function.
- @mihnita suggested a Content-Type approach where the requestor can say that they support JSON or Bincode, and the provider needs to provide a blob in one of those two formats.
Send the struct definition along with the request, such that the struct can be used at the point where the data is loaded from the filesystem.
- Essentially the opposite of option 1.
Use a JSON Value or some other self-describing format instead of JSON through the full stack of the data provider.
- We lose type safety and, more importantly, data validation.
- @kpozin pointed out that this option is basically equivalent to putting the parsing code into application logic.
Schema converter: pass around a blob until it needs to be parsed to a struct, and then pass around the struct (see Kubernetes apiserver)
- Hybrid approach based on option 1 (thanks @filmil).
Stick with the runtime struct dispatch living in the data provider.

I think it may be useful to pass around structs, because if we get to a point where we can pre-build data into *.rs files (#78), we'd like to pass those verbatim through the pipeline.

The text was updated successfully, but these errors were encountered:

sffc · 2020-08-19T06:46:33Z

CC @markusicu @macchiati

sffc · 2020-08-28T01:49:34Z

Adding additional comments from @Manishearth to this thread, originally shared by @zbraniecki in #198 (review):

manish: it seems okay, but might be nice to have a more performant set of APIs that aren't using trait objects
zibi: what would they use?
manish: yeah i'm also not sure if it's possible
but you could have a direct API that's get_response<T>(Request) -> Result<T>
load<T>(&self, req: &DataRequest) -> Result<T, DataError>
and use that to write one that uses trait objects

sffc · 2020-09-04T03:40:07Z

@dtolnay, thanks for maintaining serde_json and erased_serde. Do you have any thoughts on the design puzzle outlined above?

dtolnay · 2020-09-04T04:03:42Z

I don't completely follow the scenario and the existing setup of data providers and requests, since I don't know anything about this crate. Would you be able to put together a minimized compilable code snippet that shows the problem being solved and the traits/components involved?

sffc · 2020-09-04T08:55:42Z

Thanks! I'll do my best at a minimal explanation. If you want more color, see data-pipeline.md.

The DataProvider trait is defined as:

pub trait DataProvider<'d> {
    /// Query the provider for data. Returns Ok if the request successfully loaded data. If data
    /// failed to load, returns an Error with more information.
    fn load<'a>(&'a self, req: &DataRequest) -> Result<DataResponse<'d>, Error>;
}

In other words, it's a pretty basic request-response pattern. Request carries information on the resource to fetch, and Response carries the resource. Right now, the resource is stored in the Response as a boxed (Cow'd) Any (actually CowableAny, my superset of Any that adds a clone function and a few other things), and the caller gets the data out with the help of downcast-rs:

#[derive(Debug, Clone)]
pub struct DataResponse<'d> {
    payload: Cow<'d, dyn CloneableAny>,
}

Multiple DataProviders can be chained together, each providing specific functionality like filtering, caching, routing, etc. Each chain has a source (the upstream data provider that ultimately receives and fulfills the request) and a sink (the downstream agent that initiated the request).

The problem is, source knows the data format (e.g., JSON, Bincode, CBOR, etc), but the sink knows the data structure (the thing implementing Serde Deserialize). However, both of those pieces of information need to converge somewhere in order for Serde to do its job.

The options from the OP are to:

Change Response to pass a blob instead of an Any, along with information on how to decode the blob (e.g., a Content-Type header).
Change Request to pass the struct definition, probably in the form of a lambda function that takes an erased_serde::Deserializer and returns the boxed Any.
Change Response to pass a schema-less type like JSON Value.
Hybrid: Change Response to allow either a blob or an Any to be passed.
Encode all possible struct definitions into the source data provider, such that it can map the Request to the correct deserializer function on the spot.

Does this make more sense?

I think I'm leaning toward option 2.

sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Aug 5, 2020

sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 14, 2020

sffc self-assigned this Aug 14, 2020

sffc added the A-design Area: Architecture or design label Aug 19, 2020

sffc added this to the 2020 Q4 milestone Oct 30, 2020

This was referenced Dec 3, 2020

Adding DataReceiver with DataProviderV2 #405

Merged

Names of data provider request fields #244

Closed

Make ResourcePath compatible with [Var]ZeroVec #243

Closed

sffc mentioned this issue Dec 14, 2020

Add Locale support for likely subtags #417

Closed

sffc closed this as completed in #405 Dec 23, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Decoupling data schema from data format #196

Decoupling data schema from data format #196

sffc commented Aug 5, 2020 •

edited

Loading

sffc commented Aug 19, 2020

sffc commented Aug 28, 2020

sffc commented Sep 4, 2020

dtolnay commented Sep 4, 2020

sffc commented Sep 4, 2020

Decoupling data schema from data format #196

Decoupling data schema from data format #196

Comments

sffc commented Aug 5, 2020 • edited Loading

sffc commented Aug 19, 2020

sffc commented Aug 28, 2020

sffc commented Sep 4, 2020

dtolnay commented Sep 4, 2020

sffc commented Sep 4, 2020

sffc commented Aug 5, 2020 •

edited

Loading