Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Decoupling data schema from data format #196

Closed
sffc opened this issue Aug 5, 2020 · 5 comments · Fixed by #405
Closed

Decoupling data schema from data format #196

sffc opened this issue Aug 5, 2020 · 5 comments · Fixed by #405
Assignees
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality
Milestone

Comments

@sffc
Copy link
Member

sffc commented Aug 5, 2020

In order to parse a JSON blob in Serde, one needs to know the data schema (struct definition).

Currently, the data provider passes Rust structs (encoded as Anys) in the Reponse objects through the pipeline.

Together, these two statements mean that the source data provider, the one reading the JSON blob from the file system, needs to know ahead of time the mapping from JSON files to structs.

This is undesirable because:

  1. It requires some form of "key-to-struct" dispatch in the source data provider, whether a match statement, table lookup, etc. This dispatch needs to be maintained and could be a source of failures or performance bottlenecks.
  2. Having that dispatch means that a lot of struct parsing code needs to be carried in the source data provider, even if it is not being used downstream. This goes against the grain of having dead code elimination also eliminate unused data structs.

I've considered a few solutions:

  1. Make data providers return a blob instead of a parsed struct. Parsing of the blob would occur at the end of the chain when the type is known.
    • We'd need to either declare that the data blobs are always in one specific format (e.g. bincode), or have the source data provider also pass a deserializer function.
    • @mihnita suggested a Content-Type approach where the requestor can say that they support JSON or Bincode, and the provider needs to provide a blob in one of those two formats.
  2. Send the struct definition along with the request, such that the struct can be used at the point where the data is loaded from the filesystem.
    • Essentially the opposite of option 1.
  3. Use a JSON Value or some other self-describing format instead of JSON through the full stack of the data provider.
    • We lose type safety and, more importantly, data validation.
    • @kpozin pointed out that this option is basically equivalent to putting the parsing code into application logic.
  4. Schema converter: pass around a blob until it needs to be parsed to a struct, and then pass around the struct (see Kubernetes apiserver)
    • Hybrid approach based on option 1 (thanks @filmil).
  5. Stick with the runtime struct dispatch living in the data provider.

I think it may be useful to pass around structs, because if we get to a point where we can pre-build data into *.rs files (#78), we'd like to pass those verbatim through the pipeline.

@sffc sffc added T-core Type: Required functionality C-data-infra Component: provider, datagen, fallback, adapters discuss Discuss at a future ICU4X-SC meeting labels Aug 5, 2020
@sffc sffc removed the discuss Discuss at a future ICU4X-SC meeting label Aug 14, 2020
@sffc sffc self-assigned this Aug 14, 2020
@sffc sffc added the A-design Area: Architecture or design label Aug 19, 2020
@sffc
Copy link
Member Author

sffc commented Aug 19, 2020

CC @markusicu @macchiati

@sffc
Copy link
Member Author

sffc commented Aug 28, 2020

Adding additional comments from @Manishearth to this thread, originally shared by @zbraniecki in #198 (review):

manish: it seems okay, but might be nice to have a more performant set of APIs that aren't using trait objects
zibi: what would they use?
manish: yeah i'm also not sure if it's possible
but you could have a direct API that's get_response<T>(Request) -> Result<T>
load<T>(&self, req: &DataRequest) -> Result<T, DataError>
and use that to write one that uses trait objects

@sffc
Copy link
Member Author

sffc commented Sep 4, 2020

@dtolnay, thanks for maintaining serde_json and erased_serde. Do you have any thoughts on the design puzzle outlined above?

@dtolnay
Copy link

dtolnay commented Sep 4, 2020

I don't completely follow the scenario and the existing setup of data providers and requests, since I don't know anything about this crate. Would you be able to put together a minimized compilable code snippet that shows the problem being solved and the traits/components involved?

@sffc
Copy link
Member Author

sffc commented Sep 4, 2020

Thanks! I'll do my best at a minimal explanation. If you want more color, see data-pipeline.md.

The DataProvider trait is defined as:

pub trait DataProvider<'d> {
    /// Query the provider for data. Returns Ok if the request successfully loaded data. If data
    /// failed to load, returns an Error with more information.
    fn load<'a>(&'a self, req: &DataRequest) -> Result<DataResponse<'d>, Error>;
}

In other words, it's a pretty basic request-response pattern. Request carries information on the resource to fetch, and Response carries the resource. Right now, the resource is stored in the Response as a boxed (Cow'd) Any (actually CowableAny, my superset of Any that adds a clone function and a few other things), and the caller gets the data out with the help of downcast-rs:

#[derive(Debug, Clone)]
pub struct DataResponse<'d> {
    payload: Cow<'d, dyn CloneableAny>,
}

Multiple DataProviders can be chained together, each providing specific functionality like filtering, caching, routing, etc. Each chain has a source (the upstream data provider that ultimately receives and fulfills the request) and a sink (the downstream agent that initiated the request).

The problem is, source knows the data format (e.g., JSON, Bincode, CBOR, etc), but the sink knows the data structure (the thing implementing Serde Deserialize). However, both of those pieces of information need to converge somewhere in order for Serde to do its job.

The options from the OP are to:

  1. Change Response to pass a blob instead of an Any, along with information on how to decode the blob (e.g., a Content-Type header).
  2. Change Request to pass the struct definition, probably in the form of a lambda function that takes an erased_serde::Deserializer and returns the boxed Any.
  3. Change Response to pass a schema-less type like JSON Value.
  4. Hybrid: Change Response to allow either a blob or an Any to be passed.
  5. Encode all possible struct definitions into the source data provider, such that it can map the Request to the correct deserializer function on the spot.

Does this make more sense?

I think I'm leaning toward option 2.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-design Area: Architecture or design C-data-infra Component: provider, datagen, fallback, adapters T-core Type: Required functionality
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants