[RMP] Establish a metadata standard for serializing information about Merlin components #489

nv-alaiacano · 2022-07-29T16:13:53Z

Problem:

In Systems, we need to know information about the various ops that go into an Ensemble. The primary one is the input/output schemas for the data and models that make up the ensemble.

We are able to infer some of this from eg the Tensorflow model, but are not able to do so from more flexible frameworks like pytorch or xgboost.

Saving an NVTabular workflow produces a small amount of info in metadata.json, and I propose we expand that concept to record any expected metadata about models, nvt workflows, and other components that we expect to load in a Systems ensemble.

Goal:

Systems can load any model produced by Models and know the input/output schema information and any additional critical metadata TBD.
Systems can load any workflow produced by NVTabular and know the input/output schema information and any additional critical metadata TBD.

Constraints:

This information should be serialized to disk along with the artifact (model, workflow). We should not rely on any kind of service for keeping track of this information. (yet??)
The metadata format should be consistent with some standard fields (library versions, input_schema, output_schema)
Input/output schemas should ser/deser into the Schema python class
It should also be extensible in order to provide optional information about the artifact.

Starting Point:

Core repo:

Define a metadata format and schema that will be shared among all of the Merlin libraries (live in core).

Models repo:

Override the Model.save method in Models to include a metadata file including the required fields.

Systems repo:

Begin using the

The text was updated successfully, but these errors were encountered:

EvenOldridge · 2022-08-17T22:39:20Z

@nv-alaiacano Seems like this would solve a number of issues for us on the systems side. Is this beneficial for session-based or should we postpone until after that work is done? Trying to figure out where to slot this in.

karlhigley · 2022-08-18T13:55:25Z

Since this work isn't a customer-facing Merlin-level feature that makes sense for product to prioritize, we'll likely go ahead and do this work whenever it's necessary. I expect that at least the schemas part will be beneficial for sequence models, so we're having a chat about that part this morning.

nv-alaiacano added the roadmap label Jul 29, 2022

nv-alaiacano assigned EvenOldridge Jul 29, 2022

oliverholworthy mentioned this issue Aug 2, 2022

Add Forest Inference Operator NVIDIA-Merlin/systems#118

Merged

sararb mentioned this issue Aug 3, 2022

[RMP] Tensorflow support for session based recommendations integration in Merlin #433

Closed

37 tasks

oliverholworthy mentioned this issue Aug 22, 2022

[FEA] Save input and output schema when .save methods are called on models NVIDIA-Merlin/models#669

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[RMP] Establish a metadata standard for serializing information about Merlin components #489

[RMP] Establish a metadata standard for serializing information about Merlin components #489

nv-alaiacano commented Jul 29, 2022 •

edited

Loading

EvenOldridge commented Aug 17, 2022

karlhigley commented Aug 18, 2022

[RMP] Establish a metadata standard for serializing information about Merlin components #489

[RMP] Establish a metadata standard for serializing information about Merlin components #489

Comments

nv-alaiacano commented Jul 29, 2022 • edited Loading

Problem:

Goal:

Constraints:

Starting Point:

EvenOldridge commented Aug 17, 2022

karlhigley commented Aug 18, 2022

nv-alaiacano commented Jul 29, 2022 •

edited

Loading