Add a new `shape` field to `ColumnSchema` #195

karlhigley · 2023-01-06T20:20:50Z

This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of all dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)

A few significant design elements:

Defining a Shape dataclass allows us to make them immutable, which we wouldn't get if we sub-classed tuple like some other frameworks do. For compatibility with those other frameworks (and just general ease of use), the Shape constructor accepts a tuple of dimensions, so it should be relatively straightforward to (for example) create a Merlin shape from a Tensorflow shape.
Shapes are stored in dtypes, since shape information is how we know a particular column is a list, and several frameworks we use (e.g. cuDF, Feast) have either a general List dtype or specific list dtypes like int32_list for each element type. The API has been designed to hide this from calling code as much as possible, but it does influence how shapes work to some extent.
The is_list and is_ragged flags are now computed from the shape and can no longer be set by providing them as constructor args. We tried to find a way to maintain that, but it's not possible to use dataclass init vars with the same name as a defined property. The best workaround for this we could come up with was to allow is_list and is_ragged to be set from the properties dictionary, in which case we do our best to infer a shape from the values of these flags.

github-actions · 2023-01-06T20:30:06Z

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-195

This test was brittle to the addition of new fields in the `ColumnSchema` dataclass, but minor rework avoids that issue.

This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of _all_ dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)

Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.) Depends on NVIDIA-Merlin/core#195

EvenOldridge

I like the level of detail into the error messaging and the comprehensive tests. Assuming the merge of shapes is meant to be future work this looks good.

merlin/schema/schema.py

merlin/dtypes/base.py

merlin/dtypes/shape.py

For now, since all existing dtype translations rely on exact matching, we can drop the shape. In the future, when we add translations that need to know whether to use a list dtype or not, we'll have the information available here in the translation code.

karlhigley · 2023-02-03T18:20:22Z

@oliverholworthy I updated this PR to keep is_list and is_ragged as properties that are set in the __post_init__ to match the shape, which really does clean things up quite a bit. I think it still ends up being a breaking change that requires coordination with the other repos though, because validating that the provided flags are consistent with the shape turns up places where that isn't currently true.

PR changed significantly in response to feedback after approval

oliverholworthy · 2023-02-06T15:01:50Z

I think it still ends up being a breaking change that requires coordination with the other repos though, because validating that the provided flags are consistent with the shape turns up places where that isn't currently true.

Looks like we'll be setting a requirement for the shape to be specified if is_ragged=False is specified, the fixed dimension will be required to be passed. Would it be possible or helpful with the migration to loosen that restriction further and only check the shape if it's used explictly (either with the dims passed to the ColumnSchema constructor or with_shape)

This changes the way validation is done so that only the new shape info that's provided gets validated for consistency, and the rest gets inferred and filled in based on what was provided (assuming it's valid.)

karlhigley · 2023-02-06T16:47:02Z

Would it be possible or helpful with the migration to loosen that restriction further and only check the shape if it's used explictly?

I shifted the validation in this general direction, so that only information that's explicitly provided gets validated. I'm not sure about lifting the restriction that some kind of shape info has to be provided (through the dtype's shape, the dims, or the value counts) when is_ragged=False because that will put us in a position where we sometimes don't have a shape. That may complicate the downstream changes as we try to migrate over. 🤔

This is now handled by the shape validation

oliverholworthy · 2023-02-06T18:59:58Z

because that will put us in a position where we sometimes don't have a shape. That may complicate the downstream changes as we try to migrate over

I suppose we're already in the position where we sometimes don't have a shape, since there's nothing in here currently to enforce having a value count when is_ragged=False. Adding this requirement in this PR forces us to address that now. I support that choice if we're commited to making the changes to the downstream libs right now. If it's mostly a question of how we'll identify which parts of the code need to change, we could achieve that with an optional strict mode or something along those lines.

And closely related to this: to what extent will this change make to loading previously saved schemas (e.g. from an NVTabular workflow and saved schema file). Will we be able to load schemas saved with a prior version after merging this? We'd presumably prefer to be backwards compatible if possible, however if it's a necessary breaking change across the saved file format, then we might need some extra docs about this when we release the next version.

karlhigley · 2023-02-06T19:50:13Z

Adding this requirement in this PR forces us to address that now. I support that choice if we're commited to making the changes to the downstream libs right now. If it's mostly a question of how we'll identify which parts of the code need to change, we could achieve that with an optional strict mode or something along those lines.

We're thinking along similar lines then; we've disabled the is_ragged=False without shape info validation for now, and will use it to identify the places where changes are needed in further dev work. 👍🏻

to what extent will this change make to loading previously saved schemas (e.g. from an NVTabular workflow and saved schema file). Will we be able to load schemas saved with a prior version after merging this? We'd presumably prefer to be backwards compatible if possible, however if it's a necessary breaking change across the saved file format, then we might need some extra docs about this when we release the next version.

I believe the change is backward but maybe not forward compatible, since the allowed values of value counts are slightly different in a more permissive direction after this change.

oliverholworthy

Looks like this is in good shape! ✨

karlhigley self-assigned this Jan 6, 2023

karlhigley added clean up chore Maintenance for the repository labels Jan 6, 2023

karlhigley added 2 commits January 17, 2023 14:51

Improve robustness of test for converting schemas to dataframes

724bebc

This test was brittle to the addition of new fields in the `ColumnSchema` dataclass, but minor rework avoids that issue.

karlhigley force-pushed the feature/comprehensive-shapes branch from 60a0f33 to ee8c804 Compare January 18, 2023 19:15

karlhigley requested review from oliverholworthy, angmc, bschifferer, nv-alaiacano, rjzamora, rnyak and sararb January 19, 2023 16:52

karlhigley marked this pull request as ready for review January 19, 2023 16:52

karlhigley assigned jperez999 Jan 19, 2023

karlhigley requested a review from marcromeyn January 19, 2023 16:53

karlhigley mentioned this pull request Jan 19, 2023

Update NVTabular for ColumnSchema API changes from Merlin Core NVIDIA-Merlin/NVTabular#1751

Closed

karlhigley added the breaking Breaking change label Jan 20, 2023

karlhigley mentioned this pull request Jan 20, 2023

Update Models for ColumnSchema API changes from Merlin Core NVIDIA-Merlin/models#957

Closed

karlhigley mentioned this pull request Jan 20, 2023

Update Dataloader for ColumnSchema API changes from Merlin Core NVIDIA-Merlin/dataloader#95

Closed

This was referenced Jan 20, 2023

Update T4R for ColumnSchema API changes from Merlin Core NVIDIA-Merlin/Transformers4Rec#603

Closed

refactor systems to handle shape in columns NVIDIA-Merlin/systems#278

Closed

karlhigley requested a review from EvenOldridge January 26, 2023 14:53

EvenOldridge previously approved these changes Jan 27, 2023

View reviewed changes

merlin/schema/schema.py Outdated Show resolved Hide resolved

karlhigley commented Jan 27, 2023

View reviewed changes

merlin/dtypes/base.py Outdated Show resolved Hide resolved

karlhigley commented Jan 27, 2023

View reviewed changes

merlin/dtypes/shape.py Outdated Show resolved Hide resolved

karlhigley added 8 commits February 3, 2023 12:47

Compute is_list and is_ragged from ColumnSchema.shape

9123161

Make the default Shape() represent unknown shapes

f45d2c4

Ignore shapes when validating operator output dtypes

ffa9d19

Fall back to the existing shape if there is one

9399742

Remove Shape.fixed property

c491a63

Insert missing f-string

74f0436

Use DType.without_shape

327c62a

karlhigley force-pushed the feature/comprehensive-shapes branch from 890a24b to 327c62a Compare February 3, 2023 17:58

Merge branch 'main' into feature/comprehensive-shapes

2df78c0

karlhigley added 2 commits February 6, 2023 11:25

Make None shorthand for a dimension with unknown or unbounded min/max

423a9b9

Use whatever shape info is provided to fill in the rest

da2f677

This changes the way validation is done so that only the new shape info that's provided gets validated for consistency, and the rest gets inferred and filled in based on what was provided (assuming it's valid.)

karlhigley added 3 commits February 6, 2023 11:52

Remove the value count min/max test

eae9882

This is now handled by the shape validation

Fix stray linter error

00c7be2

Merge branch 'main' into feature/comprehensive-shapes

967a43e

karlhigley added 6 commits February 6, 2023 14:37

Minor test fix

6e8160f

Disable validation that shape info is provided when is_ragged=False

c05f8ba

Add few convenience methods to Shape

5f8ebc5

Update ColumnSchema.with_* methods to clear existing shape info

8f3b4f6

Drop shapes from dtypes in ColumnSchema constructor

44faf95

Fix with_dtype so dtypes don't overwrite the shape

b4291c1

oliverholworthy approved these changes Feb 6, 2023

View reviewed changes

karlhigley merged commit c41f23d into NVIDIA-Merlin:main Feb 7, 2023

oliverholworthy mentioned this pull request Feb 8, 2023

Support partially specified value_count when used with is_ragged=False #213

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a new `shape` field to `ColumnSchema` #195

Add a new `shape` field to `ColumnSchema` #195

karlhigley commented Jan 6, 2023 •

edited

Loading

github-actions bot commented Jan 6, 2023

EvenOldridge left a comment

karlhigley commented Feb 3, 2023

oliverholworthy commented Feb 6, 2023

karlhigley commented Feb 6, 2023

oliverholworthy commented Feb 6, 2023

karlhigley commented Feb 6, 2023

oliverholworthy left a comment

Add a new shape field to ColumnSchema #195

Add a new shape field to ColumnSchema #195

Conversation

karlhigley commented Jan 6, 2023 • edited Loading

github-actions bot commented Jan 6, 2023

Documentation preview

EvenOldridge left a comment

Choose a reason for hiding this comment

karlhigley commented Feb 3, 2023

oliverholworthy commented Feb 6, 2023

karlhigley commented Feb 6, 2023

oliverholworthy commented Feb 6, 2023

karlhigley commented Feb 6, 2023

oliverholworthy left a comment

Choose a reason for hiding this comment

Add a new `shape` field to `ColumnSchema` #195

Add a new `shape` field to `ColumnSchema` #195

karlhigley commented Jan 6, 2023 •

edited

Loading