Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a new shape field to ColumnSchema #195

Merged

Conversation

karlhigley
Copy link
Contributor

@karlhigley karlhigley commented Jan 6, 2023

This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of all dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)

A few significant design elements:

  • Defining a Shape dataclass allows us to make them immutable, which we wouldn't get if we sub-classed tuple like some other frameworks do. For compatibility with those other frameworks (and just general ease of use), the Shape constructor accepts a tuple of dimensions, so it should be relatively straightforward to (for example) create a Merlin shape from a Tensorflow shape.
  • Shapes are stored in dtypes, since shape information is how we know a particular column is a list, and several frameworks we use (e.g. cuDF, Feast) have either a general List dtype or specific list dtypes like int32_list for each element type. The API has been designed to hide this from calling code as much as possible, but it does influence how shapes work to some extent.
  • The is_list and is_ragged flags are now computed from the shape and can no longer be set by providing them as constructor args. We tried to find a way to maintain that, but it's not possible to use dataclass init vars with the same name as a defined property. The best workaround for this we could come up with was to allow is_list and is_ragged to be set from the properties dictionary, in which case we do our best to infer a shape from the values of these flags.

@karlhigley karlhigley self-assigned this Jan 6, 2023
@karlhigley karlhigley added clean up chore Maintenance for the repository labels Jan 6, 2023
@github-actions
Copy link

github-actions bot commented Jan 6, 2023

Documentation preview

https://nvidia-merlin.github.io/core/review/pr-195

This test was brittle to the addition of new fields in the `ColumnSchema` dataclass, but minor rework avoids that issue.
This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of _all_ dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)
@karlhigley karlhigley force-pushed the feature/comprehensive-shapes branch from 60a0f33 to ee8c804 Compare January 18, 2023 19:15
@karlhigley karlhigley marked this pull request as ready for review January 19, 2023 16:52
@karlhigley karlhigley requested a review from marcromeyn January 19, 2023 16:53
@karlhigley karlhigley added the breaking Breaking change label Jan 20, 2023
karlhigley added a commit to NVIDIA-Merlin/dataloader that referenced this pull request Jan 20, 2023
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195
karlhigley added a commit to NVIDIA-Merlin/dataloader that referenced this pull request Jan 20, 2023
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195
karlhigley added a commit to karlhigley/Transformers4Rec that referenced this pull request Jan 20, 2023
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195
EvenOldridge
EvenOldridge previously approved these changes Jan 27, 2023
Copy link
Member

@EvenOldridge EvenOldridge left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I like the level of detail into the error messaging and the comprehensive tests. Assuming the merge of shapes is meant to be future work this looks good.

merlin/schema/schema.py Outdated Show resolved Hide resolved
merlin/dtypes/base.py Outdated Show resolved Hide resolved
merlin/dtypes/shape.py Outdated Show resolved Hide resolved
For now, since all existing dtype translations rely on exact matching, we can drop the shape. In the future, when we add translations that need to know whether to use a list dtype or not, we'll have the information available here in the translation code.
@karlhigley karlhigley force-pushed the feature/comprehensive-shapes branch from 890a24b to 327c62a Compare February 3, 2023 17:58
@karlhigley
Copy link
Contributor Author

@oliverholworthy I updated this PR to keep is_list and is_ragged as properties that are set in the __post_init__ to match the shape, which really does clean things up quite a bit. I think it still ends up being a breaking change that requires coordination with the other repos though, because validating that the provided flags are consistent with the shape turns up places where that isn't currently true.

@karlhigley karlhigley dismissed EvenOldridge’s stale review February 6, 2023 14:37

PR changed significantly in response to feedback after approval

@oliverholworthy
Copy link
Member

I think it still ends up being a breaking change that requires coordination with the other repos though, because validating that the provided flags are consistent with the shape turns up places where that isn't currently true.

Looks like we'll be setting a requirement for the shape to be specified if is_ragged=False is specified, the fixed dimension will be required to be passed. Would it be possible or helpful with the migration to loosen that restriction further and only check the shape if it's used explictly (either with the dims passed to the ColumnSchema constructor or with_shape)

This changes the way validation is done so that only the new shape info that's provided gets validated for consistency, and the rest gets inferred and filled in based on what was provided (assuming it's valid.)
@karlhigley
Copy link
Contributor Author

Would it be possible or helpful with the migration to loosen that restriction further and only check the shape if it's used explictly?

I shifted the validation in this general direction, so that only information that's explicitly provided gets validated. I'm not sure about lifting the restriction that some kind of shape info has to be provided (through the dtype's shape, the dims, or the value counts) when is_ragged=False because that will put us in a position where we sometimes don't have a shape. That may complicate the downstream changes as we try to migrate over. 🤔

@oliverholworthy
Copy link
Member

because that will put us in a position where we sometimes don't have a shape. That may complicate the downstream changes as we try to migrate over

I suppose we're already in the position where we sometimes don't have a shape, since there's nothing in here currently to enforce having a value count when is_ragged=False. Adding this requirement in this PR forces us to address that now. I support that choice if we're commited to making the changes to the downstream libs right now. If it's mostly a question of how we'll identify which parts of the code need to change, we could achieve that with an optional strict mode or something along those lines.

And closely related to this: to what extent will this change make to loading previously saved schemas (e.g. from an NVTabular workflow and saved schema file). Will we be able to load schemas saved with a prior version after merging this? We'd presumably prefer to be backwards compatible if possible, however if it's a necessary breaking change across the saved file format, then we might need some extra docs about this when we release the next version.

@karlhigley
Copy link
Contributor Author

Adding this requirement in this PR forces us to address that now. I support that choice if we're commited to making the changes to the downstream libs right now. If it's mostly a question of how we'll identify which parts of the code need to change, we could achieve that with an optional strict mode or something along those lines.

We're thinking along similar lines then; we've disabled the is_ragged=False without shape info validation for now, and will use it to identify the places where changes are needed in further dev work. 👍🏻

to what extent will this change make to loading previously saved schemas (e.g. from an NVTabular workflow and saved schema file). Will we be able to load schemas saved with a prior version after merging this? We'd presumably prefer to be backwards compatible if possible, however if it's a necessary breaking change across the saved file format, then we might need some extra docs about this when we release the next version.

I believe the change is backward but maybe not forward compatible, since the allowed values of value counts are slightly different in a more permissive direction after this change.

Copy link
Member

@oliverholworthy oliverholworthy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this is in good shape! ✨

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
breaking Breaking change chore Maintenance for the repository clean up
Projects
None yet
Development

Successfully merging this pull request may close these issues.

5 participants