-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add a new shape
field to ColumnSchema
#195
Add a new shape
field to ColumnSchema
#195
Conversation
Documentation preview |
This test was brittle to the addition of new fields in the `ColumnSchema` dataclass, but minor rework avoids that issue.
This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of _all_ dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)
60a0f33
to
ee8c804
Compare
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.) Depends on NVIDIA-Merlin/core#195
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.) Depends on NVIDIA-Merlin/core#195
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.) Depends on NVIDIA-Merlin/core#195
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I like the level of detail into the error messaging and the comprehensive tests. Assuming the merge of shapes is meant to be future work this looks good.
For now, since all existing dtype translations rely on exact matching, we can drop the shape. In the future, when we add translations that need to know whether to use a list dtype or not, we'll have the information available here in the translation code.
890a24b
to
327c62a
Compare
@oliverholworthy I updated this PR to keep |
PR changed significantly in response to feedback after approval
Looks like we'll be setting a requirement for the shape to be specified if |
This changes the way validation is done so that only the new shape info that's provided gets validated for consistency, and the rest gets inferred and filled in based on what was provided (assuming it's valid.)
I shifted the validation in this general direction, so that only information that's explicitly provided gets validated. I'm not sure about lifting the restriction that some kind of shape info has to be provided (through the dtype's shape, the dims, or the value counts) when |
This is now handled by the shape validation
I suppose we're already in the position where we sometimes don't have a shape, since there's nothing in here currently to enforce having a value count when And closely related to this: to what extent will this change make to loading previously saved schemas (e.g. from an NVTabular workflow and saved schema file). Will we be able to load schemas saved with a prior version after merging this? We'd presumably prefer to be backwards compatible if possible, however if it's a necessary breaking change across the saved file format, then we might need some extra docs about this when we release the next version. |
We're thinking along similar lines then; we've disabled the
I believe the change is backward but maybe not forward compatible, since the allowed values of value counts are slightly different in a more permissive direction after this change. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like this is in good shape! ✨
This creates a place to store shape information for all dimensions of the data across both array/tensor and dataframe formats. In contrast to the existing "value_count" property (which only records the value counts of the lists in list field, this attribute is intended to capture the size of all dimensions of the data (the batch dimension, the list lengths, embedding sizes, etc.)
A few significant design elements:
Shape
dataclass allows us to make them immutable, which we wouldn't get if we sub-classedtuple
like some other frameworks do. For compatibility with those other frameworks (and just general ease of use), theShape
constructor accepts a tuple of dimensions, so it should be relatively straightforward to (for example) create a Merlin shape from a Tensorflow shape.List
dtype or specific list dtypes likeint32_list
for each element type. The API has been designed to hide this from calling code as much as possible, but it does influence how shapes work to some extent.is_list
andis_ragged
flags are now computed from the shape and can no longer be set by providing them as constructor args. We tried to find a way to maintain that, but it's not possible to use dataclass init vars with the same name as a defined property. The best workaround for this we could come up with was to allowis_list
andis_ragged
to be set from theproperties
dictionary, in which case we do our best to infer a shape from the values of these flags.