Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update T4R for ColumnSchema API changes from Merlin Core #603

Closed

Conversation

karlhigley
Copy link
Contributor

Since is_list and is_ragged have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same is_list and is_ragged values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting Shape in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195

Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195
@karlhigley karlhigley added the chore Maintenance for the repository label Jan 20, 2023
@karlhigley karlhigley added this to the Merlin 23.02 milestone Jan 20, 2023
@karlhigley karlhigley requested a review from sararb January 20, 2023 17:10
@karlhigley karlhigley self-assigned this Jan 20, 2023
Copy link
Contributor

@sararb sararb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the PR @karlhigley! I left one question regarding how the is_ragged property is inferred.
I think, the model's property output_schema needs also to be updated as it sets the values of is_list and is_ragged (here).

@@ -739,19 +739,15 @@ def input_schema(self):

dtype = {0: np.float32, 2: np.int64, 3: np.float32}[column.type]
tags = column.tags
is_list = column.value_count.max > 0
value_counts = {"min": column.value_count.min, "max": column.value_count.max}
Copy link
Contributor

@sararb sararb Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to confirm if my understanding is correct: The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max? If this is the case, the user needs to ensure that the schema used to define the T4Rec model includes all the List features with value_counts.min == value_counts.max.

I am asking because the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input. This means that a user can provide a schema with list columns described as 'is_ragged=True', and it's the T4Rec data loader that overwrites the value_counts and is_ragged properties accordingly (via _augment_schema method ). 

I am not really convinced with the way T4Rec handles input schema properties, and requiring those properties to be set correctly and aligned with T4Rec input blocks in the schema passed by the user makes more sense to me. So, I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow? 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max

Yeah, essentially. It's technically inferred from the shape, which itself is inferred from the value counts. If value counts are provided, they're treated as the second dimension of the shape, so value counts without further shape information turn into something like ((0,None),(count_min,count_max)). Shapes with a second dimension have is_list=True, and if the minimum bound and maximum bound are not equal then is_ragged=True.

the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input

I think it probably should, which aligns with what you're saying. I'm trying really hard to avoid writing code where the schema "lies" about what the data actually is, because every place we're tempted to do that is an opportunity to improve the correctness of either the schema or the data instead.

I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow?

There are changes in Core to make this easier that are currently awaiting review by the teams that work on each of the libraries. In particular, I added this with_shape() method, which allows updating the shape on a ColumnSchema. I'd use that in the dataloader where the padding actually takes place, so that the schema coming out NVT accurately reflects what's in the Parquet files, and the schema produced by the dataloader reflects the batches it produces. (Since the padding is happening in the dataloader, you'll probably be able to capture the batch size in the first dimension of the shape too.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @karlhigley for your answer! The with_shape() method makes sense to me. I definitely need to deep dive into your PR in core to get the full picture of the proposed solution 😁
For the model’s input_schema, we could maybe update it with the schema returned by the data loader (as the latter matches exactly what is expected by the model’s forward method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds exactly right to me, and if it doesn't work smoothly, that means there's something wrong with my proposed code in Merlin Core. Happy to help try it out and discover where I messed it up, because the proposed code is almost certainly not the final version that will work for all the libraries—and the sooner we discover exactly how it's wrong the better. I think it's maybe 70% of the way there, but I'm making adjustments as I try to use it in the various libraries and hit cases I hadn't thought of yet (which is happening several times per library and to which I expect T4R will not be an exception.) It's not done until it works for all of the libraries and all of us.

@karlhigley karlhigley closed this Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance for the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants