Update T4R for `ColumnSchema` API changes from Merlin Core #603

karlhigley · 2023-01-20T17:10:35Z

Since is_list and is_ragged have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same is_list and is_ragged values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting Shape in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195

Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.) Depends on NVIDIA-Merlin/core#195

sararb

Thank you for the PR @karlhigley! I left one question regarding how the is_ragged property is inferred.
I think, the model's property output_schema needs also to be updated as it sets the values of is_list and is_ragged (here).

sararb · 2023-01-23T14:44:47Z

transformers4rec/torch/model/base.py

@@ -739,19 +739,15 @@ def input_schema(self):

            dtype = {0: np.float32, 2: np.int64, 3: np.float32}[column.type]
            tags = column.tags
-            is_list = column.value_count.max > 0
+            value_counts = {"min": column.value_count.min, "max": column.value_count.max}


I just wanted to confirm if my understanding is correct: The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max? If this is the case, the user needs to ensure that the schema used to define the T4Rec model includes all the List features with value_counts.min == value_counts.max.

I am asking because the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input. This means that a user can provide a schema with list columns described as 'is_ragged=True', and it's the T4Rec data loader that overwrites the value_counts and is_ragged properties accordingly (via _augment_schema method ).

I am not really convinced with the way T4Rec handles input schema properties, and requiring those properties to be set correctly and aligned with T4Rec input blocks in the schema passed by the user makes more sense to me. So, I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow?

The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max

Yeah, essentially. It's technically inferred from the shape, which itself is inferred from the value counts. If value counts are provided, they're treated as the second dimension of the shape, so value counts without further shape information turn into something like ((0,None),(count_min,count_max)). Shapes with a second dimension have is_list=True, and if the minimum bound and maximum bound are not equal then is_ragged=True.

the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input

I think it probably should, which aligns with what you're saying. I'm trying really hard to avoid writing code where the schema "lies" about what the data actually is, because every place we're tempted to do that is an opportunity to improve the correctness of either the schema or the data instead.

I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow?

There are changes in Core to make this easier that are currently awaiting review by the teams that work on each of the libraries. In particular, I added this with_shape() method, which allows updating the shape on a ColumnSchema. I'd use that in the dataloader where the padding actually takes place, so that the schema coming out NVT accurately reflects what's in the Parquet files, and the schema produced by the dataloader reflects the batches it produces. (Since the padding is happening in the dataloader, you'll probably be able to capture the batch size in the first dimension of the shape too.)

Thank you @karlhigley for your answer! The with_shape() method makes sense to me. I definitely need to deep dive into your PR in core to get the full picture of the proposed solution 😁
For the model’s input_schema, we could maybe update it with the schema returned by the data loader (as the latter matches exactly what is expected by the model’s forward method)

That sounds exactly right to me, and if it doesn't work smoothly, that means there's something wrong with my proposed code in Merlin Core. Happy to help try it out and discover where I messed it up, because the proposed code is almost certainly not the final version that will work for all the libraries—and the sooner we discover exactly how it's wrong the better. I think it's maybe 70% of the way there, but I'm making adjustments as I try to use it in the various libraries and hit cases I hadn't thought of yet (which is happening several times per library and to which I expect T4R will not be an exception.) It's not done until it works for all of the libraries and all of us.

karlhigley added the chore Maintenance for the repository label Jan 20, 2023

karlhigley added this to the Merlin 23.02 milestone Jan 20, 2023

karlhigley requested a review from sararb January 20, 2023 17:10

karlhigley self-assigned this Jan 20, 2023

sararb reviewed Jan 23, 2023

View reviewed changes

Merge branch 'main' into chore/comprehensive-shapes

518e0fb

karlhigley closed this Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update T4R for `ColumnSchema` API changes from Merlin Core #603

Update T4R for `ColumnSchema` API changes from Merlin Core #603

karlhigley commented Jan 20, 2023

sararb left a comment

sararb Jan 23, 2023 •

edited

Loading

karlhigley Jan 23, 2023

sararb Jan 23, 2023

karlhigley Jan 24, 2023

Update T4R for ColumnSchema API changes from Merlin Core #603

Update T4R for ColumnSchema API changes from Merlin Core #603

Conversation

karlhigley commented Jan 20, 2023

sararb left a comment

Choose a reason for hiding this comment

sararb Jan 23, 2023 • edited Loading

Choose a reason for hiding this comment

karlhigley Jan 23, 2023

Choose a reason for hiding this comment

sararb Jan 23, 2023

Choose a reason for hiding this comment

karlhigley Jan 24, 2023

Choose a reason for hiding this comment

Update T4R for `ColumnSchema` API changes from Merlin Core #603

Update T4R for `ColumnSchema` API changes from Merlin Core #603

sararb Jan 23, 2023 •

edited

Loading