Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update T4R for ColumnSchema API changes from Merlin Core #603

Closed
Closed
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
4 changes: 2 additions & 2 deletions merlin_standard_lib/utils/misc_utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -265,14 +265,14 @@ def _augment_schema(
for col in sparse_names or []:
cs = schema[col]
properties = cs.properties
properties["is_list"] = True
properties["is_ragged"] = True
if sparse_max and col in sparse_max:
properties["value_count"] = {"max": sparse_max[col]}
schema[col] = ColumnSchema(
name=cs.name,
tags=cs.tags,
dtype=cs.dtype,
is_list=True,
is_ragged=not sparse_as_dense,
properties=properties,
)

Expand Down
8 changes: 2 additions & 6 deletions transformers4rec/torch/model/base.py
Original file line number Diff line number Diff line change
Expand Up @@ -739,19 +739,15 @@ def input_schema(self):

dtype = {0: np.float32, 2: np.int64, 3: np.float32}[column.type]
tags = column.tags
is_list = column.value_count.max > 0
value_counts = {"min": column.value_count.min, "max": column.value_count.max}
Copy link
Contributor

@sararb sararb Jan 23, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I just wanted to confirm if my understanding is correct: The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max? If this is the case, the user needs to ensure that the schema used to define the T4Rec model includes all the List features with value_counts.min == value_counts.max.

I am asking because the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input. This means that a user can provide a schema with list columns described as 'is_ragged=True', and it's the T4Rec data loader that overwrites the value_counts and is_ragged properties accordingly (via _augment_schema method ). 

I am not really convinced with the way T4Rec handles input schema properties, and requiring those properties to be set correctly and aligned with T4Rec input blocks in the schema passed by the user makes more sense to me. So, I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow? 

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The is_ragged property is automatically inferred from the condition value_counts.min != value_counts.max

Yeah, essentially. It's technically inferred from the shape, which itself is inferred from the value counts. If value counts are provided, they're treated as the second dimension of the shape, so value counts without further shape information turn into something like ((0,None),(count_min,count_max)). Shapes with a second dimension have is_list=True, and if the minimum bound and maximum bound are not equal then is_ragged=True.

the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input

I think it probably should, which aligns with what you're saying. I'm trying really hard to avoid writing code where the schema "lies" about what the data actually is, because every place we're tempted to do that is an opportunity to improve the correctness of either the schema or the data instead.

I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow?

There are changes in Core to make this easier that are currently awaiting review by the teams that work on each of the libraries. In particular, I added this with_shape() method, which allows updating the shape on a ColumnSchema. I'd use that in the dataloader where the padding actually takes place, so that the schema coming out NVT accurately reflects what's in the Parquet files, and the schema produced by the dataloader reflects the batches it produces. (Since the padding is happening in the dataloader, you'll probably be able to capture the batch size in the first dimension of the shape too.)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @karlhigley for your answer! The with_shape() method makes sense to me. I definitely need to deep dive into your PR in core to get the full picture of the proposed solution 😁
For the model’s input_schema, we could maybe update it with the schema returned by the data loader (as the latter matches exactly what is expected by the model’s forward method)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That sounds exactly right to me, and if it doesn't work smoothly, that means there's something wrong with my proposed code in Merlin Core. Happy to help try it out and discover where I messed it up, because the proposed code is almost certainly not the final version that will work for all the libraries—and the sooner we discover exactly how it's wrong the better. I think it's maybe 70% of the way there, but I'm making adjustments as I try to use it in the various libraries and hit cases I hadn't thought of yet (which is happening several times per library and to which I expect T4R will not be an exception.) It's not done until it works for all of the libraries and all of us.

int_domain = {"min": column.int_domain.min, "max": column.int_domain.max}
properties = {
"int_domain": int_domain,
}
properties = {"int_domain": int_domain, "value_counts": value_counts}

col_schema = ColumnSchema(
name,
dtype=dtype,
tags=tags,
properties=properties,
is_list=is_list,
is_ragged=False,
)
core_schema[name] = col_schema
return core_schema
Expand Down