-
Notifications
You must be signed in to change notification settings - Fork 148
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update T4R for ColumnSchema
API changes from Merlin Core
#603
Closed
karlhigley
wants to merge
2
commits into
NVIDIA-Merlin:main
from
karlhigley:chore/comprehensive-shapes
Closed
Changes from 1 commit
Commits
Show all changes
2 commits
Select commit
Hold shift + click to select a range
File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I just wanted to confirm if my understanding is correct: The
is_ragged
property is automatically inferred from the conditionvalue_counts.min != value_counts.max
? If this is the case, the user needs to ensure that the schema used to define the T4Rec model includes all the List features with value_counts.min == value_counts.max.I am asking because the current implementation of t4rec does not require input schemas to have all list columns with the value "is_ragged=False", despite the requirement that all t4rec input blocks use fixed-length dense tensors for input. This means that a user can provide a schema with list columns described as 'is_ragged=True', and it's the T4Rec data loader that overwrites the value_counts and is_ragged properties accordingly (via
_augment_schema
method ).I am not really convinced with the way T4Rec handles input schema properties, and requiring those properties to be set correctly and aligned with T4Rec input blocks in the schema passed by the user makes more sense to me. So, I am wondering if there is any util method in the Schema class that allows the user to modify the properties of a schema returned by NVTabular workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, essentially. It's technically inferred from the shape, which itself is inferred from the value counts. If value counts are provided, they're treated as the second dimension of the shape, so value counts without further shape information turn into something like
((0,None),(count_min,count_max))
. Shapes with a second dimension haveis_list=True
, and if the minimum bound and maximum bound are not equal thenis_ragged=True
.I think it probably should, which aligns with what you're saying. I'm trying really hard to avoid writing code where the schema "lies" about what the data actually is, because every place we're tempted to do that is an opportunity to improve the correctness of either the schema or the data instead.
There are changes in Core to make this easier that are currently awaiting review by the teams that work on each of the libraries. In particular, I added this
with_shape()
method, which allows updating the shape on aColumnSchema
. I'd use that in the dataloader where the padding actually takes place, so that the schema coming out NVT accurately reflects what's in the Parquet files, and the schema produced by the dataloader reflects the batches it produces. (Since the padding is happening in the dataloader, you'll probably be able to capture the batch size in the first dimension of the shape too.)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you @karlhigley for your answer! The
with_shape()
method makes sense to me. I definitely need to deep dive into your PR in core to get the full picture of the proposed solution 😁For the model’s
input_schema
, we could maybe update it with the schema returned by the data loader (as the latter matches exactly what is expected by the model’s forward method)There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That sounds exactly right to me, and if it doesn't work smoothly, that means there's something wrong with my proposed code in Merlin Core. Happy to help try it out and discover where I messed it up, because the proposed code is almost certainly not the final version that will work for all the libraries—and the sooner we discover exactly how it's wrong the better. I think it's maybe 70% of the way there, but I'm making adjustments as I try to use it in the various libraries and hit cases I hadn't thought of yet (which is happening several times per library and to which I expect T4R will not be an exception.) It's not done until it works for all of the libraries and all of us.