Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update Models for ColumnSchema API changes from Merlin Core #957

Closed

Conversation

karlhigley
Copy link
Contributor

Since is_list and is_ragged have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same is_list and is_ragged values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting Shape in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195

Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)
@github-actions
Copy link

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-957

@@ -40,12 +40,12 @@ def _augment_schema(
properties = cs.properties
if sparse_max and col in sparse_max:
properties["value_count"] = {"max": sparse_max[col]}
properties["is_list"] = True
properties["is_ragged"] = not sparse_as_dense
Copy link
Contributor

@rnyak rnyak Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am thinking out loud. if we set sparse_as_dense to False (default), does that mean we do not have a fixed length list, so is_ragged becomes True. Similarly, if someone sets sparse_to_dense to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length. Am I wrong? But due to this line properties["is_ragged"] = not sparse_as_dense is_ragged becomes False. So my question, should sparse_to_dense not be set to True if the list columns are variable length (ragged)?

also wouldnt be sparse_to_dense become False if we dont have list columns at all? in that case why is_ragged becomes True?
sorry If I am misunderstanding the relation between sparse_as_dense arg and list columns.

Copy link
Contributor Author

@karlhigley karlhigley Jan 27, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

sorry If I am misunderstanding the relation between sparse_as_dense arg and list columns

I'm not as familiar with the history of sparse_as_dense as you are, so it's just as likely to be me misunderstanding the relationship. Let's try to figure it out together though! 😄

if we set sparse_as_dense to False (default), does that mean we do not have a fixed length list, so is_ragged becomes True

If I understand what sparse_as_dense does (and I may not), it converts ragged tensors to fixed length tensors by padding them, yeah? If that's true, I think sparse_as_dense should probably be called either ragged_as_dense or pad, which would make this easier to think about. If sparse/ragged_as_dense/pad is False, that means we're not padding, which means that any ragged list remains ragged.

if someone sets sparse_to_dense to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length

If sparse/ragged_to_dense/pad is True, I think that means we'd pad any ragged list inputs, so ragged lists would be converted to fixed length, after which all lists would be fixed length (i.e. is_ragged=False.)

due to this line properties["is_ragged"] = not sparse_as_dense is_ragged becomes False. So my question, should sparse_to_dense not be set to True if the list columns are variable length (ragged)?

The way I understood this was sparse/ragged_as_dense/pad = False implies that ragged lists are not padded and therefore stay ragged, while sparse/ragged_as_dense/pad = True implies that ragged lists are padded and therefore become fixed length. I'm not sure if I have that right though...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes @karlhigley, the sparse_as_dense API of the older NVT dataloader was intended to work like that, padding list columns by converting them from Ragged to Dense when sparse_as_dense=True
It makes sense to me this change in _augment_schema(), as it only sets is_ragged property to sparse/list columns, so that is_ragged needs to be either False or True.

@rnyak rnyak requested review from oliverholworthy and removed request for marcromeyn and sararb February 1, 2023 17:10
Copy link
Member

@gabrielspmoreira gabrielspmoreira left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@karlhigley I found this logic in ProcessList() that processes all features and converts the tuple representation of sparse features to either Ragged or Dense tensors. It converts all tuple representations to Ragged, unless it finds value_count.min == value_count.max, making the tensor dense in such case.
What should be the new standard to determine if a tuple representation needs to be converted to ragged or not? Will is_ragged property be always set in the schema?
If not, should we check the tuple shape to infer?

@karlhigley
Copy link
Contributor Author

@gabrielspmoreira I think this PR will end up being superseded by revisions in the Core PR based on @oliverholworthy's feedback, but re: converting tuple representations, I think we should get rid of the tuples altogether (since they significantly complicate the way inputs end up being named) and instead supply the values and offsets with separate dictionary keys. The is_ragged property still exists on ColumnSchema though, it's just computed from the shape instead of set independently, so it's still fine to use it to determine how to handle the data for particular columns.

@karlhigley karlhigley closed this Feb 6, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
chore Maintenance for the repository
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants