Update Models for `ColumnSchema` API changes from Merlin Core #957

karlhigley · 2023-01-20T16:34:46Z

Since is_list and is_ragged have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same is_list and is_ragged values on the other side.

(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting Shape in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

Depends on NVIDIA-Merlin/core#195

Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)

github-actions · 2023-01-20T16:42:48Z

Documentation preview

https://nvidia-merlin.github.io/models/review/pr-957

rnyak · 2023-01-27T19:04:43Z

merlin/models/loader/backend.py

@@ -40,12 +40,12 @@ def _augment_schema(
        properties = cs.properties
        if sparse_max and col in sparse_max:
            properties["value_count"] = {"max": sparse_max[col]}
+        properties["is_list"] = True
+        properties["is_ragged"] = not sparse_as_dense


I am thinking out loud. if we set sparse_as_dense to False (default), does that mean we do not have a fixed length list, so is_ragged becomes True. Similarly, if someone sets sparse_to_dense to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length. Am I wrong? But due to this line properties["is_ragged"] = not sparse_as_dense is_ragged becomes False. So my question, should sparse_to_dense not be set to True if the list columns are variable length (ragged)?

also wouldnt be sparse_to_dense become False if we dont have list columns at all? in that case why is_ragged becomes True?
sorry If I am misunderstanding the relation between sparse_as_dense arg and list columns.

sorry If I am misunderstanding the relation between sparse_as_dense arg and list columns

I'm not as familiar with the history of sparse_as_dense as you are, so it's just as likely to be me misunderstanding the relationship. Let's try to figure it out together though! 😄

if we set sparse_as_dense to False (default), does that mean we do not have a fixed length list, so is_ragged becomes True

If I understand what sparse_as_dense does (and I may not), it converts ragged tensors to fixed length tensors by padding them, yeah? If that's true, I think sparse_as_dense should probably be called either ragged_as_dense or pad, which would make this easier to think about. If sparse/ragged_as_dense/pad is False, that means we're not padding, which means that any ragged list remains ragged.

if someone sets sparse_to_dense to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length

If sparse/ragged_to_dense/pad is True, I think that means we'd pad any ragged list inputs, so ragged lists would be converted to fixed length, after which all lists would be fixed length (i.e. is_ragged=False.)

due to this line properties["is_ragged"] = not sparse_as_dense is_ragged becomes False. So my question, should sparse_to_dense not be set to True if the list columns are variable length (ragged)?

The way I understood this was sparse/ragged_as_dense/pad = False implies that ragged lists are not padded and therefore stay ragged, while sparse/ragged_as_dense/pad = True implies that ragged lists are padded and therefore become fixed length. I'm not sure if I have that right though...

Yes @karlhigley, the sparse_as_dense API of the older NVT dataloader was intended to work like that, padding list columns by converting them from Ragged to Dense when sparse_as_dense=True
It makes sense to me this change in _augment_schema(), as it only sets is_ragged property to sparse/list columns, so that is_ragged needs to be either False or True.

gabrielspmoreira

@karlhigley I found this logic in ProcessList() that processes all features and converts the tuple representation of sparse features to either Ragged or Dense tensors. It converts all tuple representations to Ragged, unless it finds value_count.min == value_count.max, making the tensor dense in such case.
What should be the new standard to determine if a tuple representation needs to be converted to ragged or not? Will is_ragged property be always set in the schema?
If not, should we check the tuple shape to infer?

karlhigley · 2023-02-06T14:21:54Z

@gabrielspmoreira I think this PR will end up being superseded by revisions in the Core PR based on @oliverholworthy's feedback, but re: converting tuple representations, I think we should get rid of the tuples altogether (since they significantly complicate the way inputs end up being named) and instead supply the values and offsets with separate dictionary keys. The is_ragged property still exists on ColumnSchema though, it's just computed from the shape instead of set independently, so it's still fine to use it to determine how to handle the data for particular columns.

karlhigley added the chore Maintenance for the repository label Jan 20, 2023

karlhigley added this to the Merlin 23.02 milestone Jan 20, 2023

karlhigley self-assigned this Jan 20, 2023

karlhigley requested review from sararb and marcromeyn January 20, 2023 16:36

karlhigley mentioned this pull request Jan 20, 2023

Update NVTabular for ColumnSchema API changes from Merlin Core NVIDIA-Merlin/NVTabular#1751

Closed

Merge branch 'main' into chore/comprehensive-shapes

8b7e441

rnyak reviewed Jan 27, 2023

View reviewed changes

rnyak requested review from oliverholworthy and removed request for marcromeyn and sararb February 1, 2023 17:10

gabrielspmoreira reviewed Feb 6, 2023

View reviewed changes

karlhigley closed this Feb 6, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Update Models for `ColumnSchema` API changes from Merlin Core #957

Update Models for `ColumnSchema` API changes from Merlin Core #957

karlhigley commented Jan 20, 2023

github-actions bot commented Jan 20, 2023

rnyak Jan 27, 2023 •

edited

Loading

karlhigley Jan 27, 2023 •

edited

Loading

gabrielspmoreira Feb 6, 2023

gabrielspmoreira left a comment •

edited

Loading

karlhigley commented Feb 6, 2023

Update Models for ColumnSchema API changes from Merlin Core #957

Update Models for ColumnSchema API changes from Merlin Core #957

Conversation

karlhigley commented Jan 20, 2023

github-actions bot commented Jan 20, 2023

Documentation preview

rnyak Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

karlhigley Jan 27, 2023 • edited Loading

Choose a reason for hiding this comment

gabrielspmoreira Feb 6, 2023

Choose a reason for hiding this comment

gabrielspmoreira left a comment • edited Loading

Choose a reason for hiding this comment

karlhigley commented Feb 6, 2023

Update Models for `ColumnSchema` API changes from Merlin Core #957

Update Models for `ColumnSchema` API changes from Merlin Core #957

rnyak Jan 27, 2023 •

edited

Loading

karlhigley Jan 27, 2023 •

edited

Loading

gabrielspmoreira left a comment •

edited

Loading