-
Notifications
You must be signed in to change notification settings - Fork 50
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update Models for ColumnSchema
API changes from Merlin Core
#957
Update Models for ColumnSchema
API changes from Merlin Core
#957
Conversation
Since `is_list` and `is_ragged` have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the same `is_list` and `is_ragged` values on the other side. (This is a first step toward capturing and using more comprehensive shape information, with the goal of putting `Shape` in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)
Documentation preview |
@@ -40,12 +40,12 @@ def _augment_schema( | |||
properties = cs.properties | |||
if sparse_max and col in sparse_max: | |||
properties["value_count"] = {"max": sparse_max[col]} | |||
properties["is_list"] = True | |||
properties["is_ragged"] = not sparse_as_dense |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am thinking out loud. if we set sparse_as_dense
to False (default), does that mean we do not have a fixed length list, so is_ragged
becomes True. Similarly, if someone sets sparse_to_dense
to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length. Am I wrong? But due to this line properties["is_ragged"] = not sparse_as_dense
is_ragged
becomes False. So my question, should sparse_to_dense
not be set to True if the list columns are variable length (ragged)?
also wouldnt be sparse_to_dense
become False if we dont have list columns at all? in that case why is_ragged
becomes True?
sorry If I am misunderstanding the relation between sparse_as_dense
arg and list columns.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
sorry If I am misunderstanding the relation between
sparse_as_dense
arg and list columns
I'm not as familiar with the history of sparse_as_dense
as you are, so it's just as likely to be me misunderstanding the relationship. Let's try to figure it out together though! 😄
if we set
sparse_as_dense
to False (default), does that mean we do not have a fixed length list, so is_ragged becomes True
If I understand what sparse_as_dense
does (and I may not), it converts ragged tensors to fixed length tensors by padding them, yeah? If that's true, I think sparse_as_dense
should probably be called either ragged_as_dense
or pad
, which would make this easier to think about. If sparse/ragged_as_dense/pad
is False
, that means we're not padding, which means that any ragged list remains ragged.
if someone sets
sparse_to_dense
to True, that means, we have sparse input representation, so that we have list columns where these list columns could be fixed or variable length
If sparse/ragged_to_dense/pad
is True, I think that means we'd pad any ragged list inputs, so ragged lists would be converted to fixed length, after which all lists would be fixed length (i.e. is_ragged=False
.)
due to this line
properties["is_ragged"] = not sparse_as_dense
is_ragged becomes False. So my question, should sparse_to_dense not be set to True if the list columns are variable length (ragged)?
The way I understood this was sparse/ragged_as_dense/pad = False
implies that ragged lists are not padded and therefore stay ragged, while sparse/ragged_as_dense/pad = True
implies that ragged lists are padded and therefore become fixed length. I'm not sure if I have that right though...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes @karlhigley, the sparse_as_dense
API of the older NVT dataloader was intended to work like that, padding list columns by converting them from Ragged to Dense when sparse_as_dense=True
It makes sense to me this change in _augment_schema()
, as it only sets is_ragged
property to sparse/list columns, so that is_ragged
needs to be either False
or True
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@karlhigley I found this logic in ProcessList() that processes all features and converts the tuple representation of sparse features to either Ragged or Dense tensors. It converts all tuple representations to Ragged, unless it finds value_count.min == value_count.max
, making the tensor dense in such case.
What should be the new standard to determine if a tuple representation needs to be converted to ragged or not? Will is_ragged
property be always set in the schema?
If not, should we check the tuple shape to infer?
@gabrielspmoreira I think this PR will end up being superseded by revisions in the Core PR based on @oliverholworthy's feedback, but re: converting tuple representations, I think we should get rid of the tuples altogether (since they significantly complicate the way inputs end up being named) and instead supply the values and offsets with separate dictionary keys. The |
Since
is_list
andis_ragged
have become derived properties computed from the shape, it's no longer possible to directly set them from the constructor. They can be smuggled in through the properties, after which they'll be used to determine an appropriate shape that results in the sameis_list
andis_ragged
values on the other side.(This is a first step toward capturing and using more comprehensive shape information, with the goal of putting
Shape
in place while breaking as little as possible. There will be subsequent changes to directly capture more shape information, but this gets us part-way there.)Depends on NVIDIA-Merlin/core#195