-
Notifications
You must be signed in to change notification settings - Fork 25
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update dataloader to provide new output structure #101
Conversation
@jperez999 @oliverholworthy would be interested in an early feedback :) |
Based on the TF RaggedTensor docs it seems like we could handle this with a multi-dimensional array of offsets in the |
thanks for the documentation - although I posted it - I havent seen it. I was wondering, how to deal with that case. I am not sure, if a multi-dimensional __offsets columns will work. We will need to be able to feed the same structure in Triton. I think the idea is that we provide only 1-d Tensors in every output column. In the TensorFlow documentations, both offsets vectors have different length:
I think we should provide multiple offsets values with __offsets_0, __offsets_1, etc? What do you think? |
Multiple arrays worth of offsets sound reasonable, but I think we should ignore the case with multiple ragged dimensions for now, since we don't currently have a use case for it. |
merlin/dataloader/loader_base.py
Outdated
X.update(lists) | ||
|
||
for column_name in self.sparse_names: | ||
if column_name in self.sparse_max: | ||
# raise ValueError( | ||
# f"Did not convert {column_name} to sparse due to missing sparse_max entry" | ||
# ) | ||
X[column_name] = self._to_sparse_tensor(X[column_name], column_name) | ||
if self.lists_as_tuple: | ||
tensor = (X[column_name][0], X[column_name][1][:-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems like we want the full set of offsets here, not just the last one ([:-1]
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently, the PyTorch dataloader does NOT provide the last offsets (length of a batch). See comment: #101 (comment)
The function _to_sparse_tensor
(actually I think the result is a dense_tensor) does not work, when the last offset (length) is part of the offset tensors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ohhhhh, I get it now 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I noticed the issue was that the torch dataloader adds the last offset in the _to_sparse_tensor:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/torch.py#L148-L151
I will remove these lines in my updated PR
merlin/dataloader/tensorflow.py
Outdated
@@ -116,6 +116,9 @@ def __init__( | |||
drop_last=False, | |||
transforms=None, | |||
device=None, | |||
use_row_lengths=False, | |||
tensors_as_1d = True, | |||
lists_as_tuple = False |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure how others feel about this, but in the interest of simplifying the code I wonder if we can live without the extra configuration flags and just make the new behavior the default/standard
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Unless we're making the default values preserve the existing behaviour, it's going to be a breaking change that forces some action. I agree in this case it seems like a situation where not having the flags seems like a reasonable option in this case.
merlin/dataloader/tensorflow.py
Outdated
|
||
def _reshape_dim(self, tensor): | ||
if self.tensors_as_1d: | ||
return tf.reshape(tensor, shape=[-1]) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍🏻
In the course of researching this change, I learned that reshape
is essentially free because it just changes how the data in memory is interpreted, but transpose
actually copies the data so it's more expensive. In theory, either one would work here, but reshape
is definitely the better choice.
merlin/dataloader/tensorflow.py
Outdated
else: | ||
return tf.reshape(tensor, shape=[-1, 1]) | ||
|
||
def _add_last_offset(self, index, value): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm kind of surprised we're not already getting the length as one of the offsets 🤔
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the issue is how we create batches from a chunk of dataframe:
https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L411
If we have an offsets value: [0, 3, 5, 9, 10, 15, 20, 25, 27, 30] with batch_size 3 to
[0, 3, 5], [9, 10, 15, 20, 25, 27, 30] - the last offset will go as the initial offset into the new batch. The 2nd iteration will be [9, 10, 15], [20, 25, 27, 30].
The offests will be adjusted in line: https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L438
I thought about changing it, but that will result into missing the initial offset in the new batch.
[0, 3, 5, 9], [10, 15, 20, 25, 27, 30].
Note: Currently, the pytorch dataloader does NOT return the last offset values of a batch.
I think we need to duplicate the boundary offsets (9 and 20 in the example). I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic ( https://github.com/NVIDIA-Merlin/dataloader/blob/main/merlin/dataloader/loader_base.py#L388-L440 ), I felt that is the best way.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that makes sense. Thanks for the explanation! I can't think of a cleaner way to do it yet either.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am not sure, if there is a cleaner way to achieve it. Without trying to refactor the full logic
Working on refactoring the full logic you refererenced in here #104 which may remove the requirement for this method.
That PR (#104) is currently incomplete since it returns only row_lengths currently. (need to add a final row_lengths -> offsets conversion). the row_lengths representation turned out to be easier to work with for the splitting into batches since the batch split array corresponds to the row lengths and not the offsets.
merlin/dataloader/torch.py
Outdated
if self.tensors_as_1d: | ||
return tensor.view(-1) | ||
else: | ||
return tensor.view(-1, 1) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
view()
seems good here 👍🏻
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I tried squeeze - but it was incorrect for a single batch
{"cat": [1]} got squeezed to {"cat" 1}
print(col) | ||
print(feature_tensor) | ||
if tensors_as_1d or "__values" in col or "__offsets" in col or col in spa_lst: | ||
assert len(list(feature_tensor.shape)) == 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Might be good to add test assertions about the offsets either in this test or in a new test, just to double-check that the resulting offsets are what we expect
I had to modify some unittests, which I think were incorrect/do not work. Updates based on the discussion on slack + meeting:
|
Check out this pull request on See visual diffs & provide feedback on Jupyter Notebooks. Powered by ReviewNB |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I reviewed @oliverholworthy changes. I guess that he modified the logic that certain functions are not required anymore (add last offset or reshape the tensors). As unittests are working, I think the code is working correct.
Co-authored-by: Karl Higley <kmhigley@gmail.com>
Currently the dataloaders and Merlin Models use values/row_lengths, which is incompatible both with Triton’s ragged batching support
We want to update:
{“col_name__values”: [1,2…], “col_name__offsets”: [0,4,7]}
Currently: