-
Notifications
You must be signed in to change notification settings - Fork 750
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Speedup data reader #2217
Speedup data reader #2217
Conversation
@mr-1993 looks like style checks are failing; we use
and committing the changes it applies. |
Thanks @lostella! Should be fixed by now |
Some tests fail atm, but not related to our implementation (numerical instabilities) |
src/gluonts/dataset/arrow/dec.py
Outdated
|
||
for column_name in self.columns: | ||
value = raw_row[column_name] | ||
shape = raw_row.get(f"{column_name}._np_shape") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we want to do this check for every entry.
Rather, I would keep storing ndarray_columns
and then just itereate over them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We would have to run some additional experiments to see whether using ndarray_columns
really speeds things up significantly. The codebase without it seems to be more elegant. We will provide you with benchmarks as soon as we have them.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the PR.
Do you have any comparison numbers on how much this improves decoding?
src/gluonts/dataset/arrow/dec.py
Outdated
shape = raw_row.get(f"{column_name}._np_shape") | ||
|
||
if shape is not None: | ||
value = np.stack(value).reshape(shape) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why stack the value here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the comment, does not seem to be necessary, will delete it
src/gluonts/dataset/arrow/dec.py
Outdated
if shape is not None: | ||
value = np.stack(value).reshape(shape) | ||
if ( | ||
isinstance(value, np.ndarray) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Couldn't we just check the shape of the array?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hi @jaheba ,
The problem is inline with what we discussed last time. What comes out of the pandas conversion for 2d array is something like
np.array([
np.array([1,2,3]),
np.array([4,5,6])
])
then if we check the shape, we would only get 1d array. This is why we have
and len(value.shape) == 1
here.
But would be glad to adapt if you have a better idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To the best of our knowledge this does not work. The problem lies in the pandas conversion of multi-dimensional arrays. One obtains arrays of arrays (instead of multi dim arrays)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, I see. Thanks for the clarification.
I think neither arrow nor pandas support arrays with 2 or more dimensions, thus resulting in these nested arrays, with dtype=object
.
That's why the re-shaping approach makes things so much easier.
Playing around it looks like np.stack
is indeed able to handle nested arrays just fine, even with dimensions higher than 2.
However, I think we can enable stacking only on arrays which dtype == object
.
Thanks! These are massive improvements. Can you share a bit what shape the data has? I assume the larger each row is the bigger the performance difference gets. If I remember correctly, I saw a 2x improvement on some simpler case I tested, but this looks much more impressive! |
Our dataset contains 500 rows with: 12 columns that contain strings or integers, and 11 columns that contain arrays. 6 of these arrays are one-dimensional and 5 are two-dimensional. They have varying sizes, the largest of the one-dim arrays contain around 270 elements, the largest two-dim arrays contain around 15x270 entries. The file itself is 2.5MB large. |
@mr-1993 I went ahead and implemented the changes myself. Hope you don't feel overlooked! Thanks a lot for the PR 🎉 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @mr-1993 @emptymalei !
Co-authored-by: Mones Raslan <mones.raslan@zalando.de> Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>
Co-authored-by: Mones Raslan <mones.raslan@zalando.de> Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>
Issue #2195, if available:
Description of changes: @emptymalei @julian-sieber and I have implemented a significant speedup of the data reader. @jaheba Please have a look. We significantly simplified the from_schema method in the decoder by using dictionaries directly instead of the need to infer the idx of the ndarray columns (maybe you have further ideas to simplify the code? we did not do this yet in order not to break anything). For the moment, the implementation is only for data frames the rows of which contain 1D- or 2D-arrays.
By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.
Please tag this pr with at least one of these labels to make our release process faster: other change