Speedup data reader #2217

mr-1993 · 2022-08-15T10:40:12Z

Issue #2195, if available:

Description of changes: @emptymalei @julian-sieber and I have implemented a significant speedup of the data reader. @jaheba Please have a look. We significantly simplified the from_schema method in the decoder by using dictionaries directly instead of the need to infer the idx of the ndarray columns (maybe you have further ideas to simplify the code? we did not do this yet in order not to break anything). For the moment, the implementation is only for data frames the rows of which contain 1D- or 2D-arrays.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: other change

…in Decoder

lostella · 2022-08-15T11:28:57Z

@mr-1993 looks like style checks are failing; we use black to enforce coding styles, you can fix the issue by installing it and running

black src test

and committing the changes it applies.

mr-1993 · 2022-08-15T11:34:58Z

Thanks @lostella! Should be fixed by now

mr-1993 · 2022-08-15T12:44:29Z

Some tests fail atm, but not related to our implementation (numerical instabilities)

jaheba · 2022-08-15T14:20:46Z

src/gluonts/dataset/arrow/dec.py

-
+ for column_name in self.columns:
+ value = raw_row[column_name]
+ shape = raw_row.get(f"{column_name}._np_shape")


I don't think we want to do this check for every entry.

Rather, I would keep storing ndarray_columns and then just itereate over them.

We would have to run some additional experiments to see whether using ndarray_columns really speeds things up significantly. The codebase without it seems to be more elegant. We will provide you with benchmarks as soon as we have them.

jaheba

Thanks for the PR.

Do you have any comparison numbers on how much this improves decoding?

jaheba · 2022-08-15T14:21:39Z

src/gluonts/dataset/arrow/dec.py

+ shape = raw_row.get(f"{column_name}._np_shape")
+
+ if shape is not None:
+ value = np.stack(value).reshape(shape)


Why stack the value here?

Thanks for the comment, does not seem to be necessary, will delete it

jaheba · 2022-08-15T14:21:41Z

src/gluonts/dataset/arrow/dec.py

+ if shape is not None:
+ value = np.stack(value).reshape(shape)
+ if (
+ isinstance(value, np.ndarray)


Couldn't we just check the shape of the array?

Hi @jaheba ,

The problem is inline with what we discussed last time. What comes out of the pandas conversion for 2d array is something like

np.array([ np.array([1,2,3]), np.array([4,5,6]) ])

then if we check the shape, we would only get 1d array. This is why we have

and len(value.shape) == 1

here.

But would be glad to adapt if you have a better idea.

To the best of our knowledge this does not work. The problem lies in the pandas conversion of multi-dimensional arrays. One obtains arrays of arrays (instead of multi dim arrays)

Oh, I see. Thanks for the clarification.

I think neither arrow nor pandas support arrays with 2 or more dimensions, thus resulting in these nested arrays, with dtype=object.

That's why the re-shaping approach makes things so much easier.

Playing around it looks like np.stack is indeed able to handle nested arrays just fine, even with dimensions higher than 2.

However, I think we can enable stacking only on arrays which dtype == object.

mr-1993 · 2022-08-15T15:34:25Z

Thanks for the PR.

Do you have any comparison numbers on how much this improves decoding?

Atm, we have just tested it on our internal data which we are not allowed to share (but which contains a lot of 2D arrays). We will run some consecutive experiments on open source data, but for benchmarking, we already provide this screenshot with the results:

jaheba · 2022-08-16T09:52:02Z

Thanks! These are massive improvements. Can you share a bit what shape the data has? I assume the larger each row is the bigger the performance difference gets. If I remember correctly, I saw a 2x improvement on some simpler case I tested, but this looks much more impressive!

mr-1993 · 2022-08-16T20:33:35Z

Thanks! These are massive improvements. Can you share a bit what shape the data has? I assume the larger each row is the bigger the performance difference gets. If I remember correctly, I saw a 2x improvement on some simpler case I tested, but this looks much more impressive!

Our dataset contains 500 rows with: 12 columns that contain strings or integers, and 11 columns that contain arrays. 6 of these arrays are one-dimensional and 5 are two-dimensional. They have varying sizes, the largest of the one-dim arrays contain around 270 elements, the largest two-dim arrays contain around 15x270 entries. The file itself is 2.5MB large.

jaheba · 2022-08-23T08:53:26Z

@mr-1993 I went ahead and implemented the changes myself. Hope you don't feel overlooked!

Thanks a lot for the PR 🎉

lostella

Thanks @mr-1993 @emptymalei !

mr-1993 · 2022-08-23T09:09:17Z

@jaheba @lostella Thank you very much for the small collaboration! Hope to contribute other things as well! Also, thank you for implementing the last changes and sorry for our delay in that regard...we were blocked last week.

Co-authored-by: Mones Raslan <mones.raslan@zalando.de> Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>

Mones Raslan added 4 commits August 12, 2022 21:22

Test

94826d8

Significant Speed Up Data Reader Implemented, simplified from_schema …

c8b7241

…in Decoder

Simplified from_schema

92b4fab

Simplified decode_batch

fcd6b03

lostella added pending v0.10.x backport This contains a fix to be backported to the v0.10.x branch bug fix (one of pr required labels) labels Aug 15, 2022

Run black style check

51a93bb

lostella requested a review from jaheba August 15, 2022 11:32

jaheba reviewed Aug 15, 2022

View reviewed changes

jaheba added enhancement New feature or request and removed bug fix (one of pr required labels) labels Aug 22, 2022

Jasper Zschiegner added 3 commits August 23, 2022 10:07

Simplify arrow.dec.

a7f4d99

Merge branch 'dev' into Speedup-Data-Reader

44cd01b

Merge branch 'dev' into Speedup-Data-Reader

d41e8f0

lostella approved these changes Aug 23, 2022

View reviewed changes

jaheba merged commit e142230 into awslabs:dev Aug 23, 2022

lostella pushed a commit to lostella/gluonts that referenced this pull request Aug 26, 2022

Improve arrow reading performance. (awslabs#2217)

cb37719

Co-authored-by: Mones Raslan <mones.raslan@zalando.de> Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>

lostella mentioned this pull request Aug 26, 2022

Backports for v0.10.5 #2252

Merged

lostella pushed a commit that referenced this pull request Aug 26, 2022

Improve arrow reading performance. (#2217)

7857574

Co-authored-by: Mones Raslan <mones.raslan@zalando.de> Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>

lostella added performance improvement This item contains performance improvements and removed pending v0.10.x backport This contains a fix to be backported to the v0.10.x branch labels Aug 27, 2022

lostella removed the enhancement New feature or request label Aug 30, 2022

lostella mentioned this pull request Aug 22, 2023

Fix ArrowDecoder.decode to return instead of yield #2976

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speedup data reader #2217

Speedup data reader #2217

mr-1993 commented Aug 15, 2022 •

edited

Loading

lostella commented Aug 15, 2022

mr-1993 commented Aug 15, 2022

mr-1993 commented Aug 15, 2022

jaheba Aug 15, 2022

mr-1993 Aug 15, 2022

jaheba left a comment

jaheba Aug 15, 2022

mr-1993 Aug 15, 2022

jaheba Aug 15, 2022

emptymalei Aug 15, 2022

mr-1993 Aug 15, 2022

jaheba Aug 16, 2022

mr-1993 commented Aug 15, 2022

jaheba commented Aug 16, 2022 •

edited

Loading

mr-1993 commented Aug 16, 2022 •

edited

Loading

jaheba commented Aug 23, 2022

lostella left a comment

mr-1993 commented Aug 23, 2022 •

edited

Loading

Speedup data reader #2217

Speedup data reader #2217

Conversation

mr-1993 commented Aug 15, 2022 • edited Loading

lostella commented Aug 15, 2022

mr-1993 commented Aug 15, 2022

mr-1993 commented Aug 15, 2022

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jaheba left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mr-1993 commented Aug 15, 2022

jaheba commented Aug 16, 2022 • edited Loading

mr-1993 commented Aug 16, 2022 • edited Loading

jaheba commented Aug 23, 2022

lostella left a comment

Choose a reason for hiding this comment

mr-1993 commented Aug 23, 2022 • edited Loading

mr-1993 commented Aug 15, 2022 •

edited

Loading

jaheba commented Aug 16, 2022 •

edited

Loading

mr-1993 commented Aug 16, 2022 •

edited

Loading

mr-1993 commented Aug 23, 2022 •

edited

Loading