Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Speedup data reader #2217

Merged
merged 8 commits into from
Aug 23, 2022
Merged

Speedup data reader #2217

merged 8 commits into from
Aug 23, 2022

Conversation

mr-1993
Copy link
Contributor

@mr-1993 mr-1993 commented Aug 15, 2022

Issue #2195, if available:

Description of changes: @emptymalei @julian-sieber and I have implemented a significant speedup of the data reader. @jaheba Please have a look. We significantly simplified the from_schema method in the decoder by using dictionaries directly instead of the need to infer the idx of the ndarray columns (maybe you have further ideas to simplify the code? we did not do this yet in order not to break anything). For the moment, the implementation is only for data frames the rows of which contain 1D- or 2D-arrays.

By submitting this pull request, I confirm that you can use, modify, copy, and redistribute this contribution, under the terms of your choice.

Please tag this pr with at least one of these labels to make our release process faster: other change

@lostella lostella added pending v0.10.x backport This contains a fix to be backported to the v0.10.x branch bug fix (one of pr required labels) labels Aug 15, 2022
@lostella
Copy link
Contributor

@mr-1993 looks like style checks are failing; we use black to enforce coding styles, you can fix the issue by installing it and running

black src test

and committing the changes it applies.

@lostella lostella requested a review from jaheba August 15, 2022 11:32
@mr-1993
Copy link
Contributor Author

mr-1993 commented Aug 15, 2022

Thanks @lostella! Should be fixed by now

@mr-1993
Copy link
Contributor Author

mr-1993 commented Aug 15, 2022

Some tests fail atm, but not related to our implementation (numerical instabilities)


for column_name in self.columns:
value = raw_row[column_name]
shape = raw_row.get(f"{column_name}._np_shape")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't think we want to do this check for every entry.

Rather, I would keep storing ndarray_columns and then just itereate over them.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We would have to run some additional experiments to see whether using ndarray_columns really speeds things up significantly. The codebase without it seems to be more elegant. We will provide you with benchmarks as soon as we have them.

Copy link
Contributor

@jaheba jaheba left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the PR.

Do you have any comparison numbers on how much this improves decoding?

shape = raw_row.get(f"{column_name}._np_shape")

if shape is not None:
value = np.stack(value).reshape(shape)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why stack the value here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the comment, does not seem to be necessary, will delete it

if shape is not None:
value = np.stack(value).reshape(shape)
if (
isinstance(value, np.ndarray)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Couldn't we just check the shape of the array?

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @jaheba ,

The problem is inline with what we discussed last time. What comes out of the pandas conversion for 2d array is something like

np.array([
    np.array([1,2,3]),
    np.array([4,5,6])
])

then if we check the shape, we would only get 1d array. This is why we have

and len(value.shape) == 1

here.

But would be glad to adapt if you have a better idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

To the best of our knowledge this does not work. The problem lies in the pandas conversion of multi-dimensional arrays. One obtains arrays of arrays (instead of multi dim arrays)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, I see. Thanks for the clarification.

I think neither arrow nor pandas support arrays with 2 or more dimensions, thus resulting in these nested arrays, with dtype=object.

That's why the re-shaping approach makes things so much easier.

Playing around it looks like np.stack is indeed able to handle nested arrays just fine, even with dimensions higher than 2.

However, I think we can enable stacking only on arrays which dtype == object.

@mr-1993
Copy link
Contributor Author

mr-1993 commented Aug 15, 2022

Thanks for the PR.

Do you have any comparison numbers on how much this improves decoding?

Atm, we have just tested it on our internal data which we are not allowed to share (but which contains a lot of 2D arrays). We will run some consecutive experiments on open source data, but for benchmarking, we already provide this screenshot with the results:
image

image

@jaheba
Copy link
Contributor

jaheba commented Aug 16, 2022

Thanks! These are massive improvements. Can you share a bit what shape the data has? I assume the larger each row is the bigger the performance difference gets. If I remember correctly, I saw a 2x improvement on some simpler case I tested, but this looks much more impressive!

@mr-1993
Copy link
Contributor Author

mr-1993 commented Aug 16, 2022

Thanks! These are massive improvements. Can you share a bit what shape the data has? I assume the larger each row is the bigger the performance difference gets. If I remember correctly, I saw a 2x improvement on some simpler case I tested, but this looks much more impressive!

Our dataset contains 500 rows with: 12 columns that contain strings or integers, and 11 columns that contain arrays. 6 of these arrays are one-dimensional and 5 are two-dimensional. They have varying sizes, the largest of the one-dim arrays contain around 270 elements, the largest two-dim arrays contain around 15x270 entries. The file itself is 2.5MB large.

@jaheba jaheba added enhancement New feature or request and removed bug fix (one of pr required labels) labels Aug 22, 2022
@jaheba
Copy link
Contributor

jaheba commented Aug 23, 2022

@mr-1993 I went ahead and implemented the changes myself. Hope you don't feel overlooked!

Thanks a lot for the PR 🎉

Copy link
Contributor

@lostella lostella left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@mr-1993
Copy link
Contributor Author

mr-1993 commented Aug 23, 2022

@jaheba @lostella Thank you very much for the small collaboration! Hope to contribute other things as well! Also, thank you for implementing the last changes and sorry for our delay in that regard...we were blocked last week.

@jaheba jaheba merged commit e142230 into awslabs:dev Aug 23, 2022
lostella pushed a commit to lostella/gluonts that referenced this pull request Aug 26, 2022
Co-authored-by: Mones Raslan <mones.raslan@zalando.de>
Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>
@lostella lostella mentioned this pull request Aug 26, 2022
lostella pushed a commit that referenced this pull request Aug 26, 2022
Co-authored-by: Mones Raslan <mones.raslan@zalando.de>
Co-authored-by: Jasper Zschiegner <schjaspe@amazon.de>
@lostella lostella added performance improvement This item contains performance improvements and removed pending v0.10.x backport This contains a fix to be backported to the v0.10.x branch labels Aug 27, 2022
@lostella lostella removed the enhancement New feature or request label Aug 30, 2022
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
performance improvement This item contains performance improvements
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants