Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improve (de-)serialization performance for scalar arrays #517

Closed
wants to merge 3 commits into from
Closed

Improve (de-)serialization performance for scalar arrays #517

wants to merge 3 commits into from

Conversation

124C41p
Copy link
Contributor

@124C41p 124C41p commented Aug 6, 2023

Fixes #515

Since (de-)serialization is implemented purely in Python, it is quite slow compared to native implementations. I try to circumvent that issue by not deserializing repeated scalar fields immediately, but wrapping their byte representation inside the ScalarArray[T] class instead. This class acts like a list. That is, you can call len(a), a[i], and list(a) for any ScalarArray a, and only at this point we actually deserialize (which is still very slow for big arrays).

On the other hand, when using numpy you can also call np.asarray(a) for any ScalarArray a to turn it into a numpy array in no time. Conversely, any numpy array b can be turned into a ScalarArray by calling ScalarArray.from_numpy(b) to be passed to a betterproto dataclass field (instead of a list) for faster serialization speed.

I tried to be as non-breaking as possible. That is, you can use lists everywhere you used them before. However, it was necessary to generate Sequence[T] type hints where List[T] hints were generated before. Also note that ScalarArray is an immutable data structure. So you might not be able to use .append() or .insert() on repeated fields as before (although it should be possible to make ScalarArray mutable if really needed).

What do you think about this approach?

@124C41p 124C41p marked this pull request as ready for review August 11, 2023 09:00
124C41p and others added 3 commits August 12, 2023 14:12
This increases (de-)serialization speed of
repeated scalar fields (of fixed length)
drastically in the case they are used as
numpy arrays.
unreachable code removed

generic type parameter removed from ScalarArray
for compatibility with Python 3.7 and 3.8

code (auto-)reformatted
@cetanu cetanu self-assigned this Oct 16, 2023
@cetanu cetanu added enhancement New feature or request low priority labels Oct 16, 2023
@Gobot1234
Copy link
Collaborator

Superseded by #545

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request low priority
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Improve (de-)serialization performance for scalar arrays
3 participants