Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Ensure writable frames #3967

Merged
merged 13 commits into from
Jul 17, 2020
Merged

Ensure writable frames #3967

merged 13 commits into from
Jul 17, 2020

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Jul 17, 2020

Fixes #1978
Fixes #3943

Closes #3918 (superseded)

User code often expects NumPy and Pandas objects to be writable. However this fails in various cases. After some investigation this appears to be due to some frames being read-only (as they are coerced to bytes via compression, logic in merge_frames, etc.). To fix this we update the logic in merge_frames to check frames are writable. If frames are read-only, we copy them to a writable object (like bytearray). We include tests of merge_frames and NumPy array serialization to make sure this is working correctly.

Should make it easier to extend these tests to more cases.
User code working with NumPy or Pandas objects often expects the objects
to be mutable. However if read-only frames (like `bytes`) objects are
used, this is not true. So add a test to check for this so that we can
make sure this is true and we can catch and fix cases where that may not
be true.
@jakirkham jakirkham force-pushed the ensure_writable_frames branch from 38a7d14 to 2fb24e5 Compare July 17, 2020 19:43
This makes it easier to operate with the frames in later parts of the
code.
Instead of having a fast path, guard against going through the slow
path. Should make it easier to do some common post-processing work.
Instead of using `b""`, create the `bytes` object explicitly.
Instead of using a `bytes` object to merge frames together, use a
`bytearray`. This ensures the result is mutable, which downstream
objects from serialization may care about.
As users expect objects that are deserialized like NumPy arrays and
Pandas objects are mutable, make sure the frames used to build them up
are also mutable. Do this by checking if they are read-only (as is the
case for `bytes` as an example). If they are, copy them into a
`bytearray` backed `memoryview`. Otherwise pass them through as-is.
@mrocklin
Copy link
Member

Thanks for handling this @jakirkham . I'm only seeing tests in this PR so far. Is that correct?

@jakirkham
Copy link
Member Author

Yeah I'm just trying to demonstrate the failure on CI. Will push the rest of the fix after.

@jakirkham
Copy link
Member Author

Alright have pushed the changes to fix the tests. Please let me know what you think 🙂

out.append(memoryview(bytearray().join(L)))
frames = out

frames = [memoryview(bytearray(f)) if f.readonly else f for f in frames]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For fun I decided to micro-benchmark this:

In [1]: L = [b'0' * i for i in [10, 100, 10, 100, 1000] * 2]                                                                                                                                                      

In [2]: %timeit b"".join(L)                                                                                                                                                                                       
203 ns ± 1.6 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

In [3]: %timeit memoryview(bytearray().join(L))                                                                                                                                                                   
463 ns ± 5.9 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)

260ns doesn't really matter today I don't think

@mrocklin
Copy link
Member

In principle this looks good to me. Thank you for identifying and resolving this @jakirkham !

@mrocklin mrocklin merged commit c67705f into dask:master Jul 17, 2020
@jakirkham jakirkham deleted the ensure_writable_frames branch July 17, 2020 23:21
@jakirkham
Copy link
Member Author

Thanks Matt! 😄

@jakirkham
Copy link
Member Author

Adding a test for a Pandas Series as well in PR ( #3995 ).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
2 participants