Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Track mutable frames #4004

Merged
merged 34 commits into from
Aug 4, 2020
Merged

Track mutable frames #4004

merged 34 commits into from
Aug 4, 2020

Conversation

jakirkham
Copy link
Member

@jakirkham jakirkham commented Jul 31, 2020

Fixes #3994
Closes #3995 (as it's included)

Instead of making all frames writeable as was done in PR ( #3967 ), this tracks which frames were writeable and only makes sure those are writeable. This also fixes a performance regression. Thanks @jsignell for the suggestion 🙂

In [1]: import cupy
   ...: import cudf
   ...: import rmm
   ...: 
   ...: from distributed.protocol import serialize_bytes, deserialize_bytes

In [2]: rmm.reinitialize(pool_allocator=True,
   ...:                  initial_pool_size=int(30 * 2**30))

In [3]: df = cudf.DataFrame({
   ...:     k: cupy.random.random(1_000_000)
   ...:     for i, k in enumerate(map(chr, range(ord("A"), ord("K"))))
   ...: })

In [4]: b = serialize_bytes(df)

In [5]: %timeit deserialize_bytes(b)
10.1 ms ± 50.6 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)

Edit: Had a typo in my benchmark before. It has been updated.

Note: The merge_frames code in this PR is now closer to what it looked like in 2.21.0 (thanks to some simplifications :). So it may be easier to compare what is here against the code in that version for a cleaner diff.

cc @quasiben

Keeps logic simple here in exchange for potentially copying later if
this is writeable.
We don't need to bother making these frames writeable as they will be
transferred to GPU anyways where they will then be writeable. As a
result this saves us a copy that we don't otherwise need.
@jakirkham jakirkham force-pushed the track_mutable_frames branch from e3d3d8c to 7ffa2c2 Compare July 31, 2020 21:30
jakirkham added 12 commits July 31, 2020 15:20
Only try to figure out `writeable` if it is not otherwise specified.
This is important for CUDA objects for example, which are not
`memoryview` coercible.
Make sure all frames are `bytes` as they are read-only. This should test
whether the deserialization logic is smart enough to coerce them back
into something that is writeable.
As we are only using `"dask"` or `"pickle"` serialization when spilling
here, we can be confident that there will not be any CUDA frames. So
drop this unneeded check.
@jakirkham
Copy link
Member Author

Additionally have managed to back out the requirement that all frames be memoryviews. Also included tests to make sure NumPy arrays and Pandas objects built from read-only frames are reconstructed with writeable frames (if they were originally).

As we now guarantee frames are readonly or writeable as appropriate when
splitting/merging frames in all cases, there is no need to run this code
in all cases. We simply need it for the fast path. So add it there.
Make sure we force readonly frames to be readonly.
@jakirkham jakirkham force-pushed the track_mutable_frames branch from 9f83aaa to e36b592 Compare August 1, 2020 01:06
This special case of `None` will bypass both cases.
@jakirkham jakirkham force-pushed the track_mutable_frames branch from e36b592 to 09c1ce7 Compare August 1, 2020 01:12
This is very slightly faster than `.append()`

```python
In [1]: %timeit L = []
17.4 ns ± 0.0668 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each)

In [2]: L2 = [5]

In [3]: %timeit L = []; L.extend(L2)
73.7 ns ± 1.72 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)

In [4]: %timeit L = []; L.append(L2[0])
84.5 ns ± 0.211 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each)
```
@jakirkham jakirkham force-pushed the track_mutable_frames branch from ba49387 to 7012d85 Compare August 1, 2020 01:49
@quasiben
Copy link
Member

quasiben commented Aug 3, 2020

Thanks for doing continuing to push here @jakirkham. I have two comments:

  1. I think updating a docstring somewhere on how writeable is used would be helpful. Maybe in core.py -- this is only used for GPU data correct ?
  2. Would you be willing to confirm test_cupy/numba/rmm/test_collection_cuda/test_ucx pass as well.

Try to coerce the frame through `memoryview` and check the `readonly`
attribute. If the frame cannot be coerced through `memoryview`, use
`None` to indicate the frame is ambiguous, which means we will bypass
any special handling of the frame.
@jakirkham
Copy link
Member Author

  1. I think updating a docstring somewhere on how writeable is used would be helpful. Maybe in core.py -- this is only used for GPU data correct ?

Hopefully is_writeable addresses this. Though happy to expand on this further if there are particular points that remain unclear

  1. Would you be willing to confirm test_cupy/numba/rmm/test_collection_cuda/test_ucx pass as well.

I've been running the full protocol test suite on a DGX-1, which has worked well. Haven't been testing comms though. However it seems there are other issues without this change as evidenced by PR ( rapidsai/ucx-py#567 ).

@jakirkham
Copy link
Member Author

  1. Would you be willing to confirm test_cupy/numba/rmm/test_collection_cuda/test_ucx pass as well.

I've been running the full protocol test suite on a DGX-1, which has worked well. Haven't been testing comms though. However it seems there are other issues without this change as evidenced by PR ( rapidsai/ucx-py#567 ).

Ok the UCX comms tests pass for me locally too. Though we should still figure out what is up with UCX-Py's CI.

@jakirkham
Copy link
Member Author

  1. Would you be willing to confirm test_cupy/numba/rmm/test_collection_cuda/test_ucx pass as well.

I've been running the full protocol test suite on a DGX-1, which has worked well. Haven't been testing comms though. However it seems there are other issues without this change as evidenced by PR ( rapidsai/ucx-py#567 ).

Ok the UCX comms tests pass for me locally too. Though we should still figure out what is up with UCX-Py's CI.

Turns out to be related to how the test was written. PR ( rapidsai/ucx-py#568 ) should fix this.

@quasiben
Copy link
Member

quasiben commented Aug 4, 2020

Thanks John. Merging

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Copy for mutable frames can introduce a slowdown
2 participants