-
-
Notifications
You must be signed in to change notification settings - Fork 719
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Track mutable frames #4004
Track mutable frames #4004
Conversation
1607d5e
to
e3d3d8c
Compare
Keeps logic simple here in exchange for potentially copying later if this is writeable.
We don't need to bother making these frames writeable as they will be transferred to GPU anyways where they will then be writeable. As a result this saves us a copy that we don't otherwise need.
e3d3d8c
to
7ffa2c2
Compare
Only try to figure out `writeable` if it is not otherwise specified. This is important for CUDA objects for example, which are not `memoryview` coercible.
Make sure all frames are `bytes` as they are read-only. This should test whether the deserialization logic is smart enough to coerce them back into something that is writeable.
As we are only using `"dask"` or `"pickle"` serialization when spilling here, we can be confident that there will not be any CUDA frames. So drop this unneeded check.
Additionally have managed to back out the requirement that all frames be |
As we now guarantee frames are readonly or writeable as appropriate when splitting/merging frames in all cases, there is no need to run this code in all cases. We simply need it for the fast path. So add it there.
Make sure we force readonly frames to be readonly.
5b82157
to
9f83aaa
Compare
9f83aaa
to
e36b592
Compare
This special case of `None` will bypass both cases.
e36b592
to
09c1ce7
Compare
This is very slightly faster than `.append()` ```python In [1]: %timeit L = [] 17.4 ns ± 0.0668 ns per loop (mean ± std. dev. of 7 runs, 100000000 loops each) In [2]: L2 = [5] In [3]: %timeit L = []; L.extend(L2) 73.7 ns ± 1.72 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) In [4]: %timeit L = []; L.append(L2[0]) 84.5 ns ± 0.211 ns per loop (mean ± std. dev. of 7 runs, 10000000 loops each) ```
ba49387
to
7012d85
Compare
Thanks for doing continuing to push here @jakirkham. I have two comments:
|
Try to coerce the frame through `memoryview` and check the `readonly` attribute. If the frame cannot be coerced through `memoryview`, use `None` to indicate the frame is ambiguous, which means we will bypass any special handling of the frame.
Hopefully
I've been running the full protocol test suite on a DGX-1, which has worked well. Haven't been testing comms though. However it seems there are other issues without this change as evidenced by PR ( rapidsai/ucx-py#567 ). |
Ok the UCX comms tests pass for me locally too. Though we should still figure out what is up with UCX-Py's CI. |
Turns out to be related to how the test was written. PR ( rapidsai/ucx-py#568 ) should fix this. |
Thanks John. Merging |
Fixes #3994
Closes #3995 (as it's included)
Instead of making all frames writeable as was done in PR ( #3967 ), this tracks which frames were writeable and only makes sure those are writeable. This also fixes a performance regression. Thanks @jsignell for the suggestion 🙂
Edit: Had a typo in my benchmark before. It has been updated.
Note: The
merge_frames
code in this PR is now closer to what it looked like in2.21.0
(thanks to some simplifications :). So it may be easier to compare what is here against the code in that version for a cleaner diff.cc @quasiben