Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Most efficient way of returning large byte array from Rust to Python #1385

Closed
cswinter opened this issue Jan 14, 2021 · 17 comments
Closed

Most efficient way of returning large byte array from Rust to Python #1385

cswinter opened this issue Jan 14, 2021 · 17 comments
Labels

Comments

@cswinter
Copy link

I'm looking for some guidance on the best way to return a large byte array to Python code and/or share access to raw memory between Rust and Python.

Basically, my application transfers large amounts of data over the network in Rust and then hands the data over to Python (but in some cases I still want concurrent access to the data on the Rust side as well).
One of my ideas is to allocate a numpy array and have the networking code write into it directly, though in that case I don't think the Rust code can still safely access the array because it might be modified concurrently (related question, if I store a reference to a numpy array object on the Rust side, will GC work properly or could the numpy array get deallocated once all references on the Python side disappear?).

I think PyBytes::from_ptr might be what I want but I haven't found any documentation on what invariants it expects to be maintained without triggering unsafety.
If I have a Vec<u8> on the Rust side and return direct access to it with PyBytes::from_ptr, I assume this is fine as long as I don't deallocate the Vec and the Python code doesn't do anything funky? Is there way to ever safely/automatically deallocate the Vec, or would that require something like exposing a free method that has to be called explicitly by the Python code after it stops accessing the bytes object?

@cswinter
Copy link
Author

I see the PyBytes has a get_refcnt function, if I keep one copy of the PyBytes object on the Rust side, is it then safe to automatically deallocate it and the underlying storage once the reference count reaches 1?

@davidhewitt
Copy link
Member

I think there's no way to do this without giving Python control of the allocation. Most of the PyBytes functions will copy the bytes data into a new allocation, PyBytes::from_ptr included.

I think PyBytes::new_with is the closest you can get to what you want with the PyBytes type. You can pre-allocate Python bytes of a fixed size, and you can get access to that as a mutable slice.

You could also consider PyByteArray, which is a mutable bytes buffer.

@althonos
Copy link
Member

Sounds like a job for memoryview, but there is no stable interface in pyo3 for that!

@davidhewitt
Copy link
Member

Ahhh yes that looks like it! https://docs.python.org/3/c-api/memoryview.html

We have some ffi definitions at src/ffi/memoryobject.rs which look slightly out of date. They could probably use an update (see #1289). It would also be completely appropriate to add this as a new &PyMemoryView native type, I guess, in src/types/memoryview.rs.

@cswinter
Copy link
Author

Thanks for the pointers, looks like memoryview might be the way to go. Will do some more reading.

@cswinter
Copy link
Author

Still learning, but the key to making this work seems to be the buffer protocol which makes it possible for objects to give access to an underlying memory region by implementing a bf_getbuffer method.
So I could have an object that wraps an Arc<Vec> and uses the buffer protocol to give Python/cpython extensions direct access to its memory.
By implementing the buffer protocol it is then eligible to be wrapped in a memoryview.

My understanding is that pointers to the bf_getbuffer and bf_releasebuffer methods required by the buffer protocol would have to added to some buffer struct on the Python type object for the class. Or something like that, I'm still unsure about how this part works.
PyO3 does have a PyBufferProtocol trait, is it enough to implement that trait on a #[pyclass] struct to implement the buffer protocol?

@davidhewitt
Copy link
Member

Yes, implement the PyBufferProtocol trait as a #[pyproto] to implement the buffer protocol.

I'm still not exactly sure how the lifetimes for memoryview work so that you don't get into a use-after-free situation. That would be worth understanding carefully.

@davidhewitt
Copy link
Member

davidhewitt commented Jan 15, 2021

I think PyMemoryView_FromMemory could be used on your Arc<Vec> directly without going through the buffer protocol, for example, but I haven't thought about how to keep the lifetimes safe in that case.

@kngwyu
Copy link
Member

kngwyu commented Jan 15, 2021

BTW, you can make your array readonly: https://docs.rs/numpy/0.13.0/numpy/struct.PyReadonlyArray.html

@programmerjake
Copy link
Contributor

I think PyMemoryView_FromMemory could be used on your Arc<Vec> directly without going through the buffer protocol, for example, but I haven't thought about how to keep the lifetimes safe in that case.

sounds like a memory safety nightmare from some C programmers who didn't think about the consequences -- the buffer would have to be &'static [u8] or &'static mut [u8]

@davidhewitt
Copy link
Member

sounds like a memory safety nightmare from some C programmers who didn't think about the consequences -- the buffer would have to be &'static [u8] or &'static mut [u8]

I think it might be possible to use a weakref to the memoryview in order to deallocate the buffer / decrease Arc reference count when the memoryview is garbage collected. Haven't tried this though.

@nw0
Copy link
Contributor

nw0 commented Jan 15, 2021

So I could have an object that wraps an Arc<Vec> and uses the buffer protocol to give Python/cpython extensions direct access to its memory.
By implementing the buffer protocol it is then eligible to be wrapped in a memoryview.

My understanding is that pointers to the bf_getbuffer and bf_releasebuffer methods required by the buffer protocol would have to added to some buffer struct on the Python type object for the class. Or something like that, I'm still unsure about how this part works.

This sounds like what you want. When the MemoryView gets dropped, bf_releasebuffer will get called (see CPython's memoryobject.c and abstract.c). You'd decrement the refcount on the actual buffer here. It looks like the Py_buffer object itself gets handled by Python's GC, though.

@cswinter
Copy link
Author

I've figured out the remaining details and implemented ArcVec which wraps an Arc<Vec<u8>> and allows it to be used as a bytes-like object by Python. Majority of my implementation is just based off the existing buffer protocol test. I removed the drop_called field since I don't think it provides that much protection, if you use rogue C extensions that decide to hang on to pointers that have been freed by Python GC and Rust memory allocator then you're probably have a bad time either way 😛

Basic example/test case:

// Rust
#[pyfunction]
fn unicode_buffer() -> arc_buffer::ArcVec {
    arc_buffer::ArcVec::new(Arc::new(vec![
        80, 121, 79, 51, 32, 70, 84, 87, 33, 32, 240, 159, 166, 128, 240, 159, 166, 128, 240, 159, 166, 128,
    ]))
}
# Python
from pyo3lib import unicode_buffer
import gc
import numpy as np

print("Creating new buffer")
buf = unicode_buffer()

print("Decoding buffer as unicode")
print(bytes(buf).decode('utf-8'))

print("Creating numpy array from buffer")
array = np.frombuffer(buf, np.uint8)

print("Deleting original buffer variable and running gc")
del buf
gc.collect()

print("Deleting numpy array and running gc")
del array
gc.collect()
# Output
Creating new buffer
Decoding buffer as unicode
getbuffer 0
releasebuffer 0
PyO3 FTW! 🦀🦀🦀
Creating numpy array from buffer
getbuffer 0
getbuffer 0
releasebuffer 0
Deleting original buffer variable and running gc
Deleting numpy array and running gc
Dropped 0

From what I understand this implementation should be sound, though I may well have missed something. When a C extension wants to obtain access to the memory buffer, it calls PyObject_GetBuffer (bf_getbuffer) and we increment the reference count of our Python object which prevents it from getting garbage collected by Python and dropped by Rust. Once the C extension is done accessing the buffer, it calls PyBuffer_Release and Python automatically decreases the reference count of our object, allowing it to be garbage collected once all other outstanding buffers and references to the objects are returned. Alternatively, a C extension might return some kind of view that includes a reference to our object and which is garbage collected using the normal mechanisms.

@aeshirey
Copy link

aeshirey commented Feb 3, 2021

I'd like to chime in with my use case (and my admittedly much simpler understanding) here that is causing me grief:

I'm doing parallel processing in Rust and returning 5x ~150MiB PyBytes to Python which then get sent over the wire to be aggregated elsewhere, at which point those chunks of memory are passed from Python to Rust. I'm seeing significant latency in passing the data back to Python. The Rust code looks roughly like:

#[pyfunction]
pub fn do_stuff<'a>(py: Python<'a>-> PyResult<&'a PyBytes> {
    let start = Instant::now();
    let data : Vec<u8> = do_some_stuff();
    let result = PyBytes::new(py, &data[..]);
    println!("Rust took {} seconds", start.elapsed().as_secs_f32());
    Ok(result)
}

And I call it from Python:

import time
start = time.time()
result = MyLibrary.do_stuff()
end = time.time()
print(f'Python took {end - start} seconds')

These times vary by a significant margin: around 70 seconds of Rust time to 78 seconds of Python time, quite consistently. Passing 5x150mb from Python to Rust is instantaneous.

In my case, once I pass the memory from Rust to Python or vice-versa, I don't need to maintain the original data. I'm surprised that there's a ~7-8 sec delta between when Rust completes and Python receives the data. Is this something that would be addressed by memoryview? Or is this more likely PEBCAK?

@davidhewitt
Copy link
Member

Judging by the example code you've written, the timing in rust is measured after the creation of the PyBytes object, so I don't think that the latency you see is caused by returning the bytes.

Some random stabs in the dark which might help you locate the issue:

  • you mention multithreading, are you releasing the Python GIL? Maybe some time is spent waiting to re-acquire the Python GIL from other threads which are doing stuff.
  • if your algorithm is allocating a lot of complicated structures in Rust, it could take some time to drop it all. Try dropping everything explicitly before measuring the timing?
  • you're definitely running with --release ?

@aeshirey
Copy link

aeshirey commented Feb 6, 2021

@davidhewitt - your second point was the key. I assumed that the data structures were being dropped right away when not needed. Manually calling drop addresses this, and now instead of ~7 sec difference between Rust and Python's timing calculations, it's about 0.005 - 0.02 sec. Thanks!

@davidhewitt
Copy link
Member

This question has been stale for some time now, so I'm going to close it. Thanks for the discussion.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

7 participants