Most efficient way of returning large byte array from Rust to Python #1385

cswinter · 2021-01-14T18:01:39Z

I'm looking for some guidance on the best way to return a large byte array to Python code and/or share access to raw memory between Rust and Python.

Basically, my application transfers large amounts of data over the network in Rust and then hands the data over to Python (but in some cases I still want concurrent access to the data on the Rust side as well).
One of my ideas is to allocate a numpy array and have the networking code write into it directly, though in that case I don't think the Rust code can still safely access the array because it might be modified concurrently (related question, if I store a reference to a numpy array object on the Rust side, will GC work properly or could the numpy array get deallocated once all references on the Python side disappear?).

I think PyBytes::from_ptr might be what I want but I haven't found any documentation on what invariants it expects to be maintained without triggering unsafety.
If I have a Vec<u8> on the Rust side and return direct access to it with PyBytes::from_ptr, I assume this is fine as long as I don't deallocate the Vec and the Python code doesn't do anything funky? Is there way to ever safely/automatically deallocate the Vec, or would that require something like exposing a free method that has to be called explicitly by the Python code after it stops accessing the bytes object?

The text was updated successfully, but these errors were encountered:

cswinter · 2021-01-14T18:23:41Z

I see the PyBytes has a get_refcnt function, if I keep one copy of the PyBytes object on the Rust side, is it then safe to automatically deallocate it and the underlying storage once the reference count reaches 1?

davidhewitt · 2021-01-14T19:54:32Z

I think there's no way to do this without giving Python control of the allocation. Most of the PyBytes functions will copy the bytes data into a new allocation, PyBytes::from_ptr included.

I think PyBytes::new_with is the closest you can get to what you want with the PyBytes type. You can pre-allocate Python bytes of a fixed size, and you can get access to that as a mutable slice.

You could also consider PyByteArray, which is a mutable bytes buffer.

althonos · 2021-01-14T20:02:36Z

Sounds like a job for memoryview, but there is no stable interface in pyo3 for that!

davidhewitt · 2021-01-14T20:07:07Z

Ahhh yes that looks like it! https://docs.python.org/3/c-api/memoryview.html

We have some ffi definitions at src/ffi/memoryobject.rs which look slightly out of date. They could probably use an update (see #1289). It would also be completely appropriate to add this as a new &PyMemoryView native type, I guess, in src/types/memoryview.rs.

cswinter · 2021-01-14T22:47:18Z

Thanks for the pointers, looks like memoryview might be the way to go. Will do some more reading.

cswinter · 2021-01-15T05:15:33Z

Still learning, but the key to making this work seems to be the buffer protocol which makes it possible for objects to give access to an underlying memory region by implementing a bf_getbuffer method.
So I could have an object that wraps an Arc<Vec> and uses the buffer protocol to give Python/cpython extensions direct access to its memory.
By implementing the buffer protocol it is then eligible to be wrapped in a memoryview.

My understanding is that pointers to the bf_getbuffer and bf_releasebuffer methods required by the buffer protocol would have to added to some buffer struct on the Python type object for the class. Or something like that, I'm still unsure about how this part works.
PyO3 does have a PyBufferProtocol trait, is it enough to implement that trait on a #[pyclass] struct to implement the buffer protocol?

davidhewitt · 2021-01-15T07:51:31Z

Yes, implement the PyBufferProtocol trait as a #[pyproto] to implement the buffer protocol.

I'm still not exactly sure how the lifetimes for memoryview work so that you don't get into a use-after-free situation. That would be worth understanding carefully.

davidhewitt · 2021-01-15T07:52:26Z

I think PyMemoryView_FromMemory could be used on your Arc<Vec> directly without going through the buffer protocol, for example, but I haven't thought about how to keep the lifetimes safe in that case.

kngwyu · 2021-01-15T08:16:04Z

BTW, you can make your array readonly: https://docs.rs/numpy/0.13.0/numpy/struct.PyReadonlyArray.html

programmerjake · 2021-01-15T08:43:11Z

I think PyMemoryView_FromMemory could be used on your Arc<Vec> directly without going through the buffer protocol, for example, but I haven't thought about how to keep the lifetimes safe in that case.

sounds like a memory safety nightmare from some C programmers who didn't think about the consequences -- the buffer would have to be &'static [u8] or &'static mut [u8]

davidhewitt · 2021-01-15T09:15:28Z

sounds like a memory safety nightmare from some C programmers who didn't think about the consequences -- the buffer would have to be &'static [u8] or &'static mut [u8]

I think it might be possible to use a weakref to the memoryview in order to deallocate the buffer / decrease Arc reference count when the memoryview is garbage collected. Haven't tried this though.

nw0 · 2021-01-15T12:35:21Z

So I could have an object that wraps an Arc<Vec> and uses the buffer protocol to give Python/cpython extensions direct access to its memory.
By implementing the buffer protocol it is then eligible to be wrapped in a memoryview.

My understanding is that pointers to the bf_getbuffer and bf_releasebuffer methods required by the buffer protocol would have to added to some buffer struct on the Python type object for the class. Or something like that, I'm still unsure about how this part works.

This sounds like what you want. When the MemoryView gets dropped, bf_releasebuffer will get called (see CPython's memoryobject.c and abstract.c). You'd decrement the refcount on the actual buffer here. It looks like the Py_buffer object itself gets handled by Python's GC, though.

cswinter · 2021-01-21T23:00:28Z

I've figured out the remaining details and implemented ArcVec which wraps an Arc<Vec<u8>> and allows it to be used as a bytes-like object by Python. Majority of my implementation is just based off the existing buffer protocol test. I removed the drop_called field since I don't think it provides that much protection, if you use rogue C extensions that decide to hang on to pointers that have been freed by Python GC and Rust memory allocator then you're probably have a bad time either way 😛

Basic example/test case:

// Rust
#[pyfunction]
fn unicode_buffer() -> arc_buffer::ArcVec {
    arc_buffer::ArcVec::new(Arc::new(vec![
        80, 121, 79, 51, 32, 70, 84, 87, 33, 32, 240, 159, 166, 128, 240, 159, 166, 128, 240, 159, 166, 128,
    ]))
}

# Python
from pyo3lib import unicode_buffer
import gc
import numpy as np

print("Creating new buffer")
buf = unicode_buffer()

print("Decoding buffer as unicode")
print(bytes(buf).decode('utf-8'))

print("Creating numpy array from buffer")
array = np.frombuffer(buf, np.uint8)

print("Deleting original buffer variable and running gc")
del buf
gc.collect()

print("Deleting numpy array and running gc")
del array
gc.collect()

# Output
Creating new buffer
Decoding buffer as unicode
getbuffer 0
releasebuffer 0
PyO3 FTW! 🦀🦀🦀
Creating numpy array from buffer
getbuffer 0
getbuffer 0
releasebuffer 0
Deleting original buffer variable and running gc
Deleting numpy array and running gc
Dropped 0

From what I understand this implementation should be sound, though I may well have missed something. When a C extension wants to obtain access to the memory buffer, it calls PyObject_GetBuffer (bf_getbuffer) and we increment the reference count of our Python object which prevents it from getting garbage collected by Python and dropped by Rust. Once the C extension is done accessing the buffer, it calls PyBuffer_Release and Python automatically decreases the reference count of our object, allowing it to be garbage collected once all other outstanding buffers and references to the objects are returned. Alternatively, a C extension might return some kind of view that includes a reference to our object and which is garbage collected using the normal mechanisms.

aeshirey · 2021-02-03T03:37:47Z

I'd like to chime in with my use case (and my admittedly much simpler understanding) here that is causing me grief:

I'm doing parallel processing in Rust and returning 5x ~150MiB PyBytes to Python which then get sent over the wire to be aggregated elsewhere, at which point those chunks of memory are passed from Python to Rust. I'm seeing significant latency in passing the data back to Python. The Rust code looks roughly like:

#[pyfunction]
pub fn do_stuff<'a>(py: Python<'a>-> PyResult<&'a PyBytes> {
    let start = Instant::now();
    let data : Vec<u8> = do_some_stuff();
    let result = PyBytes::new(py, &data[..]);
    println!("Rust took {} seconds", start.elapsed().as_secs_f32());
    Ok(result)
}

And I call it from Python:

import time
start = time.time()
result = MyLibrary.do_stuff()
end = time.time()
print(f'Python took {end - start} seconds')

These times vary by a significant margin: around 70 seconds of Rust time to 78 seconds of Python time, quite consistently. Passing 5x150mb from Python to Rust is instantaneous.

In my case, once I pass the memory from Rust to Python or vice-versa, I don't need to maintain the original data. I'm surprised that there's a ~7-8 sec delta between when Rust completes and Python receives the data. Is this something that would be addressed by memoryview? Or is this more likely PEBCAK?

davidhewitt · 2021-02-03T07:31:27Z

Judging by the example code you've written, the timing in rust is measured after the creation of the PyBytes object, so I don't think that the latency you see is caused by returning the bytes.

Some random stabs in the dark which might help you locate the issue:

you mention multithreading, are you releasing the Python GIL? Maybe some time is spent waiting to re-acquire the Python GIL from other threads which are doing stuff.
if your algorithm is allocating a lot of complicated structures in Rust, it could take some time to drop it all. Try dropping everything explicitly before measuring the timing?
you're definitely running with --release ?

aeshirey · 2021-02-06T18:20:22Z

@davidhewitt - your second point was the key. I assumed that the data structures were being dropped right away when not needed. Manually calling drop addresses this, and now instead of ~7 sec difference between Rust and Python's timing calculations, it's about 0.005 - 0.02 sec. Thanks!

davidhewitt · 2021-09-18T09:23:11Z

This question has been stale for some time now, so I'm going to close it. Thanks for the discussion.

nw0 mentioned this issue Jan 15, 2021

ffi module cleanup: listobject.h to memoryobject.h #1387

Merged

davidhewitt added the question label Jan 26, 2021

Wenzel mentioned this issue Feb 5, 2021

libmicrovmi python bindings Wenzel/libmicrovmi#162

Closed

8 tasks

This was referenced Feb 11, 2021

Improve benchmarks milesgranger/cramjam#12

Merged

[WIP] Efficient movement of Vec<u8> to PyBytes #1427

Closed

davidhewitt closed this as completed Sep 18, 2021

agoscinski mentioned this issue Apr 2, 2023

Pickling support by converting to numpy metatensor/metatensor#238

Merged

ilan-gold mentioned this issue Oct 22, 2024

(fix): return bytes as numpy array ilan-gold/zarrs-python#18

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Most efficient way of returning large byte array from Rust to Python #1385

Most efficient way of returning large byte array from Rust to Python #1385

cswinter commented Jan 14, 2021

cswinter commented Jan 14, 2021

davidhewitt commented Jan 14, 2021

althonos commented Jan 14, 2021

davidhewitt commented Jan 14, 2021

cswinter commented Jan 14, 2021

cswinter commented Jan 15, 2021

davidhewitt commented Jan 15, 2021

davidhewitt commented Jan 15, 2021 •

edited

Loading

kngwyu commented Jan 15, 2021

programmerjake commented Jan 15, 2021

davidhewitt commented Jan 15, 2021

nw0 commented Jan 15, 2021

cswinter commented Jan 21, 2021

aeshirey commented Feb 3, 2021

davidhewitt commented Feb 3, 2021

aeshirey commented Feb 6, 2021

davidhewitt commented Sep 18, 2021

Most efficient way of returning large byte array from Rust to Python #1385

Most efficient way of returning large byte array from Rust to Python #1385

Comments

cswinter commented Jan 14, 2021

cswinter commented Jan 14, 2021

davidhewitt commented Jan 14, 2021

althonos commented Jan 14, 2021

davidhewitt commented Jan 14, 2021

cswinter commented Jan 14, 2021

cswinter commented Jan 15, 2021

davidhewitt commented Jan 15, 2021

davidhewitt commented Jan 15, 2021 • edited Loading

kngwyu commented Jan 15, 2021

programmerjake commented Jan 15, 2021

davidhewitt commented Jan 15, 2021

nw0 commented Jan 15, 2021

cswinter commented Jan 21, 2021

aeshirey commented Feb 3, 2021

davidhewitt commented Feb 3, 2021

aeshirey commented Feb 6, 2021

davidhewitt commented Sep 18, 2021

davidhewitt commented Jan 15, 2021 •

edited

Loading