-
Notifications
You must be signed in to change notification settings - Fork 223
Fixed error in reading a non-finished IPC stream. #302
Conversation
Codecov Report
@@ Coverage Diff @@
## main #302 +/- ##
=======================================
Coverage 80.55% 80.56%
=======================================
Files 324 324
Lines 21399 21404 +5
=======================================
+ Hits 17239 17244 +5
Misses 4160 4160
Continue to review full report at Codecov.
|
Hey, I can confirm that this works as intended for me! TBH, I still had core dumps due to memory allocations, but once I switched over to sockets this resolved itself. The nice thing about the new With respect to docs - it's always hard to convince the Arrow mailing list that something in unclear in their docs for some reason. Perhaps you could add a warning here that data should not be simultaneously written and read from simple files, but only from sockets? It's kinda obvious, but the Arrow docs hint that it is possible using their (Pythonic) file format. Thanks again! |
Awesome! Could you share or PR the socket solution? I added an example in this PR demonstrating files, but if we change the example to use sockets like you are doing, maybe it is more obvious what users should use? |
Not sure what's the best way to "merge" my proposed changes, so I'll just add them here: examples/ipc_pyarrow/main.py: import pyarrow as pa
from time import sleep
import socket
# Set up the data exchange socket
HOST = 127.0.0.1
PORT = 12989
sk = socket.socket()
sk.bind((HOST, PORT))
sk.listen()
data = [
pa.array([1, 2, 3, 4]),
pa.array(["foo", "bar", "baz", None]),
pa.array([True, None, False, True]),
]
batch = pa.record_batch(data, names=["f0", "f1", "f2"])
# Accept incoming connection and stream the data away
connection, address = sk.accept()
dummy_socket_file = connection.makefile('wb')
writer = pa.ipc.new_stream(dummy_socket_file, batch.schema)
while True:
for _ in range(10):
writer.write(batch)
sleep(1) examples/ipc_pyarrow/src/main.rs: use std::net::TcpStream;
use std::thread;
use std::time::Duration;
use arrow2::array::{Array, Int64Array};
use arrow2::datatypes::DataType;
use arrow2::error::Result;
use arrow2::io::ipc::read;
fn main() -> Result<()> {
const ADDRESS: &str = "127.0.0.1:12989";
let mut reader = TcpStream::connect(ADDRESS)?;
let metadata = read::read_stream_metadata(&mut reader)?;
let mut stream = read::StreamReader::new(&mut reader, metadata);
let mut idx = 0;
loop {
match stream.next() {
Some(x) => match x {
Ok(read::StreamState::Some(b)) => {
idx += 1;
println!("batch: {:?}", idx)
}
Ok(read::StreamState::Waiting) => thread::sleep(Duration::from_millis(4000)),
Err(l) => println!("{:?} ({})", l, idx),
},
None => break,
};
}
Ok(())
} |
Awesome, I incorporated it in the example; thank you for taking the time. I aim to release this on the v0.4, so that you do not need to depend on the github version. |
The example is available in |
When the stream reader gets called
.next()
and the stream has not finished, we should not error nor returnNone
(end of stream), and instead should offer the stateWaiting
, so that the user can decide what to do if no new batches have been observed, but no finished state has also been found.This PR changes the return state of the stream reader returns
Option<Result<State>>
whereState
:None
describes the end of the streamSome(Err)
describes an errorSome(Ok(State::Waiting))
describes that the stream has not finished, but that no new data is available to readSome(Ok(State::Some(_))
describes a new batchThanks a lot to @HagaiHargil for clarifying the limitations of the stream reader.
Backwards incompatible changes:
StreamReader
no longer implementsRecordBatchReader
Iterator
implementation ofStreamReader
no longer returnsOption<Result<RecordBatch>>
and instead returnsOption<Result<State>>
. UseState::unwrap
if you are certain that the stream contains a batch (and is not waiting).Closes #301