Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic using record_batches_to_json_rows: PrimitiveArray out of bounds #1277

Closed
cmackenzie1 opened this issue Apr 13, 2023 · 7 comments
Closed
Labels
bug Something isn't working

Comments

@cmackenzie1
Copy link
Contributor

Environment

Delta-rs version: 0.8.0

Binding:

Environment:

  • Cloud provider: N/A
  • OS: MacOS Ventura 13.3.1
  • Other: rustc 1.68.2 (9eb3afe9e 2023-03-27)

Bug

What happened: Attempting to serialize [RecordBatch] to JSON fails.

let ctx = SessionContext::new();
ctx.register_table(dataset_name.clone().as_str(), Arc::new(table)).unwrap();

let df = ctx
    .sql(format!("SELECT * FROM {}", dataset_name).as_str())
    .await.unwrap();

let results = df.collect().await.unwrap();

let buf = Vec::new();
let mut writer = arrow_json::LineDelimitedWriter::new(buf);
writer.write_batches(results.as_slice()).unwrap();  // <- panic happens in here
writer.finish().unwrap()

What you expected to happen:
Result is serialized to JSON without panicking

How to reproduce it:

  • Create a partitioned table using delta-rs
  • Query the partitioned table using delta-rs w/ datafusion

More details:

Stacktrace

thread 'tokio-runtime-worker' panicked at 'Trying to access an element at index 27 from a PrimitiveArray of length 27', /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-array-33.0.0/src/array/primitive_array.rs:351:9
stack backtrace:
   0: rust_begin_unwind
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/std/src/panicking.rs:575:5
   1: core::panicking::panic_fmt
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/panicking.rs:64:14
   2: arrow_array::array::primitive_array::PrimitiveArray<T>::value
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-array-33.0.0/src/array/primitive_array.rs:351:9
   3: <&arrow_array::array::primitive_array::PrimitiveArray<T> as arrow_array::array::ArrayAccessor>::value
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-array-33.0.0/src/array/primitive_array.rs:724:9
   4: <&arrow_array::array::primitive_array::PrimitiveArray<arrow_array::types::TimestampMicrosecondType> as arrow_cast::display::DisplayIndexState>::write
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-33.0.0/src/display.rs:465:29
   5: <arrow_cast::display::ArrayFormat<F> as arrow_cast::display::DisplayIndex>::write
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-33.0.0/src/display.rs:361:9
   6: <arrow_cast::display::ValueFormatter as core::fmt::Display>::fmt
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-cast-33.0.0/src/display.rs:162:15
   7: <T as alloc::string::ToString>::to_string
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/alloc/src/string.rs:2536:9
   8: arrow_json::writer::set_column_for_json_rows::{{closure}}
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-json-33.0.0/src/writer.rs:309:25
   9: core::iter::traits::iterator::Iterator::for_each::call::{{closure}}
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/iter/traits/iterator.rs:834:29
  10: <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::fold::enumerate::{{closure}}
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/iter/adapters/enumerate.rs:106:27
  11: core::iter::traits::iterator::Iterator::fold
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/iter/traits/iterator.rs:2438:21
  12: <core::iter::adapters::enumerate::Enumerate<I> as core::iter::traits::iterator::Iterator>::fold
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/iter/adapters/enumerate.rs:112:9
  13: core::iter::traits::iterator::Iterator::for_each
             at /rustc/9eb3afe9ebe9c7d2b84b71002d44f4a0edac95e0/library/core/src/iter/traits/iterator.rs:837:9
  14: arrow_json::writer::set_column_for_json_rows
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-json-33.0.0/src/writer.rs:305:13
  15: arrow_json::writer::record_batches_to_json_rows
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-json-33.0.0/src/writer.rs:422:17
  16: arrow_json::writer::Writer<W,F>::write_batches
             at /Users/cole/.cargo/registry/src/github.com-1ecc6299db9ec823/arrow-json-33.0.0/src/writer.rs:581:20
  17: loghouse::handlers::query::query::{{closure}}
             at ./src/handlers/query.rs:93:21
@cmackenzie1 cmackenzie1 added the bug Something isn't working label Apr 13, 2023
@cmackenzie1
Copy link
Contributor Author

cmackenzie1 commented Apr 13, 2023

It is unclear where exactly this bug is introduced on writing or reading and if it has already been fixed in later versions of datafusion or arrow.

I have attached a DeltaTable that reproduces this error.

issue-1277.zip

$ sha256sum < issue-1277.zip
7fa3d9df7367400749b8d694174cadaebf62b1893bc7521c34f1e2f2ed265db2  -

$ tree http_requests
http_requests
├── _delta_log
│   ├── 00000000000000000000.json
│   └── 00000000000000000001.json
├── date=2022-11-01
│   └── part-00000-2f186249-30c2-400d-9212-3a69941eeb3a-c000.snappy.parquet
└── date=2022-11-02
    └── part-00000-bcd22ccd-28ce-41ed-9f42-e53d2c14e9fc-c000.snappy.parquet

@cmackenzie1
Copy link
Contributor Author

Experimenting around some more, this looks to be an "off-by-one" error when going from Parquet to RecordBatches, and the error manifests itself when using partitioned tables. However, I am not sure whether is this a datafusion bug or delta-rs.

@roeap
Copy link
Collaborator

roeap commented Apr 14, 2023

just in case this is an arrow / datafusion issue, that may already be fixed. Would it be possible to try again with main, once #1249 is merged?

@gruuya
Copy link
Contributor

gruuya commented Apr 14, 2023

Fwiw, I think this is the same problem as in splitgraph/seafowl#349. The underlying issue in arrow-rs has been fixed in v36.0.0, so #1249 should resolve this.

@cmackenzie1
Copy link
Contributor Author

Thanks @roeap, @gruuya - it does look like the bug in arrow-js you mentioned. I tried with rev e6df734 and it works!

Is there an ETA for 0.9.0 release?

@roeap
Copy link
Collaborator

roeap commented Apr 14, 2023

Is there an ETA for 0.9.0 release?

Yes, we are currently preparing it. Hopefully today :)

@cmackenzie1
Copy link
Contributor Author

This is resolved now by #1249

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

3 participants