Panic occurring when using streaming and limit with Parquet #18181

bchalk101 · 2024-08-14T09:54:20Z

Checks

I have checked that this issue has not already been reported.
I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import boto3

session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options={"aws_access_key_id": credentials.access_key, "aws_secret_access_key": credentials.secret_key, "aws_session_token": credentials.token, "aws_region": session.region_name}

df = pl.scan_parquet(<s3_path>, storage_options=storage_options)
df = df.limit(1)

df = df.collect(streaming=True)
print(df[0])

s3_path = Path in S3 to multiple parquets`

Log output

POLARS PREFETCH_SIZE: 20
RUN STREAMING PIPELINE
[parquet -> ordered_sink]
STREAMING CHUNK SIZE: 3571 rows
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sources/parquet.rs:127:50:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/boruchc/work/angie-shuffle-service/./scipts/read_with_polars.py", line 12, in <module>
    df = df.collect(streaming=True)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/angie-shuffle-service-dev/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

Issue description

I believe the issue is that Polars is defaulting to init_next_reader_sync while it should be doing async reading.

I believe the more of this if statement is causing the problems https://github.com/pola-rs/polars/blob/main/crates/polars-pipe/src/executors/sources/parquet.rs#L273

It should be reverted to this

polars/crates/polars-pipe/src/executors/sources/parquet.rs

Lines 251 to 275 in d76609a

 if self.run_async { 

 #[cfg(not(feature = "async"))] 

 panic!("activate 'async' feature"); 

 #[cfg(feature = "async")] 

 { 

 let range = range 

 .zip(&mut self.iter) 

 .map(|(_, index)| index) 

 .collect::<Vec<_>>(); 

 let init_iter = range.into_iter().map(|index| self.init_reader_async(index)); 

 let batched_readers = polars_io::pl_async::get_runtime() 

 .block_on_potential_spawn(async { 

 futures::future::try_join_all(init_iter).await 

 })?; 

 for r in batched_readers { 

 self.finish_init_reader(r)?; 

 } 

 } 

 } else { 

 for _ in 0..self.prefetch_size - self.batched_readers.len() { 

 self.init_next_reader()? 

 }

Expected behavior

Should read the parquets and limit to 1 row

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.1
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             1.10.17
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                2.3.1
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>

The text was updated successfully, but these errors were encountered:

bchalk101 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 14, 2024

nameexhaustion self-assigned this Aug 14, 2024

nameexhaustion added P-high Priority: high A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation and removed needs triage Awaiting prioritization by a maintainer labels Aug 14, 2024

nameexhaustion mentioned this issue Aug 15, 2024

fix: Fix panic streaming parquet scan from cloud with slice #18202

Merged

nameexhaustion added the regression Issue introduced by a new release label Aug 15, 2024

ritchie46 closed this as completed in #18202 Aug 19, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Panic occurring when using streaming and limit with Parquet #18181

Panic occurring when using streaming and limit with Parquet #18181

bchalk101 commented Aug 14, 2024

Panic occurring when using streaming and limit with Parquet #18181

Panic occurring when using streaming and limit with Parquet #18181

Comments

bchalk101 commented Aug 14, 2024

Checks

Reproducible example

Log output

Issue description

Expected behavior

Installed versions