Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Panic occurring when using streaming and limit with Parquet #18181

Closed
2 tasks done
bchalk101 opened this issue Aug 14, 2024 · 0 comments · Fixed by #18202
Closed
2 tasks done

Panic occurring when using streaming and limit with Parquet #18181

bchalk101 opened this issue Aug 14, 2024 · 0 comments · Fixed by #18202
Assignees
Labels
A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release

Comments

@bchalk101
Copy link
Contributor

Checks

  • I have checked that this issue has not already been reported.
  • I have confirmed this bug exists on the latest version of Polars.

Reproducible example

import polars as pl
import boto3

session = boto3.session.Session()
credentials = session.get_credentials().get_frozen_credentials()
storage_options={"aws_access_key_id": credentials.access_key, "aws_secret_access_key": credentials.secret_key, "aws_session_token": credentials.token, "aws_region": session.region_name}

df = pl.scan_parquet(<s3_path>, storage_options=storage_options)
df = df.limit(1)

df = df.collect(streaming=True)
print(df[0])

s3_path = Path in S3 to multiple parquets`

Log output

POLARS PREFETCH_SIZE: 20
RUN STREAMING PIPELINE
[parquet -> ordered_sink]
STREAMING CHUNK SIZE: 3571 rows
thread '<unnamed>' panicked at crates/polars-pipe/src/executors/sources/parquet.rs:127:50:
called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }
note: run with `RUST_BACKTRACE=1` environment variable to display a backtrace
Traceback (most recent call last):
  File "/Users/boruchc/work/angie-shuffle-service/./scipts/read_with_polars.py", line 12, in <module>
    df = df.collect(streaming=True)
  File "/opt/homebrew/Caskroom/miniconda/base/envs/angie-shuffle-service-dev/lib/python3.10/site-packages/polars/lazyframe/frame.py", line 2027, in collect
    return wrap_df(ldf.collect(callback))
pyo3_runtime.PanicException: called `Result::unwrap()` on an `Err` value: Os { code: 2, kind: NotFound, message: "No such file or directory" }

Issue description

I believe the issue is that Polars is defaulting to init_next_reader_sync while it should be doing async reading.

I believe the more of this if statement is causing the problems https://github.com/pola-rs/polars/blob/main/crates/polars-pipe/src/executors/sources/parquet.rs#L273

It should be reverted to this

if self.run_async {
#[cfg(not(feature = "async"))]
panic!("activate 'async' feature");
#[cfg(feature = "async")]
{
let range = range
.zip(&mut self.iter)
.map(|(_, index)| index)
.collect::<Vec<_>>();
let init_iter = range.into_iter().map(|index| self.init_reader_async(index));
let batched_readers = polars_io::pl_async::get_runtime()
.block_on_potential_spawn(async {
futures::future::try_join_all(init_iter).await
})?;
for r in batched_readers {
self.finish_init_reader(r)?;
}
}
} else {
for _ in 0..self.prefetch_size - self.batched_readers.len() {
self.init_next_reader()?
}

Expected behavior

Should read the parquets and limit to 1 row

Installed versions

--------Version info---------
Polars:               1.4.1
Index type:           UInt32
Platform:             macOS-14.5-arm64-arm-64bit
Python:               3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:51:49) [Clang 16.0.6 ]

----Optional dependencies----
adbc_driver_manager:  <not installed>
cloudpickle:          <not installed>
connectorx:           <not installed>
deltalake:            <not installed>
fastexcel:            <not installed>
fsspec:               2024.6.1
gevent:               <not installed>
great_tables:         <not installed>
hvplot:               <not installed>
matplotlib:           3.9.1
nest_asyncio:         1.6.0
numpy:                1.26.4
openpyxl:             <not installed>
pandas:               2.2.2
pyarrow:              17.0.0
pydantic:             1.10.17
pyiceberg:            <not installed>
sqlalchemy:           2.0.31
torch:                2.3.1
xlsx2csv:             <not installed>
xlsxwriter:           <not installed>
@bchalk101 bchalk101 added bug Something isn't working needs triage Awaiting prioritization by a maintainer python Related to Python Polars labels Aug 14, 2024
@nameexhaustion nameexhaustion self-assigned this Aug 14, 2024
@nameexhaustion nameexhaustion added P-high Priority: high A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation and removed needs triage Awaiting prioritization by a maintainer labels Aug 14, 2024
@nameexhaustion nameexhaustion added the regression Issue introduced by a new release label Aug 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
A-io-cloud Area: reading/writing to cloud storage A-io-parquet Area: reading/writing Parquet files accepted Ready for implementation bug Something isn't working P-high Priority: high python Related to Python Polars regression Issue introduced by a new release
Projects
Archived in project
Development

Successfully merging a pull request may close this issue.

2 participants