Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(python): wrong batch size #2314

Merged
merged 8 commits into from
Mar 26, 2024

Conversation

ion-elgreco
Copy link
Collaborator

@ion-elgreco ion-elgreco commented Mar 22, 2024

Description

Was passing the wrong param

@github-actions github-actions bot added binding/python Issues for the Python package binding/rust Issues for the Rust crate labels Mar 22, 2024
@ion-elgreco ion-elgreco enabled auto-merge (squash) March 22, 2024 11:11
@github-actions github-actions bot removed the binding/rust Issues for the Rust crate label Mar 22, 2024
@rtyler
Copy link
Member

rtyler commented Mar 22, 2024

I'm struggling to understand this change and how it relates to the issue, I don't see how WriterProperties are sent in the Rust writer, or how removing this argument fixes things, maybe it just clears up confusion?

The write_deltalake function still has this parameter but I'm also not understanding why 😆

@ion-elgreco
Copy link
Collaborator Author

ion-elgreco commented Mar 22, 2024

@rtyler when I built the rust engine on the python side, I misinterpreted the max row group size as the batch size.

The actual max_batch_size and max_row_group_size can be set with the python WriterProperies class.
It's confusing because we are still supporting both pyarrow and rust engines

Copy link
Collaborator

@wjones127 wjones127 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there a test on the current Rust writer for max_rows_per_group? Is it being set through writer_properties: Option<HashMap<String, Option<String>>>,?

@ion-elgreco
Copy link
Collaborator Author

@wjones127 There are some tests here: https://github.com/delta-io/delta-rs/blob/main/python/tests/test_writerproperties.py.

Setting the max_row_group_size inside WriterProperties works fine, see here: #2309 (comment).

I just by accident pass max_rows_per_group which is used for the pyarrow writer to parquet::write_batch_size, which is something different : P

@ion-elgreco ion-elgreco merged commit 5b404e2 into delta-io:main Mar 26, 2024
23 checks passed
@ion-elgreco ion-elgreco deleted the fix/wrong_batch_size branch March 26, 2024 08:11
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
binding/python Issues for the Python package
Projects
None yet
Development

Successfully merging this pull request may close these issues.

delta_rs don't seems to respect the row group size
3 participants