Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Iceberg fixes for reading table metadata #2810

Merged
merged 9 commits into from
Apr 11, 2024
Merged

Conversation

vrongmeal
Copy link
Contributor

@vrongmeal vrongmeal commented Mar 21, 2024

Fixes #2603

@vrongmeal vrongmeal changed the title [WIP] Iceberg stuff fix: Iceberg fixes for reading table metadata Mar 25, 2024
@tychoish
Copy link
Contributor

What more do we need here? (I think there's some more test coverage that we could use here?)

@vrongmeal
Copy link
Contributor Author

What more do we need here? (I think there's some more test coverage that we could use here?)

Added test coverage. There are issues though. If we overwrite the existing data (say running the generate_pyiceberg.py again without deleting the existing data), GlareDB returns twice the number of rows whereas the data is appropriately overwritten according to iceberg.

@vrongmeal vrongmeal marked this pull request as ready for review March 29, 2024 09:33
Comment on lines 2 to 9
How to run this script:
======================

$ python3 -m venv venv
$ source venv/bin/activate
$ pip install "pyiceberg[pyarrow,sql-sqlite]"
$ pip install botocore
$ python ./testdata/generate_pyiceberg.py
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the data set ends up being pretty big, and while I don't think it really matters, I think it'd be good to avoid just committing lots of test data to the repo just cause, when we have other options (e.g. uploading the data to the test bucket, or generating as part of a fixture in one of the pytests).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, thinking the same thing.

Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
@vrongmeal
Copy link
Contributor Author

@tychoish made the required changes for pytest

@vrongmeal vrongmeal requested a review from tychoish April 10, 2024 09:55
Signed-off-by: Vaibhav <vrongmeal@gmail.com>
Copy link
Contributor

@tychoish tychoish left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

looks great, one concern about the size of test data.

the current test failure is orthogonal to this PR and I can push through it.

is there anything that needs to be done here?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

45mb feels big for a test file.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, I'll upload it on GCS. We have a script to handle big data files.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It was a simple HTTP url, added it into the fixture to download the data.

@vrongmeal vrongmeal merged commit 15cd5dc into main Apr 11, 2024
26 checks passed
@vrongmeal vrongmeal deleted the vrongmeal/iceberg branch April 11, 2024 13:22
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

error reading iceberg table
2 participants