Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

0.1.2 Testing pyiceberg 0.8.1 feature requests #1

Closed
11 tasks
tusharchou opened this issue Sep 13, 2024 · 5 comments · Fixed by #2 · May be fixed by #83
Closed
11 tasks

0.1.2 Testing pyiceberg 0.8.1 feature requests #1

tusharchou opened this issue Sep 13, 2024 · 5 comments · Fixed by #2 · May be fixed by #83
Assignees
Labels
good first issue Good for newcomers

Comments

@tusharchou
Copy link
Owner

tusharchou commented Sep 13, 2024

How to contributing to pyiceberg-0.9.0

Step 1 to find problem statement?

  1. fork or sync repo
  2. clone or pull locally
  3. run setup and tests
  4. find scope for improvement on issues or slack
  5. reach back to community for help

A. Scope: Issue-1223 on version 0.7.1 Oct 8

User wants a Count rows as a metadata-only operation.

The python-iceberg repository has released 0.8.1 version on Nov 19 2024.
The library support a function called inspect that can help a user quickly get insights on the table.metadata

  • test partition row count using inspect

Step 2: to understand root cause analytics

  1. Write a test case to recreate the issue locally
  2. Find source of exception, lack of function
  3. Handle the test in pytest
  4. Proposing solution to the community on the issue and slack
  5. create a pr on your folk

B. Create a use case to understand the issue

With the 0.8.1 release a new feature got integrated that gives inspects the table using metadata only.

Inspects

0.7.1: Fix delete to trace existing manifests when a data file is partially rewritten

so even when we are rewriting the data partially, we still need to add the new manifestentries as "existing" entries in order to track the new data files that are re-written.
these files are unaffected by the delete and should be kept in the manifest as an existing entry.

0.7.1 pytest: tests/intergration/test_writes/test_writes.py
  • test_delete_threshold()

  • load minio catalog

  • create schema

  • partition specification

  • clean environment for testing

  • exception handling

  • create table

  • generate test data

  • design test

  • Source Issue

Let's try it out and understand root cause of this issue

@tusharchou
Copy link
Owner Author

tusharchou commented Oct 1, 2024

How to contribute

Hi iceberg community,

Then goal of this issue is to create a sandbox platform for open source enthusiasts to learn how to contribute to apache projects like python-iceberg. We get to learn new libraries and share that learning with the community.

Data Lake House format have a huge impact on cloud cost and understanding optimization are very important to scale at production.

I believe if we use a real world use case to break down the problem it will become easy to solve.

Explain the problem better

Who is facing the problem?

The python developer who is facing this problem is probably working for some data product company on a production environment.

What is the problem?

Interacting with Iceberg tables programmatically using Python

When does the problem occur?

Accessing the Iceberg table while a Spark Job is updating the underlying Table.

Where does the user encounter the problem ?

To replicate the cloud on local we can use tabular spark docker container

Why is the problem existing?

Iceberg tables being managed by python makes it very friendly

@tusharchou tusharchou reopened this Oct 1, 2024
@tusharchou tusharchou modified the milestones: 0.0.1, 0.1.0 Oct 1, 2024
@tusharchou tusharchou changed the title Testing pyiceberg 0.7.1 limitations 0.1.1 Testing pyiceberg 0.7.1 limitations Oct 1, 2024
@tusharchou tusharchou changed the title 0.1.1 Testing pyiceberg 0.7.1 limitations 0.1.2 Testing pyiceberg 0.7.1 limitations Oct 7, 2024
@tusharchou
Copy link
Owner Author

Write a pytest for this feature request

iceberg-python
Count rows as a metadata-only operation

@tusharchou tusharchou changed the title 0.1.2 Testing pyiceberg 0.7.1 limitations 0.1.2 Testing pyiceberg 0.8.1 feature requests Nov 1, 2024
@tusharchou tusharchou linked a pull request Nov 4, 2024 that will close this issue
@tusharchou
Copy link
Owner Author

@rakhioza07

@tusharchou
Copy link
Owner Author

@rakhioza07 I am trying to close this issue on PR #83.

The problem is that row counts of each partition should be accessible by the table metadata class.

I have attempted to solve for this using local_data_platfom.format.iceberg.manifest.py

And the test I have written in tests/test_manifest.py

To raise a PR to PyIceberg I want make sure of my understanding

@tusharchou
Copy link
Owner Author

Merged in iceberg python!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment