Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use ArrowScan.to_table to replace project_table #1180

Merged
merged 4 commits into from
Sep 20, 2024
Merged

Use ArrowScan.to_table to replace project_table #1180

merged 4 commits into from
Sep 20, 2024

Conversation

JE-Chen
Copy link
Contributor

@JE-Chen JE-Chen commented Sep 17, 2024

PR #1119

  • Use ArrowScan.to_table to replace project_table in these files:
    • pyiceberg\table_init_.py
    • pyiceberg\io\pyarrow.py
    • pyiceberg\test_pyarrow.py

* Use ArrowScan.to_table to replace project_table on these file:
** pyiceberg\table\__init__.py
** pyiceberg\io\pyarrow.py
** pyiceberg\test_pyarrow.py
Replace all remaining of project_table using ArrowScan.to_table
@sungwy
Copy link
Collaborator

sungwy commented Sep 17, 2024

Hi @JE-Chen thank you for putting together this PR! It looks like these were missed in the initial refactoring. Running the CI now.

Fix format
Modify by ruff
@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 17, 2024

Hi @sungwy I have already fixed the problem that ruff found.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the contribution!

Looks like this captures all the project_table usage https://github.com/search?q=repo%3Aapache%2Ficeberg-python%20project_table&type=code

Let's double check the log for make test and make test-integration to make sure there's no deprecation warnings related to project_table

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 18, 2024

There are make test-integration logs that the CI test generated.
No warnings found on pyiceberg/table/init.py & tests/io/test_pyarrow.py.

=============================== warnings summary ===============================
tests/integration/test_add_files.py: 42 warnings
tests/integration/test_deletes.py: 27 warnings
tests/integration/test_inspect_table.py: 24 warnings
tests/integration/test_partition_evolution.py: 140 warnings
tests/integration/test_reads.py: 54 warnings
tests/integration/test_rest_schema.py: 91 warnings
tests/integration/test_snapshot_operations.py: 4 warnings
tests/integration/test_writes/test_partitioned_writes.py: 175 warnings
tests/integration/test_writes/test_writes.py: 128 warnings
== 809 passed, 7 skipped, 2840 deselected, 1441 warnings in 295.87s (0:04:55) ==

make test No warnings found on pyiceberg/table/init.py & tests/io/test_pyarrow.py too.

=============================== warnings summary ===============================
tests/integration/test_add_files.py: 42 warnings
tests/integration/test_deletes.py: 27 warnings
tests/integration/test_inspect_table.py: 24 warnings
tests/integration/test_partition_evolution.py: 140 warnings
tests/integration/test_reads.py: 54 warnings
tests/integration/test_rest_schema.py: 91 warnings
tests/integration/test_snapshot_operations.py: 4 warnings
tests/integration/test_writes/test_partitioned_writes.py: 175 warnings
tests/integration/test_writes/test_writes.py: 128 warnings
== 809 passed, 7 skipped, 2840 deselected, 1441 warnings in 283.39s (0:04:43) ==

Log of make test-integration on my VM has so many errors. Any ideas? (It succeeds on CI Test).

$ make test-integration
docker compose -f dev/docker-compose-integration.yml kill
WARN[0000] /home/jeffrey/Desktop/GItHub_Project/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Killing 5/5
 ✔ Container pyiceberg-minio  Killed                                                                                                                                                         0.3s 
 ✔ Container pyiceberg-mc     Killed                                                                                                                                                         0.1s 
 ✔ Container pyiceberg-rest   Killed                                                                                                                                                         0.2s 
 ✔ Container hive             Killed                                                                                                                                                         0.2s 
 ✔ Container pyiceberg-spark  Killed                                                                                                                                                         0.3s 
docker compose -f dev/docker-compose-integration.yml rm -f
WARN[0000] /home/jeffrey/Desktop/GItHub_Project/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
Going to remove pyiceberg-spark, pyiceberg-mc, pyiceberg-rest, hive, pyiceberg-minio
[+] Removing 5/0
 ✔ Container pyiceberg-minio  Removed                                                                                                                                                        0.0s 
 ✔ Container pyiceberg-spark  Removed                                                                                                                                                        0.0s 
 ✔ Container pyiceberg-rest   Removed                                                                                                                                                        0.0s 
 ✔ Container pyiceberg-mc     Removed                                                                                                                                                        0.0s 
 ✔ Container hive             Removed                                                                                                                                                        0.0s 
docker compose -f dev/docker-compose-integration.yml up -d
WARN[0000] /home/jeffrey/Desktop/GItHub_Project/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Running 5/5
 ✔ Container pyiceberg-minio  Started                                                                                                                                                        0.5s 
 ✔ Container hive             Started                                                                                                                                                        0.4s 
 ✔ Container pyiceberg-rest   Started                                                                                                                                                        0.7s 
 ✔ Container pyiceberg-mc     Started                                                                                                                                                        0.9s 
 ✔ Container pyiceberg-spark  Started                                                                                                                                                        1.0s 
sleep 10
docker compose -f dev/docker-compose-integration.yml cp ./dev/provision.py spark-iceberg:/opt/spark/provision.py
WARN[0000] /home/jeffrey/Desktop/GItHub_Project/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
[+] Copying 1/0
 ✔ pyiceberg-spark copy ./dev/provision.py to pyiceberg-spark:/opt/spark/provision.py Copied                                                                                                 0.0s 
docker compose -f dev/docker-compose-integration.yml exec -T spark-iceberg ipython ./provision.py
WARN[0000] /home/jeffrey/Desktop/GItHub_Project/iceberg-python/dev/docker-compose-integration.yml: the attribute `version` is obsolete, it will be ignored, please remove it to avoid potential confusion 
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
24/09/18 11:02:25 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO                 
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
poetry run pytest tests/ -v -m integration 

====================================================================================== test session starts =======================================================================================
platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- /home/jeffrey/Desktop/GItHub_Project/iceberg-python/venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jeffrey/Desktop/GItHub_Project/iceberg-python
configfile: pyproject.toml
plugins: lazy-fixture-0.6.3, mock-3.14.0, requests-mock-1.12.1, checkdocs-2.10.1
collected 3640 items / 2840 deselected / 800 selected 

... tests ...
====================================================== 501 passed, 7 skipped, 2840 deselected, 731 warnings, 292 errors in 99.24s (0:01:39) ======================================================

@sungwy
Copy link
Collaborator

sungwy commented Sep 18, 2024

Log of make test-integration on my VM has so many errors. Any ideas? (It succeeds on CI Test).

What kind of errors are you running into @JE-Chen ? Could you provide us an example?

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 18, 2024

steps

  • Open Pycharm
  • Open iceberg project
  • Create new venv using python 3.11.9
  • pip install poetry
  • make install
  • make test-integration
  • Get this message four time Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO
  • Start run tests (all docker service is running) and get so many error (= 38 failed, 755 passed, 7 skipped, 2840 deselected, 1431 warnings in 409.59s (0:06:49) =)
platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- /home/jeffrey/Desktop/GItHub_Project/iceberg-python/venv/bin/python
cachedir: .pytest_cache
rootdir: /home/jeffrey/Deskt501  38 failed, 755 passed, 7 skipped, 2840 deselected, 1431 warnings in 409.59s (0:06:49)

platform linux -- Python 3.11.9, pytest-7.4.4, pluggy-1.5.0 -- /home/jeffrey/Desktop/GItHub_Project/iceberg-python/venv/bin/python
cachedir: .pytest_cacheop/GItHub_Project/iceberg-python
configfile: pyproject.toml
plugins: lazy-fixture-0.6.3, mock-3.14.0, requests-mock-1.12.1, checkdocs-2.10.1

@sungwy
Copy link
Collaborator

sungwy commented Sep 18, 2024

Thank you for sending over the reproducible steps @JE-Chen - I do not use PyCharm for my development (I'm a VSCode user) so it might take a while for me to reproduce your setup.

Could not initialize FileIO: pyiceberg.io.pyarrow.PyArrowFileIO

except ModuleNotFoundError:
logger.warning("Could not initialize FileIO: %s", io_impl)
return None

This is a warning message and I'm familiar with seeing that in my integration tests as well, but it didn't result in any of the tests failing in my case. Are you able to share the verbose error trace for a failing test case, so we can investigate what its failing on?

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 18, 2024

@kevinjqliu
Copy link
Contributor

	Suppressed: org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (3227648)
Allocator(toArrowBatchIterator) 0/3227648/3227648/9223372036854775807 (res/actual/peak/limit)

What VM are you running the integration tests on?

I have a theory that a previous change is causing memory leaks (see #1167). Can you try to revert it and run the integration test?
1971fcf

git revert 1971fcfe0875eeb200dbcb66f385e504cfad6609

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 18, 2024

	Suppressed: org.apache.spark.util.TaskCompletionListenerException: Memory was leaked by query. Memory leaked: (3227648)
Allocator(toArrowBatchIterator) 0/3227648/3227648/9223372036854775807 (res/actual/peak/limit)

What VM are you running the integration tests on?

I have a theory that a previous change is causing memory leaks (see #1167). Can you try to revert it and run the integration test? 1971fcf

git revert 1971fcfe0875eeb200dbcb66f385e504cfad6609

The memory leak still occurred after reverting

VMware Workstation 17 Pro (Ubuntu 24.04.1) with these settings:
image

@kevinjqliu
Copy link
Contributor

Gotcha, thanks! I think the CI runs in ubuntu, so it should mirror your own setup

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 20, 2024

I’m trying to remove all docker images and re-run make-integration, and I noticed this.
So, I think the test failed because:
Can’t pull the python-integration image (access denied).

{5C8FEA3F-E233-4E2E-909D-28A2304B8EC6}

@kevinjqliu
Copy link
Contributor

seems like its an issue with your docker installation.

image: python-integration

here's what I found for "VMware Workstation 17 Pro "
https://www.reddit.com/r/vmware/comments/167rurt/easy_way_to_run_docker_container_in_vmware/

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 20, 2024

My Windows environment has the same pull issue.

(Docker plugin on Pycharm IDE)
{ADD7151F-C1E1-48BE-96AB-F1575E271E57}

(Run on powershell)
{BF11F960-B3EB-41E5-B68F-1408DEC492E7}

@kevinjqliu
Copy link
Contributor

@JE-Chen I was able to run integration tests for this PR locally. And given that CI also pass, let's merge this and debug your VM environment issue separately.
How does that sound?

@JE-Chen
Copy link
Contributor Author

JE-Chen commented Sep 20, 2024

@JE-Chen I was able to run integration tests for this PR locally. And given that CI also pass, let's merge this and debug your VM environment issue separately. How does that sound?

Sounds great.

Copy link
Contributor

@kevinjqliu kevinjqliu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, confirmed no more warnings

@kevinjqliu kevinjqliu merged commit b8b2f66 into apache:main Sep 20, 2024
8 checks passed
@kevinjqliu
Copy link
Contributor

Thank you @JE-Chen for the contribution and @sungwy for the review

sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* Use ArrowScan.to_table to replace project_table

* Use ArrowScan.to_table to replace project_table on these file:
** pyiceberg\table\__init__.py
** pyiceberg\io\pyarrow.py
** pyiceberg\test_pyarrow.py

* Replace all remaining of project_table using ArrowScan.to_table

Replace all remaining of project_table using ArrowScan.to_table

* Fix format

Fix format

* Modify by ruff

Modify by ruff
sungwy pushed a commit to sungwy/iceberg-python that referenced this pull request Dec 7, 2024
* Use ArrowScan.to_table to replace project_table

* Use ArrowScan.to_table to replace project_table on these file:
** pyiceberg\table\__init__.py
** pyiceberg\io\pyarrow.py
** pyiceberg\test_pyarrow.py

* Replace all remaining of project_table using ArrowScan.to_table

Replace all remaining of project_table using ArrowScan.to_table

* Fix format

Fix format

* Modify by ruff

Modify by ruff
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants