-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Index columns removed on s3.to_parquet #2655
Conversation
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
return _table_to_df(pa.concat_tables(tables, promote=True), kwargs=kwargs) | ||
promote_kwargs: dict[str, bool | str] = {"promote": True} | ||
if version.parse(pa.__version__) >= version.parse("14.0.0"): | ||
promote_kwargs = {"promote_options": "default"} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you clarify please - this was needed because schemas of the individual tables were different due to different index column types or empty values?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This particular change wasn't needed for the fix to work. I just noticed a FutureWarning
where the promote
option had been replaced by promote_options
in pyarrow=14
.
The overall issue was that the index was dropped on write completely.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ok understood, thanks.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
This comment was marked as outdated.
Hey all, thanks for keeping this PR up to date. If there's anything on my end that I can do to help make the build pass -- do let me know! |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
AWS CodeBuild CI Report
Powered by github-codebuild-logs, available on the AWS Serverless Application Repository |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you!
[![Mend Renovate](https://app.renovatebot.com/images/banner.svg)](https://renovatebot.com) This PR contains the following updates: | Package | Change | Age | Adoption | Passing | Confidence | |---|---|---|---|---|---| | [aws-lambda-powertools](https://togithub.com/aws-powertools/powertools-lambda-python) ([changelog](https://togithub.com/aws-powertools/powertools-lambda-python/releases)) | `2.34.2` -> `2.38.1` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/aws-lambda-powertools/2.38.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/aws-lambda-powertools/2.38.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/aws-lambda-powertools/2.34.2/2.38.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/aws-lambda-powertools/2.34.2/2.38.1?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [awswrangler](https://aws-sdk-pandas.readthedocs.io/) ([source](https://togithub.com/aws/aws-sdk-pandas)) | `3.6.0` -> `3.7.3` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/awswrangler/3.7.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/awswrangler/3.7.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/awswrangler/3.6.0/3.7.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/awswrangler/3.6.0/3.7.3?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [pandas](https://pandas.pydata.org) ([source](https://togithub.com/pandas-dev/pandas)) | `2.2.1` -> `2.2.2` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/pandas/2.2.2?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/pandas/2.2.2?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/pandas/2.2.1/2.2.2?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pandas/2.2.1/2.2.2?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [pytest-mock](https://togithub.com/pytest-dev/pytest-mock) ([changelog](https://pytest-mock.readthedocs.io/en/latest/changelog.html)) | `3.12.0` -> `3.14.0` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/pytest-mock/3.14.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/pytest-mock/3.14.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/pytest-mock/3.12.0/3.14.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/pytest-mock/3.12.0/3.14.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | | [tenacity](https://togithub.com/jd/tenacity) | `8.2.3` -> `8.3.0` | [![age](https://developer.mend.io/api/mc/badges/age/pypi/tenacity/8.3.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![adoption](https://developer.mend.io/api/mc/badges/adoption/pypi/tenacity/8.3.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![passing](https://developer.mend.io/api/mc/badges/compatibility/pypi/tenacity/8.2.3/8.3.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | [![confidence](https://developer.mend.io/api/mc/badges/confidence/pypi/tenacity/8.2.3/8.3.0?slim=true)](https://docs.renovatebot.com/merge-confidence/) | --- ### Release Notes <details> <summary>aws-powertools/powertools-lambda-python (aws-lambda-powertools)</summary> ### [`v2.38.1`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2381---2024-05-17) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.38.0...v2.38.1) #### \[v2.38.1] - 2024-05-17 ### [`v2.38.0`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2380---2024-05-17) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.37.0...v2.38.0) #### \[v2.38.0] - 2024-05-17 ### [`v2.37.0`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2370---2024-04-18) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.36.0...v2.37.0) #### \[v2.37.0] - 2024-04-18 ### [`v2.36.0`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2360---2024-03-27) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.35.1...v2.36.0) #### \[v2.36.0] - 2024-03-27 ### [`v2.35.1`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2351---2024-03-08) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.35.0...v2.35.1) #### \[v2.35.1] - 2024-03-08 ### [`v2.35.0`](https://togithub.com/aws-powertools/powertools-lambda-python/blob/HEAD/CHANGELOG.md#v2350---2024-03-06) [Compare Source](https://togithub.com/aws-powertools/powertools-lambda-python/compare/v2.34.2...v2.35.0) #### \[v2.35.0] - 2024-03-06 </details> <details> <summary>aws/aws-sdk-pandas (awswrangler)</summary> ### [`v3.7.3`](https://togithub.com/aws/aws-sdk-pandas/releases/tag/3.7.3): AWS SDK for pandas 3.7.3 [Compare Source](https://togithub.com/aws/aws-sdk-pandas/compare/3.7.2...3.7.3) #### Bug fixes 🐛 - Iceberg schema evolution fails for map, array and struct types by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [#​2755](https://togithub.com/aws/aws-sdk-pandas/issues/2755) - trickle down `s3_output` in `athena.to_iceberg` by [@​jaidisido](https://togithub.com/jaidisido) in [#​2767](https://togithub.com/aws/aws-sdk-pandas/issues/2767) - respect order of columns in `to_iceberg` by [@​jaidisido](https://togithub.com/jaidisido) in [#​2768](https://togithub.com/aws/aws-sdk-pandas/issues/2768) - add PyArrow `fixed_size_binary` dtype support by [@​jaidisido](https://togithub.com/jaidisido) in [#​2775](https://togithub.com/aws/aws-sdk-pandas/issues/2775) - Opensearch serverless vector search collections - remove default `_id` by [@​kukushking](https://togithub.com/kukushking) in [#​2784](https://togithub.com/aws/aws-sdk-pandas/issues/2784) - missing keys in `list_to_arrow_table` by [@​kukushking](https://togithub.com/kukushking) in [#​2778](https://togithub.com/aws/aws-sdk-pandas/issues/2778) - prevent `athena.to_iceberg` overwrite to delete table in order to preserve Iceberg transactions history by [@​erwan-simon](https://togithub.com/erwan-simon) in [#​2776](https://togithub.com/aws/aws-sdk-pandas/issues/2776) #### Documentation 📚 - simplify README, remove AWS Glue for Ray references by [@​jaidisido](https://togithub.com/jaidisido) in [#​2750](https://togithub.com/aws/aws-sdk-pandas/issues/2750) - fix YAML formatting in Ray Remote tutorial by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [#​2770](https://togithub.com/aws/aws-sdk-pandas/issues/2770) #### Security Dependency Updates 🛡️ - bump `idna` from 3.6 to 3.7 by [@​dependabot](https://togithub.com/dependabot) in [#​2772](https://togithub.com/aws/aws-sdk-pandas/issues/2772) - bump `aiohttp` from 3.9.3 to 3.9.4 by [@​dependabot](https://togithub.com/dependabot) in [#​2777](https://togithub.com/aws/aws-sdk-pandas/issues/2777) #### New Contributors 👋 - [@​erwan-simon](https://togithub.com/erwan-simon) made their first contribution in [#​2776](https://togithub.com/aws/aws-sdk-pandas/issues/2776) **Full Changelog**: aws/aws-sdk-pandas@3.7.2...3.7.3 ### [`v3.7.2`](https://togithub.com/aws/aws-sdk-pandas/releases/tag/3.7.2): AWS SDK for pandas 3.7.2 [Compare Source](https://togithub.com/aws/aws-sdk-pandas/compare/3.7.1...3.7.2) #### Features/Enhancements 🚀 - Add support for DeltaLake's DynamoDB lock mechanism by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [#​2705](https://togithub.com/aws/aws-sdk-pandas/issues/2705) #### Bug fixes 🐛 - `wr.athena.to_iceberg` - Insert query has mismatched column types [#​2678](https://togithub.com/aws/aws-sdk-pandas/issues/2678) by [@​GalvFionic](https://togithub.com/GalvFionic) in [#​2715](https://togithub.com/aws/aws-sdk-pandas/issues/2715) - allow `s3_output` in `athena.to_iceberg` by [@​jaidisido](https://togithub.com/jaidisido) in [#​2727](https://togithub.com/aws/aws-sdk-pandas/issues/2727) - replace deprecated `np.split_array` by [@​jaidisido](https://togithub.com/jaidisido) in [#​2735](https://togithub.com/aws/aws-sdk-pandas/issues/2735) - Athena `to_iceberg` fails with non-lowercase column names by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [#​2736](https://togithub.com/aws/aws-sdk-pandas/issues/2736) - Support Ray 2.10 by [@​kukushking](https://togithub.com/kukushking) in [#​2741](https://togithub.com/aws/aws-sdk-pandas/issues/2741) #### New Contributors - [@​GalvFionic](https://togithub.com/GalvFionic) made their first contribution in [#​2715](https://togithub.com/aws/aws-sdk-pandas/issues/2715) **Full Changelog**: aws/aws-sdk-pandas@3.7.1...3.7.2 ### [`v3.7.1`](https://togithub.com/aws/aws-sdk-pandas/releases/tag/3.7.1): AWS SDK for pandas 3.7.1 [Compare Source](https://togithub.com/aws/aws-sdk-pandas/compare/3.7.0...3.7.1) #### Bug fixes 🐛 - fix breaking change in `_create_table` by [@​jaidisido](https://togithub.com/jaidisido) in [https://github.com/aws/aws-sdk-pandas/pull/2711](https://togithub.com/aws/aws-sdk-pandas/pull/2711) - pin pyarrow to version 8 and above by [@​jaidisido](https://togithub.com/jaidisido) in [https://github.com/aws/aws-sdk-pandas/pull/2709](https://togithub.com/aws/aws-sdk-pandas/pull/2709) #### Documentation 📚 - fix `redshift.to_sql` doc indentation error by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [https://github.com/aws/aws-sdk-pandas/pull/2706](https://togithub.com/aws/aws-sdk-pandas/pull/2706) **Full Changelog**: aws/aws-sdk-pandas@3.7.0...3.7.1 ### [`v3.7.0`](https://togithub.com/aws/aws-sdk-pandas/releases/tag/3.7.0): AWS SDK for pandas 3.7.0 [Compare Source](https://togithub.com/aws/aws-sdk-pandas/compare/3.6.0...3.7.0) #### Breaking changes 💥 Lake Formation Governed tables are being phased out and we are dropping support ([#​2692](https://togithub.com/aws/aws-sdk-pandas/issues/2692)). #### Features/Enhancements 🚀 - support parquet client encryption ([#​2642](https://togithub.com/aws/aws-sdk-pandas/issues/2642)) by [@​Marwen94](https://togithub.com/Marwen94) in [https://github.com/aws/aws-sdk-pandas/pull/2674](https://togithub.com/aws/aws-sdk-pandas/pull/2674) #### Bug fixes 🐛 - Index columns removed on s3.to_parquet by [@​robert-schmidtke](https://togithub.com/robert-schmidtke) in [https://github.com/aws/aws-sdk-pandas/pull/2655](https://togithub.com/aws/aws-sdk-pandas/pull/2655) - Missing timezone metadata by [@​kukushking](https://togithub.com/kukushking) in [https://github.com/aws/aws-sdk-pandas/pull/2682](https://togithub.com/aws/aws-sdk-pandas/pull/2682) - remove enforced openpyxl engine constraint by [@​jaidisido](https://togithub.com/jaidisido) in [https://github.com/aws/aws-sdk-pandas/pull/2696](https://togithub.com/aws/aws-sdk-pandas/pull/2696) - Iceberg partitioning not working with partition transform functions by [@​LeonLuttenberger](https://togithub.com/LeonLuttenberger) in [https://github.com/aws/aws-sdk-pandas/pull/2694](https://togithub.com/aws/aws-sdk-pandas/pull/2694) - remove awswrangler README from `site-packages` folder by [@​AlJohri](https://togithub.com/AlJohri) in [https://github.com/aws/aws-sdk-pandas/pull/2698](https://togithub.com/aws/aws-sdk-pandas/pull/2698) - indent categories in pyarrow_additional_kwargs correctly by [@​jaidisido](https://togithub.com/jaidisido) in [https://github.com/aws/aws-sdk-pandas/pull/2701](https://togithub.com/aws/aws-sdk-pandas/pull/2701) #### New Contributors - [@​Marwen94](https://togithub.com/Marwen94) made their first contribution in [https://github.com/aws/aws-sdk-pandas/pull/2674](https://togithub.com/aws/aws-sdk-pandas/pull/2674) - [@​AlJohri](https://togithub.com/AlJohri) made their first contribution in [https://github.com/aws/aws-sdk-pandas/pull/2698](https://togithub.com/aws/aws-sdk-pandas/pull/2698) **Full Changelog**: aws/aws-sdk-pandas@3.6.0...3.7.0 </details> <details> <summary>pandas-dev/pandas (pandas)</summary> ### [`v2.2.2`](https://togithub.com/pandas-dev/pandas/compare/v2.2.1...v2.2.2) [Compare Source](https://togithub.com/pandas-dev/pandas/compare/v2.2.1...v2.2.2) </details> <details> <summary>pytest-dev/pytest-mock (pytest-mock)</summary> ### [`v3.14.0`](https://togithub.com/pytest-dev/pytest-mock/blob/HEAD/CHANGELOG.rst#3140-2024-03-21) [Compare Source](https://togithub.com/pytest-dev/pytest-mock/compare/v3.13.0...v3.14.0) - `#​415 <https://github.com/pytest-dev/pytest-mock/pull/415>`\_: `MockType` and `AsyncMockType` can be imported from `pytest_mock` for type annotation purposes. - `#​420 <https://github.com/pytest-dev/pytest-mock/issues/420>`\_: Fixed a regression which would cause `mocker.patch.object` to not being properly cleared between tests. ### [`v3.13.0`](https://togithub.com/pytest-dev/pytest-mock/blob/HEAD/CHANGELOG.rst#3130-2024-03-21) [Compare Source](https://togithub.com/pytest-dev/pytest-mock/compare/v3.12.0...v3.13.0) - `#​417 <https://github.com/pytest-dev/pytest-mock/pull/417>`\_: `spy` now has `spy_return_list`, which is a list containing all the values returned by the spied function. - `pytest-mock` now requires `pytest>=6.2.5`. - `#​410 <https://github.com/pytest-dev/pytest-mock/pull/410>`*: pytest-mock's `setup.py` file is removed. If you relied on this file, e.g. to install pytest using `setup.py install`, please see `Why you shouldn't invoke setup.py directly <https://blog.ganssle.io/articles/2021/10/setup-py-deprecated.html#summary>`* for alternatives. </details> <details> <summary>jd/tenacity (tenacity)</summary> ### [`v8.3.0`](https://togithub.com/jd/tenacity/releases/tag/8.3.0) [Compare Source](https://togithub.com/jd/tenacity/compare/8.2.3...8.3.0) ### New Features - Added a new stop function: `stop_before_delay`, which will stop execution if the next sleep time would cause overall delay to exceed the specified delay. Useful for use cases where you have some upper bound on retry times that you must not exceed, so returning before that timeout is preferable than returning after that timeout. ### Bug Fixes - Preserve **defaults** and **kwdefaults** through retry decorator ### Other Notes - Add a "test" extra </details> --- ### Configuration 📅 **Schedule**: Branch creation - "before 4am on the first day of the month" (UTC), Automerge - At any time (no schedule defined). 🚦 **Automerge**: Disabled by config. Please merge this manually once you are satisfied. ♻ **Rebasing**: Whenever PR becomes conflicted, or you tick the rebase/retry checkbox. 👻 **Immortal**: This PR will be recreated if closed unmerged. Get [config help](https://togithub.com/renovatebot/renovate/discussions) if that's undesired. --- - [ ] <!-- rebase-check -->If you want to rebase/retry this PR, check this box --- This PR has been generated by [Mend Renovate](https://www.mend.io/free-developer-tools/renovate/). View repository job log [here](https://developer.mend.io/github/sawyerh/highlights). <!--renovate-debug:eyJjcmVhdGVkSW5WZXIiOiIzNy4yNjkuMiIsInVwZGF0ZWRJblZlciI6IjM3LjM2My41IiwidGFyZ2V0QnJhbmNoIjoibWFpbiJ9--> Co-authored-by: renovate[bot] <29139614+renovate[bot]@users.noreply.github.com>
Feature or Bugfix
Bugfix
Detail
Fixes index columns not being written on a second write. The issue was that they were ignored as part of processing columns types. I have aligned the handling of index columns with regular columns. I have added a test for that.
Because I touched code that introduced the ability to partition by index columns, I added a test for that as well, asserting the current behavior. Please do check that test and see if it codifies the assumptions and expected behavior for partitioning on index columns correctly.
Relates
#2652
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.