Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix: Update the pyarrow to latest v14.0.1 regarding the CVE-2023-47248. #3835

Closed
wants to merge 14 commits into from

Conversation

shuchu
Copy link
Collaborator

@shuchu shuchu commented Nov 14, 2023

What this PR does / why we need it:
Update the pyarrow to latest version v14.0.1 which has the fix for CVE-2023-47248

Fixes #3832

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@shuchu
Copy link
Collaborator Author

shuchu commented Nov 14, 2023

A little bit worried about the unit test coverage.

please be aware that I unpin the pyarrow version.

py3.8-requirements.txt and py3.8-ci-requirements.txt were updated manually. (regarding the DASK version issue for python 3.8)

@@ -1,5 +1,5 @@
#
# This file is autogenerated by pip-compile with Python 3.10
# This file is autogenerated by pip-compile with Python 3.9
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is incorrect

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

you are right, I need to create a python 3.10 venv and run the command from Makefile. let me fix this.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fixed, let's see the testing results.

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

seems integration test failed...

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

21 fails, the most frequent error is about the wrong format of Timestamp: google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid timestamp microseconds value 1700011424237000000 of logical type NONE; in column 'created'

let me dig into it and see what's the root cause

Copy link
Collaborator Author

@shuchu shuchu Nov 16, 2023

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1, The Google's BigQuery api only accepts "ms" resolution for timestamp, while the Pyarrow.parquet.write_table() will maintain the resolution to the exact original resolution which is "ns" by default.
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
…le write to temporary parquet file.

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
…ng pyarrow v10.0.1

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@shuchu
Copy link
Collaborator Author

shuchu commented Nov 16, 2023

I meet a very interesting problem. I only update the Pyarrow version and snowflake api, the integration test results show me that the timestamp range is error while running Redshift SQL query.

{ error:  Timestamp out of range.\n  
              code:      8001\n  }

it happens while run "get_historical_features()" and the timestamp range were inferenced from the "entity_df":
as in redshift.py::_get_entity_df_event_timestamp_range().
f"SELECT MIN({entity_df_event_timestamp_col}) AS min, MAX({entity_df_event_timestamp_col}) AS max "

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@shuchu
Copy link
Collaborator Author

shuchu commented Nov 17, 2023

please do not merge this PR. @sudohainguyen
It's in a mess status and is for debugging only now.

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@sudohainguyen
Copy link
Collaborator

No worries, looking forward to seeing this works

Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
@shuchu
Copy link
Collaborator Author

shuchu commented Nov 17, 2023

Finally, I found the fix way. It's about the setting of parameter "coerce_timestamps" of "pyarrow.parquet.write_table".

Let me close this PR and create a clean new one.

@sudohainguyen
Copy link
Collaborator

Great @shuchu !!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Security vulnerability of python package: pyarrow (CVE-2023-47248)
3 participants