-
Notifications
You must be signed in to change notification settings - Fork 995
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
fix: Update the pyarrow to latest v14.0.1 regarding the CVE-2023-47248. #3835
Conversation
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
A little bit worried about the unit test coverage. please be aware that I unpin the pyarrow version. py3.8-requirements.txt and py3.8-ci-requirements.txt were updated manually. (regarding the DASK version issue for python 3.8) |
@@ -1,5 +1,5 @@ | |||
# | |||
# This file is autogenerated by pip-compile with Python 3.10 | |||
# This file is autogenerated by pip-compile with Python 3.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is incorrect
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
you are right, I need to create a python 3.10 venv and run the command from Makefile. let me fix this.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed, let's see the testing results.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
seems integration test failed...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
21 fails, the most frequent error is about the wrong format of Timestamp: google.api_core.exceptions.BadRequest: 400 Error while reading data, error message: Invalid timestamp microseconds value 1700011424237000000 of logical type NONE; in column 'created'
let me dig into it and see what's the root cause
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
1, The Google's BigQuery api only accepts "ms" resolution for timestamp, while the Pyarrow.parquet.write_table() will maintain the resolution to the exact original resolution which is "ns" by default.
https://arrow.apache.org/docs/python/generated/pyarrow.parquet.write_table.html
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
…le write to temporary parquet file. Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
…ng pyarrow v10.0.1 Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
I meet a very interesting problem. I only update the Pyarrow version and snowflake api, the integration test results show me that the timestamp range is error while running Redshift SQL query.
it happens while run "get_historical_features()" and the timestamp range were inferenced from the "entity_df": |
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
please do not merge this PR. @sudohainguyen |
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
No worries, looking forward to seeing this works |
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Signed-off-by: Shuchu Han <shuchu.han@gmail.com>
Finally, I found the fix way. It's about the setting of parameter "coerce_timestamps" of "pyarrow.parquet.write_table". Let me close this PR and create a clean new one. |
Great @shuchu !! |
What this PR does / why we need it:
Update the pyarrow to latest version v14.0.1 which has the fix for CVE-2023-47248
Fixes #3832