-
Notifications
You must be signed in to change notification settings - Fork 2.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Python: GCS Support #8207
Python: GCS Support #8207
Conversation
Co-authored-by: Fokko Driesprong <fokko@apache.org>
GCS_SESSION_KWARGS = "gcs.session-kwargs" | ||
GCS_ENDPOINT_URL = "gcs.endpoint-url" | ||
GCS_DEFAULT_LOCATION = "gcs.default-bucket-location" | ||
GCS_VERSION_AWARE = "gcs.version-aware" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These look consistent with Java, but only the first 3 are shared.
python/pyiceberg/io/__init__.py
Outdated
GCS_CACHE_TIMEOUT = "gcs.cache-timeout" | ||
GCS_REQUESTER_PAYS = "gcs.requester-pays" | ||
GCS_SESSION_KWARGS = "gcs.session-kwargs" | ||
GCS_ENDPOINT_URL = "gcs.endpoint-url" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In other places, we have used just "endpoint" rather than "endpoint-url" to point the client to a different base URI. Should we do that here for consistency? Other examples are glue.endpoint
and s3.endpoint
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Great catch, I like that very much
GCS_REQUESTER_PAYS = "gcs.requester-pays" | ||
GCS_SESSION_KWARGS = "gcs.session-kwargs" | ||
GCS_ENDPOINT_URL = "gcs.endpoint-url" | ||
GCS_DEFAULT_LOCATION = "gcs.default-bucket-location" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What does this control? Does it modify URIs to fill in a bucket if it isn't present? I'm not sure that's a good idea.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is an option to create a new bucket if it doesn't exist. This doesn't do much for PyIceberg currently until it has write support.
python/pyiceberg/io/pyarrow.py
Outdated
gcs_kwargs: Dict[str, Any] = {} | ||
if access_token := self.properties.get(GCS_TOKEN): | ||
gcs_kwargs["access_token"] = access_token | ||
if expiration := self.properties.get(GCS_TOKEN_EXPIRES_AT): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If this is interpreted as ms, should the property include -ms
at the end?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good call
python/tests/io/test_pyarrow.py
Outdated
@@ -1326,7 +1327,7 @@ def test_pyarrow_wrap_fsspec(example_task: FileScanTask, table_schema_simple: Sc | |||
partition_specs=[PartitionSpec()], | |||
), | |||
metadata_location=metadata_location, | |||
io=load_file_io(properties={"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}, location=metadata_location), | |||
io=load_file_io(properties={"py-io-impl": "pyiceberg.io.fsspec.PyArrowFileIO"}, location=metadata_location), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why set this to PyArrow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good overall. Just a few minor comments. This is also blocked on apache/arrow#36993 right?
Thanks for the review @rdblue
Not necessarily. I've tested it locally, and it works fine with GCP/GCS, I only get this error locally with Minio (and therefore also in the integration tests). |
Revival of #6906 @Buktoria, I hope you don't mind cherry-picking your commits and doing the final mile. We're very much looking forward having GCS support in PyIceberg