Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Python: GCS Support #8207

Merged
merged 10 commits into from
Aug 14, 2023
Merged

Python: GCS Support #8207

merged 10 commits into from
Aug 14, 2023

Conversation

Fokko
Copy link
Contributor

@Fokko Fokko commented Aug 2, 2023

Revival of #6906 @Buktoria, I hope you don't mind cherry-picking your commits and doing the final mile. We're very much looking forward having GCS support in PyIceberg

@github-actions github-actions bot added the python label Aug 2, 2023
GCS_SESSION_KWARGS = "gcs.session-kwargs"
GCS_ENDPOINT_URL = "gcs.endpoint-url"
GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
GCS_VERSION_AWARE = "gcs.version-aware"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These look consistent with Java, but only the first 3 are shared.

GCS_CACHE_TIMEOUT = "gcs.cache-timeout"
GCS_REQUESTER_PAYS = "gcs.requester-pays"
GCS_SESSION_KWARGS = "gcs.session-kwargs"
GCS_ENDPOINT_URL = "gcs.endpoint-url"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In other places, we have used just "endpoint" rather than "endpoint-url" to point the client to a different base URI. Should we do that here for consistency? Other examples are glue.endpoint and s3.endpoint.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great catch, I like that very much

GCS_REQUESTER_PAYS = "gcs.requester-pays"
GCS_SESSION_KWARGS = "gcs.session-kwargs"
GCS_ENDPOINT_URL = "gcs.endpoint-url"
GCS_DEFAULT_LOCATION = "gcs.default-bucket-location"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What does this control? Does it modify URIs to fill in a bucket if it isn't present? I'm not sure that's a good idea.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is an option to create a new bucket if it doesn't exist. This doesn't do much for PyIceberg currently until it has write support.

gcs_kwargs: Dict[str, Any] = {}
if access_token := self.properties.get(GCS_TOKEN):
gcs_kwargs["access_token"] = access_token
if expiration := self.properties.get(GCS_TOKEN_EXPIRES_AT):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If this is interpreted as ms, should the property include -ms at the end?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good call

@@ -1326,7 +1327,7 @@ def test_pyarrow_wrap_fsspec(example_task: FileScanTask, table_schema_simple: Sc
partition_specs=[PartitionSpec()],
),
metadata_location=metadata_location,
io=load_file_io(properties={"py-io-impl": "pyiceberg.io.fsspec.FsspecFileIO"}, location=metadata_location),
io=load_file_io(properties={"py-io-impl": "pyiceberg.io.fsspec.PyArrowFileIO"}, location=metadata_location),
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why set this to PyArrow?

Copy link
Contributor

@rdblue rdblue left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good overall. Just a few minor comments. This is also blocked on apache/arrow#36993 right?

@Fokko
Copy link
Contributor Author

Fokko commented Aug 14, 2023

Thanks for the review @rdblue

This is also blocked on apache/arrow#36993 right?

Not necessarily. I've tested it locally, and it works fine with GCP/GCS, I only get this error locally with Minio (and therefore also in the integration tests).

@Fokko Fokko merged commit 2477ae7 into apache:master Aug 14, 2023
7 checks passed
@Fokko Fokko deleted the fd-gcs branch August 14, 2023 12:43
@Fokko
Copy link
Contributor Author

Fokko commented Aug 14, 2023

Thanks @rdblue for the review, and @Buktoria for most of the work! 🙌🏻

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants