-
Notifications
You must be signed in to change notification settings - Fork 700
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Parquet S3 Client Side Encryption #2642
Labels
enhancement
New feature or request
Comments
Hi @Marwen94 thanks, that's a useful feature. A PR is very welcome. |
Marwen94
pushed a commit
to Marwen94/aws-sdk-pandas
that referenced
this issue
Feb 13, 2024
Hello @kukushking, I have opened a PR for this issue. Please take a look :) |
Marwen94
pushed a commit
to Marwen94/aws-sdk-pandas
that referenced
this issue
Feb 13, 2024
This is very much needed. Excited to see that the PR is close to completion. Thank you, @Marwen94!! |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Is your idea related to a problem? Please describe.
the problem is that awswrangler does not support Client Side Encryption for Parquet format although Pyarrow supports this feature : https://arrow.apache.org/docs/python/parquet.html#kms-connection-configuration
This feature is very important to support for sensitive data writing to S3.
Describe the solution you'd like
Since Pyarrow supports this feature, I don't think it is very costly to implement it in awswrangler.
writing client side encrypted parquet from s3
The
s3.to_parquet
method already exposespyarrow_additional_kwargs
parameter. Through this parameter we can include aencryption_properties
with a custom implementation of pyarrowKmsClient
(https://arrow.apache.org/docs/python/generated/pyarrow.parquet.encryption.KmsClient.html#).I already tested this and it works when writing the dataframe to one single file. When writing concurrently, this throws an error
OSError: Re-using encryption properties for another file
because the same writer with the same encryption configuration is used to write all chunks (https://github.com/aws/aws-sdk-pandas/blob/main/awswrangler/s3/_write_parquet.py#L116) and this is not permitted by Pyarrow.reading client side encrypted parquet from s3
In the same logic Pyarrow exposes a decryption configuration that can be passed to the Pyarrow reader (https://arrow.apache.org/docs/python/parquet.html#decryption-configuration). The
pyarrow_additional_kwargs
parameter is exposed inawswrangler.s3.read_parquet
however it is only forwarded toto_pandas
method.An example of Pyarrow KmsClient implementation using AWS KMS :
I could propose a PR to address this if you agree with my investigation and with including this feature to
awswrangler.s3
.Thank you!
The text was updated successfully, but these errors were encountered: