-
Notifications
You must be signed in to change notification settings - Fork 395
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5996] feat(python-client): Using credentail in python GVFS client. #5997
Conversation
…s_support_credential
@FANNG1 @jerryshao |
clients/client-python/gravitino/api/credential/adls_token_credential.py
Outdated
Show resolved
Hide resolved
|
||
cls.fs = GCSFileSystem(token=cls.key_file) | ||
|
||
def test_mkdir(self): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
why not move these tests to TestGvfsWithHDFS
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
TestGvfsWithHDFS already has the method and test_gvfs_with_gcs_credential.py just overwrite it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could you provide the reason?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
GCS has different behaviour compared to hdfs, so we to overwrite it. OSS and S3 are similar.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
could you add detailed reason in the code about why adding the test in separate file.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can add some description about the detailed reason, not only mkdir
, other operation like makedirs
, ls
also have similar issues.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added some details on why we need to overwrite it.
clients/client-python/tests/integration/test_gvfs_with_gcs_credential.py
Show resolved
Hide resolved
importlib.import_module("s3fs").S3FileSystem( | ||
key=aws_access_key_id, | ||
secret=aws_secret_access_key, | ||
endpoint_url=aws_endpoint_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
according to the pydoc, aws_endpoint_url
seems not required?
endpoint_url : string (None)
Use this endpoint_url, if specified. Needed for connecting to non-AWS
S3 buckets. Takes precedence over `endpoint_url` in client_kwargs.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For Python client, the endpoint is optional, but for Java clients, this is a required value to ensure consistency.
java.lang.IllegalArgumentException: Aliyun OSS endpoint should not be null or empty. Please set proper endpoint with 'fs.oss.endpoint'.
at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystemStore.initialize(AliyunOSSFileSystemStore.java:150)
at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.initialize(AliyunOSSFileSystem.java:349)
at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
at org.apache.gravitino.filesystem.hadoop.integration.test.GravitinoVirtualFileSystemIT.testDelete(GravitinoVirtualFileSystemIT.java:238)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, aliyun OSS is required, but this is aws
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Using S3FileSystem does not need endpoint parameters, but when applying GVFS to padas (or other third-party application), it's required, this has been tested by jerry.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the following code doesn't need aws_endpoint_url
?
if isinstance(credential, S3SecretKeyCredential):
fs = importlib.import_module("s3fs").S3FileSystem(
key=credential.access_key_id(),
secret=credential.secret_access_key(),
)
return (expire_time, fs)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
), | ||
) | ||
|
||
oss_endpoint_url = self._options.get(GVFSConfig.GVFS_FILESYSTEM_OSS_ENDPOINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the difference of oss_endpoint_url
and oss_endpoint
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
They are the same thing, what do you mean here, oss_endpoint_url
is not a good name?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If they are the same, why use two variables?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The former one is from the catalog (fetching from Gravitino server) and the other is set by the user from the client.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action is a little wired, you only use oss endpoint only when credential is enabled in server side. seems the client side doesn't need to configuration it .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
importlib.import_module("s3fs").S3FileSystem( | ||
key=aws_access_key_id, | ||
secret=aws_secret_access_key, | ||
endpoint_url=aws_endpoint_url, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why the following code doesn't need aws_endpoint_url
?
if isinstance(credential, S3SecretKeyCredential):
fs = importlib.import_module("s3fs").S3FileSystem(
key=credential.access_key_id(),
secret=credential.secret_access_key(),
)
return (expire_time, fs)
), | ||
) | ||
|
||
oss_endpoint_url = self._options.get(GVFSConfig.GVFS_FILESYSTEM_OSS_ENDPOINT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The action is a little wired, you only use oss endpoint only when credential is enabled in server side. seems the client side doesn't need to configuration it .
def _get_expire_time_by_ratio(self, expire_time: int): | ||
if expire_time <= 0: | ||
return TIME_WITHOUT_EXPIRATION | ||
return time.time() * 1000 + (expire_time - time.time() * 1000) * 0.9 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
0.9
seems too high, since the filesystem is used out of the control of GVFS, how about make it configurable with default 0.5
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you mean the credential should be refreshed within 30 minutes, even if the expiration time is an hour?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another point, I think you have similar comments in the Java implementation, from 0.5 to 0.9. What are your considerations regarding this point?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
what's the the influence if we adjust the value from 0.5 to 0.9?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If change to 0.9, Only 6 mins are left to use the filesystem.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added a new configuration item as suggested.
clients/client-python/tests/integration/test_gvfs_with_gcs_credential.py
Outdated
Show resolved
Hide resolved
For S3, I have tested again. Both Java and Python clients can omit the endpoint, I'm hesitant to change this old behaviour, maybe another PR to optimize it will be proper. |
I agree OSS endpoint is required, What I doubt is in same cases you use OSS endpoint from the server side, in another condition you use OOS endpoint from the client side. |
After this PR, does client side need to configuration some catalog properties like |
All have been resolved. @FANNG1 please help to take a look again, thanks. |
# This configuration marks the expired time of the credential. For instance, if the credential | ||
# fetched from Gravitino server has expired time of 3600 seconds, and the credential_expired_time_ration is 0.5 | ||
# then the credential will be considered as expired after 1800 seconds and will try to fetch a new credential. | ||
GVFS_FILESYSTEM_CREDENTIAL_EXPIRED_TIME_RATIO = "credential_expired_time_ratio" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
credential_expire_ratio?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's also okay for me, I can make changes. Could you please help verify the accuracy of the description above?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
credential will be considered as expired
seems not correct, it's something like remove credential from the cache, use credential_cache_expire_ratio
like sever side?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is already a configuration item cache_expired_time
to control the expiration time of the value in the cache, I'm not very in favor of credential_cache_expire_ratio
.
LGTM except minor comments, @jerryshao do you have time to review again? |
You can make a final call if you think it's OK. @FANNG1 |
…ient. (apache#5997) ### What changes were proposed in this pull request? Support using credentail in GVFS python client for cloud storage. ### Why are the changes needed? It's need. Fix: apache#5996 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? New it and test locally
What changes were proposed in this pull request?
Support using credentail in GVFS python client for cloud storage.
Why are the changes needed?
It's need.
Fix: #5996
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
New it and test locally