Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5996] feat(python-client): Using credentail in python GVFS client. #5997

Merged
merged 25 commits into from
Jan 8, 2025

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Dec 26, 2024

What changes were proposed in this pull request?

Support using credentail in GVFS python client for cloud storage.

Why are the changes needed?

It's need.

Fix: #5996

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

New it and test locally

@yuqi1129 yuqi1129 marked this pull request as draft December 26, 2024 02:09
@yuqi1129 yuqi1129 marked this pull request as ready for review December 27, 2024 06:12
@yuqi1129 yuqi1129 self-assigned this Dec 27, 2024
@yuqi1129 yuqi1129 requested review from FANNG1 and jerryshao December 27, 2024 13:39
@yuqi1129
Copy link
Contributor Author

@FANNG1 @jerryshao
Please help to review this PR.

@yuqi1129 yuqi1129 requested a review from FANNG1 January 3, 2025 11:43

cls.fs = GCSFileSystem(token=cls.key_file)

def test_mkdir(self):
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why not move these tests to TestGvfsWithHDFS ?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

TestGvfsWithHDFS already has the method and test_gvfs_with_gcs_credential.py just overwrite it.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you provide the reason?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

GCS has different behaviour compared to hdfs, so we to overwrite it. OSS and S3 are similar.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

could you add detailed reason in the code about why adding the test in separate file.

Copy link
Contributor Author

@yuqi1129 yuqi1129 Jan 7, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can add some description about the detailed reason, not only mkdir, other operation like makedirs, ls also have similar issues.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added some details on why we need to overwrite it.

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
importlib.import_module("s3fs").S3FileSystem(
key=aws_access_key_id,
secret=aws_secret_access_key,
endpoint_url=aws_endpoint_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

according to the pydoc, aws_endpoint_url seems not required?

    endpoint_url : string (None)
        Use this endpoint_url, if specified. Needed for connecting to non-AWS
        S3 buckets. Takes precedence over `endpoint_url` in client_kwargs.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For Python client, the endpoint is optional, but for Java clients, this is a required value to ensure consistency.

java.lang.IllegalArgumentException: Aliyun OSS endpoint should not be null or empty. Please set proper endpoint with 'fs.oss.endpoint'.
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystemStore.initialize(AliyunOSSFileSystemStore.java:150)
	at org.apache.hadoop.fs.aliyun.oss.AliyunOSSFileSystem.initialize(AliyunOSSFileSystem.java:349)
	at org.apache.hadoop.fs.FileSystem.createFileSystem(FileSystem.java:3469)
	at org.apache.hadoop.fs.FileSystem.access$300(FileSystem.java:174)
	at org.apache.hadoop.fs.FileSystem$Cache.getInternal(FileSystem.java:3574)
	at org.apache.hadoop.fs.FileSystem$Cache.get(FileSystem.java:3521)
	at org.apache.hadoop.fs.FileSystem.get(FileSystem.java:540)
	at org.apache.hadoop.fs.Path.getFileSystem(Path.java:365)
	at org.apache.gravitino.filesystem.hadoop.integration.test.GravitinoVirtualFileSystemIT.testDelete(GravitinoVirtualFileSystemIT.java:238)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, aliyun OSS is required, but this is aws

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using S3FileSystem does not need endpoint parameters, but when applying GVFS to padas (or other third-party application), it's required, this has been tested by jerry.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the following code doesn't need aws_endpoint_url?

           if isinstance(credential, S3SecretKeyCredential):
                fs = importlib.import_module("s3fs").S3FileSystem(
                    key=credential.access_key_id(),
                    secret=credential.secret_access_key(),
                )
                return (expire_time, fs)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
),
)

oss_endpoint_url = self._options.get(GVFSConfig.GVFS_FILESYSTEM_OSS_ENDPOINT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the difference of oss_endpoint_url and oss_endpoint?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

They are the same thing, what do you mean here, oss_endpoint_url is not a good name?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If they are the same, why use two variables?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The former one is from the catalog (fetching from Gravitino server) and the other is set by the user from the client.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action is a little wired, you only use oss endpoint only when credential is enabled in server side. seems the client side doesn't need to configuration it .

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
importlib.import_module("s3fs").S3FileSystem(
key=aws_access_key_id,
secret=aws_secret_access_key,
endpoint_url=aws_endpoint_url,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why the following code doesn't need aws_endpoint_url?

           if isinstance(credential, S3SecretKeyCredential):
                fs = importlib.import_module("s3fs").S3FileSystem(
                    key=credential.access_key_id(),
                    secret=credential.secret_access_key(),
                )
                return (expire_time, fs)

),
)

oss_endpoint_url = self._options.get(GVFSConfig.GVFS_FILESYSTEM_OSS_ENDPOINT)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The action is a little wired, you only use oss endpoint only when credential is enabled in server side. seems the client side doesn't need to configuration it .

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
def _get_expire_time_by_ratio(self, expire_time: int):
if expire_time <= 0:
return TIME_WITHOUT_EXPIRATION
return time.time() * 1000 + (expire_time - time.time() * 1000) * 0.9
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

0.9 seems too high, since the filesystem is used out of the control of GVFS, how about make it configurable with default 0.5?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean the credential should be refreshed within 30 minutes, even if the expiration time is an hour?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Another point, I think you have similar comments in the Java implementation, from 0.5 to 0.9. What are your considerations regarding this point?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

what's the the influence if we adjust the value from 0.5 to 0.9?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If change to 0.9, Only 6 mins are left to use the filesystem.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I have added a new configuration item as suggested.

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Jan 7, 2025

@FANNG1

image

You can comment L100 in GravitinoVirtualFileSystemOSSIT and see what happens

image

I have said that I just want to keep the configuration for OSS in Java and Python consistent 0.7.0

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Jan 7, 2025

For S3, I have tested again. Both Java and Python clients can omit the endpoint, I'm hesitant to change this old behaviour, maybe another PR to optimize it will be proper.

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 7, 2025

I have said that I just want to keep the configuration for OSS in Java and Python consistent 0.7.0

I agree OSS endpoint is required, What I doubt is in same cases you use OSS endpoint from the server side, in another condition you use OOS endpoint from the client side.

@FANNG1
Copy link
Contributor

FANNG1 commented Jan 7, 2025

For S3, I have tested again. Both Java and Python clients can omit the endpoint, I'm hesitant to change this old behaviour, maybe another PR to optimize it will be proper.

After this PR, does client side need to configuration some catalog properties like region, endpoint?

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Jan 8, 2025

All have been resolved. @FANNG1 please help to take a look again, thanks.

@yuqi1129 yuqi1129 requested a review from FANNG1 January 8, 2025 06:13
# This configuration marks the expired time of the credential. For instance, if the credential
# fetched from Gravitino server has expired time of 3600 seconds, and the credential_expired_time_ration is 0.5
# then the credential will be considered as expired after 1800 seconds and will try to fetch a new credential.
GVFS_FILESYSTEM_CREDENTIAL_EXPIRED_TIME_RATIO = "credential_expired_time_ratio"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credential_expire_ratio?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's also okay for me, I can make changes. Could you please help verify the accuracy of the description above?

Copy link
Contributor

@FANNG1 FANNG1 Jan 8, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

credential will be considered as expired seems not correct, it's something like remove credential from the cache, use credential_cache_expire_ratio like sever side?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is already a configuration item cache_expired_time to control the expiration time of the value in the cache, I'm not very in favor of credential_cache_expire_ratio.

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
@FANNG1
Copy link
Contributor

FANNG1 commented Jan 8, 2025

LGTM except minor comments, @jerryshao do you have time to review again?

@jerryshao
Copy link
Contributor

You can make a final call if you think it's OK. @FANNG1

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
@FANNG1 FANNG1 merged commit 598bc05 into apache:main Jan 8, 2025
28 checks passed
Abyss-lord pushed a commit to Abyss-lord/gravitino that referenced this pull request Jan 9, 2025
…ient. (apache#5997)

### What changes were proposed in this pull request?

Support using credentail in GVFS python client for cloud storage.

### Why are the changes needed?

It's need.

Fix: apache#5996 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

New it and test locally
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support using credential when using fileset with cloud storage in python GVFS
3 participants