-
Notifications
You must be signed in to change notification settings - Fork 350
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5139] feat(python-client): Support GCS fileset in the python GVFS client #5160
Conversation
@jerryshao @xloya |
@xloya would you please help to review this code, thanks. |
@@ -27,6 +27,8 @@ | |||
from fsspec.implementations.arrow import ArrowFSWrapper | |||
from fsspec.utils import infer_storage_options | |||
from pyarrow.fs import HadoopFileSystem | |||
from pyarrow.fs import GcsFileSystem |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the current implementation, that all storage Python lib dependencies will be introduced into PyGVFS. Although there will be no conflicts in most cases, it may be better for users to only load the underlying storage dependencies they need. I wonder if there is a better way to introduce other FileSystem dependencies on demand, could you take a time to research this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, I will do it today.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@xloya
I have updated the code and used importlib
to dynamically import file system classes according to needs.
data_dir = os.path.join(cls.gravitino_home, "data") | ||
if os.path.exists(data_dir): | ||
logger.info("Remove Gravitino data directory: %s", data_dir) | ||
subprocess.run(["rm", "-rf", data_dir], check=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's better to use shutil.rmtree
to delete the dir recursively.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
data_dir = os.path.join(gravitino_home, "data") | ||
if os.path.exists(data_dir): | ||
logger.info("Remove Gravitino data directory: %s", data_dir) | ||
subprocess.run(["rm", "-rf", data_dir], check=False) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Same as above to use shutil.rmtree
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
|
||
# Disable the following test case as it is not working for GCS | ||
# >>> gcs.mkdir('example_qazwsx/catalog/schema/fileset3') | ||
# >>> r = gcs.modified('example_qazwsx/catalog/schema/fileset3') |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think you should throw an exception when call modified
method for fileset on gcs, rather than disable tests here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't believe we need to make exceptions here as we call gcs.modified('example_qazwsx/catalog/schema/fileset3')
. GCP does not support fetching the modification time of a directory, it only support getting the modification time of a object
>>> print(gcs.modified('example_qazwsx/test_gvfs_catalog7222/test_gvfs_schema/robots.txt'))
2024-10-21 07:14:27.628000+00:00
>>> print(gcs.modified('example_qazwsx/test_gvfs_catalog7222/test_gvfs_schema'))
None
I will change the test code.
cls._get_gravitino_home() | ||
|
||
cls.hadoop_conf_path = f"{cls.gravitino_home}/catalogs/hadoop/conf/hadoop.conf" | ||
# append the hadoop conf to server |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It seems will not append gcs conf for the server?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, I will remove it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done.
@xloya |
def setUpClass(cls): | ||
cls._get_gravitino_home() | ||
|
||
cls.hadoop_conf_path = f"{cls.gravitino_home}/catalogs/hadoop/conf/hadoop.conf" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not need to append some hadoop configurations in the server side, is there having any necessity to reset hadoop conf here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If we do not need to append some hadoop configurations in the server side, is there having any necessity to reset hadoop conf here?
I'm afraid things may not be so ideal, I've encountered several cases where there are already some configuration keys in the conf file added by HDFS test cases, for example, we terminated the tests halfway and the file content can't be properly reset, so I will always rewrite it to ensure that the configuration file is clean.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I see.
Overall LGTM, but a minor issue. |
…GVFS client (apache#5160) ### What changes were proposed in this pull request? - Support GCS fileset in the Python GVFS client - Add an IT about the GCS fileset. ### Why are the changes needed? It's user needs. Fix: apache#5139 ### Does this PR introduce _any_ user-facing change? Modify the Python GVFS client. ### How was this patch tested? Test locally and add an IT that can't run automatically. Modify mode as the following picture and execute `./gradlew :clients:client-python:test -PskipDockerTests=false` success. <img width="1332" alt="image" src="https://github.com/user-attachments/assets/e3b781fa-3ac1-458a-9224-d7e01107828b"> <img width="1530" alt="image" src="https://github.com/user-attachments/assets/f53c2d20-c253-4eb9-a84f-4c13569b513b">
What changes were proposed in this pull request?
Why are the changes needed?
It's user needs.
Fix: #5139
Does this PR introduce any user-facing change?
Modify the Python GVFS client.
How was this patch tested?
Test locally and add an IT that can't run automatically.
Modify mode as the following picture and execute
./gradlew :clients:client-python:test -PskipDockerTests=false
success.