Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5139] feat(python-client): Support GCS fileset in the python GVFS client #5160

Merged
merged 110 commits into from
Oct 21, 2024

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Oct 16, 2024

What changes were proposed in this pull request?

  • Support GCS fileset in the Python GVFS client
  • Add an IT about the GCS fileset.

Why are the changes needed?

It's user needs.

Fix: #5139

Does this PR introduce any user-facing change?

Modify the Python GVFS client.

How was this patch tested?

Test locally and add an IT that can't run automatically.

Modify mode as the following picture and execute ./gradlew :clients:client-python:test -PskipDockerTests=false success.

image image

@yuqi1129 yuqi1129 requested a review from xloya October 17, 2024 13:42
@yuqi1129 yuqi1129 self-assigned this Oct 17, 2024
@yuqi1129
Copy link
Contributor Author

@jerryshao @xloya
This PR is ready for review, please help to take a look.

@jerryshao
Copy link
Contributor

@xloya would you please help to review this code, thanks.

clients/client-python/gravitino/filesystem/gvfs.py Outdated Show resolved Hide resolved
@@ -27,6 +27,8 @@
from fsspec.implementations.arrow import ArrowFSWrapper
from fsspec.utils import infer_storage_options
from pyarrow.fs import HadoopFileSystem
from pyarrow.fs import GcsFileSystem
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In the current implementation, that all storage Python lib dependencies will be introduced into PyGVFS. Although there will be no conflicts in most cases, it may be better for users to only load the underlying storage dependencies they need. I wonder if there is a better way to introduce other FileSystem dependencies on demand, could you take a time to research this?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Okay, I will do it today.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@xloya
I have updated the code and used importlib to dynamically import file system classes according to needs.

data_dir = os.path.join(cls.gravitino_home, "data")
if os.path.exists(data_dir):
logger.info("Remove Gravitino data directory: %s", data_dir)
subprocess.run(["rm", "-rf", data_dir], check=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's better to use shutil.rmtree to delete the dir recursively.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

data_dir = os.path.join(gravitino_home, "data")
if os.path.exists(data_dir):
logger.info("Remove Gravitino data directory: %s", data_dir)
subprocess.run(["rm", "-rf", data_dir], check=False)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same as above to use shutil.rmtree.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


# Disable the following test case as it is not working for GCS
# >>> gcs.mkdir('example_qazwsx/catalog/schema/fileset3')
# >>> r = gcs.modified('example_qazwsx/catalog/schema/fileset3')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should throw an exception when call modified method for fileset on gcs, rather than disable tests here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't believe we need to make exceptions here as we call gcs.modified('example_qazwsx/catalog/schema/fileset3'). GCP does not support fetching the modification time of a directory, it only support getting the modification time of a object

>>> print(gcs.modified('example_qazwsx/test_gvfs_catalog7222/test_gvfs_schema/robots.txt'))
2024-10-21 07:14:27.628000+00:00
>>> print(gcs.modified('example_qazwsx/test_gvfs_catalog7222/test_gvfs_schema'))
None

I will change the test code.

cls._get_gravitino_home()

cls.hadoop_conf_path = f"{cls.gravitino_home}/catalogs/hadoop/conf/hadoop.conf"
# append the hadoop conf to server
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It seems will not append gcs conf for the server?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I will remove it.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done.

@yuqi1129
Copy link
Contributor Author

@xloya
All comments have been resolved. Please help to take a look again.

def setUpClass(cls):
cls._get_gravitino_home()

cls.hadoop_conf_path = f"{cls.gravitino_home}/catalogs/hadoop/conf/hadoop.conf"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not need to append some hadoop configurations in the server side, is there having any necessity to reset hadoop conf here?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we do not need to append some hadoop configurations in the server side, is there having any necessity to reset hadoop conf here?

I'm afraid things may not be so ideal, I've encountered several cases where there are already some configuration keys in the conf file added by HDFS test cases, for example, we terminated the tests halfway and the file content can't be properly reset, so I will always rewrite it to ensure that the configuration file is clean.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I see.

@xloya
Copy link
Contributor

xloya commented Oct 21, 2024

Overall LGTM, but a minor issue.

@jerryshao jerryshao changed the title [#5139] feat: (python-client): Support GCS fileset in the python GVFS client [#5139] feat(python-client): Support GCS fileset in the python GVFS client Oct 21, 2024
@jerryshao jerryshao merged commit ef44021 into apache:main Oct 21, 2024
26 checks passed
mplmoknijb pushed a commit to mplmoknijb/gravitino that referenced this pull request Nov 6, 2024
…GVFS client (apache#5160)

### What changes were proposed in this pull request?

- Support GCS fileset in the Python GVFS client
- Add an IT about the GCS fileset.

### Why are the changes needed?

It's user needs.

Fix: apache#5139

### Does this PR introduce _any_ user-facing change?

Modify the Python GVFS client. 

### How was this patch tested?

Test locally and add an IT that can't run automatically. 

Modify mode as the following picture and execute `./gradlew
:clients:client-python:test -PskipDockerTests=false` success.

<img width="1332" alt="image"
src="https://github.com/user-attachments/assets/e3b781fa-3ac1-458a-9224-d7e01107828b">


<img width="1530" alt="image"
src="https://github.com/user-attachments/assets/f53c2d20-c253-4eb9-a84f-4c13569b513b">
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[FEATURE] Support GCS for python GVFS client
3 participants