[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

yuqi1129 · 2024-12-09T14:03:03Z

What changes were proposed in this pull request?

Provide another kind of bundle jars that does not contains hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini.

Why are the changes needed?

To make it works in a wide range of Hadoop version

Fix: #5585

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Existing UTs and ITs

…oop-{aws,gcp,aliyun}

yuqi1129 · 2024-12-09T14:04:38Z

@jerryshao
This is the POC changes for S3, please help to take a look if this is acceptable, then I will continue working on the other three cloud storage systems.

jerryshao · 2024-12-10T02:26:51Z

Think about a different name, not "mini"...

jerryshao · 2024-12-10T02:28:59Z

Generally, I'm OK with this solution.

yuqi1129 · 2024-12-11T11:25:18Z

Generally, I'm OK with this solution.

@jerryshao

After discussing with @FANNG1 offline, He suggests we need to include hadoop-common in the bundle jars, in this PR, I have not added it to the bundle jars, I'm not very sure about this point, could you please help to verify it?

…ar_problem

…s_jar_problem

jerryshao · 2024-12-16T10:58:37Z

@yuqi1129 Please fix the conflicts and move forward this PR as we discussed offline.

…s_jar_problem

bundles/azure-bundle/build.gradle.kts

…s_jar_problem

jerryshao · 2024-12-25T13:12:54Z

docs/cloud-storage-fileset-example.md

+fileset_name = "example"
+
+## this is for S3
+os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aws-bundle/build/libs/gravitino-aws-bundle-{gravitino-version}.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aws-3.2.0.jar,/Users/yuqi/Downloads/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"


It's not good to use the path which has your name included..

jerryshao · 2024-12-25T13:16:15Z

I think the current issue is that we have several documents describe similar things about Fileset, which may have overlaps or repetitions, we have to better refactor the docs to make it easy to read from user's perspective.

yuqi1129 · 2024-12-25T14:27:39Z

I have changed other documents slightly, it's quite time-consuming to change them all, and I would rather optimize them later.

…s_jar_problem

jerryshao · 2024-12-26T06:59:38Z

docs/how-to-use-gvfs.md

@@ -137,8 +137,13 @@ You can configure these properties in two ways:
    ```

 :::note
-If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jar in the Hadoop environment. 
-For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath. 
+If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle/core jar in the Hadoop environment. 


I think the description "core" is not correct anymore. Please revisit the document to make sure the content is correct at least.

jerryshao · 2024-12-26T09:37:45Z

docs/hadoop-catalog.md

-Gravitino itself hasn't yet tested the object storage support, so if you have any issue,
-please create an [issue](https://github.com/apache/gravitino/issues).
+object storage like S3, GCS, Azure Blob Storage and OSS, you can put the hadoop object store jar like
+`gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the `$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support.


Why do we need to use the hadoop bundle jar? I assume we already have the Hadoop dependencies in the Hadoop catalog, right?

hadoop bundle does not contain hadoop-aws though Gravitino severs already provides a Hadoop environment(hadoop-client-api & hadoop-client-api-runtime).

aws-hadoop-bundle = aws-bundle + hadoop-client-* + hadoop-aws.

jerryshao · 2024-12-26T09:43:28Z

docs/hadoop-catalog.md

@@ -52,7 +50,7 @@ Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or cu
 | `s3-access-key-id`            | The access key of the AWS S3.                                                                                                                                                                                                | (none)          | Yes if it's a S3 fileset. | 0.7.0-incubating |
 | `s3-secret-access-key`        | The secret key of the AWS S3.                                                                                                                                                                                                | (none)          | Yes if it's a S3 fileset. | 0.7.0-incubating |

-At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
+At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadoop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-hadoop-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.


jerryshao · 2024-12-26T09:44:47Z

docs/how-to-use-gvfs.md

@@ -77,16 +77,15 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom
 | `s3-access-key-id`             | The access key of the AWS S3.                                                                                                                                                          | (none)        | Yes if it's a S3 fileset.| 0.7.0-incubating |
 | `s3-secret-access-key`         | The secret key of the AWS S3.                                                                                                                                                          | (none)        | Yes if it's a S3 fileset.| 0.7.0-incubating |

-At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
-
+At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadooop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-hadoop-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).


Why using hadoop bundle, not bundle only?

jerryshao · 2024-12-26T10:11:49Z

Since our bundled jar does not actually bundle anything, I would rethink the jar/module name, take aws as an example:

aws: jar without aws and hadoop dependencies.
aws-bundle: jar with aws and hadoop dependencies.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

yuqi1129 · 2024-12-26T11:05:22Z

Since our bundled jar does not actually bundle anything.

To a certain extent, it did bundle something like credential vending(contains all jars needed to provide credential services), when it comes to hadoop filesystem, xxx-bundle.jar contains nothing but interface FileSystemProvider needed by Gravitino.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

Yeah, version 0.7 and 0.8 are not compatible on this point.

Considering about aspect mentioned before, I also prefer to keep the name xxx-bundle.jar as it is and do not introduce xxxx-hadoop-bundle, @FANNG1 do you have any thoughts about it?

yuqi1129 · 2024-12-26T12:26:04Z

Since our bundled jar does not actually bundle anything, I would rethink the jar/module name, take aws as an example:

aws: jar without aws and hadoop dependencies.

aws-bundle: jar with aws and hadoop dependencies.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

done.

docs/how-to-use-gvfs.md

jerryshao · 2024-12-27T04:12:16Z

docs/how-to-use-gvfs.md

-For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath. 
+If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment. 
+For example, if you want to access the S3 fileset,  you need to place
+1. The S3 hadoop bundle jar [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/)


Why "S3 hadoop bundle jar"? Please use the correcrt name.

jerryshao · 2024-12-27T04:15:26Z

docs/how-to-use-gvfs.md


 #### GCS fileset

 | Configuration item             | Description                                                                                                                                                                              | Default value | Required                  | Since version    |
 |--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------|
 | `gcs-service-account-file`     | The path of GCS service account JSON file.                                                                                                                                               | (none)        | Yes if it's a GCS fileset.| 0.7.0-incubating |

-In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
+In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).


We don't have to put the jar into Hadoop location ${HADOOP_HOME}/share/hadoop/common/lib/, I think typically it is not allowed, neither not a good practice. Just add the jars into the classpath (wherever it is) is enough, right?

Yeah, adding the jar to the classpath is more accurate.

jerryshao · 2024-12-27T06:43:26Z

docs/how-to-use-gvfs.md

@@ -77,7 +77,9 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom
 | `s3-access-key-id`             | The access key of the AWS S3.                                                                                                                                                          | (none)        | Yes if it's a S3 fileset.| 0.7.0-incubating |
 | `s3-secret-access-key`         | The secret key of the AWS S3.                                                                                                                                                          | (none)        | Yes if it's a S3 fileset.| 0.7.0-incubating |

-At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
+At the same time, you need to add the corresponding bundle jar
+1. [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if there already have hadoop environment, or


IIUC, the bundle jar should be used when hadoop environment is not ready, right? What you mentioned is the opposite.

…core jars that does not contains hadoop-{aws,gcp,aliyun,azure} (apache#5806) ### What changes were proposed in this pull request? Provide another kind of bundle jars that does not contains hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini. ### Why are the changes needed? To make it works in a wide range of Hadoop version Fix: apache#5585 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Existing UTs and ITs

Refactor bundle jars and provide mini jars that does not contains had…

6be4e69

…oop-{aws,gcp,aliyun}

yuqi1129 added 3 commits December 10, 2024 10:55

Fix build error.

e854f70

fix

32be67a

fix

b69a793

yuqi1129 changed the title ~~[#5585] improvement(bundles): Refactor bundle jars and provide mini jars that does not contains hadoop-{aws,gcp,aliyun}~~ [#5585] improvement(bundles): Refactor bundle jars and provide mini jars that does not contains hadoop-{aws,gcp,aliyun,azure} Dec 10, 2024

yuqi1129 self-assigned this Dec 11, 2024

yuqi1129 added 2 commits December 11, 2024 12:23

fix

86e8c2e

fix

4c04317

yuqi1129 added 2 commits December 11, 2024 19:33

Merge branch 'main' of github.com:apache/gravitino into fix_bundles_j…

3abc788

…ar_problem

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

9a87b87

…s_jar_problem

yuqi1129 added 9 commits December 16, 2024 19:06

Fix

94179fa

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

96b69e2

…s_jar_problem

Fix

9d527cb

Fix

cc65f17

Fix

92b4539

Fix

b876977

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

0585df5

…s_jar_problem

Fix

dc085ae

Fix

6a0e87b

yuqi1129 requested review from FANNG1 and jerryshao December 17, 2024 12:08

jerryshao reviewed Dec 19, 2024

View reviewed changes

bundles/azure-bundle/build.gradle.kts Outdated Show resolved Hide resolved

yuqi1129 added 3 commits December 19, 2024 18:26

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

b63a84e

…s_jar_problem

fix

e5a2083

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

df4884e

…s_jar_problem

jerryshao reviewed Dec 25, 2024

View reviewed changes

yuqi1129 added 2 commits December 25, 2024 21:47

Fix CI error.

258a328

Fix again.

44e5807

yuqi1129 added 3 commits December 25, 2024 22:29

Fix again.

c4eac38

Remove the newly added document as suggested.

bbefd00

Merge branch 'main' of github.com:datastrato/graviton into fix_bundle…

a9c1a29

…s_jar_problem

jerryshao reviewed Dec 26, 2024

View reviewed changes

fix docs

1ee1dd1

jerryshao reviewed Dec 26, 2024

View reviewed changes

yuqi1129 added 2 commits December 26, 2024 20:17

fix

1923707

fix

3c6e20c

Fix again.

ac4b815

jerryshao reviewed Dec 27, 2024

View reviewed changes

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved

jerryshao reviewed Dec 27, 2024

View reviewed changes

yuqi1129 added 3 commits December 27, 2024 12:28

fix

d62f4c5

fix

1e7abe4

fix docs again

af4bdd4

jerryshao reviewed Dec 27, 2024

View reviewed changes

Fix a silly mistake in doc description.

0d7baf5

jerryshao approved these changes Dec 27, 2024

View reviewed changes

jerryshao merged commit 07cdcba into apache:main Dec 27, 2024
26 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

yuqi1129 commented Dec 9, 2024 •

edited

Loading

yuqi1129 commented Dec 9, 2024

jerryshao commented Dec 10, 2024

jerryshao commented Dec 10, 2024

yuqi1129 commented Dec 11, 2024

jerryshao commented Dec 16, 2024

jerryshao Dec 25, 2024

yuqi1129 Dec 25, 2024

jerryshao commented Dec 25, 2024

yuqi1129 commented Dec 25, 2024

jerryshao Dec 26, 2024

jerryshao Dec 26, 2024

yuqi1129 Dec 26, 2024 •

edited

Loading

yuqi1129 Dec 26, 2024 •

edited

Loading

jerryshao Dec 26, 2024

jerryshao Dec 26, 2024

jerryshao commented Dec 26, 2024

yuqi1129 commented Dec 26, 2024

yuqi1129 commented Dec 26, 2024

jerryshao Dec 27, 2024

yuqi1129 Dec 27, 2024

jerryshao Dec 27, 2024

yuqi1129 Dec 27, 2024

yuqi1129 Dec 27, 2024

jerryshao Dec 27, 2024

[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

Conversation

yuqi1129 commented Dec 9, 2024 • edited Loading

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

yuqi1129 commented Dec 9, 2024

jerryshao commented Dec 10, 2024

jerryshao commented Dec 10, 2024

yuqi1129 commented Dec 11, 2024

jerryshao commented Dec 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Dec 25, 2024

yuqi1129 commented Dec 25, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi1129 Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

yuqi1129 Dec 26, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jerryshao commented Dec 26, 2024

yuqi1129 commented Dec 26, 2024

yuqi1129 commented Dec 26, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

yuqi1129 commented Dec 9, 2024 •

edited

Loading

yuqi1129 Dec 26, 2024 •

edited

Loading

yuqi1129 Dec 26, 2024 •

edited

Loading