-
Notifications
You must be signed in to change notification settings - Fork 383
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806
Conversation
…oop-{aws,gcp,aliyun}
@jerryshao |
Think about a different name, not "mini"... |
Generally, I'm OK with this solution. |
After discussing with @FANNG1 offline, He suggests we need to include |
@yuqi1129 Please fix the conflicts and move forward this PR as we discussed offline. |
fileset_name = "example" | ||
|
||
## this is for S3 | ||
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aws-bundle/build/libs/gravitino-aws-bundle-{gravitino-version}.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aws-3.2.0.jar,/Users/yuqi/Downloads/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's not good to use the path which has your name included..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fix
I think the current issue is that we have several documents describe similar things about Fileset, which may have overlaps or repetitions, we have to better refactor the docs to make it easy to read from user's perspective. |
I have changed other documents slightly, it's quite time-consuming to change them all, and I would rather optimize them later. |
docs/how-to-use-gvfs.md
Outdated
@@ -137,8 +137,13 @@ You can configure these properties in two ways: | |||
``` | |||
|
|||
:::note | |||
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jar in the Hadoop environment. | |||
For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath. | |||
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle/core jar in the Hadoop environment. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think the description "core" is not correct anymore. Please revisit the document to make sure the content is correct at least.
docs/hadoop-catalog.md
Outdated
Gravitino itself hasn't yet tested the object storage support, so if you have any issue, | ||
please create an [issue](https://github.com/apache/gravitino/issues). | ||
object storage like S3, GCS, Azure Blob Storage and OSS, you can put the hadoop object store jar like | ||
`gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the `$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we need to use the hadoop bundle jar? I assume we already have the Hadoop dependencies in the Hadoop catalog, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
hadoop bundle does not contain hadoop-aws
though Gravitino severs already provides a Hadoop environment(hadoop-client-api & hadoop-client-api-runtime).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
aws-hadoop-bundle = aws-bundle + hadoop-client-* + hadoop-aws.
docs/hadoop-catalog.md
Outdated
@@ -52,7 +50,7 @@ Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or cu | |||
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | |||
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating | | |||
|
|||
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. | |||
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadoop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-hadoop-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also here.
docs/how-to-use-gvfs.md
Outdated
@@ -77,16 +77,15 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom | |||
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | | |||
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | | |||
|
|||
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`). | |||
|
|||
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadooop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-hadoop-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why using hadoop bundle, not bundle only?
Since our bundled jar does not actually bundle anything, I would rethink the jar/module name, take aws as an example:
One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT? |
To a certain extent, it did bundle something like credential vending(contains all jars needed to provide credential services), when it comes to hadoop filesystem,
Yeah, version 0.7 and 0.8 are not compatible on this point. Considering about aspect mentioned before, I also prefer to keep the name |
done. |
docs/how-to-use-gvfs.md
Outdated
For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath. | ||
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment. | ||
For example, if you want to access the S3 fileset, you need to place | ||
1. The S3 hadoop bundle jar [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why "S3 hadoop bundle jar"? Please use the correcrt name.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/how-to-use-gvfs.md
Outdated
|
||
#### GCS fileset | ||
|
||
| Configuration item | Description | Default value | Required | Since version | | ||
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------| | ||
| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset.| 0.7.0-incubating | | ||
|
||
In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`). | ||
In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We don't have to put the jar into Hadoop location ${HADOOP_HOME}/share/hadoop/common/lib/
, I think typically it is not allowed, neither not a good practice. Just add the jars into the classpath (wherever it is) is enough, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, adding the jar to the classpath is more accurate.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
docs/how-to-use-gvfs.md
Outdated
@@ -77,7 +77,9 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom | |||
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | | |||
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating | | |||
|
|||
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`). | |||
At the same time, you need to add the corresponding bundle jar | |||
1. [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if there already have hadoop environment, or |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IIUC, the bundle jar should be used when hadoop environment is not ready, right? What you mentioned is the opposite.
…core jars that does not contains hadoop-{aws,gcp,aliyun,azure} (apache#5806) ### What changes were proposed in this pull request? Provide another kind of bundle jars that does not contains hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini. ### Why are the changes needed? To make it works in a wide range of Hadoop version Fix: apache#5585 ### Does this PR introduce _any_ user-facing change? N/A ### How was this patch tested? Existing UTs and ITs
What changes were proposed in this pull request?
Provide another kind of bundle jars that does not contains hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini.
Why are the changes needed?
To make it works in a wide range of Hadoop version
Fix: #5585
Does this PR introduce any user-facing change?
N/A
How was this patch tested?
Existing UTs and ITs