Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[#5585] improvement(bundles): Refactor bundle jars and provide core jars that does not contains hadoop-{aws,gcp,aliyun,azure} #5806

Merged
merged 39 commits into from
Dec 27, 2024

Conversation

yuqi1129
Copy link
Contributor

@yuqi1129 yuqi1129 commented Dec 9, 2024

What changes were proposed in this pull request?

Provide another kind of bundle jars that does not contains hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini.

Why are the changes needed?

To make it works in a wide range of Hadoop version

Fix: #5585

Does this PR introduce any user-facing change?

N/A

How was this patch tested?

Existing UTs and ITs

@yuqi1129
Copy link
Contributor Author

yuqi1129 commented Dec 9, 2024

@jerryshao
This is the POC changes for S3, please help to take a look if this is acceptable, then I will continue working on the other three cloud storage systems.

@jerryshao
Copy link
Contributor

Think about a different name, not "mini"...

@jerryshao
Copy link
Contributor

Generally, I'm OK with this solution.

@yuqi1129 yuqi1129 changed the title [#5585] improvement(bundles): Refactor bundle jars and provide mini jars that does not contains hadoop-{aws,gcp,aliyun} [#5585] improvement(bundles): Refactor bundle jars and provide mini jars that does not contains hadoop-{aws,gcp,aliyun,azure} Dec 10, 2024
@yuqi1129 yuqi1129 self-assigned this Dec 11, 2024
@yuqi1129
Copy link
Contributor Author

Generally, I'm OK with this solution.

@jerryshao

After discussing with @FANNG1 offline, He suggests we need to include hadoop-common in the bundle jars, in this PR, I have not added it to the bundle jars, I'm not very sure about this point, could you please help to verify it?

@jerryshao
Copy link
Contributor

@yuqi1129 Please fix the conflicts and move forward this PR as we discussed offline.

@yuqi1129 yuqi1129 requested review from FANNG1 and jerryshao December 17, 2024 12:08
fileset_name = "example"

## this is for S3
os.environ["PYSPARK_SUBMIT_ARGS"] = "--jars /Users/yuqi/project/gravitino/bundles/aws-bundle/build/libs/gravitino-aws-bundle-{gravitino-version}.jar,/Users/yuqi/project/gravitino/clients/filesystem-hadoop3-runtime/build/libs/gravitino-filesystem-hadoop3-runtime-{gravitino-version}-SNAPSHOT.jar,/Users/yuqi/Downloads/hadoop-jars/hadoop-aws-3.2.0.jar,/Users/yuqi/Downloads/aws-java-sdk-bundle-1.11.375.jar --master local[1] pyspark-shell"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's not good to use the path which has your name included..

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

fix

@jerryshao
Copy link
Contributor

I think the current issue is that we have several documents describe similar things about Fileset, which may have overlaps or repetitions, we have to better refactor the docs to make it easy to read from user's perspective.

@yuqi1129
Copy link
Contributor Author

I have changed other documents slightly, it's quite time-consuming to change them all, and I would rather optimize them later.

@@ -137,8 +137,13 @@ You can configure these properties in two ways:
```

:::note
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jar in the Hadoop environment.
For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath.
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle/core jar in the Hadoop environment.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the description "core" is not correct anymore. Please revisit the document to make sure the content is correct at least.

Gravitino itself hasn't yet tested the object storage support, so if you have any issue,
please create an [issue](https://github.com/apache/gravitino/issues).
object storage like S3, GCS, Azure Blob Storage and OSS, you can put the hadoop object store jar like
`gravitino-aws-hadoop-bundle-{gravitino-version}.jar` into the `$GRAVITINO_HOME/catalogs/hadoop/libs` directory to enable the support.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why do we need to use the hadoop bundle jar? I assume we already have the Hadoop dependencies in the Hadoop catalog, right?

Copy link
Contributor Author

@yuqi1129 yuqi1129 Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hadoop bundle does not contain hadoop-aws though Gravitino severs already provides a Hadoop environment(hadoop-client-api & hadoop-client-api-runtime).

Copy link
Contributor Author

@yuqi1129 yuqi1129 Dec 26, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

aws-hadoop-bundle = aws-bundle + hadoop-client-* + hadoop-aws.

@@ -52,7 +50,7 @@ Apart from the above properties, to access fileset like HDFS, S3, GCS, OSS or cu
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating |
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset. | 0.7.0-incubating |

At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadoop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-hadoop-bundle/) in the directory `${GRAVITINO_HOME}/catalogs/hadoop/libs`.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also here.

@@ -77,16 +77,15 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating |
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating |

At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).

At the same time, you need to place the corresponding bundle jar [`gravitino-aws-hadooop-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-hadoop-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why using hadoop bundle, not bundle only?

@jerryshao
Copy link
Contributor

Since our bundled jar does not actually bundle anything, I would rethink the jar/module name, take aws as an example:

  • aws: jar without aws and hadoop dependencies.
  • aws-bundle: jar with aws and hadoop dependencies.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

@yuqi1129
Copy link
Contributor Author

Since our bundled jar does not actually bundle anything.

To a certain extent, it did bundle something like credential vending(contains all jars needed to provide credential services), when it comes to hadoop filesystem, xxx-bundle.jar contains nothing but interface FileSystemProvider needed by Gravitino.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

Yeah, version 0.7 and 0.8 are not compatible on this point.

Considering about aspect mentioned before, I also prefer to keep the name xxx-bundle.jar as it is and do not introduce xxxx-hadoop-bundle, @FANNG1 do you have any thoughts about it?

@yuqi1129
Copy link
Contributor Author

Since our bundled jar does not actually bundle anything, I would rethink the jar/module name, take aws as an example:

  • aws: jar without aws and hadoop dependencies.
  • aws-bundle: jar with aws and hadoop dependencies.

One consideration is about the compatibility, if we we rename to aws-hadoop-bundle, then it will not guarantee the compatibility, WDYT?

done.

docs/how-to-use-gvfs.md Outdated Show resolved Hide resolved
For example if you want to access the S3 fileset, you need to place the S3 bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`) or add it to the classpath.
If you want to access the S3, GCS, OSS or custom fileset through GVFS, apart from the above properties, you need to place the corresponding bundle jars in the Hadoop environment.
For example, if you want to access the S3 fileset, you need to place
1. The S3 hadoop bundle jar [`gravitino-aws-bundle-${gravitino-version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why "S3 hadoop bundle jar"? Please use the correcrt name.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done


#### GCS fileset

| Configuration item | Description | Default value | Required | Since version |
|--------------------------------|------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------|---------------|---------------------------|------------------|
| `gcs-service-account-file` | The path of GCS service account JSON file. | (none) | Yes if it's a GCS fileset.| 0.7.0-incubating |

In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
In the meantime, you need to place the corresponding bundle jar [`gravitino-gcp-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-gcp-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't have to put the jar into Hadoop location ${HADOOP_HOME}/share/hadoop/common/lib/, I think typically it is not allowed, neither not a good practice. Just add the jars into the classpath (wherever it is) is enough, right?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah, adding the jar to the classpath is more accurate.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

done

@@ -77,7 +77,9 @@ Apart from the above properties, to access fileset like S3, GCS, OSS and custom
| `s3-access-key-id` | The access key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating |
| `s3-secret-access-key` | The secret key of the AWS S3. | (none) | Yes if it's a S3 fileset.| 0.7.0-incubating |

At the same time, you need to place the corresponding bundle jar [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/aws-bundle/) in the Hadoop environment(typically located in `${HADOOP_HOME}/share/hadoop/common/lib/`).
At the same time, you need to add the corresponding bundle jar
1. [`gravitino-aws-bundle-${version}.jar`](https://repo1.maven.org/maven2/org/apache/gravitino/gravitino-aws-bundle/) in the classpath if there already have hadoop environment, or
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IIUC, the bundle jar should be used when hadoop environment is not ready, right? What you mentioned is the opposite.

@jerryshao jerryshao merged commit 07cdcba into apache:main Dec 27, 2024
26 checks passed
Abyss-lord pushed a commit to Abyss-lord/gravitino that referenced this pull request Dec 29, 2024
…core jars that does not contains hadoop-{aws,gcp,aliyun,azure} (apache#5806)

### What changes were proposed in this pull request?

Provide another kind of bundle jars that does not contains
hadoop-{aws,gcp,aliyun,azure} like aws-mini, gcp-mini.

### Why are the changes needed?

To make it works in a wide range of Hadoop version


Fix: apache#5585 

### Does this PR introduce _any_ user-facing change?

N/A

### How was this patch tested?

Existing UTs and ITs
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[Improvement] Verify and test whether PySpark can access fileset with cloud storage.
4 participants