-
Notifications
You must be signed in to change notification settings - Fork 3k
Description
This issue is opened to discuss about a standardized way to ship new catalogs, as we plan to ship GlueCatalog and NessieCatalog in upcoming release, and new catalogs like JDBC is also in progress to be added.
In the last community sync meeting, we discussed 2 ways:
- for each new catalog (if in a new module), add a new runtime module that bundles all additional dependencies.
- directly add the module in existing runtimes, such as
iceberg-spark3-runtimeas long as the jar size increase is reasonable.
For approach 1, compiling a runtime jar for shared usage seems to be not ideal. It will introduce duplicated class path issue on the user side, with the AWS client version used here potentially different from the one in the user's application. Another issue with this approach is that we will introduce a ton of runtime modules to iceberg as more catalogs are added, and this is not desired.
For approach 2, I did some experiments based on the current aws module, and the result was not good. When added to the spark3 runtime, the jar size increased from 18.9MB to 34.7MB, almost doubled. I checked all AWS dependencies, and even with all non-aws dependencies excluded, the added size was still over 10MB. I would imagine this situation very similar if we support for GCS and Azure are added in the future.
So the best way to go in open source seems to be not bundling any runtime jar, and a user can start the spark session by specifying the additional dependencies in the --packages flag:
spark-sql --packages org.apache.iceberg:iceberg-spark3-runtime:0.11.0,org.apache.iceberg:iceberg-aws:0.11.0,software.amazon.awssdk:bundle:2.15.40 \
--conf spark.sql.catalog.test=org.apache.iceberg.spark.SparkCatalog \
--conf spark.sql.catalog.test.catalog-impl=org.apache.iceberg.aws.glue.GlueCatalog \
--conf spark.sql.catalog.test.warehouse=s3://some-bucket
By doing so, the user can freely choose the version of aws client, 2.15.40 here for example. And for users sensitive to jar size, they can cherry pick the aws client packages to bring in by themselves, instead of using the 250MB bundle.
Any thoughts? @rdblue @rymurr @yyanyy @giovannifumarola