[SPARK-24252][SQL] Add v2 catalog plugin system #23915

rdblue · 2019-02-28T01:35:23Z

What changes were proposed in this pull request?

This adds a v2 API for adding new catalog plugins to Spark.

Catalog implementations extend CatalogPlugin and are loaded via reflection, similar to data sources
Catalogs loads and initializes catalogs using configuration from a SQLConf
CaseInsensitiveStringMap is used to pass configuration to CatalogPlugin via initialize

Catalogs are configured by adding config properties starting with spark.sql.catalog.(name). The name property must specify a class that implements CatalogPlugin. Other properties under the namespace (spark.sql.catalog.(name).(prop)) are passed to the provider during initialization along with the catalog name.

This replaces #21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like TableCatalog.

How was this patch tested?

Added test suites for CaseInsensitiveStringMap and for catalog loading.

rdblue · 2019-02-28T01:35:58Z

@mccheah, @cloud-fan, could you review?

Matt, I've updated this to use CatalogPlugin from our discussion on #21306.

mccheah · 2019-02-28T02:02:45Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java

+   * @throws SparkException If the plugin class cannot be found or instantiated
+   */
+  public static CatalogPlugin load(String name, SQLConf conf) throws SparkException {
+    String pluginClassName = conf.getConfString("spark.sql.catalog." + name, null);


Should this configuration allow for loading multiple catalogs with the same name but with different contexts? For example, say I want to load a different catalog plugin for functions vs. tables, but I want them to be named the same.

My intuition is that we shouldn't allow that as it makes the behavior quite ambiguous.

I think it is reasonable to go with what is here: a name has just one implementation class. That class can implement multiple catalog interfaces, which do not conflict.

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java

mccheah · 2019-02-28T02:55:20Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

@@ -620,6 +622,12 @@ class SparkSession private(
   */
  @transient lazy val catalog: Catalog = new CatalogImpl(self)

+  @transient private lazy val catalogs = new mutable.HashMap[String, CatalogPlugin]()


Should there be some best-effort lifecycle for stopping or closing these catalog plugins? I'm wondering for example if we want to close connections to databases when the Spark Session shuts down. Maybe add a stop() API and calling it from SparkSession#stop?

We can check for Closeable and clean them up. Do you think this is needed?

Using an explicit stop API is more consistent with other parts of Spark, thinking the SparkSession itself and scheduler components for example.

Is there a use case for stopping catalogs? HiveExternalCatalog doesn't have a stop method.

possibly - a job server type thing that would like to "update" or "refresh" catalog perhaps. I don't know if it's super important for starter

I would say let's build it when we have a use case for it and avoid unnecessary additions.

I was thinking JDBC would be a significant one, enough so that it's worth putting in the API up front and showing that we support it. We'd like to be able to shut down connection pools when the catalog is discarded.

mccheah · 2019-02-28T02:56:32Z

Overall it looks reasonable; it's about what I would expect for a pluggable system that relies on classloading and reflective instantiation. We do this in other places too, for example in SchedulerBackend bootstrapping.

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CaseInsensitiveStringMap.java

cloud-fan · 2019-02-28T03:26:43Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CatalogPlugin.java

+   * @param name the name used to identify and load this catalog
+   * @param options a case-insensitive string map of configuration
+   */
+  void initialize(String name, CaseInsensitiveStringMap options);


it's weird to ask the catalog plugin to report name and initialize requires a name.

shall this be void initialize(CaseInsensitiveStringMap options);?

The catalog is passed the name that was used to identify it. For example, say I have a REST-based catalog endpoint that I'm configuring in two cases, like this:

spark.sql.catalog.prod = com.example.MyCatalogPlugin spark.sql.catalog.prod.connuri = http://prod.catalog.example.com:80/ spark.sql.catalog.test = com.example.MyCatalogPlugin spark.sql.catalog.test.connuri = http://test.catalog.example.com:80/

MyCatalogPlugin is instantiated and configured twice and both times is passed the name it is configured with, prod and test.

Adding a getter for name just makes it easy to identify the catalog without Spark keeping track of name -> catalog instance everywhere.

In this case, how would the MyCatalogPlugin report its name? prod or test?

^ depends on which catalog instance. one would say prod the other would say test

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java

cloud-fan · 2019-02-28T03:33:02Z

IIRC we want to follow Presto and support config files as well. Is it still in the roadmap?

SparkQA · 2019-02-28T06:01:47Z

Test build #102846 has finished for PR 23915 at commit c584e35.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class CaseInsensitiveStringMap implements Map<String, String>
public class Catalogs

mccheah · 2019-02-28T17:32:56Z

Could config files be implementation-specific? For example I'd want my Data Source implementation to be given a YML file but another data source implementation wants a properties file. Deserialization of the file doesn't make sense to be done on the Spark side, but in the internal implementation of the data source. Thus I'd imagine one of the properties passed to the data source implementation would be the path to the configuration to deserialize, and we leave it up to the implementation to decide how to handle that path in initialize.

So I don't think we need to make configuration files as an explicit part of the API here.

rdblue · 2019-03-01T01:49:49Z

IIRC we want to follow Presto and support config files as well. Is it still in the roadmap?

I hadn't considered this before. Can we add it in a follow-up if it is something that we want to do?

rdblue · 2019-03-01T01:51:19Z

@mccheah, @cloud-fan, I've updated the PR and replied to your comments. Please have a look. Thank you!

SparkQA · 2019-03-01T02:08:07Z

Test build #102891 has finished for PR 23915 at commit 17e6de1.

This patch fails Java style tests.
This patch merges cleanly.
This patch adds no public classes.

felixcheung · 2019-03-01T07:04:33Z

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java

+
+    } catch (ClassNotFoundException e) {
+      throw new SparkException(String.format(
+          "Cannot find catalog plugin class for catalog '%s': %s", name, pluginClassName));


why not pass e for consistency?

I don't think it is useful in this case. The stack trace would be from Class.forName into the Java runtime, and all of the information from the message, like the class name, is included in this one. The stack above this call is also included in the thrown exception.

Nit: Then we should only wrap the Class.forName call in the try-catch - if anything else in the block throws ClassNotFoundException it will not be obvious where it was thrown from. And while ClassNotFoundException can't be thrown by any other code currently, future contributors adding code in this block can get their exceptions swallowed up.

I do think in general it's best practice to pass along the exception. Prevents us from losing any state, even if that state is noise 99.9% of the time.

felixcheung · 2019-03-01T07:07:05Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

@@ -620,6 +622,12 @@ class SparkSession private(
   */
  @transient lazy val catalog: Catalog = new CatalogImpl(self)

+  @transient private lazy val catalogs = new mutable.HashMap[String, CatalogPlugin]()


possibly - a job server type thing that would like to "update" or "refresh" catalog perhaps. I don't know if it's super important for starter

SparkQA · 2019-03-01T20:17:12Z

Test build #102918 has finished for PR 23915 at commit 6edb880.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

rdblue · 2019-03-05T22:29:56Z

Retest this please.

SparkQA · 2019-03-06T02:41:16Z

Test build #103071 has finished for PR 23915 at commit 6edb880.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CatalogPlugin.java

rdblue · 2019-03-06T16:50:02Z

@cloud-fan, I've updated the docs as you requested. There should be no need to wait for tests because the change was to docs only.

SparkQA · 2019-03-06T18:41:36Z

Test build #103102 has finished for PR 23915 at commit 7e50ca6.

This patch fails Spark unit tests.
This patch merges cleanly.
This patch adds no public classes.

cloud-fan · 2019-03-07T00:42:41Z

The change itself LGTM(except one comment) but I need a little more information to justify the design:

why the initialize method takes the name parameter? Since one table catalog can be registered more than once withh different names, end users would expect the table catalog is not sensitive to the names. I think it's better if Spark can force the table catalog to be not sensitive to the names, i.e. do not give name when initializing table catalog.
why the table catalog needs to report the name? Spark has a map of string to table catalog, when Spark gets a table catalog, the name should already be known.

Interestingly the name doesn't exist in the original PR: https://github.com/apache/spark/pull/21306/files#diff-81c54123a7549b07a9d627353d9cbf95R49 . I'm wondering what has been hanged recently.

EDIT:
We discussed this problem in the ds v2 community meeting. The conclusion is, theoretically it's not needed, but makes it easier to use table catalog in Spark. We will justify it after we implement table catalog. We may decided to remove the name parameter and create a wrapper class in Spark, or leave it unchanged.

cloud-fan · 2019-03-07T04:20:17Z

retest this please

viirya · 2019-03-07T05:03:32Z

sql/core/src/main/scala/org/apache/spark/sql/SparkSession.scala

+  @transient private lazy val catalogs = new mutable.HashMap[String, CatalogPlugin]()
+
+  private[sql] def catalog(name: String): CatalogPlugin = synchronized {
+    catalogs.getOrElseUpdate(name, Catalogs.load(name, sessionState.conf))


Looks like we don't support unloading or reloading a catalog plugin?

do you have any use case of unloading/reloading?

Possible case like loading a catalog plugin with different config? Currently it has to load it with a different catalog name. For example, like for a REST-based catalog endpoint, if it requires some authentication. We might want to reconnect the catalog with different authentication.

I would be fine adding this feature if it is needed, but we don't currently support reconfiguration in SessionCatalog so I think it makes sense to get this in and consider adding it later.

SparkQA · 2019-03-07T08:05:01Z

Test build #103118 has finished for PR 23915 at commit 7e50ca6.

This patch fails due to an unknown error code, -9.
This patch merges cleanly.
This patch adds no public classes.

dilipbiswal · 2019-03-07T08:09:01Z

retest this please

SparkQA · 2019-03-07T12:26:13Z

Test build #103128 has finished for PR 23915 at commit 7e50ca6.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-08T00:46:57Z

Test build #103162 has finished for PR 23915 at commit f118a3d.

This patch passes all tests.
This patch merges cleanly.
This patch adds no public classes.

SparkQA · 2019-03-08T01:22:11Z

Test build #103165 has finished for PR 23915 at commit 7c64a26.

This patch passes all tests.
This patch merges cleanly.
This patch adds the following public classes (experimental):
public class Catalogs
public class CaseInsensitiveStringMap implements Map<String, String>

cloud-fan · 2019-03-08T11:32:19Z

thanks, merging to master!

## What changes were proposed in this pull request? This adds a v2 API for adding new catalog plugins to Spark. * Catalog implementations extend `CatalogPlugin` and are loaded via reflection, similar to data sources * `Catalogs` loads and initializes catalogs using configuration from a `SQLConf` * `CaseInsensitiveStringMap` is used to pass configuration to `CatalogPlugin` via `initialize` Catalogs are configured by adding config properties starting with `spark.sql.catalog.(name)`. The name property must specify a class that implements `CatalogPlugin`. Other properties under the namespace (`spark.sql.catalog.(name).(prop)`) are passed to the provider during initialization along with the catalog name. This replaces apache#21306, which will be implemented in two multiple parts: the catalog plugin system (this commit) and specific catalog APIs, like `TableCatalog`. ## How was this patch tested? Added test suites for `CaseInsensitiveStringMap` and for catalog loading. Closes apache#23915 from rdblue/SPARK-24252-add-v2-catalog-plugins. Authored-by: Ryan Blue <blue@apache.org> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

juliuszsompolski · 2024-08-16T17:20:15Z

What's the difference between the existing org.apache.spark.sql.catalyst.util.CaseInsensitiveMap in sql/api and org.apache.spark.sql.util.CaseInsensitiveStringMap added here in sql/catalyst?
E.g. DataFrameReader and DataFrameWriter and then has to convert with val dsOptions = new CaseInsensitiveStringMap(finalOptions.asJava)

cloud-fan · 2024-08-17T03:01:00Z

It's just a java friendly version of CaseInsensitiveMap.

mccheah reviewed Feb 28, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java Show resolved Hide resolved

mccheah reviewed Feb 28, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java Show resolved Hide resolved

mccheah reviewed Feb 28, 2019

View reviewed changes

cloud-fan reviewed Feb 28, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 28, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CaseInsensitiveStringMap.java Outdated Show resolved Hide resolved

cloud-fan reviewed Feb 28, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/Catalogs.java Show resolved Hide resolved

felixcheung reviewed Mar 1, 2019

View reviewed changes

rdblue force-pushed the SPARK-24252-add-v2-catalog-plugins branch from 17e6de1 to 6edb880 Compare March 1, 2019 17:06

dongjoon-hyun changed the title ~~SPARK-24252: Add v2 catalog plugin system.~~ [SPARK-24252][SQL] Add v2 catalog plugin system Mar 1, 2019

cloud-fan reviewed Mar 6, 2019

View reviewed changes

sql/catalyst/src/main/java/org/apache/spark/sql/catalog/v2/CatalogPlugin.java Show resolved Hide resolved

viirya reviewed Mar 7, 2019

View reviewed changes

rdblue mentioned this pull request Mar 7, 2019

[SPARK-26811][SQL] Add capabilities to v2.Table #24012

Closed

SPARK-24252: Add v2 catalog plugin system.

7c64a26

rdblue force-pushed the SPARK-24252-add-v2-catalog-plugins branch from f118a3d to 7c64a26 Compare March 7, 2019 20:38

cloud-fan closed this in 6170e40 Mar 8, 2019

WeiWenda mentioned this pull request Jan 25, 2020

[SPARK-30617][SQL] Stop check values of spark.sql.catalogImplementation to improve expansibility #27349

Closed

[SPARK-24252][SQL] Add v2 catalog plugin system #23915

[SPARK-24252][SQL] Add v2 catalog plugin system #23915

Conversation

rdblue commented Feb 28, 2019 • edited Loading

What changes were proposed in this pull request?

How was this patch tested?

rdblue commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

mccheah commented Feb 28, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

cloud-fan commented Feb 28, 2019

SparkQA commented Feb 28, 2019

mccheah commented Feb 28, 2019

rdblue commented Mar 1, 2019

rdblue commented Mar 1, 2019

SparkQA commented Mar 1, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 1, 2019

rdblue commented Mar 5, 2019

SparkQA commented Mar 6, 2019

rdblue commented Mar 6, 2019

SparkQA commented Mar 6, 2019

cloud-fan commented Mar 7, 2019 • edited Loading

cloud-fan commented Mar 7, 2019

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

SparkQA commented Mar 7, 2019

dilipbiswal commented Mar 7, 2019

SparkQA commented Mar 7, 2019

SparkQA commented Mar 8, 2019

SparkQA commented Mar 8, 2019

cloud-fan commented Mar 8, 2019

juliuszsompolski commented Aug 16, 2024

cloud-fan commented Aug 17, 2024

rdblue commented Feb 28, 2019 •

edited

Loading

cloud-fan commented Mar 7, 2019 •

edited

Loading