[SPARK-35242][SQL] Support change catalog default database for spark #32364

hddong · 2021-04-27T09:08:03Z

What changes were proposed in this pull request?

Support change catalog default database for spark.

Why are the changes needed?

Spark catalog default database can only be default. When we can not access default, we will get Exception:

Permission denied: user [spark] does not have [SELECT] privilege on [spark_test])

We should support change default datbase for catalog like jdbc/thrift does.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added

AmplabJenkins · 2021-04-27T09:28:30Z

Can one of the admins verify this patch?

HyukjinKwon · 2021-04-28T01:43:11Z

cc @aokolnychyi and @cloud-fan FYI

cloud-fan · 2021-04-28T09:04:07Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

Every catalog is free to define its own default database/namespace, see CatalogPlugin.defaultNamespace.

What we need here is a config to change the default database for the session catalog. How about spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase?

cloud-fan · 2021-04-28T09:04:43Z

cc @yaooqinn

cloud-fan · 2021-04-28T09:05:21Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala

can we read the conf in class SessionCatalog so that it can be changed per session?

yaooqinn · 2021-04-28T09:12:47Z

Permission denied: user [spark] does not have [SELECT] privilege on [spark_test])

Can you detail the context or operation that leads to this exception?

It looks to have nothing to do with the default database.

yaooqinn · 2021-04-28T09:23:35Z

sql/catalyst/src/main/scala/org/apache/spark/sql/internal/StaticSQLConf.scala

Can we test this in CliSuite? AFAIK, spark actually never got a chance to create the default database if not exists, which will be done during hive metastore client initialization. If it is configured to default2 for example, Spark now will get the opportunity to create and there might be 2 default databases then.

IMO, the database need exits when not connect to default. Now, spark shell(submit) always need a read permision of default when init.

update format

hddong · 2021-05-07T07:51:52Z

@cloud-fan @yaooqinn : thanks for your review.
In my case hive permission managed by ranger, and all users have not read access to default.
And please review again.

hddong · 2021-05-26T01:57:36Z

@cloud-fan @yaooqinn : please have a review for this PR again when free.

cloud-fan · 2021-05-26T04:44:43Z

sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala


 object SessionCatalog {
-  val DEFAULT_DATABASE = "default"
+  val DEFAULT_DATABASE = SQLConf.get.defaultDatabase


This makes it look like it's a runtime config. Let's write getConf(StaticSQLConf.CATALOG_DEFAULT_DATABASE)

BTW it's a bit tricky to access the active SQLConf in scala object. Can we read the conf in BaseSessionStateBuilder and pass it to SessionCatalog?

github-actions · 2021-09-04T00:08:36Z

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

### What changes were proposed in this pull request? This PR is a follow-up PR for #32364. It has been closed by github-actions because it hasn't been updated in a while. The previous PR has added a new custom parameter (spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase) to configure the session catalog's default database which is required by some use cases where the user does not have access to the default database. Therefore I have created a new PR based on this and added these changes in addition: - Rebased / updated the previous PR to the latest master branch version - Deleted the DEFAULT_DATABASE static member from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala and refactored the code regarding this ### Why are the changes needed? If our user does not have any permissions for the Hive default database in Ranger, it will fail with the following error: ``` 22/08/26 18:36:21 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) ``` The idea is that we introduce a new configuration parameter where we can set a different database name for the default database. Our user has enough permissions for this in Ranger. For example: ```spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other_db``` ### Does this PR introduce _any_ user-facing change? There will be a new configuration parameter as I mentioned above but the default value is "default" as it was previously. ### How was this patch tested? 1) With github action (all tests passed) https://github.com/roczei/spark/actions/runs/2934863118 2) Manually tested with Ranger + Hive Scenario a) hrt_10 does not have access to the default database in Hive: ``` [hrt_10quasar-thbnqr-2 ~]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:14:18 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:14:30 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-17]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:18:47 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:18:48 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:18:48 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:48 INFO SessionState: [main]: Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:18:50 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:179) ``` This is the expected behavior because it will use the "default" db name. Scenario b) Use the "other" database where the hrt_10 user has proper permissions ``` [hrt_10quasar-thbnqr-2 ~]$ spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:27:03 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:27:14 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-15]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:29:22 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:29:22 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:29:22 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:22 INFO SessionState: [main]: Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:29:24 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from employee").show() +---+----+------+-----------+ |eid|name|salary|destination| +---+----+------+-----------+ | 12| Ram| 10| Szeged| | 13| Joe| 20| Debrecen| +---+----+------+-----------+ scala> ``` Closes #37679 from roczei/SPARK-35242. Lead-authored-by: Gabor Roczei <roczei@gmail.com> Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Co-authored-by: hongdd <jn_hdd@163.com> Co-authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>

github-actions bot added the SQL label Apr 27, 2021

HyukjinKwon changed the title ~~[SPARK-35242][SQL]support change catalog default database for spark~~ [SPARK-35242][SQL] Support change catalog default database for spark Apr 28, 2021

cloud-fan reviewed Apr 28, 2021

View reviewed changes

yaooqinn reviewed Apr 28, 2021

View reviewed changes

[SPARK-35242][SQL]support change catalog default database for spark

1fe6c00

hddong force-pushed the spark-35242 branch from 73c565a to 1fe6c00 Compare May 6, 2021 09:53

update format

8dbd8bb

update format

cloud-fan reviewed May 26, 2021

View reviewed changes

github-actions bot added the Stale label Sep 4, 2021

github-actions bot closed this Sep 5, 2021

roczei mentioned this pull request Aug 26, 2022

[SPARK-35242][SQL] Support changing session catalog's default database #37679

Closed

[SPARK-35242][SQL] Support change catalog default database for spark #32364

[SPARK-35242][SQL] Support change catalog default database for spark #32364

Uh oh!

Conversation

hddong commented Apr 27, 2021

What changes were proposed in this pull request?

Why are the changes needed?

Does this PR introduce any user-facing change?

How was this patch tested?

Uh oh!

AmplabJenkins commented Apr 27, 2021

Uh oh!

HyukjinKwon commented Apr 28, 2021

Uh oh!

cloud-fan Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan commented Apr 28, 2021

Uh oh!

cloud-fan Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

yaooqinn commented Apr 28, 2021

Uh oh!

yaooqinn Apr 28, 2021

Choose a reason for hiding this comment

Uh oh!

hddong May 7, 2021 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Choose a reason for hiding this comment

Uh oh!

hddong commented May 7, 2021

Uh oh!

hddong commented May 26, 2021

Uh oh!

cloud-fan May 26, 2021

Choose a reason for hiding this comment

Uh oh!

cloud-fan May 26, 2021

Choose a reason for hiding this comment

Uh oh!

github-actions bot commented Sep 4, 2021

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

hddong May 7, 2021 •

edited

Loading