-
Notifications
You must be signed in to change notification settings - Fork 29k
[SPARK-35242][SQL] Support change catalog default database for spark #32364
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
|
Can one of the admins verify this patch? |
|
cc @aokolnychyi and @cloud-fan FYI |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Every catalog is free to define its own default database/namespace, see CatalogPlugin.defaultNamespace.
What we need here is a config to change the default database for the session catalog. How about spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase?
|
cc @yaooqinn |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can we read the conf in class SessionCatalog so that it can be changed per session?
Can you detail the context or operation that leads to this exception? It looks to have nothing to do with the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we test this in CliSuite? AFAIK, spark actually never got a chance to create the default database if not exists, which will be done during hive metastore client initialization. If it is configured to default2 for example, Spark now will get the opportunity to create and there might be 2 default databases then.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
IMO, the database need exits when not connect to default. Now, spark shell(submit) always need a read permision of default when init.
update format
|
@cloud-fan @yaooqinn : thanks for your review. |
|
@cloud-fan @yaooqinn : please have a review for this PR again when free. |
|
|
||
| object SessionCatalog { | ||
| val DEFAULT_DATABASE = "default" | ||
| val DEFAULT_DATABASE = SQLConf.get.defaultDatabase |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This makes it look like it's a runtime config. Let's write getConf(StaticSQLConf.CATALOG_DEFAULT_DATABASE)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
BTW it's a bit tricky to access the active SQLConf in scala object. Can we read the conf in BaseSessionStateBuilder and pass it to SessionCatalog?
|
We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable. |
### What changes were proposed in this pull request? This PR is a follow-up PR for #32364. It has been closed by github-actions because it hasn't been updated in a while. The previous PR has added a new custom parameter (spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase) to configure the session catalog's default database which is required by some use cases where the user does not have access to the default database. Therefore I have created a new PR based on this and added these changes in addition: - Rebased / updated the previous PR to the latest master branch version - Deleted the DEFAULT_DATABASE static member from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala and refactored the code regarding this ### Why are the changes needed? If our user does not have any permissions for the Hive default database in Ranger, it will fail with the following error: ``` 22/08/26 18:36:21 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) ``` The idea is that we introduce a new configuration parameter where we can set a different database name for the default database. Our user has enough permissions for this in Ranger. For example: ```spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other_db``` ### Does this PR introduce _any_ user-facing change? There will be a new configuration parameter as I mentioned above but the default value is "default" as it was previously. ### How was this patch tested? 1) With github action (all tests passed) https://github.com/roczei/spark/actions/runs/2934863118 2) Manually tested with Ranger + Hive Scenario a) hrt_10 does not have access to the default database in Hive: ``` [hrt_10quasar-thbnqr-2 ~]$ spark-shell Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:14:18 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:14:30 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-17]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:18:47 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:18:48 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:18:48 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:48 INFO SessionState: [main]: Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:18:50 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:18:50 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default]) at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110) at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223) at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150) at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144) at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:179) ``` This is the expected behavior because it will use the "default" db name. Scenario b) Use the "other" database where the hrt_10 user has proper permissions ``` [hrt_10quasar-thbnqr-2 ~]$ spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other Setting default log level to "WARN". To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel). 22/08/26 18:27:03 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:27:14 WARN cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-15]: Attempted to request executors before the AM has registered! ... scala> spark.sql("use other") 22/08/26 18:29:22 INFO conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml 22/08/26 18:29:22 WARN conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist 22/08/26 18:29:22 WARN client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:22 INFO SessionState: [main]: Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection. 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1 22/08/26 18:29:24 INFO metastore.HiveMetaStoreClient: [main]: Connected to metastore. 22/08/26 18:29:24 INFO metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0 res0: org.apache.spark.sql.DataFrame = [] scala> spark.sql("select * from employee").show() +---+----+------+-----------+ |eid|name|salary|destination| +---+----+------+-----------+ | 12| Ram| 10| Szeged| | 13| Joe| 20| Debrecen| +---+----+------+-----------+ scala> ``` Closes #37679 from roczei/SPARK-35242. Lead-authored-by: Gabor Roczei <roczei@gmail.com> Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com> Co-authored-by: hongdd <jn_hdd@163.com> Co-authored-by: Gabor Roczei <roczei@cloudera.com> Signed-off-by: Wenchen Fan <wenchen@databricks.com>
What changes were proposed in this pull request?
Support change catalog default database for spark.
Why are the changes needed?
Spark catalog default database can only be
default. When we can not accessdefault, we will get Exception:We should support change default datbase for catalog like
jdbc/thriftdoes.Does this PR introduce any user-facing change?
No
How was this patch tested?
Added