Skip to content

Conversation

@hddong
Copy link
Contributor

@hddong hddong commented Apr 27, 2021

What changes were proposed in this pull request?

Support change catalog default database for spark.

Why are the changes needed?

Spark catalog default database can only be default. When we can not access default, we will get Exception:

Permission denied: user [spark] does not have [SELECT] privilege on [spark_test])

We should support change default datbase for catalog like jdbc/thrift does.

Does this PR introduce any user-facing change?

No

How was this patch tested?

Added

@github-actions github-actions bot added the SQL label Apr 27, 2021
@AmplabJenkins
Copy link

Can one of the admins verify this patch?

@HyukjinKwon HyukjinKwon changed the title [SPARK-35242][SQL]support change catalog default database for spark [SPARK-35242][SQL] Support change catalog default database for spark Apr 28, 2021
@HyukjinKwon
Copy link
Member

cc @aokolnychyi and @cloud-fan FYI

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Every catalog is free to define its own default database/namespace, see CatalogPlugin.defaultNamespace.

What we need here is a config to change the default database for the session catalog. How about spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase?

@cloud-fan
Copy link
Contributor

cc @yaooqinn

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can we read the conf in class SessionCatalog so that it can be changed per session?

@yaooqinn
Copy link
Member

Permission denied: user [spark] does not have [SELECT] privilege on [spark_test])

Can you detail the context or operation that leads to this exception?

It looks to have nothing to do with the default database.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can we test this in CliSuite? AFAIK, spark actually never got a chance to create the default database if not exists, which will be done during hive metastore client initialization. If it is configured to default2 for example, Spark now will get the opportunity to create and there might be 2 default databases then.

Copy link
Contributor Author

@hddong hddong May 7, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

IMO, the database need exits when not connect to default. Now, spark shell(submit) always need a read permision of default when init.

update format
@hddong
Copy link
Contributor Author

hddong commented May 7, 2021

@cloud-fan @yaooqinn : thanks for your review.
In my case hive permission managed by ranger, and all users have not read access to default.
And please review again.

@hddong
Copy link
Contributor Author

hddong commented May 26, 2021

@cloud-fan @yaooqinn : please have a review for this PR again when free.


object SessionCatalog {
val DEFAULT_DATABASE = "default"
val DEFAULT_DATABASE = SQLConf.get.defaultDatabase
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This makes it look like it's a runtime config. Let's write getConf(StaticSQLConf.CATALOG_DEFAULT_DATABASE)

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

BTW it's a bit tricky to access the active SQLConf in scala object. Can we read the conf in BaseSessionStateBuilder and pass it to SessionCatalog?

@github-actions
Copy link

github-actions bot commented Sep 4, 2021

We're closing this PR because it hasn't been updated in a while. This isn't a judgement on the merit of the PR in any way. It's just a way of keeping the PR queue manageable.
If you'd like to revive this PR, please reopen it and ask a committer to remove the Stale tag!

@github-actions github-actions bot added the Stale label Sep 4, 2021
@github-actions github-actions bot closed this Sep 5, 2021
cloud-fan pushed a commit that referenced this pull request Sep 27, 2022
### What changes were proposed in this pull request?

This PR is a follow-up PR for #32364. It has been closed by github-actions because it hasn't been updated in a while. The previous PR has added a new custom parameter (spark.sql.catalog.$SESSION_CATALOG_NAME.defaultDatabase) to configure the session catalog's default database which is required by some use cases where the user does not have access to the default database.

Therefore I have created a new PR based on this and added these changes in addition:

- Rebased / updated the previous PR to the latest master branch version
- Deleted the DEFAULT_DATABASE  static member from sql/catalyst/src/main/scala/org/apache/spark/sql/catalyst/catalog/SessionCatalog.scala and refactored the code regarding this

### Why are the changes needed?

If our user does not have any permissions for the Hive default database in Ranger, it will fail with the following error:

```
22/08/26 18:36:21 INFO  metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default])
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
  at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
  at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144)
```
The idea is that we introduce a new configuration parameter where we can set a different database name for the default database. Our user has enough permissions for this  in Ranger.

For example:

```spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other_db```

### Does this PR introduce _any_ user-facing change?

There will be a new configuration parameter as I mentioned above but the default value is "default" as it was previously.

### How was this patch tested?

1) With github action (all tests passed)

https://github.com/roczei/spark/actions/runs/2934863118

2) Manually tested with Ranger + Hive

Scenario a) hrt_10 does not have access to the default database in Hive:

```
[hrt_10quasar-thbnqr-2 ~]$ spark-shell
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 18:14:18 WARN  conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist
22/08/26 18:14:30 WARN  cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-17]: Attempted to request executors before the AM has registered!

...

scala> spark.sql("use other")
22/08/26 18:18:47 INFO  conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml
22/08/26 18:18:48 WARN  conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist
22/08/26 18:18:48 WARN  client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic
Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d
22/08/26 18:18:48 INFO  SessionState: [main]: Hive Session ID = 2188764e-d0dc-41b3-b714-f89b03cb3d6d
22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled.
22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083
22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection.
22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1
22/08/26 18:18:50 INFO  metastore.HiveMetaStoreClient: [main]: Connected to metastore.
22/08/26 18:18:50 INFO  metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0
org.apache.spark.sql.AnalysisException: org.apache.hadoop.hive.ql.metadata.HiveException: MetaException(message:Permission denied: user [hrt_10] does not have [USE] privilege on [default])
  at org.apache.spark.sql.hive.HiveExternalCatalog.withClient(HiveExternalCatalog.scala:110)
  at org.apache.spark.sql.hive.HiveExternalCatalog.databaseExists(HiveExternalCatalog.scala:223)
  at org.apache.spark.sql.internal.SharedState.externalCatalog$lzycompute(SharedState.scala:150)
  at org.apache.spark.sql.internal.SharedState.externalCatalog(SharedState.scala:144)
  at org.apache.spark.sql.internal.SharedState.globalTempViewManager$lzycompute(SharedState.scala:179)
```

This is the expected behavior because it will use the "default" db name.

Scenario b) Use the "other" database where the hrt_10 user has proper permissions

```
[hrt_10quasar-thbnqr-2 ~]$ spark-shell --conf spark.sql.catalog.spark_catalog.defaultDatabase=other
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/08/26 18:27:03 WARN  conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist
22/08/26 18:27:14 WARN  cluster.YarnSchedulerBackend$YarnSchedulerEndpoint: [dispatcher-event-loop-15]: Attempted to request executors before the AM has registered!

...

scala> spark.sql("use other")
22/08/26 18:29:22 INFO  conf.HiveConf: [main]: Found configuration file file:/etc/hive/conf/hive-site.xml
22/08/26 18:29:22 WARN  conf.HiveConf: [main]: HiveConf of name hive.masking.algo does not exist
22/08/26 18:29:22 WARN  client.HiveClientImpl: [main]: Detected HiveConf hive.execution.engine is 'tez' and will be reset to 'mr' to disable useless hive logic
Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2
22/08/26 18:29:22 INFO  SessionState: [main]: Hive Session ID = 47721693-dbfe-4760-80f6-d4a76a3b37d2
22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: HMS client filtering is enabled.
22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Trying to connect to metastore with URI thrift://quasar-thbnqr-4.quasar-thbnqr.root.hwx.site:9083
22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: HMSC::open(): Could not find delegation token. Creating KERBEROS-based thrift connection.
22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Opened a connection to metastore, current connections: 1
22/08/26 18:29:24 INFO  metastore.HiveMetaStoreClient: [main]: Connected to metastore.
22/08/26 18:29:24 INFO  metastore.RetryingMetaStoreClient: [main]: RetryingMetaStoreClient proxy=class org.apache.hadoop.hive.ql.metadata.SessionHiveMetaStoreClient ugi=hrt_10ROOT.HWX.SITE (auth:KERBEROS) retries=1 delay=1 lifetime=0
res0: org.apache.spark.sql.DataFrame = []

scala> spark.sql("select * from employee").show()
+---+----+------+-----------+
|eid|name|salary|destination|
+---+----+------+-----------+
| 12| Ram|    10|     Szeged|
| 13| Joe|    20|   Debrecen|
+---+----+------+-----------+

scala>
```

Closes #37679 from roczei/SPARK-35242.

Lead-authored-by: Gabor Roczei <roczei@gmail.com>
Co-authored-by: hongdongdong <hongdongdong@cmss.chinamobile.com>
Co-authored-by: hongdd <jn_hdd@163.com>
Co-authored-by: Gabor Roczei <roczei@cloudera.com>
Signed-off-by: Wenchen Fan <wenchen@databricks.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants