Skip to content

Conversation

@LantaoJin
Copy link
Contributor

@LantaoJin LantaoJin commented May 22, 2018

What changes were proposed in this pull request?

In SPARK-23639, use --proxy-user to impersonate will invoke obtainDelegationTokens(). But from that, if current settings is connecting to DB directly via JDBC instead of RPC with metastore, it will failed with

WARN HiveConf: HiveConf of name hive.server2.enable.impersonation does not exist
Exception in thread "main" java.lang.IllegalArgumentException: requirement failed: Hive metastore uri undefined
at scala.Predef$.require(Predef.scala:224)
at org.apache.spark.sql.hive.thriftserver.HiveCredentialProvider.obtainCredentials(HiveCredentialProvider.scala:73)
at org.apache.spark.sql.hive.thriftserver.SparkSQLEnv$.init(SparkSQLEnv.scala:56)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.(SparkSQLCLIDriver.scala:288)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver$.main(SparkSQLCLIDriver.scala:137)
at org.apache.spark.sql.hive.thriftserver.SparkSQLCLIDriver.main(SparkSQLCLIDriver.scala)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(NativeMethodAccessorImpl.java:62)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(DelegatingMethodAccessorImpl.java:43)
at java.lang.reflect.Method.invoke(Method.java:498)
at org.apache.spark.deploy.SparkSubmit$.org$apache$spark$deploy$SparkSubmit$$runMain(SparkSubmit.scala:738)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:169)
at org.apache.spark.deploy.SparkSubmit$$anon$1.run(SparkSubmit.scala:167)
at java.security.AccessController.doPrivileged(Native Method)
at javax.security.auth.Subject.doAs(Subject.java:422)
at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1698)
at org.apache.spark.deploy.SparkSubmit$.doRunMain$1(SparkSubmit.scala:167)
at org.apache.spark.deploy.SparkSubmit$.submit(SparkSubmit.scala:212)
at org.apache.spark.deploy.SparkSubmit$.main(SparkSubmit.scala:126)
at org.apache.spark.deploy.SparkSubmit.main(SparkSubmit.scala)
18/05/22 05:24:16 INFO ShutdownHookManager: Shutdown hook called
18/05/22 05:24:16 INFO ShutdownHookManager: Deleting directory /tmp/spark-b63ad788-1a47-4326-9972-c4fde1dc19c3
{code}

How was this patch tested?

Remove or comment out the configuration hive.metastore.uris in hive-site.xml (Using JDBC to connect DB directly)
Below command will failed:

bin/spark-sql --proxy-user x_user --master local

@LantaoJin
Copy link
Contributor Author

Hi @vanzin @jerryshao , could you help to review this?

@AmplabJenkins
Copy link

Can one of the admins verify this patch?

require(principal.nonEmpty, s"Hive principal $principalKey undefined")
val metastoreUri = conf.getTrimmed("hive.metastore.uris", "")
require(metastoreUri.nonEmpty, "Hive metastore uri undefined")
if (metastoreUri.isEmpty) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is the code getting past the check in delegationTokensRequired? It's basically checking for the same thing.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

require() will throws IllegalArgumentException and exits the JVM here. Letting delegationTokensRequired return when metastoreUri is undefined (using JDBC to connect DB directly) only finishes this method (not set token in Credentials).

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question as @vanzin , delegationTokensRequired should already check whether hive.metastore.uris is empty or not, so it will not obtain the DT if this hive.metastore.uris is not configured.

  override def delegationTokensRequired(
      sparkConf: SparkConf,
      hadoopConf: Configuration): Boolean = {
    // Delegation tokens are needed only when:
    // - trying to connect to a secure metastore
    // - either deploying in cluster mode without a keytab, or impersonating another user
    //
    // Other modes (such as client with or without keytab, or cluster mode with keytab) do not need
    // a delegation token, since there's a valid kerberos TGT for the right user available to the
    // driver, which is the only process that connects to the HMS.
    val deployMode = sparkConf.get("spark.submit.deployMode", "client")
    UserGroupInformation.isSecurityEnabled &&
      hiveConf(hadoopConf).getTrimmed("hive.metastore.uris", "").nonEmpty &&
      (SparkHadoopUtil.get.isProxyUser(UserGroupInformation.getCurrentUser()) ||
        (deployMode == "cluster" && !sparkConf.contains(KEYTAB)))
  }

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I know. If the metastore is undefined, it no needs to obtain the DT. Am I right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Before getting the DT, we will check if delegationTokensRequired returns true or false, if it is false, then we will not get DT. Here since "hive.metastore.uris" is not configured, then it should return false.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Oh, my fault, delegationTokensRequired has been checked in SparkSQLCLIDriver.scala.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented May 23, 2018

#20784 and #21343 did the same thing, but #21343 is much readable. They are all to fix the problem using proxy user to access metastore (#17335 only considers the yarn mode). However, if current settings is connecting to DB directly instead of RPC to metastore, we shouldn't block the spark job execution after #20784

@jerryshao
Copy link
Contributor

Can you please describe your scenario @LantaoJin ?

@LantaoJin
Copy link
Contributor Author

LantaoJin commented May 23, 2018

@jerryshao
Simply speaking, in a security environment, if we use JDBC to connect to mysql directly instead of accessing hive metastore, current implementation blocks job execution.

And why not access metastore is tricky, that's because the firewall issue between spark and metastore. It should be resolved in our side. But in the code path, we still can chose whether or not enable metastore, and after #20784 , the approach of DB direct-connect was blocked.

@LantaoJin
Copy link
Contributor Author

LantaoJin commented May 23, 2018

Also, why still needs #20784 or #21343 to extends to #17335 may be caused by:

  1. Some DDL operations in local mode are much faster than launching a AM in yarn.
  2. Nodes in YARN cluster have the firewall issue with metastore :)

@LantaoJin
Copy link
Contributor Author

In our current settings, when we onboard a new cluster, the default is connect to DB directly, it's much simpler than access metastore. And we are going to update to access metastore by default. But I think spark shouldn't block that old approach.

@LantaoJin LantaoJin closed this May 23, 2018
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants