Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allow presto to connect to kerberized Hive clusters (v3) #4867

Closed
wants to merge 20 commits into from

Conversation

arhimondr
Copy link
Member

Superseeds: #4576

This implementation is supposed to minimize the Subject.doAs performance impact when using kerberized cluster. Instead of wrapping the entire hive connector into Subject.doAs only concrete places of FileSystem creation are wrapped.

Product tests for all the formats supported by Presto added. Although some obscure format which are implicitly supported by GenericRecordReader might potentially fail.

package com.facebook.presto.tests.hive;

/**
* Created by andrii on 21.03.16.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Don't add comments like this.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It is auto-generated by my IDE, and i just forgot to remove it. This is lame, i know. Will remove.

@dain
Copy link
Contributor

dain commented Mar 25, 2016

Commit "Make HiveHdfsConfiguration immutable" 61d4bb, has a typo in the commit message Callse

@@ -25,7 +25,20 @@
public class HiveHdfsConfiguration
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@electrum, please review this commit.

@arhimondr
Copy link
Member Author

@martint We've finished iterating on this PR. Could you please do the final review and merge this?

p.s.:
Build is passing. Single failure seems to be intermittent, related to task scheduler.

Andrii Rosa and others added 20 commits May 9, 2016 19:59
New version contains `hive-shims` classes within.
`hive-shims` is required by presto-product-tests to access
kerberized hive.
Test INSERT and SELECT paths for all the storage formats
supported by hive connector. This test is going to be used
for kerberized HDFS access verification.

Initally all the formats are commented out because no one of them
is supported yet. We will uncomment format by format together with
the kerberos support implmenetation pathches. So we can better track
what changes are needed in order to ensure that some particular format
works for Kerberized Hadoop.
Make INITIAL_CONFIGURATION in HiveHdfsConfiguration immutable.

Based on the speficic of the Configuration implementation it can
be modified during the additional hadoop modules loading, such as
DistributedFileSystem, MapReduce, etc.

Let's consider the next flow

1. Client1 calls HdfsConfiguration.getConfiguration()
2. Client2 calls FileSystem.getFileSystem() which implicitly loads the DistributedFileSystem
3. Client3 calls HdfsConfiguration.getConfiguration()

In such case Client1 and Client3 will obtain the Configuration with the different property set.

In order to solve this issue we must load the hdfs related configuration during the
HiveHdfsConfiguration initialization and store thoose values in unmodifiable
INITIAL_CONFIGURATION.
Instead of using the reflection to acces the private
methods from the UserGroupInformation we are going to
leverage the thin Shim.

This commit is going to be replaced with the updated
versions of Hadoop libraries once they released.
Support both KERBEROS and SIMPLE hadoop authentications
with impersonation and without.
Pass session user as a parameter to HdfsEnvironment.getFileSystem

It is enough to just create FileSystem within the UserGroupInformation.doAs
to make it authenticate the HDFS requests with Kerberos.
Use HdfsEnvironment.getFileSystem in custom readers instead
of plain FileSystem.get().
Add --krb5-disable-remote-service-hostname-canonicalization presto-cli option.
With this option presto service hostname canonicalization using the reverse
DNS lookup can be disabled.
Add `singlenode-hdfs-impersonation`, `singlenode-kerberos-hdfs-no-impersonation`
product test environments. Rename `singlenode-kerberized` environment to
`singlenode-kerberos-hdfs-impersonation` to keep the names consistent.

Hive connector supports 4 types of HDFS authentication. We have to be able
to test them all. Very basic `singlenode` product test envrironment covers
the simple hdfs authentication with no impersonation. `singlenode-hdfs-impersonation`
is intended to test simple hdfs authentication with impersonation.

Kerberos authentication with impersonation is covered by running product tests
on `singlenode-kerberos-hdfs-impersonation` environment. In order to verify
kerberos authentication without impersonation product tests must be run on
`singlenode-kerberos-hdfs-no-impersonation`.
Add product tests that verify that HDFS impersonation is either enabled or
disabled.

To verify HDFS impersonation a table is created using the Hive connector.
If HDFS impersonation is enabled table data should belong to the Presto JDBC
user, otherwise it should belong to the Hadoop user defined
in Presto configuration.

These tests are profile specific, and can't be run simultaneously on the
product tests environment. In order to exclude such tests from a regular
test suite, that is being run on all the environments, the `profile_specific`
test group has been introduced. This group should be explicitly excluded
for the regular test runs, along with the `quarantine` and `big_query` groups.
Then either the `hdfs_impersonation` or `hdfs_no_impersonation` group should
be included based on the environment configuration we are going to run
product tests on.

.. note::

If your ``krb5.conf`` location is different than ``/etc/krb5.conf`` you must set it
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is not ideal because we lose the ability to validate options that are set in this manner. I guess it's fine for now, but we should figure out a way to make the option required by the hive connector and the one required by Presto coexist.

@martint
Copy link
Contributor

martint commented May 9, 2016

Merged, thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants