Add flag to execute Hive queries as the user submitting the query #4382

billonahill · 2016-01-20T19:47:43Z

Adding the hive.read-as-query-user flag (default=false). When set to true Presto will read from HDFS as the user submitting the query. When set to false the query will read from HDFS as the presto daemon user.

dain · 2016-01-20T23:04:07Z

presto-hive/src/main/java/com/facebook/presto/hive/HivePageSourceProvider.java

+        if (HiveSessionProperties.getReadAsQueryUser(session)) {
+            UserGroupInformation ugi = UgiUtils.getUgi(session.getUser());
+            try {
+                return ugi.doAs((PrivilegedExceptionAction<ConnectorPageSource>) () ->


Is it sufficient to only doAs for the initial creation of the page source? Historically, we have had problems with Haooop APIs and the multithreaded nature of Presto. In a standard query this page source will be accessed from many threads, so if in the middle of the query the thread preforms a new "read", the UGI will not be set on the thread.

@dain if the a single query accesses this method repeatedly with multiple threads we should be ok since the query will always be executed by the same user and the method will always doAs that user. We'd have problems if the ConnectorPageSource returned by this method was then shared across queries/users. Would that ever be the case?

In an initial version of this patch we ran into user/threading issues in the BackgroundHiveSplitLoader because we were doing doAs at too high a level before kicking off tasks in the threadpool. This had the effect of binding the subject to the thread across queries. The fix was to do doAs within the thread, not outside of it.

We've been successfully running this patch in production for a week or so without issue, but that doesn't mean there couldn't be a corner case that we've missed.

I was thinking about when we make further calls to the Hive record reader (e.g., recordReader.next). I would assume most readers cache the open file system input stream, but I would bet there are record readers that open new files during execution (maybe reading a sidecar file), and these would fail if the UGI is not on the thread. Maybe we just ignore that until the problem actually happens.

Right, I see your point. An impl that opens new files after getting the page source would run into issues. We could see wait and see if that's really an issue.

dain · 2016-01-20T23:04:51Z

Do we need similar code in https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSinkProvider.java

dain · 2016-01-20T23:12:03Z

presto-hive/src/main/java/com/facebook/presto/hive/util/UgiUtils.java

+
+    // Every instance of a UserGroupInformation object for a given user has a unique hashcode, due
+    // to the hashCode() impl. If we don't cache the UGI per-user here, there will be a memory leak
+    // in the PrestoFileSystemCache.


What is the leak? Specifically, if I have a lot of users over the uptime of my system, do I have a "large" permanent allocation for every user seen, or can we "expire" them overtime?

Without this cache the specific leak would occur in PrestoFileSystemCache.map (https://github.com/prestodb/presto-hadoop-apache2/blob/master/src/main/java/org/apache/hadoop/fs/PrestoFileSystemCache.java#L62), since the Key would always be unique, even when the same user accesses the same filesystem in multiple queries.

Yes, the caching bounds the map size to O(number of users) by hashing UGIs consistently, but as implemented PrestoFileSystemCache.map has a "leak" in that it will cache all users keys for the duration of the JVM. Running our coordinator with 64G heap we ran into issues once we got to O(100k) map entries, to give a sense of the volume of users required to cause memory issues.

billonahill · 2016-01-21T20:16:31Z

Do we need similar code in https://github.com/facebook/presto/blob/master/presto-hive/src/main/java/com/facebook/presto/hive/HivePageSinkProvider.java

We would need to do that to support hive.write-as-query-user which we don't yet support [1], but we plan to. We need to submit some more patches for CTAS that will require this, so my plan to was follow up with that in a separate patch.

1 - Per user temp directory: https://github.com/twitter-forks/presto/blob/twitter-master/presto-hive/src/main/java/com/facebook/presto/hive/HiveWriteUtils.java#L378

ebd2 · 2016-01-21T20:18:05Z

@billonahill: Teradata is actively working on supporting Kerberos in the Hive connector. As part of doing this, we get non-Kerberos impersonation basically for free. We have a branch in our fork that we'd love to have somebody else who has been working on this stuff take a look at.

I'll open an issue and add a link to our branch there, but in the meantime, you can check it out here:
https://github.com/Teradata/presto/tree/kerberos_hive_poc

Edited to add: It looks like there's an issue open already: #3380

billonahill · 2016-01-22T17:20:16Z

Thanks for sharing that Eric, that's great. I do have a few comments, but
AFAIK I can't comment on a diff, only a PR. Could you make a PR just for
commenting?

On Thu, Jan 21, 2016 at 12:18 PM, Eric Diven notifications@github.com
wrote:

@billonahill https://github.com/billonahill: Teradata is actively
working on supporting Kerberos in the Hive connector. As part of doing
this, we get non-Kerberos impersonation basically for free. We have a
branch in our fork that we'd love to have somebody else who has been
working on this stuff take a look at.

I'll open an issue and add a link to our branch there, but in the
meantime, you can check it out here:
https://github.com/Teradata/presto/tree/kerberos_hive_poc

—
Reply to this email directly or view it on GitHub
#4382 (comment).

petroav · 2016-01-22T18:29:19Z

@billonahill Teradata#105

billonahill · 2016-01-22T21:31:44Z

Thanks @petroav and @ebd2, Teradata#105 looks good. We should focus on that PR instead of what I propose in this one.

cberner · 2016-05-10T01:23:10Z

#4867 has been merged, which it sounds like addresses this. If I misunderstood please re-open!

Add flag to execute Hive queries as the user submitting the query

7258c92

facebook-github-bot added the CLA Signed label Jan 20, 2016

dain reviewed Jan 20, 2016
View reviewed changes

cberner closed this May 10, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add flag to execute Hive queries as the user submitting the query #4382

Add flag to execute Hive queries as the user submitting the query #4382

billonahill commented Jan 20, 2016

dain Jan 20, 2016

billonahill Jan 21, 2016

dain Jan 21, 2016

billonahill Jan 21, 2016

dain commented Jan 20, 2016

dain Jan 20, 2016

billonahill Jan 21, 2016

billonahill commented Jan 21, 2016

ebd2 commented Jan 21, 2016

billonahill commented Jan 22, 2016

petroav commented Jan 22, 2016

billonahill commented Jan 22, 2016

cberner commented May 10, 2016

Add flag to execute Hive queries as the user submitting the query #4382

Add flag to execute Hive queries as the user submitting the query #4382

Conversation

billonahill commented Jan 20, 2016

dain Jan 20, 2016

Choose a reason for hiding this comment

billonahill Jan 21, 2016

Choose a reason for hiding this comment

dain Jan 21, 2016

Choose a reason for hiding this comment

billonahill Jan 21, 2016

Choose a reason for hiding this comment

dain commented Jan 20, 2016

dain Jan 20, 2016

Choose a reason for hiding this comment

billonahill Jan 21, 2016

Choose a reason for hiding this comment

billonahill commented Jan 21, 2016

ebd2 commented Jan 21, 2016

billonahill commented Jan 22, 2016

petroav commented Jan 22, 2016

billonahill commented Jan 22, 2016

cberner commented May 10, 2016