-
Notifications
You must be signed in to change notification settings - Fork 3.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
HBASE-26666 Add native TLS encryption support to RPC server/client #4125
Conversation
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
cef6e3b
to
4debffc
Compare
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
5996d63
to
8fe9b62
Compare
🎊 +1 overall
This message was automatically generated. |
💔 -1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. please give another try and see if we can have a clean build.
[nit] I may miss something but I'm wondered if we have any test or manual that has use HBASE_NETTY_RPCSERVER_TLS_ENABLED=true
? this is not a blocker, and I think adding the ssl header should be as straightforward as it tested in TestX509Util
hbase-server/src/main/java/org/apache/hadoop/hbase/ipc/NettyRpcServer.java
Show resolved
Hide resolved
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall, LGTM. Have you reviewed all latest UT failures? Could those be related?
@@ -297,11 +295,10 @@ public void operationComplete(ChannelFuture future) throws Exception { | |||
established(ch); | |||
} | |||
} | |||
}).channel(); | |||
}).sync().channel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why do we have sync()
here? It converts the asynchronous connection to synchronous.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
HBase shell is not able to connect without this. I'm sure I've seen errors in RS-Master communication as well, but it works in most cases.
The issue is:
ERROR: Failed contacting masters after 1 attempts.
Exceptions:
java.io.IOException: Call to address=Andors-MacBook-Pro.local:16000 failed on local exception:
java.io.IOException: java.lang.UnsupportedOperationException: unsupported message type: Call (expected: ByteBuf, FileRegion)
My thinking is that connect()
is having a race with sendRequest()
, becaue NettyRpcConnection is trying to establish the connection in lazy way: doesn't do anything until the first request (Call
) comes in. Because of this we have to make sure that the Netty pipeline is in a usable state before letting sendRequest() to push the Call into and seems like sync()
is enough. It waits for the connection to be established before we give back the channel.
This is still not ideal IMHO: we could implement a latch to wait for the SSL handshake to happen before doing anything else - like ZooKeeper does -, but it might be overkill in this case given that it already works fine with sync().
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your explanation certainly makes sense to me. I think keeping the Channel async is probably going to give us some performance gains.
If this only affected the hbase client when TLS was on, it's ok to leave this sync()
for now and come back to it later. However, this is going to affect all clients so we may be introducing a performance regression for this optional feature which isn't good.
You mentioned a latch for when the TLS handshake is done. I vaguely remember a similar latch for the SASL handshake (though, it might have be just a synchronous check rather than asynchronous). It's also OK for us to commit this patch onto the feature branch and address it later. We just have to remember that this would block our merge back to master
.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There is a BufferCallBeforeInitHandler to handle the case for writing rpc call out before initializing the connection.
That's why we can avoid calling sync in this method.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@Apache9 Yep, I saw that class piles up the Calls until it receives the green signal, but for some reason SSL doesn't work if the handler precedes the SSL handler. I'll take another look.
@joshelser I'd like to commit this independently from the JWT feature if it's feasible. The PR is against the master branch intentionally.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In the netty world, I think you should use callback instead of block waiting. If a feature can be implemented by a sync call, I think it should also be possible to be implemented by a ChannelFuture.addListener or something else.
channelRef.compareAndSet(null, connect()); | ||
channel = channelRef.get(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You discarded the return value of compareAndSet
. If there are multiple connection attempts at the same time, one of it is leaked.
if (in.readableBytes() >= 5) { | ||
super.decode(context, in, out); | ||
} else if (in.readableBytes() > 0) { | ||
// It requires 5 bytes to detect a proper ssl connection. In the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If 5 bytes are required why >= 5
in the previous if?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That's copied from the base class.
if (TLS_HANDSHAKE_RECORD_TYPE != in.getByte(0)) { | ||
LOG.debug("first byte {} does not match TLS handshake, failing to plaintext", | ||
in.getByte(0)); | ||
handleNonSsl(context); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It reconfigures the pipeline once if we have only 1-4 incoming bytes. So if at some point of time we receive only 2 bytes of data for example, it will remove itself and the ssl handler if I understand it correctly.
I think we should check the first X bytes of incoming data and make the decision to keep or remove ssl handler.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I ended up removing this class. That's something which was needed for ZooKeeper 4LW commands which is obviously not the case for HBase.
@meszibalu Please take another look. I addressed your comments. |
Out of curiosity, does anybody know why HBase not using Epoll (native transport) for client sockets? |
@taklwu Yes, I think an e2e test would make perfect sense here like we did for the JWT. We might also need a combined test for encryption + JWT. Let me work on this. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
111f518
to
5b0cc1c
Compare
@bbeaudreault I've squased and rebased the patch. |
💔 -1 overall
This message was automatically generated. |
What's wrong with this build again? |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
Okay, two +1, one -1 in the latest iteration. What should I do? Shall I restart and hope the best? |
public static final String TLS_CONFIG_CLR = CONFIG_PREFIX + "clr"; | ||
public static final String TLS_CONFIG_OCSP = CONFIG_PREFIX + "ocsp"; | ||
|
||
public static String HBASE_CLIENT_NETTY_TLS_ENABLED = "hbase.client.netty.tls.enabled"; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i think you need to change this and below to static final
-- i think that's why checkstyle is complaining
i asked on slack about the failure. unfortunately I dont really know. one thing I notice is the failure is actually for while waiting, I also noticed 2 small things. one of which is contributing to the warnings above. not quite sure what's up with the Why don't you fix those 2 small things and then we'll give someone in slack a day to check in and if no one has any ideas we can look to merge. |
Not sure what's the real problem, we do not touch the code in the PR here and ImmutableEnumCheck is not a blocker which should not fail compilation I believe. I tried locally, I could build the master branch without any failure. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
🎊 +1 overall
This message was automatically generated. |
We cannot merge this commit, because it conflicts with PR 4624. |
6ca13b4
to
6c9dbd0
Compare
I'm done with the rebase, there was only one conflicting file. Let's hope the best with the build. |
🎊 +1 overall
This message was automatically generated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I do not think this PR is qualified enough to commit.
When reviewing the first class(NettyRpcConnection), I saw a bunch of changes which I can not understand...
Please hold on committing this...
@@ -66,12 +75,6 @@ | |||
|
|||
/** | |||
* RPC connection implementation based on netty. | |||
* <p/> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing these comments?
channel.pipeline().fireUserEventTriggered(new CallEvent(TIMEOUT, call)); | ||
} | ||
}); | ||
Channel channel = getChannel(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why this change?
} | ||
|
||
private void established(Channel ch) throws IOException { | ||
assert eventLoop.inEventLoop(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why removing this assert? I'm totally confused why we need to change so much assumptions on the implementation of this class? Is it really necessary?
Seems the works need to be done here is very straight forward? Check the configuration, if TLS is enabled, we add a SslHandler to the pipeline, it will handle all the things and we even do not need to care whether the handshake is finished before writing to the channel? Then I really do not understand why we need so much changes on the client side NettyRpcConnection? Just use a new ChannelInitializer to add a SslHandler and then a BufferCallBeforeInitHandler is enough? And I also do not understand why we need to synchronized on the initSSL method, both server side and client side? The method will only be called inside the event loop, there is no race at all... |
} | ||
|
||
public SslContext createNettyJdkSslContext(SSLContext sslContext, boolean isClientSocket) { | ||
return new JdkSslContext(sslContext, isClientSocket, cipherSuitesAsList, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Please use SslContextBuilder instead of hard coded JdkSslContext here.
SslContextBuilder will try to use open ssl if possible, where the performance is much better than the jdk one.
@@ -214,4 +228,21 @@ public int getNumOpenConnections() { | |||
// allChannels also contains the server channel, so exclude that from the count. | |||
return channelsCount > 0 ? channelsCount - 1 : channelsCount; | |||
} | |||
|
|||
private synchronized void initSSL(ChannelPipeline p, boolean supportPlaintext) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Oh, this one is shared in NettyRpcServer, not per connection, so synchronized is needed if we share some common fields which are not thread safe(but seems not...)
I think we should share the SslContext stuff? That means, when constructing the NettyRpcServer, we check whether TLS is enabled, if so, we create the SslContext instance. So then when initializing a Channel, we just need to call the createSSLEngine method, which should be thread safe so we do not need synchronized here.
// Basically we only need to create it once. | ||
private synchronized void initSSL(ChannelPipeline pipeline) | ||
throws X509Exception.SSLContextException { | ||
if (sslContext == null || sslEngine == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think here we should share the SslContext instance. That means, we'd better create a SslContext instance per RpcClient, and pass it in when creating a NettyRpcConnection, and then just call its createSSLEngine method when initializing a Channel, where we do not need synchronized any more.
💔 -1 overall
This message was automatically generated. |
With this approach I could get the TestTLSIPC pass. But probably we should try to add the SslHandler before connected if we want to support tcp fast open in the future. Will dig more on why adding before connected will hang. And I noticed that, in the X509Util you create a javax's SSLContext, instead of netty's SslContext, this is not good. You should try to use netty's SslContextBuilder to create a SslContext as netty will try its best to make use of open ssl if possible, which has a much better performance. Thanks. |
💔 -1 overall
This message was automatically generated. |
OK, I found the problem. The problem is in BufferCallBeforeInitHandler, we just stop the propagation of the write event, but still propagate the flush event, this messes up the logic in SslHandler so the initialization is hang. Please check the newest commit here I ran TestTLSIPC and TestTlsWithKerberos(why a TLS and then a Tls?) locally, they could both pass. Thanks. |
HBASE-26666 Address bearer token being sent over wire before RPC encryption is enabled
because it's needed for JWT authentication. The bearer token must be transmitted on an encrypted channel.
https://issues.apache.org/jira/browse/ZOOKEEPER-2120
Still working on unit tests and documentation.