-
Notifications
You must be signed in to change notification settings - Fork 280
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Adds a check to skip serialization-deserialization if request is for same node #2765
Adds a check to skip serialization-deserialization if request is for same node #2765
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Confirmed working behavior by running some test suites locally. Waiting for GH runner to run complete test suite.
if(getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_REMOTE_ADDRESS) == null) { | ||
getThreadContext().putTransient(ConfigConstants.OPENDISTRO_SECURITY_REMOTE_ADDRESS, request.remoteAddress()); | ||
} | ||
if(skipSecurityIfDualMode) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
fixed indentation to promote better readability, as it was misaligned in the file
@@ -127,6 +128,8 @@ public <T extends TransportResponse> void sendRequestDecorate(AsyncSender sender | |||
final String origCCSTransientMf = getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_MASKED_FIELD_CCS); | |||
|
|||
final boolean isDebugEnabled = log.isDebugEnabled(); | |||
final boolean isDirectRequest = HeaderHelper.isDirectRequest(getThreadContext()); | |||
|
|||
try (ThreadContext.StoredContext stashedContext = getThreadContext().stashContext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Could not yet find a solution to not stash the context for same node.
Tried with these changes:
try (ThreadContext.StoredContext stashedContext = (isDirectRequest ? null: getThreadContext().stashContext())) {
final TransportResponseHandler<T> restoringHandler = isDirectRequest ? handler : new RestoringTransportResponseHandler<T>(handler, stashedContext);
But the test fails with ClusterManagernotDiscoveredException
everytime on test start.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perhaps good idea to continue stashing from security point of view too. That way, any existing transients assumed to be cleared are not carry forwarded.
For example let us assume current thread context had a,b,c entries and we explicitly passed around only a & b. Now carry forwarding c may be a security concern.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
But the test fails with ClusterManagernotDiscoveredException everytime on test start.
I would guess that this is related to the wrong check for isDirectRequest
. But, yes, I would also recommend to stash anyway and just restore the transients which you want to keep. That would be the most conservative and most minimal change.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say isLocalNodeRequest
to add more readability? or are we categorizing local node request to direct
channel through out the code-base ?
final String injectedRolesHeader = getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_INJECTED_ROLES_HEADER); | ||
final String injectedUserHeader = getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_INJECTED_USER_HEADER); | ||
|
||
if(Strings.isNullOrEmpty(userHeader)) { | ||
if(user != null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why remove the empty check here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Because for direct requests we don't pass headers (which would then be serialized and put as a transient), rather we directly put the transient object, User, in the thread context via sendMessageDecorate
in SecurityInterceptor and fetch it as an object from thread context here. By doing so we skip serialization and deserialization of direct requests.
@@ -127,6 +128,8 @@ public <T extends TransportResponse> void sendRequestDecorate(AsyncSender sender | |||
final String origCCSTransientMf = getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_MASKED_FIELD_CCS); | |||
|
|||
final boolean isDebugEnabled = log.isDebugEnabled(); | |||
final boolean isDirectRequest = HeaderHelper.isDirectRequest(getThreadContext()); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep, that is the wrong end.
Trying to think of a correct check:
In SecurityInterceptor
we need to check whether connection.getNode()
is equal to the local node, similar to how TransportService
is doing it here:
The local node could be retrieved by clusterService.localNode()
. I am not sure, though, whether this is a 100% correct approach, as TransportService
as a slightly more involved way to get the local node. See:
Alternatively, one might try to get ahold of a reference to TransportService
. However, this only works with hacks in a non-Guice context:
final TransportService remoteClusterService, IndicesService indicesService, PitService pitService, ExtensionsManager extensionsManager) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
i originally started out with localNode and connection node comparison, however interpreted the discussion on the GH issue as proposing to use headerhelper and so changed it to that. I have now changed it back to the original comparison.
if(isDirectRequest) { | ||
// if request is going to be handled by same node, we directly put transient value. | ||
getThreadContext().putTransient(ConfigConstants.OPENDISTRO_SECURITY_REMOTE_ADDRESS, transportAddress); | ||
} else { | ||
getThreadContext().putHeader(ConfigConstants.OPENDISTRO_SECURITY_REMOTE_ADDRESS_HEADER, Base64Helper.serializeObject(transportAddress.address())); | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Per entry if-else clauses can be prone to error as it needs matching checks on receiver end. Can we generalize to either use transients vs header for all config entries? That way, there is consistent symmetric code for passing configs.
Something along the lines of
if (isDirect) {
putTransient(remoteAddress..);
putTransient(securityUser..);
putTransient(injectedRoles..);
putTransient(injectedUser..);
} else {
putHeader(remoteAddress..);
putHeader(securityUser..);
putHeader(injectedRoles..);
putHeader(injectedUser..);
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FYI direct !== same node - @DarshitChanpura and I figured this out today after debugging the failing tests:
In OpenSearch, the channelType is a parameter that indicates the type of communication channel used for inter-node communication. There are two types of channels: direct and transport.
direct: This type of channel is used for communication between nodes within the same JVM (Java Virtual Machine). It's used for local node-to-node communication, which means it's used when the nodes are on the same machine or in the same OpenSearch process.
transport: This type of channel is for communication between nodes that are not in the same JVM. It's used for remote node-to-node communication, which means it's used when the nodes are on different machines or in different OpenSearch processes.
In summary, the direct channelType is used for local communication within the same JVM, while the transport channelType is used for remote communication between different JVMs.
@@ -127,6 +128,8 @@ public <T extends TransportResponse> void sendRequestDecorate(AsyncSender sender | |||
final String origCCSTransientMf = getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_MASKED_FIELD_CCS); | |||
|
|||
final boolean isDebugEnabled = log.isDebugEnabled(); | |||
final boolean isDirectRequest = HeaderHelper.isDirectRequest(getThreadContext()); | |||
|
|||
try (ThreadContext.StoredContext stashedContext = getThreadContext().stashContext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is perhaps good idea to continue stashing from security point of view too. That way, any existing transients assumed to be cleared are not carry forwarded.
For example let us assume current thread context had a,b,c entries and we explicitly passed around only a & b. Now carry forwarding c may be a security concern.
3d72b11
to
d1198d9
Compare
CI is currently failing with following to change introduced in core: opensearch-project/OpenSearch#7465
|
Seems like a stale artifact issue that will be fixed on its own. |
@@ -142,17 +144,15 @@ public <T extends TransportResponse> void sendRequestDecorate(AsyncSender sender | |||
|| k.equals(ConfigConstants.OPENDISTRO_SECURITY_DOC_ALLOWLIST_HEADER) | |||
|| k.equals(ConfigConstants.OPENDISTRO_SECURITY_FILTER_LEVEL_DLS_DONE) | |||
|| k.equals(ConfigConstants.OPENDISTRO_SECURITY_DLS_MODE_HEADER) | |||
|| k.equals(ConfigConstants.OPENDISTRO_SECURITY_DLS_FILTER_LEVEL_QUERY_HEADER) | |||
|| k.equals(ConfigConstants.OPENDISTRO_SECURITY_DLS_FILTER_LEVEL_QUERY_HEADER) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
possible to keep formatting changes separate from this PR?
@@ -127,6 +128,8 @@ public <T extends TransportResponse> void sendRequestDecorate(AsyncSender sender | |||
final String origCCSTransientMf = getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_MASKED_FIELD_CCS); | |||
|
|||
final boolean isDebugEnabled = log.isDebugEnabled(); | |||
final boolean isDirectRequest = HeaderHelper.isDirectRequest(getThreadContext()); | |||
|
|||
try (ThreadContext.StoredContext stashedContext = getThreadContext().stashContext()) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
should we say isLocalNodeRequest
to add more readability? or are we categorizing local node request to direct
channel through out the code-base ?
String userHeader = getThreadContext().getHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER); | ||
|
||
if(userHeader == null) { | ||
if(isDirectRequest) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
can be merge with above if..else..
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can we add tests to ensure both direct and non-direct paths are explicitly validated?
} | ||
else if(StringUtils.isNotEmpty(injectedRolesString)) { | ||
} else { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we not intend to retain the existing userHeader == null
check for this else
code block, or are we assuming that this scenario used to arise only for direct requests?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Left a comment for a query I have regarding the changes.
Codecov Report
@@ Coverage Diff @@
## main #2765 +/- ##
=========================================
Coverage 62.31% 62.31%
- Complexity 3337 3342 +5
=========================================
Files 266 266
Lines 19650 19658 +8
Branches 3329 3336 +7
=========================================
+ Hits 12244 12250 +6
Misses 5779 5779
- Partials 1627 1629 +2
|
boolean isSameNodeRequest = false; | ||
try { | ||
isSameNodeRequest = cs.localNode().equals(connection.getNode()); // using DiscoveryNode equals comparison here | ||
} catch (AssertionError e) { | ||
// do nothing | ||
} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't seem right - which assertion is failing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure but in my research, I found that in some instances the ClusterState was not initialized (which is accessed via state()) and this caused a bunch of tests to fail. My understanding was that since cs is initiated upon node spin-up, the state info will always be available, but that doesn't seem to be the case.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In order to form the initial cluster state, it is already necessary to exchange transport messages which pass through here. TransportService uses a different approach to retrieve the reference to the local node. As this also needs to pass through the early transport messages, it is likely that here the local node information is already available:
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@nibix Apologies, but I'm not super clear on whether you want the implementation to be changed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Well, I think catching an Error, especially an AssertionError is a code smell. It is problematic for two reasons:
- You could be also catching other AssertionErrors which are more serious and not meant to be ignored.
- Assertions are usually only enabled in dev environments. However, that does not mean that the error will just go away in a prod environment. It is possible that it is just deferred to a different code place which then bails with an NPE or similar thing. Thus, code that works in dev could break in prod.
Thus, I would recommend to use an approach which does not require catching AssertionErrors.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, I guess that will be also the very early initialization phase when localNode in TransportService is null, right?
I would still believe that TransportService will know earlier about localNode than ClusterService. Also, if we use the localNode from TransportService, we could use a simple null check instead of catching the AssertionError.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Will the null check not defeat the purpose? How will it be different from catching an AssertionError when cluster state is not accessible from this class?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think it will defeat the purpose. With the null check, you have full control, you know exactly what you are checking, that is the localNode value.
When catching an AssertionError, you have less control:
- The source of the assertion error can be also another code place, thus by ignoring it you could obscure other severe issues.
- Also, as discussed above the assertion is not performed on prod systems, which gives you a QA liability if you are relying on it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thank you for the explanation!. I'll update the condition here. If you have any resources handy that I can take a look for assertions and their usage in Prod vs staging, it would be helpful for me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have not found good resources on this so far, but I am still looking. I can hopefully update you later on this.
src/main/java/org/opensearch/security/transport/SecurityInterceptor.java
Outdated
Show resolved
Hide resolved
a9e31cf
to
de26a79
Compare
@nibix After updating GuiceHolder to contain an instance of TransportService and use that instance to fetch the localNode information, I'm getting different results:
cs.localNode() (the localNode value from ClusterState) matches the value of the node the connection originated from (resulting in passing tests), but localNode value (retrieved from TransportService instance) is not matching connection.getNode() resulting in test failures. |
Oh, hm, interesting. I think I have an idea where this is coming from. Am I right in the assumption that you get that error when testing in a junit based test with a cluster embedded in the JVM? If yes: This indeed collides with the GuiceHolder. Unfortunately, the GuiceHolder is at closer consideration a quite dirty hack to get ahold of dependencies which area only available by deprndecy injection. It uses for this a static attribute in the SecurityPlugin class. While in an JVM embedded cluster the nodes and also the instances of the SecurityPlugin can coexist very well together, this static class is a collision. The different nodes write in an undetermined fashion references to their own TransportService into it. That's why you see other results there, at least if my assumption is correct. Of course the more important part of the consideration would be "how can this be fixed". To be honest I also don't know at the moment. I will have to look at the code and think about it for a while. 😔 |
Yes, I saw this while running one of the failing unit tests.
If that is indeed the case, then do you think it might be worthwhile to look into using cluster state() as an alternative for getting localNode reference, cause that seems to be available at almost all time (except during cluster bootstrap using it Zen Discover mechanism, although I'm not a 100% sure whether the request will flow through security with some cluster state info during bootstrap).
Thank you for helping with this. |
Signed-off-by: Craig Perkins <cwperx@amazon.com>
Signed-off-by: Craig Perkins <cwperx@amazon.com>
Update logic in receiver to look for either transient headers or non-transient headers
@@ -142,7 +141,31 @@ protected void messageReceivedDecorate( | |||
} | |||
|
|||
// bypass non-netty requests | |||
if (channelType.equals("direct")) { | |||
if (getThreadContext().getTransient(ConfigConstants.OPENDISTRO_SECURITY_USER) != null |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This logic in the receiver was simplified using @krishna-ggk 's suggestion here: #2765 (comment)
The receiver does not need any logic to determine where the request came from, if the transient headers are present then there is no need to deserialize. If the serialized (non-transient) headers are present then it will deserialize them.
@opensearch-project/security Would you please re-review this? |
if (origUser != null) { | ||
getThreadContext().putHeader(ConfigConstants.OPENDISTRO_SECURITY_USER_HEADER, Base64Helper.serializeObject(origUser)); | ||
// if request is going to be handled by same node, we directly put transient value as the thread context is not going to be | ||
// stah. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
// stah. | |
// stashed. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me, one comment where I think we want to remove a comment from above some code you changed.
@@ -272,58 +284,10 @@ protected void messageReceivedDecorate( | |||
|
|||
// network intercluster request or cross search cluster request | |||
// CS-SUPPRESS-SINGLE: RegexpSingleline Used to allow/disallow TLS connections to extensions | |||
if (HeaderHelper.isInterClusterRequest(getThreadContext()) | |||
if (!(HeaderHelper.isInterClusterRequest(getThreadContext()) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We either need to update the comment above this or remove the exclamation point I think.
@hardik-k-shah @krishna-ggk @parasjain1 Any final thoughts? |
Merging now. Any comments can be addressed in follow-ups. |
The backport to
To backport manually, run these commands in your terminal: # Fetch latest updates from GitHub
git fetch
# Create a new working tree
git worktree add .worktrees/backport-2.x 2.x
# Navigate to the new working tree
cd .worktrees/backport-2.x
# Create a new branch
git switch --create backport/backport-2765-to-2.x
# Cherry-pick the merged commit of this pull request and resolve the conflicts
git cherry-pick -x --mainline 1 8d636c4ea366fca88f28c3226d3997f91992f55e
# Push it to GitHub
git push --set-upstream origin backport/backport-2765-to-2.x
# Go back to the original working tree
cd ../..
# Delete the working tree
git worktree remove .worktrees/backport-2.x Then, create a pull request where the |
…same node (opensearch-project#2765) Signed-off-by: Darshit Chanpura <dchanp@amazon.com> Signed-off-by: Craig Perkins <cwperx@amazon.com> Co-authored-by: Craig Perkins <cwperx@amazon.com> (cherry picked from commit 8d636c4)
… is for same node (opensearch-project#2765) (opensearch-project#2973)" This reverts commit ddbe517.
…same node (opensearch-project#2765) Signed-off-by: Darshit Chanpura <dchanp@amazon.com> Signed-off-by: Craig Perkins <cwperx@amazon.com> Co-authored-by: Craig Perkins <cwperx@amazon.com> (cherry picked from commit 8d636c4)
… is for same node (opensearch-project#2765) (opensearch-project#2973)" This reverts commit ddbe517.
… is for same node (opensearch-project#2765) (opensearch-project#2973)" This reverts commit ddbe517. Signed-off-by: Darshit Chanpura <dchanp@amazon.com>
… is for same node (opensearch-project#2765)" This reverts commit 8d636c4.
Description
Category:
Bug Fix
Why these changes are required?
To improve performance of large requests handled on same node
Issues Resolved
Testing
Current test suite should pass.
Check List
By submitting this pull request, I confirm that my contribution is made under the terms of the Apache 2.0 license.
For more information on following Developer Certificate of Origin and signing off your commits, please check here.