-
Notifications
You must be signed in to change notification settings - Fork 24.9k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix security index auto-create and state recovery race #39582
Fix security index auto-create and state recovery race #39582
Conversation
Pinging @elastic/es-security |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This looks good to me, the build failures are because of an ambitious line that wanted to grow to 141 chars. I'd prefer to revisit this Monday morning with a clear head though.
This can possibly happen on rolling upgrades from 6.7 to 7.x so it's important this makes it to 7.0. I'll add it as a blocker in https://github.com/elastic/dev/issues/1141 but given the schedule I think this can wait for more and fresher 👀 on Monday.
@@ -281,7 +290,10 @@ private static Version readMappingVersion(String indexName, MappingMetaData mapp | |||
*/ | |||
public void checkIndexVersionThenExecute(final Consumer<Exception> consumer, final Runnable andThen) { | |||
final State indexState = this.indexState; // use a local copy so all checks execute against the same state! | |||
if (indexState.indexExists && indexState.isIndexUpToDate == false) { | |||
if (indexState.concreteIndexName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use stateRecovered() == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I went with your suggestion, and then reverted. I have kept on checking indexState
because this is the "frozen" index state that is used in all if-tests.
@@ -297,14 +309,17 @@ public void checkIndexVersionThenExecute(final Consumer<Exception> consumer, fin | |||
public void prepareIndexIfNeededThenExecute(final Consumer<Exception> consumer, final Runnable andThen) { | |||
final State indexState = this.indexState; // use a local copy so all checks execute against the same state! | |||
// TODO we should improve this so we don't fire off a bunch of requests to do the same thing (create or update mappings) | |||
if (indexState.indexExists && indexState.isIndexUpToDate == false) { | |||
if (indexState.concreteIndexName == null) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
use stateRecovered() == false
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
ditto.
I went with your suggestion, and then reverted. I have kept on checking indexState because this is the "frozen" index state that is used in all if-tests.
Thanks for the review and care for this one @jkakavas ! |
@@ -161,14 +168,17 @@ public void clusterChanged(ClusterChangedEvent event) { | |||
final Version mappingVersion = oldestIndexMappingVersion(event.state()); | |||
final ClusterHealthStatus indexStatus = indexMetaData == null ? null : | |||
new ClusterIndexHealth(indexMetaData, event.state().getRoutingTable().index(indexMetaData.getIndex())).getStatus(); | |||
// index name non-null iff state recovered | |||
final String concreteIndexName = indexMetaData == null ? INTERNAL_SECURITY_INDEX : indexMetaData.getIndex().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Normally, I would've opted for a new state flag isStateRecovered
, but given the amount of flags already existing I think concreteIndexName
is a fitting substitute because the index name is also semantically (not only practically) linked to the state recovery status - it's easy to reason that we can't name the security index until the state containing all the index names has been recovered.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a new flag.
I think it's too easy to miss that the concreteIndexName
variable has additional, non-obvious semantics, and change the default constructor to assign INTERNAL_SECURITY_INDEX
to that field.
It's really random that I didn't do that when I introduced the index name field.
As the code stands, you hanging this protection off the fact that we
- happen to set that variable to
null
at construction, - don't (currently) assign a value to it until we get past the
STATE_NOT_RECOVERED_BLOCK
- always set it to a non-null value once we've recovered.
I don't think you can be sure that we will never change this implementation to behave differently and break one of those pre-conditions.
If we're worried about the number of fields here, we can switch all the booleans to a BitSet
, but I don't think it's actually necessary to re-use fields for a purpose other than what they were intended.
If you add a blocker to the description on a release issue, please also make a comment on the issue explaining that you added it so that everyone gets notified. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This PR seems to me to be a big change that looks like a small change.
Based on the fact that we're past beta and feature freeze, my inclinaton is that we should simply fix the race condition and continue to let requests fail if they're made before recovery completes.
I could be convinced that it's worth doing if we've taken the necessary steps to be sure this is safe, but I don't think we have.
@@ -161,14 +168,17 @@ public void clusterChanged(ClusterChangedEvent event) { | |||
final Version mappingVersion = oldestIndexMappingVersion(event.state()); | |||
final ClusterHealthStatus indexStatus = indexMetaData == null ? null : | |||
new ClusterIndexHealth(indexMetaData, event.state().getRoutingTable().index(indexMetaData.getIndex())).getStatus(); | |||
// index name non-null iff state recovered | |||
final String concreteIndexName = indexMetaData == null ? INTERNAL_SECURITY_INDEX : indexMetaData.getIndex().getName(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would prefer a new flag.
I think it's too easy to miss that the concreteIndexName
variable has additional, non-obvious semantics, and change the default constructor to assign INTERNAL_SECURITY_INDEX
to that field.
It's really random that I didn't do that when I introduced the index name field.
As the code stands, you hanging this protection off the fact that we
- happen to set that variable to
null
at construction, - don't (currently) assign a value to it until we get past the
STATE_NOT_RECOVERED_BLOCK
- always set it to a non-null value once we've recovered.
I don't think you can be sure that we will never change this implementation to behave differently and break one of those pre-conditions.
If we're worried about the number of fields here, we can switch all the booleans to a BitSet
, but I don't think it's actually necessary to re-use fields for a purpose other than what they were intended.
// point in time iterator | ||
final Iterator<BiConsumer<State, State>> stateListenerIterator = stateChangeListeners.iterator(); | ||
while (stateListenerIterator.hasNext()) { | ||
stateListenerIterator.next().accept(previousState, newState); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this change doesn't actually do anything.
A for
loop is just syntactic sugar over iterator()
and next()
, you haven't change the semantics. The stateListenerIterator
is still backed by the same List
and is at risk of ConcurrentModificationException
issues if a listener is added/removed.
Assuming that's the problem you're trying to solve, then the options come down to:
synchronize
whenever we use the list- switch the list implementation to
CopyOnWriteArrayList
- make a copy of the list before iterating over it
(2) is probably the best option. I don't think we add listeners very often, so making a list copy each time is probably OK, but you'd need to audit the places we modify the list.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As far as I can tell this change doesn't actually do anything.
It is a rewrite that unwraps the syntactic sugar because I find it confusing to use that when the list is concurrently modified.
The list is already of the CopyOnWriteArrayList
type.
if (indexState.indexExists && indexState.isIndexUpToDate == false) { | ||
if (indexState.concreteIndexName == null) { | ||
// state not yet recovered | ||
delayUntilStateRecovered(consumer, () -> checkIndexVersionThenExecute(consumer, andThen)); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This worries me.
We're suddenly adding in a queue of pending tasks to run when the recovery block is removed, but I don't see any analysis on how big that queue could get, and what the impact is on the thread pool when it happens.
We don't do that now. We don't queue if the security index is red, we currently fail if the index is not recovered.
There's not enough information in this PR to know whether or not it's a sensible thing to do, and how we'd protect ourselves from getting into trouble with it (e.g. too many queued tasks, or delayed read + writes because recovery was slow).
From a usability point of view it's a nice idea, but it feels like we're jumping in without thinking it through.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Okay, understood! I just wonder if there is some alternative for "delaying requests" that would be acceptable beyond feature freeze?
Alright, I will refit it to simply reject any request before state recovered. I understand how this might seem like a big change in the end.
What if the implementation would've been throwing listeners over to the generic threadpool. We don't have any queuing rules about limits for this practice, and this implementation is not so very different from the said one. I am not being pushy, but I obviously pondered the alternatives, and I wonder what would be a sketch implementation for the delay-requests option that would be LGTMd after feature freeze, if any. Again the goal is for me to adjust, so that next time I propose the alternative with the most chances of being approved. Thanks for the feedback Tim! |
I'll walk through my thinking. Bear in mind it was about midnight for me, and we're trying to get this sorted quickly so I didn't have time to do any experiments, and I defaulted to the conservative approach. If we had more time to throw around ideas and test things we might be able to find something we were confident in. Concern 1 - unbounded task queue Problem A: Is that a reasonable use of the thread pool? Will it starve resources off of other tasks? Possible Solution: I think this would be better if it had its own queue rather than reusing the listener list. That queue could be bounded, and we could reject new tasks if that queue got too long. It would also mean that we could pull the Concern 2 - unbounded delay Problem A: Is it reasonable to hang on to a task for that long? Is it appropriate to still execute it if it's been queued for a minute or longer? How do we know it's still relevant & safe? Possible Solution: The queue of tasks should have some time limit attached. If we decide to queue them up while the cluster is recovering, they need to have some expiry that eventually responds with SERVICE UNAVAILABLE. I think that would work, but it raises the question of whether it's really worth it. Since the PR doesn't indicate that those concerns have been considered, and it doesn't really include tests for any of them (the 1 new test covers a very simplistic scenario), and it's fixing an issue that we only found a few days ago and haven't had time to discuss and really think about, it just seemed too risky and a bit rushed. I think we could come up with a solution that's stable and safe enough. But I think it would need to be more complex than the one proposed here, and at that point I'd question whether that's where we want to spend our engineering efforts & whether the ongoing maintenance cost is justified. Certainly it's not something that I think makes sense to rush in just because there's a deadline coming up, when we can fix the blocking bug much more simply. |
...in/security/src/main/java/org/elasticsearch/xpack/security/support/SecurityIndexManager.java
Outdated
Show resolved
Hide resolved
…ecurity/support/SecurityIndexManager.java Co-Authored-By: albertzaharovits <albert.zaharovits@gmail.com>
Thank you for the write-up @tvernum ! I appreciate it! |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM
Previously, the security index could be wrongfully recreated. This might happen if the index was interpreted as missing, as in the case of a fresh install, but the index existed and the state did not yet recover. This fix will return HTTP SERVICE_UNAVAILABLE (503) for requests that try to write to the security index before the state has not been recovered yet.
Previously, the security index could be wrongfully recreated. This might happen if the index was interpreted as missing, as in the case of a fresh install, but the index existed and the state did not yet recover. This fix will return HTTP SERVICE_UNAVAILABLE (503) for requests that try to write to the security index before the state has not been recovered yet.
Previously, the security index could be wrongfully recreated. This might happen if the index was interpreted as missing, as in the case of a fresh install, but the index existed and the state did not yet recover. This fix will return HTTP SERVICE_UNAVAILABLE (503) for requests that try to write to the security index before the state has not been recovered yet.
* 6.7: (39 commits) Remove beta label from CCR (elastic#39722) Rename retention lease setting (elastic#39719) Add Docker build type (elastic#39378) Use any index specified by .watches for Watcher (elastic#39541) (elastic#39706) Add documentation on remote recovery (elastic#39483) fix typo in synonym graph filter docs Removed incorrect ML YAML tests (elastic#39400) Improved Terms Aggregation documentation (elastic#38892) Fix Fuzziness#asDistance(String) (elastic#39643) Revert "unmute EvilLoggerTests#testDeprecatedSettings (elastic#38743)" Mute TokenAuthIntegTests.testExpiredTokensDeletedAfterExpiration (elastic#39690) Fix security index auto-create and state recovery race (elastic#39582) [DOCS] Sorts security APIs Check for .watches that wasn't upgraded properly (elastic#39609) Assert recovery done in testDoNotWaitForPendingSeqNo (elastic#39595) [DOCS] Updates API in Watcher transform context (elastic#39540) Fixing the custom object serialization bug in diffable utils. (elastic#39544) mute test SQL: Don't allow inexact fields for MIN/MAX (elastic#39563) Update release notes for 6.7.0 ...
…s merged (elastic#32340)" This reverts commit 5285e8b.
#32580) * Revert "[kbn/es] pin 7.x snapshot until elastic/elasticsearch#39582 is merged (#32340)" This reverts commit 5285e8b. * Update config.js
EDITED:
The security index can be wrongfully recreated, because it was interpreted as missing (as is the case of a fresh install), but the index existed in the state which was not yet recovered.
This fix delays all requests (reads/writes) for the security index until the state has been recovered, as there is no other response we could return, apart from "try-again-later". A cluster with a "state-not-recovered" block is not really usable therefore this state should be transient, hence the option to delay the security requests instead of rejecting them.
Excerpt from https://kibana-ci.elastic.co/job/elastic+kibana+pull-request/3156/JOB=x-pack-ciGroup5,node=immutable/consoleFull
cc @jbudz