Interaction between flood stage and system indices #64251
Labels
:Core/Infra/Core
Core issues without another label
>enhancement
:Security/Authentication
Logging in, Usernames/passwords, Realms (Native/LDAP/AD/SAML/PKI/etc)
Team:Core/Infra
Meta label for core/infra team
Team:Security
Meta label for security team
When a node hits the flood stage watermark, all indices on that node get the
index.blocks.read_only_allow_delete
setting applied with a value oftrue
. This currently applies to system indices as well as data indices. When this happens, system operations that require writes will begin to fail, which is acceptable for certain non-critical actions but for critical actions we need to consider whether failure is the right thing to do. In an effort to reduce the scope of actions that could bypass the flood stage read only block, I have attempted to enumerate what I believe we should consider as critical operations that would otherwise fail.Critical Actions
Authentication
An item that would fail once the flood stage is hit is the ability to authenticate when using SAML, OpenID Connect, or delegated PKI authentication and to a certain extent Kerberos authentication. SAML, OpenID Connect, and delegated PKI authentication results in the generation of an access and refresh token that are used for subsequent access to Elasticsearch; if the document cannot be written to the security index then the authentication will fail. For Kerberos, Elasticsearch itself does not require the use of tokens for subsequent authentication but it will have a significant performance impact if tokens are not used. Kerberos authentication using Kibana requires the token service to be enabled so it will appear as users cannot authenticate using kerberos if users are accessing Elasticsearch through Kibana.
A workaround could be to use the built-in users or a file realm user. Built-in users can be disabled via the API and if this is the case then unless we allow enabling/disabling of a user to bypass the watermark then we cannot rely on built-in users. Additionally, there is a setting that completely disables our reserved realm, which contains the built-in users and that is another reason why we should not rely on them being available. A file based realm is our recommendation for recovery but we do not require one to be enabled and should not make recovering from being over the flood stage more difficult than it needs to be.
Credential Invalidation / Logout
In the event of the security system index becoming read only, invalidation of API keys and tokens fail. We should do our best to keep these operations available as they may be needed to stop an influx of data that is pushing the cluster to the flood stage uncontrollably.
SAML and OpenID Connect logout also need the ability to write data to an index as the tokens used are invalidated as part of the logout operation.
Disabling user
Along the same lines as above, it may become necessary to disable a user temporarily while attempting to get a cluster back up and running as a means to stop data from coming in until the cluster can be rebalanced and have any additional resources that may be needed.
Identity Provider operations
There are probably some actions within this plugin that we may want to allow bypassing a watermark, but I am not familiar enough with the details of this to truly provide a recommendation. @tvernum @jkakavas any thoughts?
Proposal
I'd like to propose that we allow a system index plugin to opt-in actions that would be allowed to bypass the flood stage so that they can allow data to be written to an index. I've only identified security components as those that would bypass the flood stage (as of writing) and currently believe that the Security plugin would opt-in the following transport actions:
An item worth consideration is a limit on the amount that we should allow critical operations to push past the flood stage; I don't think we should allow for the critical operations to push the disk out of space but if the configuration uses byte values, how far past the flood stage do we allow the critical operations to go?
The text was updated successfully, but these errors were encountered: