-
Notifications
You must be signed in to change notification settings - Fork 25k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Perform more startup consistency checks before writing anything to disk #44624
Comments
Pinging @elastic/es-distributed |
Pinging @elastic/es-core-infra |
I agree this is something important we should aim for, but there are some caveats we need to consider regarding the keystore. The keystore is loaded very early in startup, before we have even installed the security manager. This is important, as our security policy only allows reading from the config directory, not writing. We could add this write permission, but that doesn't seem worth it, and would mean any plugin could overwrite the keystore. The important thing about security manager installation is it happens before we have loaded any plugin code. Can the checks on node metadata and cluster metadata be done independent of any services being loaded? |
It's going to be tricky to do much with the cluster metadata before loading plugins. Do we need to write to the keystore at all during startup? Could we instead move the responsibility for upgrading an old-format keystore to the That said, on reflection the keystore might not matter so much here anyway because it's something that a user can create again themselves. It could be enough to emit a more actionable error message telling them to delete and re-create it if it can't be loaded. |
This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates elastic#44624
If the keystore does not exist, we auto create it. This is because we always need the keystore.seed value, and do not want archive users to need additional setup before running Elasticsearch. We also do format upgrade as you mention, which is again important to not require additional setup on upgrade, and would not change the difficulty in downgrading.
I agree. I opened #46291 |
This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624
This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624
This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624
This commit changes the version bounds of keystore reading to give better error messages when a user has a too new or too old format. relates #44624
We made progress towards fixing this in #50907 which delays writing the node metadata file until after validating and upgrading the cluster metadata, but unfortunately that's not enough: #42489 also moves the contents of the data path around so as to make it incompatible with 7.x, and does so before looking at the cluster metadata. |
Today when a node starts up after an upgrade it might write upgraded versions of at least these separate structures to disk:
None of these have forwards-compatible representations, and all of them are loaded, checked, and then rewritten independently. This can potentially get a node completely stuck in an upgrade:
if the node metadata file is invalid (e.g. comes from a version that is too old to support an in-place upgrade) then we do not discover this until after upgrading the keystore to the latest version. This version of the node cannot start up due to the invalid node metadata file, but an attempt to downgrade to the previous working version will also fail because of the upgraded keystore.
if the cluster metadata is invalid (e.g. contains an index from an unsupported version) then we do not discover this until after upgrading the keystore and the node metadata files to the latest versions. Again, this version of the node cannot start up due to the invalid cluster metadata, but an attempt to downgrade to the previous working version will also fail because of the upgraded keystore and node metadata files.
One common path into this kind of situation is by upgrading without first getting a clean bill of health from the upgrade assistant.
We can make this experience better by performing more consistency checks before writing anything to disk at startup, to avoid blocking a subsequent downgrade in cases where the upgrade is obviously infeasible.
The text was updated successfully, but these errors were encountered: