Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Handle Upgrades and Alloc.TaskResources modification #6922
Handle Upgrades and Alloc.TaskResources modification #6922
Changes from 4 commits
db49137
7783c13
0a5fd78
058076a
4f36d4b
3291523
4813863
438f98c
99bc650
File filter
Filter by extension
Conversations
Jump to
There are no files selected for viewing
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we need this check anymore? Seems like if AllocatedResources is somehow
nil
here we'll end up with an incomplete task environment which would be harder to detect in testing than a panic.Sadly we don't have a logger or error return value here, so our options are limited. If you think risking an incomplete environment is better than risking a panic, let's just add that as a comment here as the code will look kind of strange down the road.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I kept this conditional because
TestClient_AddAllocError
asserts that when an invalid alloc is passed, taskenv doesn't panic and NewTaskRunner returns an error. Not sure what the conditions the test actually tests for.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why can't this be
COMPAT(0.12): Remove in 0.12
? If both clients and servers upgrade-allocs-on-restore in 0.11, what could still haveAllocatedResources==nil
in 0.12 (and therefore need these fields to populate AllocRes)? It seems like the only problem would be if someone used a 0.10 or earlier agents with 0.12 code.(Not that we have to remove it in 0.12, I'm just curious if we could while maintaining our +/-1 Y version safety.)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If both clients and servers upgrade the on-disk representation, then yes. But we currently don't do that, neither will with this PR. Consider the case where a cluster starts with nomad 0.8; then operator upgrades in rapid short successions through 0.9, 0.10 (with this PR), 0.11, and then to 0.12 - so fast such that Raft didn't generate a Snapshot during these upgrades. In this case, Nomad 0.12 will read the representation that was persisted by 0.8 and lacks
AllocatedResources
.To be able to fully remove it, we must augment the recommended upgrade path to ensure on-disk representation get upgraded before a user can do a subsequent upgrade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we Raft Snapshot on server agent startup?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can consider it - we'll need to do some vetting and testing before going there. Agent startup is critical to cluster recovery and I'd be nervous about adding a blocking call that may fail there; if done asynchronously, we'd need to properly indicate to operators when it's safe to safe to upgrade and potentially cope with operators potentially ignore the warning. Maybe consider it as part of 0.12?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Hm, a blocking
nomad operator ...
command might be sufficient + upgrade instructions to use it if "rapidly" upgrading from 0.N to 0.N+2.Definitely seems best to leave it out of scope for this effort, but could you file an issue describing how its currently impossible to safely remove deprecated fields that are persisted to raft? Doesn't seem like anything we need to rush to fix, but I can see it mattering a lot more post-1.0 when people are much slower to upgrade (Consul struggles with this), and may want options to upgrade from 1.N to 1.N+x (where
x > 1
) quickly and easily.There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think this is the right thing to do:
nil
and Jobs are already canonicalized on statestore restore.So I think we might waste a couple CPU instructions, but it seems necessary on clients at least.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We may want to just inline this code into Canonicalize as it seems easy to get orphaned here if we are able to remove the canonicalization in the future.