-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Unmanaged pods should fail fast when evicted/preempted/deleted #353
Conversation
@@ -18,7 +18,7 @@ def test_run_without_verify_result_succeeds_as_soon_as_pod_is_successfully_creat | |||
"Result: SUCCESS", | |||
"Result verification is disabled for this task", | |||
"The following status was observed immediately after pod creation:", | |||
%r{Pod/task-runner-\w+\s+Pending}, | |||
%r{Pod/task-runner-\w+\s+(Pending|Running)}, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This change and the one below are to decrease test flakiness. It's not important to these two test cases whether the pod manages to start running in the run window.
@@ -500,7 +498,7 @@ def confirm_namespace_exists | |||
st, err = nil | |||
with_retries(2) do | |||
_, err, st = kubectl.run("get", "namespace", @namespace, use_namespace: false, log_failure: true) | |||
st.success? || err.include?(NOT_FOUND_ERROR) | |||
st.success? || err.include?(KubernetesDeploy::Kubectl::NOT_FOUND_ERROR_TEXT) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I didn't refactor this to use the new raise_on_404
because I have a WIP branch that moves this check to use Kubeclient, which would be better.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM, thanks for owning this!
lib/kubernetes-deploy/kubectl.rb
Outdated
@@ -17,7 +20,7 @@ def initialize(namespace:, context:, logger:, log_failure_by_default:, default_t | |||
raise ArgumentError, "context is required" if context.blank? | |||
end | |||
|
|||
def run(*args, log_failure: nil, use_context: true, use_namespace: true) | |||
def run(*args, log_failure: nil, use_context: true, use_namespace: true, raise_on_404: false) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick: raise_on_404
seems like a bit of a leaky abstraction (the fact that a missing resource is signaled via a 404 error code is an implementation detail that the consumer of this API shouldn't need to know about). How about raise_on_missing
or raise_on_resource_not_found
?
end | ||
|
||
def after_sync | ||
end | ||
|
||
def deleted? | ||
@instance_data.dig('metadata', 'deletionTimestamp').present? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right? If so then I find "deleted?" slightly misleading (since it might still exist and might even terminate successfully, i.e. with non-zero error, right?).
What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right?
Yes. I should probably call this terminating?
-- that's the terminology kubectl uses.
What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?
Yes. We've also seen a k8s bug in the past where resources would get stuck in the terminating state indefinitely, even after the underlying container was gone (though it hasn't been reported in recent version afaik). (Note that we are also checking that they have actually been deleted--if so disappeared?
will be true)
@@ -69,9 +69,18 @@ def timeout_message | |||
header + probe_failure_msgs.join("\n") + "\n" | |||
end | |||
|
|||
def permanent_failed_phase? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"permanently"? dunno ESL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick, maybe something like this would be easier to follow:
def permanently_failed?
failed_phase? && (unmanaged? || non_transient_error?)
end
@instance_data = mediator.get_instance(kubectl_resource_type, name) | ||
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true) | ||
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError | ||
@disappeared = true if deploy_started? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deploy_started?
is here because if the pod hasn't been created yet then the 404 is actually expected, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
deploy_started? is here because if the pod hasn't been created yet then the 404 is actually expected, right?
Exactly.
I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.
Are we? This is of course new data about the state of the resource in a general sense, but it isn't a new end state for the resource, which is what that refactor was about (the new state that triggered it was "ignored"). In other words, our end states are still succeeded, failed and timed out, and this is just a new way that resources can fail.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still disagree. The refactor is about mutually exclusive states not just terminal ones.
end | ||
|
||
cached_instance = @cache[kind].fetch(resource_name, {}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Previously this used @cache.dig(kind, ...)
. That makes me believe someone thought that @cache[kind]
might be nil
. Is that not a concern anymore? The new code here would break in that case. Or is the only way for @cache[kind]
to be nil
if @cache.key?(kind)
is false? i.e. can the key exist but the value legitimately be nil
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The only place that can set it is this:
@cache[kind] = JSON.parse(raw_json)["items"].each_with_object({}) do |r, instances|
instances[r.dig("metadata", "name")] = r
end
That's way too densely written. If I rewrite it like this it is clearer that it shouldn't be possible to have a nil
value:
instances = {}
JSON.parse(raw_json)["items"].each do |resource|
resource_name = resource.dig("metadata", "name")
instances[resource_name] = resource
end
@cache[kind] = instances
|
||
task_runner = build_task_runner | ||
deleter_thread = Thread.new do | ||
loop do |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is really scrappy. Good job! 😆
|
||
def test_deploy_failed_is_true_for_deleted_unmanaged_pods | ||
template = build_pod_template | ||
template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nitpick:
template["metadata"]["deletionTimestamp"] = "2018-04-13T22:43:23Z"
seems simpler
|
||
def test_deploy_failed_is_false_for_deleted_managed_pods | ||
template = build_pod_template | ||
template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
same
|
||
assert_predicate pod, :disappeared? | ||
assert_predicate pod, :deploy_failed? | ||
assert_equal "Pod status: Disappeared. ", pod.failure_message |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this trailing space is a bit hacky 😄
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Improvement: same as above, but handles the window where the deletion request has been made successfully but the pod isn't actually gone yet (i.e. deletion timestamp is set). Reviewers, do you agree with this? The containers should at least have received SIGTERM when we see this, but they may still be running.
I think its fine to say the deploy has failed. If you manually killed the pod, I think its ok to not to show anymore logs too, since they should be able to get to them some other way.
My big concern here is using exceptions as control flow. But, I don't see a better way.
@instance_data = mediator.get_instance(kubectl_resource_type, name) | ||
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true) | ||
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError | ||
@disappeared = true if deploy_started? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.
def failure_message | ||
if phase == FAILED_PHASE_NAME && !TRANSIENT_FAILURE_REASONS.include?(reason) | ||
phase_problem = "Pod status: #{status}. " | ||
phase_problem = if permanent_failed_phase? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this entire if block be wrapped in if unmanaged?
retry | ||
end | ||
end | ||
sleep 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What is this sleep for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It's a tiny throttle that takes effect between when we start the thread and when the pod name is generated. The on one L102 takes effect between when the pod name is generated and when the the pod has been created.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in an else block?
test/integration/runner_task_test.rb
Outdated
/Pod status\: (Terminating|Disappeared)/, | ||
]) | ||
ensure | ||
if deleter_thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just deleter_thread&.kill
in the ensure block? If we've gotten here I don't see how its important that the thread finish running.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question. This is for the case where the test is failing ultimately because of a problem in the thread (which I originally had when I wrote this). If the thread raises and you never join it, you'll never see the error and the test will suck to debug.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave a comment like: join to ensure error message is present is printed
? Also do we want to set a limit on how long we'll wait for the join?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added the comment. Thinking about this more, I think the better thing to do is to set abort_on_exception
to the thread, and not join here.
I've pushed a commit to address the comments. Please take a look.
Are you talking about tl;dr I haven't thought of an alternative I like better either. Can you be more specific about your concern? Having a resource fetcher raise an exception when the target is missing doesn't seem all that weird to me tbh. |
end | ||
|
||
def after_sync | ||
end | ||
|
||
def terminating? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for giving this a clearer name
@@ -137,6 +137,22 @@ def test_version_info_raises_if_command_fails | |||
end | |||
end | |||
|
|||
def test_run_with_raise_err_on_404_raises_the_correct_thing |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
forgot to rename the raise_err_on_404
to raise_if_not_found
here (and on line 149 too)
@@ -86,7 +82,7 @@ def failure_message | |||
container_problems += "> #{red_name}: #{c.doom_reason}\n" | |||
end | |||
end | |||
"#{phase_problem}#{container_problems}".presence | |||
"#{phase_failure_message} #{container_problems}".lstrip.presence |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why lstrip
and not strip
? Otherwise you still have trailing spaces (like the tests below show).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You're right, it should be strip
🤦♀️
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
That turned out to open a small can of worms. My last commit centralizes the chunking of the debug message (into that method itself) and adds a test displaying/proving the desired output.
@instance_data = mediator.get_instance(kubectl_resource_type, name) | ||
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true) | ||
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError | ||
@disappeared = true if deploy_started? |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I still disagree. The refactor is about mutually exclusive states not just terminal ones.
retry | ||
end | ||
end | ||
sleep 0.1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this be in an else block?
test/integration/runner_task_test.rb
Outdated
/Pod status\: (Terminating|Disappeared)/, | ||
]) | ||
ensure | ||
if deleter_thread |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you leave a comment like: join to ensure error message is present is printed
? Also do we want to set a limit on how long we'll wait for the join?
4862373
to
02b7a93
Compare
Purpose
Things I don't like
Pod
intoPod
andManagedPod
. Many of its methods do different things based on whether or not it has a parent. I don't think that is necessary for this PR though.disappeared?
thing should be implemented for all resources, not just this one (Reconsider treatment of "!exists?" in KubernetesResource classes #347). I do think this is a good start though, and it is a lot more important for pods (because of kubernetes-run).raise_on_404
option. We could probably improve the mocking strategy to make this less painful, but I just brute-forced it for now.cc @Shopify/cloudx