Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Unmanaged pods should fail fast when evicted/preempted/deleted #353

Merged
merged 6 commits into from
Oct 24, 2018

Conversation

KnVerey
Copy link
Contributor

@KnVerey KnVerey commented Oct 16, 2018

Purpose

  • Bugfix: Unmanaged pods should fail fast if they're evicted or preempted. Previously, we made pods' failure condition ignore these two statuses, because it was very occasionally making a parent's deploy fail incorrectly because the parent was unlucky enough to have all its children hit one of those conditions (which are recoverable, because the parent spawns replacement children). For unmanaged pods, these states are not recoverable and should not be ignored.
  • Improvement: We should not wait around forever if a pod is deleted out of band. If we've deployed it and it doesn't exist, it isn't magically going to appear in the future. This is especially important for the runner. Note that I'm continuing to treat non-404 errors as transient. Related to Reconsider treatment of "!exists?" in KubernetesResource classes #347
  • Improvement: same as above, but handles the window where the deletion request has been made successfully but the pod isn't actually gone yet (i.e. deletion timestamp is set). Reviewers, do you agree with this? The containers should at least have received SIGTERM when we see this, but they may still be running.

Things I don't like

  • I think we should try splitting Pod into Pod and ManagedPod. Many of its methods do different things based on whether or not it has a parent. I don't think that is necessary for this PR though.
  • The disappeared? thing should be implemented for all resources, not just this one (Reconsider treatment of "!exists?" in KubernetesResource classes #347). I do think this is a good start though, and it is a lot more important for pods (because of kubernetes-run).
  • Because we're using kubectl to get these resources (which is really difficult to change at this point), I can't check the actual response status. The "404" check is actually looking for a non-zero exit and an error string.
  • I had to change a ton of the unit tests to accommodate the new raise_on_404 option. We could probably improve the mocking strategy to make this less painful, but I just brute-forced it for now.

cc @Shopify/cloudx

@@ -18,7 +18,7 @@ def test_run_without_verify_result_succeeds_as_soon_as_pod_is_successfully_creat
"Result: SUCCESS",
"Result verification is disabled for this task",
"The following status was observed immediately after pod creation:",
%r{Pod/task-runner-\w+\s+Pending},
%r{Pod/task-runner-\w+\s+(Pending|Running)},
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This change and the one below are to decrease test flakiness. It's not important to these two test cases whether the pod manages to start running in the run window.

@@ -500,7 +498,7 @@ def confirm_namespace_exists
st, err = nil
with_retries(2) do
_, err, st = kubectl.run("get", "namespace", @namespace, use_namespace: false, log_failure: true)
st.success? || err.include?(NOT_FOUND_ERROR)
st.success? || err.include?(KubernetesDeploy::Kubectl::NOT_FOUND_ERROR_TEXT)
Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I didn't refactor this to use the new raise_on_404 because I have a WIP branch that moves this check to use Kubeclient, which would be better.

Copy link
Contributor

@fw42 fw42 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, thanks for owning this!

@@ -17,7 +20,7 @@ def initialize(namespace:, context:, logger:, log_failure_by_default:, default_t
raise ArgumentError, "context is required" if context.blank?
end

def run(*args, log_failure: nil, use_context: true, use_namespace: true)
def run(*args, log_failure: nil, use_context: true, use_namespace: true, raise_on_404: false)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick: raise_on_404 seems like a bit of a leaky abstraction (the fact that a missing resource is signaled via a 404 error code is an implementation detail that the consumer of this API shouldn't need to know about). How about raise_on_missing or raise_on_resource_not_found?

end

def after_sync
end

def deleted?
@instance_data.dig('metadata', 'deletionTimestamp').present?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right? If so then I find "deleted?" slightly misleading (since it might still exist and might even terminate successfully, i.e. with non-zero error, right?).

What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So this basically means that k8s has asked the pod to "please go away", but in theory the pod might still exist and the process might still be running, right?

Yes. I should probably call this terminating? -- that's the terminology kubectl uses.

What's the reasoning for not checking whether the pod has actually already been deleted? Because this will catch deletion requests earlier?

Yes. We've also seen a k8s bug in the past where resources would get stuck in the terminating state indefinitely, even after the underlying container was gone (though it hasn't been reported in recent version afaik). (Note that we are also checking that they have actually been deleted--if so disappeared? will be true)

@@ -69,9 +69,18 @@ def timeout_message
header + probe_failure_msgs.join("\n") + "\n"
end

def permanent_failed_phase?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

"permanently"? dunno ESL

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick, maybe something like this would be easier to follow:

def permanently_failed?
  failed_phase? && (unmanaged? || non_transient_error?)
end

@instance_data = mediator.get_instance(kubectl_resource_type, name)
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
@disappeared = true if deploy_started?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy_started? is here because if the pod hasn't been created yet then the 404 is actually expected, right?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

deploy_started? is here because if the pod hasn't been created yet then the 404 is actually expected, right?

Exactly.

I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

Are we? This is of course new data about the state of the resource in a general sense, but it isn't a new end state for the resource, which is what that refactor was about (the new state that triggered it was "ignored"). In other words, our end states are still succeeded, failed and timed out, and this is just a new way that resources can fail.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree. The refactor is about mutually exclusive states not just terminal ones.

end

cached_instance = @cache[kind].fetch(resource_name, {})
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Previously this used @cache.dig(kind, ...). That makes me believe someone thought that @cache[kind] might be nil. Is that not a concern anymore? The new code here would break in that case. Or is the only way for @cache[kind] to be nil if @cache.key?(kind) is false? i.e. can the key exist but the value legitimately be nil?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The only place that can set it is this:

      @cache[kind] = JSON.parse(raw_json)["items"].each_with_object({}) do |r, instances|
        instances[r.dig("metadata", "name")] = r
      end

That's way too densely written. If I rewrite it like this it is clearer that it shouldn't be possible to have a nil value:

      instances = {}
      JSON.parse(raw_json)["items"].each do |resource|
        resource_name = resource.dig("metadata", "name")
        instances[resource_name] = resource
      end
      @cache[kind] = instances


task_runner = build_task_runner
deleter_thread = Thread.new do
loop do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is really scrappy. Good job! 😆


def test_deploy_failed_is_true_for_deleted_unmanaged_pods
template = build_pod_template
template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nitpick:

template["metadata"]["deletionTimestamp"] = "2018-04-13T22:43:23Z"

seems simpler


def test_deploy_failed_is_false_for_deleted_managed_pods
template = build_pod_template
template["metadata"] = template["metadata"].merge("deletionTimestamp" => "2018-04-13T22:43:23Z")
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

same


assert_predicate pod, :disappeared?
assert_predicate pod, :deploy_failed?
assert_equal "Pod status: Disappeared. ", pod.failure_message
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this trailing space is a bit hacky 😄

Copy link
Contributor

@dturn dturn left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Improvement: same as above, but handles the window where the deletion request has been made successfully but the pod isn't actually gone yet (i.e. deletion timestamp is set). Reviewers, do you agree with this? The containers should at least have received SIGTERM when we see this, but they may still be running.

I think its fine to say the deploy has failed. If you manually killed the pod, I think its ok to not to show anymore logs too, since they should be able to get to them some other way.

My big concern here is using exceptions as control flow. But, I don't see a better way.

@instance_data = mediator.get_instance(kubectl_resource_type, name)
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
@disappeared = true if deploy_started?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we need to prioritize the state tracking refactor. We're effectively adding a new state here.

def failure_message
if phase == FAILED_PHASE_NAME && !TRANSIENT_FAILURE_REASONS.include?(reason)
phase_problem = "Pod status: #{status}. "
phase_problem = if permanent_failed_phase?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this entire if block be wrapped in if unmanaged?

retry
end
end
sleep 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What is this sleep for?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a tiny throttle that takes effect between when we start the thread and when the pod name is generated. The on one L102 takes effect between when the pod name is generated and when the the pod has been created.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in an else block?

/Pod status\: (Terminating|Disappeared)/,
])
ensure
if deleter_thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just deleter_thread&.kill in the ensure block? If we've gotten here I don't see how its important that the thread finish running.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question. This is for the case where the test is failing ultimately because of a problem in the thread (which I originally had when I wrote this). If the thread raises and you never join it, you'll never see the error and the test will suck to debug.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment like: join to ensure error message is present is printed? Also do we want to set a limit on how long we'll wait for the join?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added the comment. Thinking about this more, I think the better thing to do is to set abort_on_exception to the thread, and not join here.

@KnVerey
Copy link
Contributor Author

KnVerey commented Oct 23, 2018

I've pushed a commit to address the comments. Please take a look.

My big concern here is using exceptions as control flow. But, I don't see a better way.

Are you talking about ResourceNotFoundError? It's not really changing the control flow so much as passing a piece of data; the consequence is setting @disappeared, not changing what we do next. Another way to do that would be to make both Kubectl#run and SyncMediator.get_instance able to pass back enhanced information (e.g. multiple values, or some new type of object). That would be a huge change, and I don't really see why it would be better. It would also be possible to have it return nil instead of {} and have that mean "not found" (other errors would still have to return {}), but that feels a bit sketchy to me and handling those possible nils everywhere would not be fun.

tl;dr I haven't thought of an alternative I like better either. Can you be more specific about your concern? Having a resource fetcher raise an exception when the target is missing doesn't seem all that weird to me tbh.

end

def after_sync
end

def terminating?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for giving this a clearer name

@@ -137,6 +137,22 @@ def test_version_info_raises_if_command_fails
end
end

def test_run_with_raise_err_on_404_raises_the_correct_thing
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

forgot to rename the raise_err_on_404 to raise_if_not_found here (and on line 149 too)

@@ -86,7 +82,7 @@ def failure_message
container_problems += "> #{red_name}: #{c.doom_reason}\n"
end
end
"#{phase_problem}#{container_problems}".presence
"#{phase_failure_message} #{container_problems}".lstrip.presence
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why lstrip and not strip? Otherwise you still have trailing spaces (like the tests below show).

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right, it should be strip 🤦‍♀️

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That turned out to open a small can of worms. My last commit centralizes the chunking of the debug message (into that method itself) and adds a test displaying/proving the desired output.

@instance_data = mediator.get_instance(kubectl_resource_type, name)
@instance_data = mediator.get_instance(kubectl_resource_type, name, raise_on_404: true)
rescue KubernetesDeploy::Kubectl::ResourceNotFoundError
@disappeared = true if deploy_started?
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I still disagree. The refactor is about mutually exclusive states not just terminal ones.

retry
end
end
sleep 0.1
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this be in an else block?

/Pod status\: (Terminating|Disappeared)/,
])
ensure
if deleter_thread
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you leave a comment like: join to ensure error message is present is printed? Also do we want to set a limit on how long we'll wait for the join?

@KnVerey KnVerey merged commit 401fa6b into master Oct 24, 2018
@KnVerey KnVerey deleted the unmanaged_pod_fail_faster branch October 24, 2018 19:31
@fw42 fw42 temporarily deployed to rubygems October 25, 2018 17:57 Inactive
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants