Ensure instance data is synced with pod state before reporting DS success #617

timothysmith0609 · 2019-11-07T20:23:30Z

What are you trying to accomplish with this PR?
#580 added more detailed handling of daemon set rollouts. In particular, the addition of new nodes during a deployment should not interfere with convergence logic. However, this has resulted in a flaky test where the DS deploy is considered successful, but is reporting:
1 updatedNumberScheduled, 1 desiredNumberScheduled, 0 numberReady

After some digging, it appears that we fail to ensure the state of the target DS pods is reflected in the instance data of the DS itself when checking deploy_succeeded?, leading to the situation where a deploy is successful (e.g. the pod(s) is/are up), but the DS itself hasn't converged to this reality.

How is this accomplished?
Add an extra check in deploy_succeeded? to ensure `considered_pods.length == rollout_data["numberReady"]

What could go wrong?
In the worst case, this adds additional sync loops while we wait for the DS to converge, but that's the real definition of success, so it was an error to ignore this in the first place

It's also impossible to test this, since it relies on timing of sync loops between pods and the DS :(

dalehamel · 2019-11-07T20:33:07Z

It's also impossible to test this, since it relies on timing of sync loops between pods and the DS :(

In our test can't we mock the daemonset being out of sync with the pods?

dalehamel · 2019-11-07T20:34:01Z

Add an extra check in deploy_succeeded? to ensure `considered_pods.length == rollout_data["numberReady"]

As discussed in slack, (and now changed in your PR) i think this should be >=, as I think that this should always be equal to or greater than the number of pods considered, since we are only considering pods for potentially a subset of nodes.

timothysmith0609 · 2019-11-07T20:42:09Z

test/unit/krane/kubernetes_resource/daemon_set_test.rb

+  def test_deploy_waits_for_daemonset_status_to_converge_to_pod_states
+    status = {
+      "desiredNumberScheduled": 1,
+      "updatedNumberScheduled": 1,
+      "numberReady": 0,
+    }
+    ds_template = build_ds_template(filename: 'daemon_set.yml', status: status)
+    ready_pod_template = load_fixtures(filenames: ['daemon_set_pods.yml']).first # should be a pod in `Ready` state
+    node_templates = load_fixtures(filenames: ['nodes.yml'])
+    ds = build_synced_ds(ds_template: ds_template, pod_templates: [ready_pod_template], node_templates: node_templates)
+    refute_predicate(ds, :deploy_succeeded?)
+  end


Testing this against master yields a failure, proving my hypothesis that the divergent state between pods and the DS were the root of the error

lib/krane/kubernetes_resource/daemon_set.rb

test/unit/krane/kubernetes_resource/daemon_set_test.rb

dturn

The race condition here is essentially (due to controller update cycles) a pod becomes ready between fetching the DS from the server and fetching the pods?

dturn · 2019-11-08T22:47:02Z

lib/krane/kubernetes_resource/daemon_set.rb

@@ -58,8 +58,11 @@ def relevant_pods_ready?
      return true if rollout_data["desiredNumberScheduled"].to_i == rollout_data["numberReady"].to_i # all pods ready
      relevant_node_names = @nodes.map(&:name)
      considered_pods = @pods.select { |p| relevant_node_names.include?(p.node_name) }
-      @logger.debug("Considered #{considered_pods.size} pods out of #{@pods.size} for #{@nodes.size} nodes")
-      considered_pods.present? && considered_pods.all?(&:deploy_succeeded?)
+      @logger.debug("DaemonSet is reporting #{rollout_data['numberReady']} pods ready")


Should we consolidate the two debugging statements. Also do we need to name the DS incase we have a deployment with multiple DSs?

Yeah, originally I didn't want to do that to avoid an overly-long log line, but since it's debug that's not really a big concern. Fixed to make a single statement

dalehamel

thanks!

…cess (#617)

timothysmith0609 added 2 commits November 7, 2019 15:17

ensure instance data is synced with pod state before reporting DS sucess

04d69a9

Could have more numReady than consideredPods

f5a1c15

add test

5135c2b

timothysmith0609 commented Nov 7, 2019

View reviewed changes

timothysmith0609 requested review from dalehamel and KnVerey November 7, 2019 20:51

timothysmith0609 added 2 commits November 7, 2019 15:59

take test to logical conclusion

c52192c

rubocop

218a82c

dalehamel requested changes Nov 8, 2019

View reviewed changes

lib/krane/kubernetes_resource/daemon_set.rb Outdated Show resolved Hide resolved

test/unit/krane/kubernetes_resource/daemon_set_test.rb Outdated Show resolved Hide resolved

PR suggestions

9d0d023

dturn approved these changes Nov 8, 2019

View reviewed changes

same line for debug

d8dc2f9

dalehamel approved these changes Nov 11, 2019

View reviewed changes

timothysmith0609 merged commit d470158 into master Nov 11, 2019

timothysmith0609 deleted the fix_ds_race_condition branch November 11, 2019 16:30

timothysmith0609 added a commit that referenced this pull request Nov 11, 2019

Ensure instance data is synced with pod state before reporting DS suc…

5253f22

…cess (#617)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Ensure instance data is synced with pod state before reporting DS success #617

Ensure instance data is synced with pod state before reporting DS success #617

timothysmith0609 commented Nov 7, 2019 •

edited

Loading

dalehamel commented Nov 7, 2019

dalehamel commented Nov 7, 2019

timothysmith0609 Nov 7, 2019

dturn left a comment

dturn Nov 8, 2019

timothysmith0609 Nov 11, 2019

dalehamel left a comment

Ensure instance data is synced with pod state before reporting DS success #617

Ensure instance data is synced with pod state before reporting DS success #617

Conversation

timothysmith0609 commented Nov 7, 2019 • edited Loading

dalehamel commented Nov 7, 2019

dalehamel commented Nov 7, 2019

timothysmith0609 Nov 7, 2019

Choose a reason for hiding this comment

dturn left a comment

Choose a reason for hiding this comment

dturn Nov 8, 2019

Choose a reason for hiding this comment

timothysmith0609 Nov 11, 2019

Choose a reason for hiding this comment

dalehamel left a comment

Choose a reason for hiding this comment

timothysmith0609 commented Nov 7, 2019 •

edited

Loading