-
Notifications
You must be signed in to change notification settings - Fork 116
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Ensure instance data is synced with pod state before reporting DS success #617
Conversation
In our test can't we mock the daemonset being out of sync with the pods? |
As discussed in slack, (and now changed in your PR) i think this should be >=, as I think that this should always be equal to or greater than the number of pods considered, since we are only considering pods for potentially a subset of nodes. |
def test_deploy_waits_for_daemonset_status_to_converge_to_pod_states | ||
status = { | ||
"desiredNumberScheduled": 1, | ||
"updatedNumberScheduled": 1, | ||
"numberReady": 0, | ||
} | ||
ds_template = build_ds_template(filename: 'daemon_set.yml', status: status) | ||
ready_pod_template = load_fixtures(filenames: ['daemon_set_pods.yml']).first # should be a pod in `Ready` state | ||
node_templates = load_fixtures(filenames: ['nodes.yml']) | ||
ds = build_synced_ds(ds_template: ds_template, pod_templates: [ready_pod_template], node_templates: node_templates) | ||
refute_predicate(ds, :deploy_succeeded?) | ||
end |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Testing this against master yields a failure, proving my hypothesis that the divergent state between pods and the DS were the root of the error
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The race condition here is essentially (due to controller update cycles) a pod becomes ready between fetching the DS from the server and fetching the pods?
@@ -58,8 +58,11 @@ def relevant_pods_ready? | |||
return true if rollout_data["desiredNumberScheduled"].to_i == rollout_data["numberReady"].to_i # all pods ready | |||
relevant_node_names = @nodes.map(&:name) | |||
considered_pods = @pods.select { |p| relevant_node_names.include?(p.node_name) } | |||
@logger.debug("Considered #{considered_pods.size} pods out of #{@pods.size} for #{@nodes.size} nodes") | |||
considered_pods.present? && considered_pods.all?(&:deploy_succeeded?) | |||
@logger.debug("DaemonSet is reporting #{rollout_data['numberReady']} pods ready") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should we consolidate the two debugging statements. Also do we need to name the DS incase we have a deployment with multiple DSs?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, originally I didn't want to do that to avoid an overly-long log line, but since it's debug that's not really a big concern. Fixed to make a single statement
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks!
What are you trying to accomplish with this PR?
#580 added more detailed handling of daemon set rollouts. In particular, the addition of new nodes during a deployment should not interfere with convergence logic. However, this has resulted in a flaky test where the DS deploy is considered successful, but is reporting:
1 updatedNumberScheduled, 1 desiredNumberScheduled, 0 numberReady
After some digging, it appears that we fail to ensure the state of the target DS pods is reflected in the instance data of the DS itself when checking
deploy_succeeded?
, leading to the situation where a deploy is successful (e.g. the pod(s) is/are up), but the DS itself hasn't converged to this reality.How is this accomplished?
Add an extra check in
deploy_succeeded?
to ensure `considered_pods.length == rollout_data["numberReady"]What could go wrong?
In the worst case, this adds additional sync loops while we wait for the DS to converge, but that's the real definition of success, so it was an error to ignore this in the first place
It's also impossible to test this, since it relies on timing of sync loops between pods and the DS :(