Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

fix(activator): Correctly return noop value from probePodIPs based on changes #14347

Merged
merged 2 commits into from
Sep 25, 2023

Conversation

arsenetar
Copy link
Contributor

Proposed Changes

NOTE: This builds off of #14303 and includes those commits as well.

Update the activator probePodIPs to handle cases where the healthy pods are updated, but the update is not propagated due to the noop return value being set to true. This issue is specific to certain combinations of activator/revision options such as the mesh mode and probe optimization. Instead of specific unchanged logic for just the probes this uses the set Equal operation to check for unchanged which covers all possible code paths for updates.

Tests have been expanded to include more explicit coverage of different combinations of options and states to probePodIPs to help cover this behavior moving forward. There is one test case no changes without probes that now shows different behavior than prior code. Prior code would return a false for noop. After reviewing the calling code this did not seem to make sense
for a non-probing non-updating call to update the endpoints (especially given the other non-probe changes are now accounted for). So this was left as it is now with the simple unchanged value logic.

This patch has been running in a production environment without issues for awhile now and has not shown the prior problematic behavior from #14200 thus far.

Ref #14200

Release Note

Activator correctly propagates pod health when triggered by changes other than pod probes.

@knative-prow knative-prow bot added area/autoscale area/networking needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files. labels Sep 8, 2023
@knative-prow
Copy link

knative-prow bot commented Sep 8, 2023

Hi @arsenetar. Thanks for your PR.

I'm waiting for a knative member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work. Regular contributors should join the org to skip this step.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@nak3 nak3 added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Sep 11, 2023
@codecov
Copy link

codecov bot commented Sep 11, 2023

Codecov Report

All modified lines are covered by tests ✅

Comparison is base (18b3f4f) 86.09% compared to head (5a6a187) 86.08%.
Report is 50 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main   #14347      +/-   ##
==========================================
- Coverage   86.09%   86.08%   -0.02%     
==========================================
  Files         196      196              
  Lines       14783    14871      +88     
==========================================
+ Hits        12728    12802      +74     
- Misses       1749     1759      +10     
- Partials      306      310       +4     
Files Coverage Δ
pkg/activator/net/revision_backends.go 92.75% <100.00%> (-0.07%) ⬇️

... and 11 files with indirect coverage changes

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

The probePodIPs sometimes (depending on configuration) will return a
true for noop when in fact there are changes.  This is due to changes to
the healthy endpoints being possible outside of probing.

- Change the unchanged value to just compare the existing healthy set
  with the new one.
- Add tests to cover most of the different cases of behavior for the
  probePodIPs function.

NOTE: There is one test case `no changes without probes` that now shows
different behavior than prior code.  Prior code would return a false for
noop.  After reviewing the calling code this did not seem to make sense
for a non-probing non-updating call to update the endpoints (given the
other non-probe changes are now accounted for).  So this was left as it
is now with the simple unchanged value logic.
Copy link
Member

@dprotaso dprotaso left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

thanks for the PR

looks good - just some minor stuff. Can you also fix the deprecation warnings the linter noted

pkg/activator/net/revision_backends_test.go Outdated Show resolved Hide resolved
pkg/activator/net/revision_backends_test.go Outdated Show resolved Hide resolved
},
},
{
name: "only ready pods healthy without probe optimization", // NOTE: prior test is effectively this one with probe optimization
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non blocking question:

I guess to confirm - probe optimization=false doesn't have an effect and probing happens

I thought probe optimization when disabled would not probe at all - (since the revision has a container with an exec probe and the queue proxy can't intercept the container's healthy state)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probe optimization in this probePodIPs context just allows non-ready pods to be marked as healthy after successful probe. Without the optimization enabled the pod needs to have a successful probe and be ready to be marked healthy. Interestingly enough it still probes the pod regardless. Definitely could update to just only probe the ones it could actually mark healthy... but I don't really want to start adding changes unrelated to the fix here.

I think part of the current behavior has to do possibly with the mesh mode Auto behavior for Mesh detection. Since it needs to probe to detect.

@arsenetar
Copy link
Contributor Author

@dprotaso The sets.String linter error is due to matching the types used in the code for the tests, if we wanted to fix the linter issue there then the usage of those values which propagate throughout the codebase would also have to migrate to the new sets.Set[string] type. That seems a bit beyond the intention of this PR.

- Rename one test case
- Move both rw and fakeRT inside test function
@dprotaso
Copy link
Member

That seems a bit beyond the intention of this PR.

Fair enough - let me know if you want to tackle this otherwise i'll make a new issue

/lgtm
/approve

@knative-prow knative-prow bot added the lgtm Indicates that a PR is ready to be merged. label Sep 25, 2023
@knative-prow
Copy link

knative-prow bot commented Sep 25, 2023

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: arsenetar, dprotaso

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@knative-prow knative-prow bot added the approved Indicates a PR has been approved by an approver from all required OWNERS files. label Sep 25, 2023
@dprotaso
Copy link
Member

/override "style / Golang / Lint"

@knative-prow
Copy link

knative-prow bot commented Sep 25, 2023

@dprotaso: Overrode contexts on behalf of dprotaso: style / Golang / Lint

In response to this:

/override "style / Golang / Lint"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@dprotaso
Copy link
Member

/override "test (v1.27.x, istio-ambient, e2e)"

@knative-prow
Copy link

knative-prow bot commented Sep 25, 2023

@dprotaso: Overrode contexts on behalf of dprotaso: test (v1.27.x, istio-ambient, e2e)

In response to this:

/override "test (v1.27.x, istio-ambient, e2e)"

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow knative-prow bot merged commit 9e7956f into knative:main Sep 25, 2023
63 of 65 checks passed
arsenetar added a commit to coreweave/serving that referenced this pull request Dec 19, 2023
- Backport activator fixes from
  knative#14303 and
  knative#14347 from 1.12
- Add custom patches for logs and probe durations
- Update to go 1.20
- Add patch from knative#14022
- Add custom CI workflows
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
approved Indicates a PR has been approved by an approver from all required OWNERS files. area/autoscale area/networking lgtm Indicates that a PR is ready to be merged. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants