-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[WIP] Use user-defined readiness probes through queue-proxy #4600
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@joshrider: 3 warnings.
In response to this:
Fixes #4014
TODO
- check defaults and validations => lots of questions about desired behaviour here. Make a list.
- investigate flaky TestAutoScaleUpDownUp
- lots of general cleanup
- pkg/reconciler/revision/resources/deploy_test.go only remaining user of
queueReadinessProbe
- pkg/reconciler/revision/resources/queue_test.go probe stuff into table tests?
- make sure zap errors don't produce lengthy errors that destroy readability of event log
- does activator still keep using
probeUserContainer
?Proposed Changes
- move user-defined readinessProbes from kubelet to the queue-proxy
- show default probe in Revision spec if no probe provided
- allow the possibility of either our existing, aggressive probe retry semantics (which should speed up cold start), or standard k8s probe semantics
Release Note
Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.
So there's good news and bad news. 👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there. 😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request. Note to project maintainer: This is a terminal state, meaning the ℹ️ Googlers: Go here for more info. |
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: joshrider The full list of commands accepted by this bot can be found here.
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
cmd/queue/main.go
Outdated
mux.HandleFunc(queue.RequestQueueDrainPath, healthState.DrainHandler()) | ||
|
||
return mux | ||
} | ||
|
||
func probeQueueHealthPath(port int, timeout time.Duration) error { | ||
func probeQueueHealthPath(port int, timeout int) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Since we are now dealing with an int can be useful to make clear that we are dealing with seconds.
func probeQueueHealthPath(port int, timeout int) error { | |
func probeQueueHealthPath(port int, timeoutSeconds int) error { |
cmd/queue/main.go
Outdated
url := fmt.Sprintf(healthURLTemplate, port) | ||
|
||
// use aggressive retries |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit:
// use aggressive retries | |
// Use aggressive sub-second retries. |
res, err := httpClient.Get(url) | ||
|
||
if err != nil { | ||
return errors.Wrap(err, "failed to probe") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If the original error does not return the URL this could be valuable for debugging.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not sure I understand this comment. Is the suggestion to include the URL in the error message? We build the URL on the fly, but (at present) it will always be http://127.0.0.1:8022/health
.
Maybe something better than "failed to probe" is called for?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep all I meant was fail to probe: http://127.0.0.1:8022/health
provides a little bit more detail to someone looking at the error in the logs without having to trace through the code.
cmd/queue/main.go
Outdated
} | ||
|
||
var lastErr error | ||
|
||
// The 25 millisecond retry interval is an unscientific compromise between wanting to get | ||
// started as early as possible while still wanting to give the container some breathing | ||
// room to get up and running. | ||
timeoutErr := wait.PollImmediate(25*time.Millisecond, timeout, func() (bool, error) { | ||
timeoutErr := wait.PollImmediate(25*time.Millisecond, readiness.PollTimeout, func() (bool, error) { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: Make the 25ms a constant rather than a magic value here
cmd/queue/main.go
Outdated
@@ -353,9 +395,16 @@ func main() { | |||
StatChan: statChan, | |||
}, time.Now()) | |||
|
|||
coreProbe, err := parseProbe(*ucProbe) | |||
if err != nil { | |||
logger.Fatalf("Queue container failed to parse readiness probe %#v", err) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
logger.Fatalf("Queue container failed to parse readiness probe %#v", err) | |
logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err)) |
cmd/queue/main.go
Outdated
@@ -293,11 +325,21 @@ func probeQueueHealthPath(port int, timeout time.Duration) error { | |||
return nil | |||
} | |||
|
|||
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
nit: comments for declarations should be sentences and end in periods.
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error | |
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error. |
pkg/queue/readiness/probe.go
Outdated
} | ||
|
||
func (p *Probe) timeout() time.Duration { | ||
if p.IsStandardProbe() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this supposed to be inversed?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think it's correct, but defaultProbeTimeout
is poorly named.
This represents the maximum length of time waited for a TCP connection/HTTP response. For a user-supplied probe, we are using the value given to us for our single attempt. For our aggressive retries, we timeout much more quickly and just try again.
cmd/queue/main.go
Outdated
mux := http.NewServeMux() | ||
mux.HandleFunc(requestQueueHealthPath, healthState.HealthHandler(probeUserContainer)) | ||
|
||
if p.IsStandardProbe() { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There seems to be lots of commonality between HealthHandler and HealthHandlerKProbe. Why the switch here rather than just having a HealthHandler
function that configures based on StandardProbe flag?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Made a change. LMK what you think.
} | ||
|
||
errs := validateProbe(p) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like we have previously not validated the bounds of periodSeconds and let the K8s API server handle the validation. Since we are using the periodSeconds in our controllers now we probably should add the validation that this is a positive number.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably also split the changes to pkg/apis
into it's own change, since it's fairly independent.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added a check to make sure periodSeconds is non-negative (allowing 0 to signal the aggressive probe) and checks to make sure timeoutSeconds and failureThreshold are non-negative in the case where we're not using the aggressive probe.
pkg/queue/readiness/probe.go
Outdated
) | ||
|
||
const ( | ||
defaultProbeTimeout = 100 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is defined here and in main.go
. Is this supposed to stay in sync? If so we should make this public and share it between the two places.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The probe timeout in probe.go
and main.go
are used for different things that don't necessary need to sync.
defaultProbeTimeout
in main.go
is used in probeUserContainer
which is used as a signal to the activator that the pod is up and ready to accept traffic. If we want to keep using the pre-existing, hard-coded TCP connect as the activator's test for readiness, it seems like the constants are two similar, but different, things.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should the activator's test be the same as the one that's set off from kubelet?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@vagababov @mattmoor Any thoughts here?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the change @joshrider
Left a few comments throughout, but mainly this is a big change. I think it'll be clearer to tackle in pieces now that you seem to have things working. I made a couple suggestions on places this might split nicely.
@@ -134,8 +134,13 @@ func makePodSpec(rev *v1alpha1.Revision, loggingConfig *logging.Config, observab | |||
userContainer.TerminationMessagePolicy = corev1.TerminationMessageFallbackToLogsOnError | |||
} | |||
|
|||
if userContainer.ReadinessProbe != nil { | |||
if userContainer.ReadinessProbe.HTTPGet != nil || userContainer.ReadinessProbe.TCPSocket != nil { | |||
userContainer.ReadinessProbe = nil |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is worth a comment for those without the context of this code review.
pkg/queue/readiness/probe.go
Outdated
} | ||
|
||
// IsKProbe indicates whether the default Knative probe with aggressive retries should be used. | ||
func (p *Probe) IsKProbe() bool { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Perhaps use a more descriptive name like IsAggressive
or HasUnrestrictedInterval
?
} else if p.TCPSocket != nil { | ||
err = p.tcpProbe() | ||
} else if p.Exec != nil { | ||
// Assumes the true readinessProbe is being executed directly on user container. See (#4086). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I wonder if we should still TCP probe in this case, since we otherwise have no verification of the networking path from the queue to the user-container.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
cc @dgerd
pkg/queue/readiness/probe.go
Outdated
return true | ||
} else { | ||
// using Fprintf for a concise error message in the event log | ||
fmt.Fprintf(os.Stderr, "unimplemented probe type.") |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is there any state we can add to this to surface what type we did get?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can print the whole corev1.Probe
here, but all this really means at this point is that all 3 of the extant probe types are nil. Is it worth printing the whole object to catch future cases where there may be an additional probe type?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While leads me to the possibility that the error message may be less precise than it could be...
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also fmt.Fprint
, since no formatting.
// TCPProbe checks that a TCP socket to the address can be opened. | ||
// Did not reuse k8s.io/kubernetes/pkg/probe/tcp to not create a dependency | ||
// on klog. | ||
func TCPProbe(addr string, socketTimeout time.Duration) error { | ||
conn, err := net.DialTimeout("tcp", addr, socketTimeout) | ||
func TCPProbe(config TCPProbeConfigOptions) error { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I would also consider splitting just the changes needed for this package into its own PR.
if container.ReadinessProbe == nil { | ||
container.ReadinessProbe = &corev1.Probe{} | ||
} | ||
if container.ReadinessProbe.TCPSocket == nil && container.ReadinessProbe.HTTPGet == nil { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What if it is an exec probe?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
good catch 👍
} | ||
|
||
errs := validateProbe(p) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd probably also split the changes to pkg/apis
into it's own change, since it's fairly independent.
|
||
probeJSON, err := json.Marshal(rp) | ||
if err != nil { | ||
probeJSON = []byte{} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
🤔 I don't like swallowing errors :(
logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err)) | ||
} | ||
|
||
rp := readiness.NewProbe(coreProbe, logger.With(zap.String(logkey.Key, "readinessProbe"))) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Consider exposing readiness.EncodeProbe()
and readiness.DecodeProbe
to [de]serializing the probe.
cmd/queue/main.go
Outdated
// The 25 millisecond retry interval is an unscientific compromise between wanting to get | ||
// started as early as possible while still wanting to give the container some breathing | ||
// room to get up and running. | ||
kProbePollInterval = 25 * time.Millisecond |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
the naming here seems non-unified.
// Do not use the cached connection | ||
DisableKeepAlives: true, | ||
}, | ||
Timeout: time.Duration(timeoutSeconds) * time.Second, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do you really need this cast?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think so?
Timeout: timeoutSeconds * time.Second,
cmd/queue/main.go:278:27: invalid operation: timeoutSeconds * time.Second (mismatched types int and time.Duration)
cmd/queue/main.go
Outdated
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error. | ||
func parseProbe(ucProbe string) (*corev1.Probe, error) { | ||
p := &corev1.Probe{} | ||
err := json.Unmarshal([]byte(ucProbe), p) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if err := ...; err != nil {...}
is nicer.
- Refactor http probe into health package & add tests - Refactor probe function for simplicity
- reduced the timings on the unit test from 39s-> 27s
- Add support of http probe scheme - Add support for parsing json probe in queue
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
Co-authored-by: Shash Reddy <shashwathireddy@gmail.com>
The following is the coverage report on pkg/.
|
Fixes #4014
TODO
probeUserContainer
?Proposed Changes
Release Note