Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[WIP] Use user-defined readiness probes through queue-proxy #4600

Closed
wants to merge 32 commits into from

Conversation

joshrider
Copy link
Contributor

@joshrider joshrider commented Jul 2, 2019

Fixes #4014

TODO

  • check defaults and validations => lots of questions about desired behaviour here. Make a list.
  • lots of general cleanup
  • make sure zap errors don't produce lengthy errors that destroy readability of event log
  • does activator still keep using probeUserContainer?

Proposed Changes

  • move user-defined readinessProbes from kubelet to the queue-proxy
  • show default probe in Revision spec if no probe provided
  • allow the possibility of either our existing, aggressive probe retry semantics (which should speed up cold start), or standard k8s probe semantics

Release Note


@knative-prow-robot knative-prow-robot added the do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. label Jul 2, 2019
@knative-prow-robot knative-prow-robot added the size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files. label Jul 2, 2019
Copy link
Contributor

@knative-prow-robot knative-prow-robot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@joshrider: 3 warnings.

In response to this:

Fixes #4014

TODO

  • check defaults and validations => lots of questions about desired behaviour here. Make a list.
  • investigate flaky TestAutoScaleUpDownUp
  • lots of general cleanup
  • pkg/reconciler/revision/resources/deploy_test.go only remaining user of queueReadinessProbe
  • pkg/reconciler/revision/resources/queue_test.go probe stuff into table tests?
  • make sure zap errors don't produce lengthy errors that destroy readability of event log
  • does activator still keep using probeUserContainer?

Proposed Changes

  • move user-defined readinessProbes from kubelet to the queue-proxy
  • show default probe in Revision spec if no probe provided
  • allow the possibility of either our existing, aggressive probe retry semantics (which should speed up cold start), or standard k8s probe semantics

Release Note


Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes/test-infra repository.

@knative-prow-robot knative-prow-robot added area/API API objects and controllers area/networking labels Jul 2, 2019
@googlebot
Copy link

So there's good news and bad news.

👍 The good news is that everyone that needs to sign a CLA (the pull request submitter and all commit authors) have done so. Everything is all good there.

😕 The bad news is that it appears that one or more commits were authored or co-authored by someone other than the pull request submitter. We need to confirm that all authors are ok with their commits being contributed to this project. Please have them confirm that here in the pull request.

Note to project maintainer: This is a terminal state, meaning the cla/google commit status will not change from this state. It's up to you to confirm consent of all the commit author(s), set the cla label to yes (if enabled on your project), and then merge this pull request when appropriate.

ℹ️ Googlers: Go here for more info.

@googlebot googlebot added the cla: no Indicates the PR's author has not signed the CLA. label Jul 2, 2019
@joshrider
Copy link
Contributor Author

/assign @mattmoor @dgerd

@knative-prow-robot
Copy link
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: joshrider
To complete the pull request process, please assign mattmoor
You can assign the PR to them by writing /assign @mattmoor in a comment when ready.

The full list of commands accepted by this bot can be found here.

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

mux.HandleFunc(queue.RequestQueueDrainPath, healthState.DrainHandler())

return mux
}

func probeQueueHealthPath(port int, timeout time.Duration) error {
func probeQueueHealthPath(port int, timeout int) error {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Since we are now dealing with an int can be useful to make clear that we are dealing with seconds.

Suggested change
func probeQueueHealthPath(port int, timeout int) error {
func probeQueueHealthPath(port int, timeoutSeconds int) error {

url := fmt.Sprintf(healthURLTemplate, port)

// use aggressive retries
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit:

Suggested change
// use aggressive retries
// Use aggressive sub-second retries.

res, err := httpClient.Get(url)

if err != nil {
return errors.Wrap(err, "failed to probe")
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If the original error does not return the URL this could be valuable for debugging.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not sure I understand this comment. Is the suggestion to include the URL in the error message? We build the URL on the fly, but (at present) it will always be http://127.0.0.1:8022/health.

Maybe something better than "failed to probe" is called for?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep all I meant was fail to probe: http://127.0.0.1:8022/health provides a little bit more detail to someone looking at the error in the logs without having to trace through the code.

}

var lastErr error

// The 25 millisecond retry interval is an unscientific compromise between wanting to get
// started as early as possible while still wanting to give the container some breathing
// room to get up and running.
timeoutErr := wait.PollImmediate(25*time.Millisecond, timeout, func() (bool, error) {
timeoutErr := wait.PollImmediate(25*time.Millisecond, readiness.PollTimeout, func() (bool, error) {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Make the 25ms a constant rather than a magic value here

@@ -353,9 +395,16 @@ func main() {
StatChan: statChan,
}, time.Now())

coreProbe, err := parseProbe(*ucProbe)
if err != nil {
logger.Fatalf("Queue container failed to parse readiness probe %#v", err)
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
logger.Fatalf("Queue container failed to parse readiness probe %#v", err)
logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err))

@@ -293,11 +325,21 @@ func probeQueueHealthPath(port int, timeout time.Duration) error {
return nil
}

// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: comments for declarations should be sentences and end in periods.

Suggested change
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error
// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error.

}

func (p *Probe) timeout() time.Duration {
if p.IsStandardProbe() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this supposed to be inversed?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it's correct, but defaultProbeTimeout is poorly named.

This represents the maximum length of time waited for a TCP connection/HTTP response. For a user-supplied probe, we are using the value given to us for our single attempt. For our aggressive retries, we timeout much more quickly and just try again.

mux := http.NewServeMux()
mux.HandleFunc(requestQueueHealthPath, healthState.HealthHandler(probeUserContainer))

if p.IsStandardProbe() {
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There seems to be lots of commonality between HealthHandler and HealthHandlerKProbe. Why the switch here rather than just having a HealthHandler function that configures based on StandardProbe flag?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Made a change. LMK what you think.

}

errs := validateProbe(p)

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like we have previously not validated the bounds of periodSeconds and let the K8s API server handle the validation. Since we are using the periodSeconds in our controllers now we probably should add the validation that this is a positive number.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably also split the changes to pkg/apis into it's own change, since it's fairly independent.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a check to make sure periodSeconds is non-negative (allowing 0 to signal the aggressive probe) and checks to make sure timeoutSeconds and failureThreshold are non-negative in the case where we're not using the aggressive probe.

)

const (
defaultProbeTimeout = 100 * time.Millisecond
Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is defined here and in main.go. Is this supposed to stay in sync? If so we should make this public and share it between the two places.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The probe timeout in probe.go and main.go are used for different things that don't necessary need to sync.

defaultProbeTimeout in main.go is used in probeUserContainer which is used as a signal to the activator that the pod is up and ready to accept traffic. If we want to keep using the pre-existing, hard-coded TCP connect as the activator's test for readiness, it seems like the constants are two similar, but different, things.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should the activator's test be the same as the one that's set off from kubelet?

Copy link

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

@vagababov @mattmoor Any thoughts here?

Copy link
Member

@mattmoor mattmoor left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the change @joshrider

Left a few comments throughout, but mainly this is a big change. I think it'll be clearer to tackle in pieces now that you seem to have things working. I made a couple suggestions on places this might split nicely.

@@ -134,8 +134,13 @@ func makePodSpec(rev *v1alpha1.Revision, loggingConfig *logging.Config, observab
userContainer.TerminationMessagePolicy = corev1.TerminationMessageFallbackToLogsOnError
}

if userContainer.ReadinessProbe != nil {
if userContainer.ReadinessProbe.HTTPGet != nil || userContainer.ReadinessProbe.TCPSocket != nil {
userContainer.ReadinessProbe = nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is worth a comment for those without the context of this code review.

}

// IsKProbe indicates whether the default Knative probe with aggressive retries should be used.
func (p *Probe) IsKProbe() bool {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Perhaps use a more descriptive name like IsAggressive or HasUnrestrictedInterval?

} else if p.TCPSocket != nil {
err = p.tcpProbe()
} else if p.Exec != nil {
// Assumes the true readinessProbe is being executed directly on user container. See (#4086).
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if we should still TCP probe in this case, since we otherwise have no verification of the networking path from the queue to the user-container.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

cc @dgerd

return true
} else {
// using Fprintf for a concise error message in the event log
fmt.Fprintf(os.Stderr, "unimplemented probe type.")
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is there any state we can add to this to surface what type we did get?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We can print the whole corev1.Probe here, but all this really means at this point is that all 3 of the extant probe types are nil. Is it worth printing the whole object to catch future cases where there may be an additional probe type?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While leads me to the possibility that the error message may be less precise than it could be...

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also fmt.Fprint, since no formatting.

// TCPProbe checks that a TCP socket to the address can be opened.
// Did not reuse k8s.io/kubernetes/pkg/probe/tcp to not create a dependency
// on klog.
func TCPProbe(addr string, socketTimeout time.Duration) error {
conn, err := net.DialTimeout("tcp", addr, socketTimeout)
func TCPProbe(config TCPProbeConfigOptions) error {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would also consider splitting just the changes needed for this package into its own PR.

if container.ReadinessProbe == nil {
container.ReadinessProbe = &corev1.Probe{}
}
if container.ReadinessProbe.TCPSocket == nil && container.ReadinessProbe.HTTPGet == nil {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What if it is an exec probe?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch 👍

}

errs := validateProbe(p)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd probably also split the changes to pkg/apis into it's own change, since it's fairly independent.


probeJSON, err := json.Marshal(rp)
if err != nil {
probeJSON = []byte{}
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

🤔 I don't like swallowing errors :(

logger.Fatalw("Queue container failed to parse readiness probe", zap.Error(err))
}

rp := readiness.NewProbe(coreProbe, logger.With(zap.String(logkey.Key, "readinessProbe")))
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider exposing readiness.EncodeProbe() and readiness.DecodeProbe to [de]serializing the probe.

// The 25 millisecond retry interval is an unscientific compromise between wanting to get
// started as early as possible while still wanting to give the container some breathing
// room to get up and running.
kProbePollInterval = 25 * time.Millisecond
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the naming here seems non-unified.

// Do not use the cached connection
DisableKeepAlives: true,
},
Timeout: time.Duration(timeoutSeconds) * time.Second,
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you really need this cast?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think so?

Timeout: timeoutSeconds * time.Second,

cmd/queue/main.go:278:27: invalid operation: timeoutSeconds * time.Second (mismatched types int and time.Duration)

// parseProbe takes a json serialised *corev1.Probe and returns a Probe or an error.
func parseProbe(ucProbe string) (*corev1.Probe, error) {
p := &corev1.Probe{}
err := json.Unmarshal([]byte(ucProbe), p)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if err := ...; err != nil {...}
is nicer.

joshrider and others added 9 commits July 8, 2019 10:00
- Refactor http probe into health package & add tests
- Refactor probe function for simplicity
- reduced the timings on the unit test from 39s-> 27s
- Add support of http probe scheme
- Add support for parsing json probe in queue
@knative-metrics-robot
Copy link

The following is the coverage report on pkg/.
Say /test pull-knative-serving-go-coverage to re-run this coverage report

File Old Coverage New Coverage Delta
pkg/apis/serving/k8s_validation.go 98.9% 99.0% 0.1
pkg/apis/serving/v1beta1/revision_defaults.go 87.5% 89.5% 2.0
pkg/queue/health/probe.go 100.0% 90.9% -9.1
pkg/queue/readiness/probe.go Do not exist 100.0%
pkg/queue/readiness/probe_encoding.go Do not exist 87.5%

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/API API objects and controllers area/networking cla: no Indicates the PR's author has not signed the CLA. do-not-merge/work-in-progress Indicates that a PR should not merge because it is a work in progress. size/XXL Denotes a PR that changes 1000+ lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Apply default livenessProbe and readinessProbe to the user container
9 participants