Skip to content
This repository has been archived by the owner on Feb 27, 2023. It is now read-only.

discovery: fix syncqueue retries #141

Merged
merged 3 commits into from
Jun 7, 2018

Conversation

alexbrand
Copy link
Contributor

Updates #134

I propose we open another issue for spec'ing a new metric that will keep track of the number of retries.

The client-go workqueue expects the user to mark any item that needs to
be retried as "dirty". That is, on failure, the item must be re-added to
the queue for reprocessing at a later time.

The item is added to the queue via AddRateLimited, which will add the
item once the rate limiter says it's okay.

Signed-off-by: Alexander Brand <alexbrand09@gmail.com>

@alexbrand alexbrand requested a review from stevesloka June 5, 2018 15:05

// If there was an error handling the item, we will retry up to
// queueMaxRetries times.
if sq.Workqueue.NumRequeues(obj) < queueMaxRetries {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we determined that Forget() didn't do much, but we still call it in the error-retry case but not the happy path.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Forget() does have a function after all. The default rate limiter that we are using has two rate limiters. One of them is the BucketRateLimiter, for which Forget() has an empty implementation. The second is the ItemExponentialFailureRateLimiter, which does perform an operation when calling Forget().

My understanding is that Forget() does not have much to do with the queue itself, but with the rate limiters which are tracking each object in the queue, and determining when they are ready to be retried. By calling Forget(), we tell the rate limiter to stop tracking the object.

Thus, we call Forget() when we no longer want to retry an object. Or, on the flip side, the only place that we don't call Forget() is when we want the rate limiter to continue tracking the object because we are going to retry.

The updated (?) example in client-go talks a bit more about this: https://github.com/kubernetes/client-go/blob/master/examples/workqueue/main.go

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

actionAdd = "add"
actionUpdate = "update"
actionDelete = "delete"
queueMaxRetries = 5
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

5 might be a bit high of a retry count. Maybe split this out and allow it to be configurable?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Making it configurable is probably best. Any suggestions for default num retries? 3?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yeah maybe 3 is better. @rosskukulinski any thoughts?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've read through this PR and #134 and I'm not sure I understand this well enough to explain the impact this value might have to a user. What are the tradeoffs of making this 4 or 6?

Is this something we expect users to have to configure frequently? Does it depend on the size of their backend clusters (services?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The question is, if a service or endpoint fails to insert/update/delete, how many times should the discover retry the operation before giving up. In larger, busier clusters, I could see setting this number lower as it adds to the overhead of the queue size.

I think 3 is a good default, then LGTM.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ok! I don't see a reason to make this configurable to general users at this time, but if it's helpful for @alexbrand's performance testing/tuning, then I'm +1 for adding configuration at this time.

Otherwise, LGTM for 3 based on @stevesloka's input.

alexbrand added 2 commits June 7, 2018 08:58
The client-go workqueue expects the user to mark any item that needs to
be retried as "dirty". That is, on failure, the item must be re-added to
the queue for reprocessing at a later time.

The item is added to the queue via AddRateLimited, which will add the
item once the rate limiter says it's okay.

Signed-off-by: Alexander Brand <alexbrand09@gmail.com>
Signed-off-by: Alexander Brand <alexbrand09@gmail.com>
@alexbrand
Copy link
Contributor Author

Updated the retry count to 3 instead of 5. I think we can make it configurable in the future if someone has a need. I personally don't have a need right now for perf testing. Thoughts?

Copy link
Member

@stevesloka stevesloka left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Small nit otherwise LGTM

// If there was an error handling the item, we will retry up to
// queueMaxRetries times.
if sq.Workqueue.NumRequeues(obj) < queueMaxRetries {
sq.Logger.Infof("Error handling %s: %v. Requeuing.", action, err)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit: Should this be Error log type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also might be nice to log the retry count.

Signed-off-by: Alexander Brand <alexbrand09@gmail.com>
@alexbrand
Copy link
Contributor Author

Thanks @stevesloka! Fixed the log call. PTAL

@stevesloka stevesloka merged commit cc9ce58 into projectcontour:master Jun 7, 2018
@alexbrand alexbrand added this to the v0.3 milestone Jun 8, 2018
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants