-
Notifications
You must be signed in to change notification settings - Fork 8.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Retry initial backend configuration #3230
Retry initial backend configuration #3230
Conversation
@coreypobrien this makes sense but please use this approach https://github.com/kubernetes/ingress-nginx/blob/master/cmd/nginx/main.go#L205-227 |
/lgtm |
@coreypobrien thanks! |
/hold |
// start listening on the configured port (default 18080) | ||
// For large configurations it might take a while so we loop | ||
// and back off | ||
steps = 10 | ||
time.Sleep(1 * time.Second) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I was wishing we could get rid of this with this PR.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We definitely could, but there will always be at least 1 failure without the initial sleep. To avoid always failing on the first attempt, I think we should leave in the initial 1 second sleep.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@coreypobrien what do you think we get rid of isFirstSync
completely? We can unconditionally set steps to 10.
IMO it's OK to have configureDynamically
failing for few times during first sync.
That way we can simplify this code and also make sure we have retry logic for subsequent reconfigurations as well.
} | ||
glog.Warningf("Dynamic reconfiguration failed: %v", err) | ||
return false, nil | ||
}) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
When this fails after steps
can you make sure we log an error? So that people can setup their alerts accordingly.
// start listening on the configured port (default 18080) | ||
// For large configurations it might take a while so we loop | ||
// and back off | ||
steps = 10 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With 10 steps (and other retry configuration below), what's the longest time it can potentially take to finally give up and fail?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
try 1
wait for 0 + rand(1, 1+0.1)*1.5 < 1.65
try 2
wait for prev_duration + rand(1, 1+0.1)*1.5 < 1.65 + rand(1, 1+0.1)*1.5 < 1.65 * 2
try 3
wait for < 1.65 * 3
try 4
wait for < 1.65 * 4
== by this time it would take < 1.65 + 1.65 * 2 + 1.65 * 3 ==
...
try 9
wait will be < 1.65 * 9
try last time
exit
so by the time it's trying n'th time it'd have waited 1.65 * (1 + 2 + ... + n - 1) seconds worst case. So setting n to 10, it would wait 45 seconds before it fails.
can you confirm it will work like above? Is this realistic? It seems to me settings steps to 10 is a bit too much. It can mask a real problem instead of failing fast.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@ElvinEfendi here
package main
import (
"fmt"
"time"
"k8s.io/apimachinery/pkg/util/wait"
)
func main() {
retry := wait.Backoff{
Steps: 10,
Duration: 1 * time.Second,
Factor: 1.5,
Jitter: 0.1,
}
start := time.Now()
wait.ExponentialBackoff(retry, func() (bool, error) {
fmt.Printf("%s\n", time.Now().Sub(start).Round(time.Millisecond))
return false, nil
})
fmt.Printf("Total time: %v\n", time.Now().Sub(start).Round(time.Millisecond))
}
go run main.go
0s
1.061s
2.702s
5.102s
8.625s
13.902s
22.018s
33.483s
50.837s
1m16.715s
Total time: 1m16.715s
I think five steps is more than enough
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Or use a factor of 1.1
go run main.go
0s
1.061s
2.264s
3.555s
4.944s
6.471s
8.192s
9.975s
11.955s
14.119s
Total time: 14.119s
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
thanks @aledbf
/hold cancel |
we will address the above suggestions in a subsequent PR |
/lgtm |
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: aledbf, coreypobrien, ElvinEfendi The full list of commands accepted by this bot can be found here. The pull request process is described here
Needs approval from an approver in each of these files:
Approvers can indicate their approval by writing |
Initially when nginx starts, it isn't configured. After loading the initial configuration it will start listening for dynamic configuration. To avoid sending the initial list of backends before nginx is actually listening, there is currently a static 1 second sleep during the initial setup.
For larger or more complex configurations, the delay for nginx to begin listening for backend configurations can exceed 1 second. In that case, there is no retry which means the configuration is never loaded which means nginx health checks fail forever sending it into a crash loop.
This PR adds a retry backoff during the initial configuration to handle these larger delays. For most deployments, this should have identical behavior of sleeping 1s and then working. For the longer ones, if the initial call to
/configuration/backends
fails, the subsequent retries should eventually apply the backend configuration.