-
Notifications
You must be signed in to change notification settings - Fork 4.7k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Watching logs fail when etcd event history is cleared #7174
Comments
starting a watch from a build's resource version might not work if enough other changes have occurred... if a watch fails for that reason, I think we have to re-list, filtered to that build, and take the list's resourceVersion as our watch resourceVersion |
FWIW, I believe this is not only affecting builds, but deployments as well, as the code is pretty much the same. |
I'm working on a fix for this. Will detect the error code and retry. |
Make sure to limit the retry. On Thu, Feb 11, 2016 at 6:12 AM, Rodolfo Carvalho notifications@github.com
|
The retry is limited by a timeout. We can also have some kind of throttling. |
FWIW, seems that just retrying with a new watcher is not a solution. What the test runs in Jenkins showed is that every new BTW I might have put the retry logic in the wrong place. diff --git a/pkg/build/registry/rest.go b/pkg/build/registry/rest.go
index 3647344..69c319c 100644
--- a/pkg/build/registry/rest.go
+++ b/pkg/build/registry/rest.go
@@ -6,6 +6,7 @@ import (
kapi "k8s.io/kubernetes/pkg/api"
"k8s.io/kubernetes/pkg/api/rest"
+ "k8s.io/kubernetes/pkg/api/unversioned"
"k8s.io/kubernetes/pkg/fields"
"github.com/openshift/origin/pkg/build/api"
@@ -20,33 +21,44 @@ var ErrUnknownBuildPhase = fmt.Errorf("unknown build phase")
func WaitForRunningBuild(watcher rest.Watcher, ctx kapi.Context, build *api.Build, timeout time.Duration) (*api.Build, bool, error) {
fieldSelector := fields.OneTermEqualSelector("metadata.name", build.Name)
options := &kapi.ListOptions{FieldSelector: fieldSelector, ResourceVersion: build.ResourceVersion}
- w, err := watcher.Watch(ctx, options)
- if err != nil {
- return nil, false, err
- }
- defer w.Stop()
- ch := w.ResultChan()
- observed := build
expire := time.After(timeout)
- for {
- select {
- case event := <-ch:
- obj, ok := event.Object.(*api.Build)
- if !ok {
- return observed, false, fmt.Errorf("received unknown object while watching for builds")
- }
- observed = obj
- switch obj.Status.Phase {
- case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
- return observed, true, nil
- case api.BuildPhaseNew, api.BuildPhasePending:
- default:
- return observed, false, ErrUnknownBuildPhase
+watcherLoop:
+ for {
+ w, err := watcher.Watch(ctx, options)
+ if err != nil {
+ return nil, false, err
+ }
+ defer w.Stop()
+ ch := w.ResultChan()
+ eventLoop:
+ for {
+ select {
+ case event := <-ch:
+ switch observed := event.Object.(type) {
+ case *api.Build:
+ switch observed.Status.Phase {
+ case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
+ // Build has started, return it.
+ return observed, true, nil
+ case api.BuildPhaseNew, api.BuildPhasePending:
+ // Build haven't started yet, continue waiting for more events.
+ continue eventLoop
+ default:
+ // Build has an unknown phase.
+ return observed, false, ErrUnknownBuildPhase
+ }
+ case *unversioned.Status:
+ if observed.Reason == unversioned.StatusReasonExpired {
+ // The watcher expired, need to start over with a new watcher.
+ continue watcherLoop
+ }
+ }
+ return build, false, fmt.Errorf("received unknown object while watching for builds: %v", event.Object)
+ case <-expire:
+ return build, false, nil
}
- case <-expire:
- return observed, false, nil
}
}
} |
The retry shall go in |
Right, if the watch is expired you have to relist the builds (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList)
I'd rather it be server side... that's where we're turning the list into a watch |
Could you please have a look at the patch above, does it make sense? I thought that would do what you described, but in practice all new watches keep expiring until we meet the timeout... In start-build we already detect timeouts and try again... so if we had --wait I guess we could retry fetching the logs no matter if we get timeouts or any other error (in this case, the expired watch). |
No, you are using the same build that was passed in, with the same ResourceVersion, to start your watch inside the loop. You need to relist the builds inside the loop (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList) The error you're getting ("The event in requested index is outdated and cleared") means the resource version of the build is too old to start a watch from. |
@liggitt thanks. The build is being fetched in pkg/build/registry/buildlog/rest.go, so that's the whole portion we need to repeat. Should we limit the retry within a timeout, a max number of tries, or both? Right now |
not sure... feels odd for something outside this function to be retrying on very specific watch errors |
Hmmm, either that or retry on the client side, no? |
-1 for client-side retrying for things like this, means we have to reimplement in all our clients (we have 3 today and will have more in the future) |
Closing this. Most likely the problem was wrong expectation. We were trying to start builds in |
According to @smarterclayton, the logs endpoint should handle expired watches.
From #6715 / https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/10489/console:
The watcher is called in pkg/build/registry/rest.go:
The error is coming from coreos/etcd/store/event_history.go:
Most likely there's some timing problem that makes the index we pass be less than the EventHistory's StartIndex.
The error message in Jenkins printed StartIndex/index = 1332/1317, so we're "off by 16 events". Other runs were off by different number of events, but rather small number anyway.
The text was updated successfully, but these errors were encountered: