Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Watching logs fail when etcd event history is cleared #7174

Closed
rhcarvalho opened this issue Feb 10, 2016 · 15 comments
Closed

Watching logs fail when etcd event history is cleared #7174

rhcarvalho opened this issue Feb 10, 2016 · 15 comments

Comments

@rhcarvalho
Copy link
Contributor

According to @smarterclayton, the logs endpoint should handle expired watches.

From #6715 / https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/10489/console:

FAILURE after 1801.720s: hack/../test/cmd/builds.sh:82: executing 'oc start-build --wait --follow busybox' expecting success and text 'hello world': the command returned the wrong error code; the output content test failed
Standard output from the command:
busybox-1
Standard error from the command:
error getting logs: unable to wait for build busybox-1 to run: received unknown object while watching for builds: &unversioned.Status{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ListMeta:unversioned.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"401: The event in requested index is outdated and cleared (the requested history has been cleared [1332/1317]) [2331]", Reason:"Expired", Details:(*unversioned.StatusDetails)(nil), Code:410}

The watcher is called in pkg/build/registry/rest.go:

// WaitForRunningBuild waits until the specified build is no longer New or Pending. Returns true if
// the build ran within timeout, false if it did not, and an error if any other error state occurred.
// The last observed Build state is returned.
func WaitForRunningBuild(watcher rest.Watcher, ctx kapi.Context, build *api.Build, timeout time.Duration) (*api.Build, bool, error) {
    fieldSelector := fields.OneTermEqualSelector("metadata.name", build.Name)
    options := &kapi.ListOptions{FieldSelector: fieldSelector, ResourceVersion: build.ResourceVersion}
    w, err := watcher.Watch(ctx, options)

The error is coming from coreos/etcd/store/event_history.go:

// scan enumerates events from the index history and stops at the first point
// where the key matches.
func (eh *EventHistory) scan(key string, recursive bool, index uint64) (*Event, *etcdErr.Error) {
    eh.rwl.RLock()
    defer eh.rwl.RUnlock()

    // index should be after the event history's StartIndex
    if index < eh.StartIndex {
        return nil,
            etcdErr.NewError(etcdErr.EcodeEventIndexCleared,
                fmt.Sprintf("the requested history has been cleared [%v/%v]",
                    eh.StartIndex, index), 0)
    }

Most likely there's some timing problem that makes the index we pass be less than the EventHistory's StartIndex.

The error message in Jenkins printed StartIndex/index = 1332/1317, so we're "off by 16 events". Other runs were off by different number of events, but rather small number anyway.

@liggitt
Copy link
Contributor

liggitt commented Feb 10, 2016

starting a watch from a build's resource version might not work if enough other changes have occurred... if a watch fails for that reason, I think we have to re-list, filtered to that build, and take the list's resourceVersion as our watch resourceVersion

@rhcarvalho
Copy link
Contributor Author

FWIW, I believe this is not only affecting builds, but deployments as well, as the code is pretty much the same.

@rhcarvalho
Copy link
Contributor Author

I'm working on a fix for this. Will detect the error code and retry.

@mfojtik mfojtik assigned rhcarvalho and unassigned bparees Feb 11, 2016
@smarterclayton
Copy link
Contributor

Make sure to limit the retry.

On Thu, Feb 11, 2016 at 6:12 AM, Rodolfo Carvalho notifications@github.com
wrote:

I'm working on a fix for this. Will detect the error code and retry.


Reply to this email directly or view it on GitHub
#7174 (comment).

@rhcarvalho
Copy link
Contributor Author

The retry is limited by a timeout. We can also have some kind of throttling.
But first I need to see the retries actually eventually leading to a Build and not an Expired watch.

@rhcarvalho
Copy link
Contributor Author

FWIW, seems that just retrying with a new watcher is not a solution.

What the test runs in Jenkins showed is that every new watcher.Watch starts already in an expired state. Trying again makes it no better.

BTW I might have put the retry logic in the wrong place.

diff --git a/pkg/build/registry/rest.go b/pkg/build/registry/rest.go
index 3647344..69c319c 100644
--- a/pkg/build/registry/rest.go
+++ b/pkg/build/registry/rest.go
@@ -6,6 +6,7 @@ import (

        kapi "k8s.io/kubernetes/pkg/api"
        "k8s.io/kubernetes/pkg/api/rest"
+       "k8s.io/kubernetes/pkg/api/unversioned"
        "k8s.io/kubernetes/pkg/fields"

        "github.com/openshift/origin/pkg/build/api"
@@ -20,33 +21,44 @@ var ErrUnknownBuildPhase = fmt.Errorf("unknown build phase")
 func WaitForRunningBuild(watcher rest.Watcher, ctx kapi.Context, build *api.Build, timeout time.Duration) (*api.Build, bool, error) {
        fieldSelector := fields.OneTermEqualSelector("metadata.name", build.Name)
        options := &kapi.ListOptions{FieldSelector: fieldSelector, ResourceVersion: build.ResourceVersion}
-       w, err := watcher.Watch(ctx, options)
-       if err != nil {
-               return nil, false, err
-       }
-       defer w.Stop()

-       ch := w.ResultChan()
-       observed := build
        expire := time.After(timeout)
-       for {
-               select {
-               case event := <-ch:
-                       obj, ok := event.Object.(*api.Build)
-                       if !ok {
-                               return observed, false, fmt.Errorf("received unknown object while watching for builds")
-                       }
-                       observed = obj

-                       switch obj.Status.Phase {
-                       case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
-                               return observed, true, nil
-                       case api.BuildPhaseNew, api.BuildPhasePending:
-                       default:
-                               return observed, false, ErrUnknownBuildPhase
+watcherLoop:
+       for {
+               w, err := watcher.Watch(ctx, options)
+               if err != nil {
+                       return nil, false, err
+               }
+               defer w.Stop()
+               ch := w.ResultChan()
+       eventLoop:
+               for {
+                       select {
+                       case event := <-ch:
+                               switch observed := event.Object.(type) {
+                               case *api.Build:
+                                       switch observed.Status.Phase {
+                                       case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
+                                               // Build has started, return it.
+                                               return observed, true, nil
+                                       case api.BuildPhaseNew, api.BuildPhasePending:
+                                               // Build haven't started yet, continue waiting for more events.
+                                               continue eventLoop
+                                       default:
+                                               // Build has an unknown phase.
+                                               return observed, false, ErrUnknownBuildPhase
+                                       }
+                               case *unversioned.Status:
+                                       if observed.Reason == unversioned.StatusReasonExpired {
+                                               // The watcher expired, need to start over with a new watcher.
+                                               continue watcherLoop
+                                       }
+                               }
+                               return build, false, fmt.Errorf("received unknown object while watching for builds: %v", event.Object)
+                       case <-expire:
+                               return build, false, nil
                        }
-               case <-expire:
-                       return observed, false, nil
                }
        }
 }

@rhcarvalho
Copy link
Contributor Author

The retry shall go in oc start-build, client code, if anywhere...

@liggitt
Copy link
Contributor

liggitt commented Feb 12, 2016

What the test runs in Jenkins showed is that every new watcher.Watch starts already in an expired state.

Right, if the watch is expired you have to relist the builds (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList)

The retry shall go in oc start-build, client code, if anywhere...

I'd rather it be server side... that's where we're turning the list into a watch

@rhcarvalho
Copy link
Contributor Author

I'd rather it be server side... that's where we're turning the list into a watch

Could you please have a look at the patch above, does it make sense?

I thought that would do what you described, but in practice all new watches keep expiring until we meet the timeout...

In start-build we already detect timeouts and try again... so if we had --wait I guess we could retry fetching the logs no matter if we get timeouts or any other error (in this case, the expired watch).

@liggitt
Copy link
Contributor

liggitt commented Feb 12, 2016

Could you please have a look at the patch above, does it make sense?

No, you are using the same build that was passed in, with the same ResourceVersion, to start your watch inside the loop. You need to relist the builds inside the loop (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList)

The error you're getting ("The event in requested index is outdated and cleared") means the resource version of the build is too old to start a watch from.

@rhcarvalho
Copy link
Contributor Author

@liggitt thanks. The build is being fetched in pkg/build/registry/buildlog/rest.go, so that's the whole portion we need to repeat.

Should we limit the retry within a timeout, a max number of tries, or both? Right now r.Get is not limited by a timeout, but only WaitForRunningBuild is limited by the defaultTimeout of 10s.

@liggitt
Copy link
Contributor

liggitt commented Feb 12, 2016

not sure... feels odd for something outside this function to be retrying on very specific watch errors

@rhcarvalho
Copy link
Contributor Author

Hmmm, either that or retry on the client side, no?

@liggitt
Copy link
Contributor

liggitt commented Feb 12, 2016

-1 for client-side retrying for things like this, means we have to reimplement in all our clients (we have 3 today and will have more in the future)

@rhcarvalho
Copy link
Contributor Author

Closing this. Most likely the problem was wrong expectation. We were trying to start builds in test/cmd tests, but the cluster where those tests are run do not support running builds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

5 participants