Watching logs fail when etcd event history is cleared #7174

rhcarvalho · 2016-02-10T11:46:48Z

According to @smarterclayton, the logs endpoint should handle expired watches.

From #6715 / https://ci.openshift.redhat.com/jenkins/job/test_pull_requests_origin/10489/console:

FAILURE after 1801.720s: hack/../test/cmd/builds.sh:82: executing 'oc start-build --wait --follow busybox' expecting success and text 'hello world': the command returned the wrong error code; the output content test failed
Standard output from the command:
busybox-1
Standard error from the command:
error getting logs: unable to wait for build busybox-1 to run: received unknown object while watching for builds: &unversioned.Status{TypeMeta:unversioned.TypeMeta{Kind:"", APIVersion:""}, ListMeta:unversioned.ListMeta{SelfLink:"", ResourceVersion:""}, Status:"Failure", Message:"401: The event in requested index is outdated and cleared (the requested history has been cleared [1332/1317]) [2331]", Reason:"Expired", Details:(*unversioned.StatusDetails)(nil), Code:410}

The watcher is called in pkg/build/registry/rest.go:

// WaitForRunningBuild waits until the specified build is no longer New or Pending. Returns true if
// the build ran within timeout, false if it did not, and an error if any other error state occurred.
// The last observed Build state is returned.
func WaitForRunningBuild(watcher rest.Watcher, ctx kapi.Context, build *api.Build, timeout time.Duration) (*api.Build, bool, error) {
    fieldSelector := fields.OneTermEqualSelector("metadata.name", build.Name)
    options := &kapi.ListOptions{FieldSelector: fieldSelector, ResourceVersion: build.ResourceVersion}
    w, err := watcher.Watch(ctx, options)

The error is coming from coreos/etcd/store/event_history.go:

// scan enumerates events from the index history and stops at the first point
// where the key matches.
func (eh *EventHistory) scan(key string, recursive bool, index uint64) (*Event, *etcdErr.Error) {
    eh.rwl.RLock()
    defer eh.rwl.RUnlock()

    // index should be after the event history's StartIndex
    if index < eh.StartIndex {
        return nil,
            etcdErr.NewError(etcdErr.EcodeEventIndexCleared,
                fmt.Sprintf("the requested history has been cleared [%v/%v]",
                    eh.StartIndex, index), 0)
    }

Most likely there's some timing problem that makes the index we pass be less than the EventHistory's StartIndex.

The error message in Jenkins printed StartIndex/index = 1332/1317, so we're "off by 16 events". Other runs were off by different number of events, but rather small number anyway.

The text was updated successfully, but these errors were encountered:

liggitt · 2016-02-10T13:41:20Z

starting a watch from a build's resource version might not work if enough other changes have occurred... if a watch fails for that reason, I think we have to re-list, filtered to that build, and take the list's resourceVersion as our watch resourceVersion

rhcarvalho · 2016-02-11T09:15:39Z

FWIW, I believe this is not only affecting builds, but deployments as well, as the code is pretty much the same.

rhcarvalho · 2016-02-11T11:12:31Z

I'm working on a fix for this. Will detect the error code and retry.

smarterclayton · 2016-02-11T20:53:32Z

Make sure to limit the retry.

On Thu, Feb 11, 2016 at 6:12 AM, Rodolfo Carvalho notifications@github.com
wrote:

I'm working on a fix for this. Will detect the error code and retry.

—
Reply to this email directly or view it on GitHub
#7174 (comment).

rhcarvalho · 2016-02-11T20:56:04Z

The retry is limited by a timeout. We can also have some kind of throttling.
But first I need to see the retries actually eventually leading to a Build and not an Expired watch.

rhcarvalho · 2016-02-12T10:36:14Z

FWIW, seems that just retrying with a new watcher is not a solution.

What the test runs in Jenkins showed is that every new watcher.Watch starts already in an expired state. Trying again makes it no better.

BTW I might have put the retry logic in the wrong place.

diff --git a/pkg/build/registry/rest.go b/pkg/build/registry/rest.go
index 3647344..69c319c 100644
--- a/pkg/build/registry/rest.go
+++ b/pkg/build/registry/rest.go
@@ -6,6 +6,7 @@ import (

        kapi "k8s.io/kubernetes/pkg/api"
        "k8s.io/kubernetes/pkg/api/rest"
+       "k8s.io/kubernetes/pkg/api/unversioned"
        "k8s.io/kubernetes/pkg/fields"

        "github.com/openshift/origin/pkg/build/api"
@@ -20,33 +21,44 @@ var ErrUnknownBuildPhase = fmt.Errorf("unknown build phase")
 func WaitForRunningBuild(watcher rest.Watcher, ctx kapi.Context, build *api.Build, timeout time.Duration) (*api.Build, bool, error) {
        fieldSelector := fields.OneTermEqualSelector("metadata.name", build.Name)
        options := &kapi.ListOptions{FieldSelector: fieldSelector, ResourceVersion: build.ResourceVersion}
-       w, err := watcher.Watch(ctx, options)
-       if err != nil {
-               return nil, false, err
-       }
-       defer w.Stop()

-       ch := w.ResultChan()
-       observed := build
        expire := time.After(timeout)
-       for {
-               select {
-               case event := <-ch:
-                       obj, ok := event.Object.(*api.Build)
-                       if !ok {
-                               return observed, false, fmt.Errorf("received unknown object while watching for builds")
-                       }
-                       observed = obj

-                       switch obj.Status.Phase {
-                       case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
-                               return observed, true, nil
-                       case api.BuildPhaseNew, api.BuildPhasePending:
-                       default:
-                               return observed, false, ErrUnknownBuildPhase
+watcherLoop:
+       for {
+               w, err := watcher.Watch(ctx, options)
+               if err != nil {
+                       return nil, false, err
+               }
+               defer w.Stop()
+               ch := w.ResultChan()
+       eventLoop:
+               for {
+                       select {
+                       case event := <-ch:
+                               switch observed := event.Object.(type) {
+                               case *api.Build:
+                                       switch observed.Status.Phase {
+                                       case api.BuildPhaseRunning, api.BuildPhaseComplete, api.BuildPhaseFailed, api.BuildPhaseError, api.BuildPhaseCancelled:
+                                               // Build has started, return it.
+                                               return observed, true, nil
+                                       case api.BuildPhaseNew, api.BuildPhasePending:
+                                               // Build haven't started yet, continue waiting for more events.
+                                               continue eventLoop
+                                       default:
+                                               // Build has an unknown phase.
+                                               return observed, false, ErrUnknownBuildPhase
+                                       }
+                               case *unversioned.Status:
+                                       if observed.Reason == unversioned.StatusReasonExpired {
+                                               // The watcher expired, need to start over with a new watcher.
+                                               continue watcherLoop
+                                       }
+                               }
+                               return build, false, fmt.Errorf("received unknown object while watching for builds: %v", event.Object)
+                       case <-expire:
+                               return build, false, nil
                        }
-               case <-expire:
-                       return observed, false, nil
                }
        }
 }

rhcarvalho · 2016-02-12T14:08:37Z

The retry shall go in oc start-build, client code, if anywhere...

liggitt · 2016-02-12T14:11:30Z

What the test runs in Jenkins showed is that every new watcher.Watch starts already in an expired state.

Right, if the watch is expired you have to relist the builds (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList)

The retry shall go in oc start-build, client code, if anywhere...

I'd rather it be server side... that's where we're turning the list into a watch

rhcarvalho · 2016-02-12T14:21:24Z

I'd rather it be server side... that's where we're turning the list into a watch

Could you please have a look at the patch above, does it make sense?

I thought that would do what you described, but in practice all new watches keep expiring until we meet the timeout...

In start-build we already detect timeouts and try again... so if we had --wait I guess we could retry fetching the logs no matter if we get timeouts or any other error (in this case, the expired watch).

liggitt · 2016-02-12T14:26:47Z

Could you please have a look at the patch above, does it make sense?

No, you are using the same build that was passed in, with the same ResourceVersion, to start your watch inside the loop. You need to relist the builds inside the loop (probably filtered to that particular build) to get a newly watchable version (which would be the resourceVersion of the BuildsList)

The error you're getting ("The event in requested index is outdated and cleared") means the resource version of the build is too old to start a watch from.

rhcarvalho · 2016-02-12T14:43:50Z

@liggitt thanks. The build is being fetched in pkg/build/registry/buildlog/rest.go, so that's the whole portion we need to repeat.

Should we limit the retry within a timeout, a max number of tries, or both? Right now r.Get is not limited by a timeout, but only WaitForRunningBuild is limited by the defaultTimeout of 10s.

liggitt · 2016-02-12T14:46:00Z

not sure... feels odd for something outside this function to be retrying on very specific watch errors

rhcarvalho · 2016-02-12T15:08:03Z

Hmmm, either that or retry on the client side, no?

liggitt · 2016-02-12T15:10:49Z

-1 for client-side retrying for things like this, means we have to reimplement in all our clients (we have 3 today and will have more in the future)

rhcarvalho · 2016-07-11T13:40:41Z

Closing this. Most likely the problem was wrong expectation. We were trying to start builds in test/cmd tests, but the cluster where those tests are run do not support running builds.

danmcp added component/logging priority/P2 component/build labels Feb 10, 2016

danmcp assigned bparees Feb 10, 2016

mfojtik assigned rhcarvalho and unassigned bparees Feb 11, 2016

This was referenced Feb 12, 2016

Run user-provided command as part of build flow #6715

Merged

ignore events from previous builds #7313

Merged

rhcarvalho mentioned this issue Feb 29, 2016

[WIP] Build logs retry when etcd watcher expires #7697

Closed

rhcarvalho closed this as completed Jul 11, 2016

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Watching logs fail when etcd event history is cleared #7174

Watching logs fail when etcd event history is cleared #7174

rhcarvalho commented Feb 10, 2016

liggitt commented Feb 10, 2016

rhcarvalho commented Feb 11, 2016

rhcarvalho commented Feb 11, 2016

smarterclayton commented Feb 11, 2016

rhcarvalho commented Feb 11, 2016

rhcarvalho commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Jul 11, 2016

Watching logs fail when etcd event history is cleared #7174

Watching logs fail when etcd event history is cleared #7174

Comments

rhcarvalho commented Feb 10, 2016

liggitt commented Feb 10, 2016

rhcarvalho commented Feb 11, 2016

rhcarvalho commented Feb 11, 2016

smarterclayton commented Feb 11, 2016

rhcarvalho commented Feb 11, 2016

rhcarvalho commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Feb 12, 2016

liggitt commented Feb 12, 2016

rhcarvalho commented Jul 11, 2016