Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MCO: clear out failing status on success and add tests #442

Merged
merged 4 commits into from
Feb 17, 2019
Merged
Show file tree
Hide file tree
Changes from 1 commit
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
43 changes: 23 additions & 20 deletions pkg/operator/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -28,13 +28,12 @@ func (optr *Operator) syncAvailableStatus() error {
}

optrVersion, _ := optr.vStore.Get("operator")
progressing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing)
failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
message := fmt.Sprintf("Cluster has deployed %s", optrVersion)

available := configv1.ConditionTrue

if failing && !progressing {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hm, so failing now always means !available? OK, that looks like what the CVO does as well I think.

Copy link
Member Author

@runcom runcom Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

my understanding for the MCO is that if we fail the sync during a progressing, we could have e.g. applied a new MCO or something else that can likely misbehave if at some point we failed and thus no reason to report available. @abhinavdahiya what do you think?

Copy link
Member Author

@runcom runcom Feb 17, 2019

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

so, when looking at https://github.com/openshift/cluster-version-operator/blob/master/docs/dev/clusteroperator.md#conditions, there's an example:

If an error blocks reaching 4.0.1, the conditions might be:

Failing is true with a detailed message Unable to apply 4.0.1: could not update 0000_70_network_deployment.yaml because the resource type NetworkConfig has not been installed on the server.
Available is true with message Cluster has deployed 4.0.0
Progressing is true with message Unable to apply 4.0.1: a required object is missing

I'm not sure how to read that, if any of the syncFuncs here fails during a progressing, the status of the MCO isn't really available cause we may have e.g. started rolling a new mcc but the mcd broke and makes little sense to report Available.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

actually, maybe that !progressing is still right because the bug was really that we weren't clearing failed on successful sync

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This patch accounts for that case though and the test changed accounts for my bad assumption (all other tests are fine anyway with this):

diff --git a/pkg/operator/status.go b/pkg/operator/status.go
index b71370e..3ce2262 100644
--- a/pkg/operator/status.go
+++ b/pkg/operator/status.go
@@ -29,11 +29,12 @@ func (optr *Operator) syncAvailableStatus() error {
 
 	optrVersion, _ := optr.vStore.Get("operator")
 	failing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorFailing)
+	progressing := cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing)
 	message := fmt.Sprintf("Cluster has deployed %s", optrVersion)
 
 	available := configv1.ConditionTrue
 
-	if failing {
+	if (failing && !progressing) || (failing && optr.inClusterBringup) {
 		available = configv1.ConditionFalse
 		message = fmt.Sprintf("Cluster not available for %s", optrVersion)
 	}
diff --git a/pkg/operator/status_test.go b/pkg/operator/status_test.go
index 1437769..9b24c97 100644
--- a/pkg/operator/status_test.go
+++ b/pkg/operator/status_test.go
@@ -350,8 +350,7 @@ func TestOperatorSyncStatus(t *testing.T) {
 				},
 			},
 		},
-		// 3. test that if progressing fails, we report available=false because state of the operator
-		//    might have changed in the various sync calls
+		// 3. test that if progressing fails, we report available=true for the current version
 		{
 			syncs: []syncCase{
 				{
@@ -390,7 +389,7 @@ func TestOperatorSyncStatus(t *testing.T) {
 						},
 						{
 							Type:   configv1.OperatorAvailable,
-							Status: configv1.ConditionFalse,
+							Status: configv1.ConditionTrue,
 						},
 						{
 							Type:   configv1.OperatorFailing,
@@ -405,6 +404,29 @@ func TestOperatorSyncStatus(t *testing.T) {
 						},
 					},
 				},
+				{
+					// we mock the fact that we are at operator=test-version-2 after the previous sync
+					cond: []configv1.ClusterOperatorStatusCondition{
+						{
+							Type:   configv1.OperatorProgressing,
+							Status: configv1.ConditionFalse,
+						},
+						{
+							Type:   configv1.OperatorAvailable,
+							Status: configv1.ConditionTrue,
+						},
+						{
+							Type:   configv1.OperatorFailing,
+							Status: configv1.ConditionFalse,
+						},
+					},
+					syncFuncs: []syncFunc{
+						{
+							name: "fn1",
+							fn:   func(config renderConfig) error { return nil },
+						},
+					},
+				},
 			},
 		},
 		// 4. test that if progressing fails during bringup, we still report failing and not available
@@ -601,4 +623,4 @@ func TestInClusterBringUpStayOnErr(t *testing.T) {
 	assert.Nil(t, err, "expected syncAll to pass")
 
 	assert.False(t, optr.inClusterBringup)
-}
\ No newline at end of file
+}

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

the patch above effectively enables the mco to report available=true, progressing=true, failing=true if during a progressing we get a fail, but the mco is still available (assumption from cvo doc)

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

opened #450 to further discuss this.

if failing {
available = configv1.ConditionFalse
message = fmt.Sprintf("Cluster not available for %s", optrVersion)
}
Expand Down Expand Up @@ -67,6 +66,7 @@ func (optr *Operator) syncProgressingStatus() error {

if optr.vStore.Equal(co.Status.Versions) {
if optr.inClusterBringup {
message = fmt.Sprintf("Cluster is bootstrapping %s", optrVersion)
progressing = configv1.ConditionTrue
}
} else {
Expand All @@ -85,10 +85,7 @@ func (optr *Operator) syncProgressingStatus() error {
}

// syncFailingStatus applies the new condition to the mco's ClusterOperator object.
func (optr *Operator) syncFailingStatus(ierr error) error {
if ierr == nil {
return nil
}
func (optr *Operator) syncFailingStatus(ierr error) (err error) {
co, err := optr.fetchClusterOperator()
if err != nil {
return err
Expand All @@ -98,27 +95,33 @@ func (optr *Operator) syncFailingStatus(ierr error) error {
}

optrVersion, _ := optr.vStore.Get("operator")
var message string
if optr.vStore.Equal(co.Status.Versions) {
// syncing the state to exiting version.
message = fmt.Sprintf("Failed to resync %s because: %v", optrVersion, ierr.Error())
failing := configv1.ConditionTrue
var message, reason string
if ierr == nil {
failing = configv1.ConditionFalse
} else {
message = fmt.Sprintf("Unable to apply %s: %v", optrVersion, ierr.Error())
if optr.vStore.Equal(co.Status.Versions) {
// syncing the state to exiting version.
message = fmt.Sprintf("Failed to resync %s because: %v", optrVersion, ierr.Error())
} else {
message = fmt.Sprintf("Unable to apply %s: %v", optrVersion, ierr.Error())
}
reason = ierr.Error()

// set progressing
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) {
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: fmt.Sprintf("Unable to apply %s", version.Version.String())})
} else {
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionFalse, Message: fmt.Sprintf("Error while reconciling %s", version.Version.String())})
}
}
// set failing condition
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{
Type: configv1.OperatorFailing, Status: configv1.ConditionTrue,
Type: configv1.OperatorFailing, Status: failing,
Message: message,
Reason: ierr.Error(),
Reason: reason,
})

// set progressing
if cov1helpers.IsStatusConditionTrue(co.Status.Conditions, configv1.OperatorProgressing) {
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionTrue, Message: fmt.Sprintf("Unable to apply %s", version.Version.String())})
} else {
cov1helpers.SetStatusCondition(&co.Status.Conditions, configv1.ClusterOperatorStatusCondition{Type: configv1.OperatorProgressing, Status: configv1.ConditionFalse, Message: fmt.Sprintf("Error while reconciling %s", version.Version.String())})
}

optr.setMachineConfigPoolStatuses(&co.Status)
_, err = optr.configClient.ConfigV1().ClusterOperators().UpdateStatus(co)
return err
Expand Down
15 changes: 9 additions & 6 deletions pkg/operator/sync.go
Original file line number Diff line number Diff line change
Expand Up @@ -50,17 +50,20 @@ func (optr *Operator) syncAll(rconfig renderConfig) error {
}

agg := utilerrors.NewAggregate(errs)
if agg != nil {
errs = append(errs, optr.syncFailingStatus(agg))
agg = utilerrors.NewAggregate(errs)
return fmt.Errorf("error syncing: %v", agg.Error())
if err := optr.syncFailingStatus(agg); err != nil {
return fmt.Errorf("error syncing failing status: %v", err)
}
if optr.inClusterBringup {

if err := optr.syncAvailableStatus(); err != nil {
return fmt.Errorf("error syncing available status: %v", err)
}

if optr.inClusterBringup && agg == nil {
glog.Infof("Initialization complete")
optr.inClusterBringup = false
}

return optr.syncAvailableStatus()
return agg
}

func (optr *Operator) syncCustomResourceDefinitions() error {
Expand Down