Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Introduce timeout for hanging workspaces #605

Merged
merged 4 commits into from
Sep 30, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
Original file line number Diff line number Diff line change
Expand Up @@ -61,6 +61,11 @@ type WorkspaceConfig struct {
// requires support in the workspace being started. If not specified, the default
// value of "15m" is used.
IdleTimeout string `json:"idleTimeout,omitempty"`
// ProgressTimeout determines the maximum duration a DevWorkspace can be in
// a "Starting" or "Failing" phase without progressing before it is automatically failed.
// Duration should be specified in a format parseable by Go's time package, e.g.
// "15m", "20s", "1h30m", etc. If not specified, the default value of "5m" is used.
ProgressTimeout string `json:"progressTimeout,omitempty"`
// IgnoredUnrecoverableEvents defines a list of Kubernetes event names that should
// be ignored when deciding to fail a DevWorkspace startup. This option should be used
// if a transient cluster issue is triggering false-positives (for example, if
Expand Down
30 changes: 29 additions & 1 deletion controllers/workspace/devworkspace_controller.go
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,10 @@ import (
dw "github.com/devfile/api/v2/pkg/apis/workspaces/v1alpha2"
)

const (
startingWorkspaceRequeueInterval = 5 * time.Second
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I wonder if 5 secs is always the best choice or we can make it dynamic and like to minute, 2,3 in case of 5minutes timeout?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I mostly picked 5 seconds because it feels long enough that it's not burdening the controller (we can reconcile many times a second) but also short enough to avoid strange issues in setting a timeout (e.g. if we set a 1 minute requeue, what happens if the config specifies a timeout of 1 minute 15 seconds?)

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 minute 15 seconds

I don't think there is a case for such a precise timeout. Maybe we can declare our precision to like 1minute or 30 seconds.

And actually, I thought about reconciling after (timeout + last transition time - now + 5 sec), which will initiate reconcile loop after 5 sec when potentially we need to kill that.

But I'm OK with any if it does not generate redundant loading.

Copy link
Collaborator Author

@amisevsk amisevsk Sep 30, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tested this informally by starting a workspace that will timeout (ignore FailedScheduling events and ask for 100Gi in Theia). While the workspace is looping on checking the deployment every 5 seconds, I ran

for i in {1..5}; do
  kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "false"}}}'
  sleep 0.5
  kubectl patch dw theia-next --type merge -p '{"metadata": {"labels": {"touch": "true"}}}'
  sleep 0.5
done

to trigger 10 reconciles to the object, with the assumption that each of those would also start a RequeueAfter. However, after the 10 quick reconciles are completed, the controller goes back to queuing reconciles every five seconds, rather than 10 reconciles every 5 seconds, so it seems like RequeueAfter is cancelled if an event triggers a reconcile.

I agree that 5 seconds may be the wrong value here; we should tweak this later.

)

// DevWorkspaceReconciler reconciles a DevWorkspace object
type DevWorkspaceReconciler struct {
client.Client
Expand Down Expand Up @@ -134,7 +138,11 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
// If debug annotation is present, leave the deployment in place to let users
// view logs.
if workspace.Annotations[constants.DevWorkspaceDebugStartAnnotation] == "true" {
return reconcile.Result{}, nil
if isTimeout, err := checkForFailingTimeout(workspace); err != nil {
return reconcile.Result{}, err
} else if !isTimeout {
return reconcile.Result{}, nil
}
}

patch := []byte(`{"spec":{"started": false}}`)
Expand Down Expand Up @@ -181,8 +189,16 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
clusterWorkspace := workspace.DeepCopy()
timingInfo := map[string]string{}
timing.SetTime(timingInfo, timing.DevWorkspaceStarted)

defer func() (reconcile.Result, error) {
r.syncTimingToCluster(ctx, clusterWorkspace, timingInfo, reqLogger)
// Don't accidentally suppress errors by overwriting here; only check for timeout when no error
// encountered in main reconcile loop.
if err == nil {
if timeoutErr := checkForStartTimeout(clusterWorkspace); timeoutErr != nil {
reconcileResult, err = r.failWorkspace(workspace, timeoutErr.Error(), reqLogger, &reconcileStatus)
}
}
return r.updateWorkspaceStatus(clusterWorkspace, reqLogger, &reconcileStatus, reconcileResult, err)
}()

Expand Down Expand Up @@ -286,6 +302,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
message = routingStatus.Message
}
reconcileStatus.setConditionFalse(dw.DevWorkspaceRoutingReady, message)
if !routingStatus.Requeue && routingStatus.Err == nil {
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil
}
return reconcile.Result{Requeue: routingStatus.Requeue}, routingStatus.Err
}
reconcileStatus.setConditionTrue(dw.DevWorkspaceRoutingReady, "Networking ready")
Expand Down Expand Up @@ -333,6 +352,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
// FailStartup is not possible for generating the serviceaccount
reqLogger.Info("Waiting for workspace ServiceAccount")
reconcileStatus.setConditionFalse(dw.DevWorkspaceServiceAccountReady, "Waiting for DevWorkspace ServiceAccount")
if !serviceAcctStatus.Requeue && serviceAcctStatus.Err == nil {
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil
}
return reconcile.Result{Requeue: serviceAcctStatus.Requeue}, serviceAcctStatus.Err
}
serviceAcctName := serviceAcctStatus.ServiceAccountName
Expand All @@ -341,6 +363,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
pullSecretStatus := wsprovision.PullSecrets(clusterAPI, serviceAcctName, workspace.GetNamespace())
if !pullSecretStatus.Continue {
reconcileStatus.setConditionFalse(conditions.PullSecretsReady, "Waiting for DevWorkspace pull secrets")
if !pullSecretStatus.Requeue && pullSecretStatus.Err == nil {
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil
}
return reconcile.Result{Requeue: pullSecretStatus.Requeue}, pullSecretStatus.Err
}
allPodAdditions = append(allPodAdditions, pullSecretStatus.PodAdditions)
Expand All @@ -355,6 +380,9 @@ func (r *DevWorkspaceReconciler) Reconcile(ctx context.Context, req ctrl.Request
}
reqLogger.Info("Waiting on deployment to be ready")
reconcileStatus.setConditionFalse(conditions.DeploymentReady, "Waiting for workspace deployment")
if !deploymentStatus.Requeue && deploymentStatus.Err == nil {
return reconcile.Result{RequeueAfter: startingWorkspaceRequeueInterval}, nil
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does here the last or the first internal win?

  • I mean we wait for deployment initially and scheduled reconciling in 5 secs (1);
  • Deployment created pods and it invokes reconcile loop, we scheduling in 5 secs (2);
  • Pod is updated several times after containers status are propagates, we scheduled N times in 5 secs(N);

The question is, how many times reconcile loops our RequeueAfter will initiate?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

My understanding is that reconciling due to an event cancels out a requeueAfter, but it's hard to check for sure. In my testing, I don't see bursts of "waiting on deployment" in the logs, which is what would likely happen if we were stacking requeues.

}
return reconcile.Result{Requeue: deploymentStatus.Requeue}, deploymentStatus.Err
}
reconcileStatus.setConditionTrue(conditions.DeploymentReady, "DevWorkspace deployment ready")
Expand Down
53 changes: 53 additions & 0 deletions controllers/workspace/status.go
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@ import (
"net/http"
"net/url"
"sort"
"time"

dw "github.com/devfile/api/v2/pkg/apis/workspaces/v1alpha2"

Expand All @@ -31,6 +32,7 @@ import (
"github.com/devfile/devworkspace-operator/apis/controller/v1alpha1"
"github.com/devfile/devworkspace-operator/controllers/workspace/metrics"
"github.com/devfile/devworkspace-operator/pkg/conditions"
"github.com/devfile/devworkspace-operator/pkg/config"
wsprovision "github.com/devfile/devworkspace-operator/pkg/provision/workspace"
)

Expand Down Expand Up @@ -238,3 +240,54 @@ func updateMetricsForPhase(workspace *dw.DevWorkspace, oldPhase, newPhase dw.Dev
metrics.WorkspaceStarted(workspace, logger)
}
}

// checkForStartTimeout checks if the provided workspace has not progressed for longer than the configured
// startup timeout. This is determined by checking to see if the last condition transition time is more
// than [timeout] duration ago. Workspaces that are not in the "Starting" phase cannot timeout. Returns
// an error with message when timeout is reached.
func checkForStartTimeout(workspace *dw.DevWorkspace) error {
if workspace.Status.Phase != dw.DevWorkspaceStatusStarting {
return nil
}
timeout, err := time.ParseDuration(config.Workspace.ProgressTimeout)
if err != nil {
return fmt.Errorf("invalid duration specified for timeout: %w", err)
}
currTime := clock.Now()
lastUpdateTime := time.Time{}
for _, condition := range workspace.Status.Conditions {
if condition.LastTransitionTime.Time.After(lastUpdateTime) {
lastUpdateTime = condition.LastTransitionTime.Time
}
}
if !lastUpdateTime.IsZero() && lastUpdateTime.Add(timeout).Before(currTime) {
return fmt.Errorf("devworkspace failed to progress past phase '%s' for longer than timeout (%s)",
workspace.Status.Phase, config.Workspace.ProgressTimeout)
}
return nil
}

// checkForFailingTimeout checks that the current workspace has not been in the "Failing" state for longer than the
// configured progress timeout. If the workspace is not in the Failing state or does not have a DevWorkspaceFailed
// condition set, returns false. Otherwise, returns true if the workspace has timed out. Returns an error if
// timeout is configured with an unparsable duration.
func checkForFailingTimeout(workspace *dw.DevWorkspace) (isTimedOut bool, err error) {
if workspace.Status.Phase != devworkspacePhaseFailing {
return false, nil
}
timeout, err := time.ParseDuration(config.Workspace.ProgressTimeout)
if err != nil {
return false, fmt.Errorf("invalid duration specified for timeout: %w", err)
}
currTime := clock.Now()
failedTime := time.Time{}
for _, condition := range workspace.Status.Conditions {
if condition.Type == dw.DevWorkspaceFailedStart {
failedTime = condition.LastTransitionTime.Time
}
}
if !failedTime.IsZero() && failedTime.Add(timeout).Before(currTime) {
return true, nil
}
return false, nil
}

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions deploy/deployment/kubernetes/combined.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

8 changes: 8 additions & 0 deletions deploy/deployment/openshift/combined.yaml

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

Some generated files are not rendered by default. Learn more about how customized files appear on GitHub.

1 change: 1 addition & 0 deletions pkg/config/defaults.go
Original file line number Diff line number Diff line change
Expand Up @@ -24,5 +24,6 @@ var DefaultConfig = &v1alpha1.OperatorConfiguration{
ImagePullPolicy: "Always",
PVCName: "claim-devworkspace",
IdleTimeout: "15m",
ProgressTimeout: "5m",
},
}
3 changes: 3 additions & 0 deletions pkg/config/sync.go
Original file line number Diff line number Diff line change
Expand Up @@ -205,6 +205,9 @@ func mergeConfig(from, to *controller.OperatorConfiguration) {
if from.Workspace.IdleTimeout != "" {
to.Workspace.IdleTimeout = from.Workspace.IdleTimeout
}
if from.Workspace.ProgressTimeout != "" {
to.Workspace.ProgressTimeout = from.Workspace.ProgressTimeout
}
if from.Workspace.IgnoredUnrecoverableEvents != nil {
to.Workspace.IgnoredUnrecoverableEvents = from.Workspace.IgnoredUnrecoverableEvents
}
Expand Down