Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Executor hardening #685

Merged
merged 1 commit into from
Jan 17, 2018
Merged

Conversation

jessesuen
Copy link
Member

@jessesuen jessesuen commented Jan 15, 2018

  • add retries and memoization for executor k8s API calls
  • recover from unexpected panics and annotate the error.

}

// IsRetryableNetworkError returns whether or not the error is a retryable network error
func IsRetryableNetworkError(err error) bool {
Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is currently unused, and probably needs to be tweaked -- was laying the ground work for retrying artifact load/save errors.

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

var pod *apiv1.Pod
var err error
_ = wait.ExponentialBackoff(DefaultRetry, func() (bool, error) {
pod, err = podsIf.Get(we.PodName, metav1.GetOptions{})

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The underlying client-go request seems to have some built in retry for GET. You may want to check to see if it already meets your retry requirement.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like this mechanism does not suit our needs. The issue we are trying to solve is the executor retrying getting pod/secrets from API server to address issues like this:

   ├-⚠ hybridize(23)              hm-92j79-3168642571  failed to load artifacts: Get https://10.11.240.1:443/api/v1/namespaces/default/secrets/gcs-credentials: dial tcp 10.11.240.1:443: getsockopt: connection refused

}

// IsRetryableNetworkError returns whether or not the error is a retryable network error
func IsRetryableNetworkError(err error) bool {

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nice

@jessesuen jessesuen force-pushed the executor-hardening branch 2 times, most recently from 8480040 to a3707ca Compare January 17, 2018 00:33
…calls

Recover from unexpected panics and annotate the error.
@jessesuen jessesuen merged commit 1011031 into argoproj:master Jan 17, 2018
@jessesuen jessesuen deleted the executor-hardening branch January 19, 2018 01:48
icecoffee531 pushed a commit to icecoffee531/argo-workflows that referenced this pull request Jan 5, 2022
if the name of Sensor is "webhook-sensor", the gate-way client cannot send request to the desired sensor, so the name should be changed into "webhook"
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants