-
Notifications
You must be signed in to change notification settings - Fork 218
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
test(e2e): Implement retriers for tests that could fail due to flakiness #449
Comments
I dig through some recent failures to find out the common causes and came up with following in order of recency 1Atleast 2 times and most recent
2
3
4Atleast 2 times
5
6Most of the older errors, atleast 5 times
|
Let's triage one by oneFor 1The error is most likely coming from this {
Step: &kubernetes.ExecInPod{
PodName: podName,
PodNamespace: "kube-system",
Command: req.Command,
},
Opts: &types.StepOptions{
ExpectError: req.ExpectError,
SkipSavingParametersToJob: true,
},
}, The I also think we would benefit by adding these almost all k8s operation that has a change of failing due to network issues. The same solution can be used for 3rd one as well For 2We already have retry on port forwading, I can see if I can potentially tune, but other than that I don't see any potential fix for this kind of intermittent issue. My idea would be spread the retry in order to rule out network related issue. For 4,maybe increase the time limit and see if the issue is still there. The downside would be that the test would be stuck for longer in case of an actual issue. For 6We are being throttled by the k8s api engine, I would reduce the polling frequency to allow more tries at polling, We would have same tradeoff as last one. I also would like to log the state of retina agents and maybe evaluate on runtime if we need to end the test. |
Number 6 was already addressed by increasing the timeout from 8 min to 20 min, hence we don't require much there. I have addressed 1,2 & 4 in PR #867 |
…timeouts (partially fixes #449) (#867) # Description This pull request includes several changes to improve the reliability and efficiency of the Kubernetes end-to-end testing framework. The most important changes involve adjusting timeouts and retry mechanisms to enhance robustness and reduce wait times. ### Improvements to retry mechanisms: * [`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667L42-R54): Added retry logic using `retry.OnError` for executing commands in a pod to handle transient errors more gracefully. * [`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30): Enabled exponential backoff in the default retrier to improve the efficiency of retry attempts. ### Adjustments to timeouts and delays: * [`test/e2e/framework/azure/create-cluster-with-npm.go`](diffhunk://#diff-bf0613d6eb45a2bdc85b1446ab964c4249722b16bdc616129ec7b71dc0185553L21-R21): Increased the `clusterTimeout` from 10 to 15 minutes to allow more time for cluster creation. * [`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30): Reduced the `defaultRetryDelay` from 5 seconds to 500 milliseconds to decrease the wait time between retry attempts. ### Dependency updates: * [`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667R16): Added import for `k8s.io/client-go/util/retry` to support the new retry logic. Please provide a brief description of the changes made in this pull request. ## Related Issue It fixes the issue #449 which talk about the intermittent failures in our e2e test. ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/contributing). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [ ] I have updated the documentation, if necessary. - [ ] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed Please add any relevant screenshots or GIFs to showcase the changes made. ## Additional Notes Add any additional notes or context about the pull request here. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project.
# Description This pull request includes changes to improve the end-to-end resiliency in DNS scenarios. The changes include: - Added a delay to guarantee that the pods will have the BPF program attached before executing further steps in the scenario. - Updated the label selector for the `CreateDenyAllNetworkPolicy` step to target the `agnhost-drop` pods instead of `agnhost-a` pods. - Updated the `CreateAgnhostStatefulSet` step to use the `agnhost-drop` name instead of `agnhost-a`. - Updated the `ExecInPod` steps to target the `agnhost-drop-0` pod instead of `agnhost-a-0`. - Updated the `ValidateRetinaDropMetric` step to use the `agnhost-drop` source instead of `agnhost-a`. - Updated the `DeleteKubernetesResource` step to target the `agnhost-drop` stateful set instead of `agnhost-a`. ## Related Issue It addresses the issue #449 which talk about the intermittent failures in our e2e test. ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/contributing). - [x] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [ ] I have updated the documentation, if necessary. - [ ] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed  ## Additional Notes Add any additional notes or context about the pull request here. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Ritwik Ranjan <ritwikranjan@microsoft.com>
We have fixed all identified flakiness, closing the issue! |
No description provided.
The text was updated successfully, but these errors were encountered: