test(e2e): Implement retriers for tests that could fail due to flakiness #449

nddq · 2024-06-06T19:06:25Z

No description provided.

ritwikranjan · 2024-10-09T17:30:52Z

I dig through some recent failures to find out the common causes and came up with following in order of recency

1

Atleast 2 times and most recent

2024/10/08 19:37:49 executing command "nslookup kubernetes.default" on pod "agnhost-basic-dns-port-forward-2360488725034690767-0" in namespace "kube-system"...
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:77
        	Error:      	Received unexpected error:
        	            	did not expect error from step ExecInPod but got error: error executing command [nslookup kubernetes.default]: error executing command: error dialing backend: EOF
        	Test:       	TestE2ERetina

2

2024/09/20 15:47:12 checking for metrics on http://localhost:10093/metrics
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: could not start port forward within 300000000000s: HTTP request failed: Get "http://localhost:10093/metrics": dial tcp [::1]:10093: connect: connection refused	
        	Test:       	TestE2ERetina

3

2024/08/29 16:04:58 attempting to find pod with label "k8s-app=retina", on a node with a pod with label "app=agnhost-a"
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step PortForward but got error: could not find pod with affinity: could not find a pod with label "k8s-app=retina", on a node that also has a pod with label "app=agnhost-a": no pod with label found with matching pod affinity
        	Test:       	TestE2ERetina

4

Atleast 2 times

2024/08/27 17:47:16 failed to create cluster: context deadline exceeded
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:53
        	Error:      	Received unexpected error:
        	            	did not expect error from step CreateNPMCluster but got error: failed to create cluster: context deadline exceeded
        	Test:       	TestE2ERetina

5

2024/08/26 23:50:24 failed to find metric matching networkobservability_adv_dns_request_count: map[ip:10.224.4.108 namespace:kube-system podname:agnhost-adv-dns-port-forward-4243681485157638409-0 query:kubernetes.default.svc.cluster.local. query_type:A workload_kind:StatefulSet workload_name:agnhost-adv-dns-port-forward-4243681485157638409]
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: no metric found
        	Test:       	TestE2ERetina

6

Most of the older errors, atleast 5 times

2024/08/23 15:53:02 Error received when checking status of resource retina-svc. Error: 'client rate limiter Wait returned an error: context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"
Name: "retina-svc", Namespace: "kube-system"'
2024/08/23 15:53:02 Retryable error? true
2024/08/23 15:53:02 Retrying as current number of retries 0 less than max number of retries 30
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step InstallHelmChart but got error: failed to install chart: context deadline exceeded
        	Test:       	TestE2ERetina

ritwikranjan · 2024-10-10T10:34:28Z

Let's triage one by one

For 1

The error is most likely coming from this

		{
			Step: &kubernetes.ExecInPod{
				PodName:      podName,
				PodNamespace: "kube-system",
				Command:      req.Command,
			},
			Opts: &types.StepOptions{
				ExpectError:               req.ExpectError,
				SkipSavingParametersToJob: true,
			},
		},

The ExecInPod this does not has any inbuilt retry mechanism, so any failures in the step calling this function will result in the test failing. We should add a retry strategy that would help us make the running commands resilient.
Potential tool: https://pkg.go.dev/k8s.io/client-go/util/retry

I also think we would benefit by adding these almost all k8s operation that has a change of failing due to network issues. The same solution can be used for 3rd one as well

For 2

We already have retry on port forwading, I can see if I can potentially tune, but other than that I don't see any potential fix for this kind of intermittent issue. My idea would be spread the retry in order to rule out network related issue.

For 4,

maybe increase the time limit and see if the issue is still there. The downside would be that the test would be stuck for longer in case of an actual issue.

For 6

We are being throttled by the k8s api engine, I would reduce the polling frequency to allow more tries at polling, We would have same tradeoff as last one. I also would like to log the state of retina agents and maybe evaluate on runtime if we need to end the test.

ritwikranjan · 2024-10-16T13:47:08Z

Number 6 was already addressed by increasing the timeout from 8 min to 20 min, hence we don't require much there. I have addressed 1,2 & 4 in PR #867

…timeouts (partially fixes #449) (#867) # Description This pull request includes several changes to improve the reliability and efficiency of the Kubernetes end-to-end testing framework. The most important changes involve adjusting timeouts and retry mechanisms to enhance robustness and reduce wait times. ### Improvements to retry mechanisms: * [`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667L42-R54): Added retry logic using `retry.OnError` for executing commands in a pod to handle transient errors more gracefully. * [`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30): Enabled exponential backoff in the default retrier to improve the efficiency of retry attempts. ### Adjustments to timeouts and delays: * [`test/e2e/framework/azure/create-cluster-with-npm.go`](diffhunk://#diff-bf0613d6eb45a2bdc85b1446ab964c4249722b16bdc616129ec7b71dc0185553L21-R21): Increased the `clusterTimeout` from 10 to 15 minutes to allow more time for cluster creation. * [`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30): Reduced the `defaultRetryDelay` from 5 seconds to 500 milliseconds to decrease the wait time between retry attempts. ### Dependency updates: * [`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667R16): Added import for `k8s.io/client-go/util/retry` to support the new retry logic. Please provide a brief description of the changes made in this pull request. ## Related Issue It fixes the issue #449 which talk about the intermittent failures in our e2e test. ## Checklist - [X] I have read the [contributing documentation](https://retina.sh/docs/contributing). - [X] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [X] I have correctly attributed the author(s) of the code. - [X] I have tested the changes locally. - [X] I have followed the project's style guidelines. - [ ] I have updated the documentation, if necessary. - [ ] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed Please add any relevant screenshots or GIFs to showcase the changes made. ## Additional Notes Add any additional notes or context about the pull request here. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project.

# Description This pull request includes changes to improve the end-to-end resiliency in DNS scenarios. The changes include: - Added a delay to guarantee that the pods will have the BPF program attached before executing further steps in the scenario. - Updated the label selector for the `CreateDenyAllNetworkPolicy` step to target the `agnhost-drop` pods instead of `agnhost-a` pods. - Updated the `CreateAgnhostStatefulSet` step to use the `agnhost-drop` name instead of `agnhost-a`. - Updated the `ExecInPod` steps to target the `agnhost-drop-0` pod instead of `agnhost-a-0`. - Updated the `ValidateRetinaDropMetric` step to use the `agnhost-drop` source instead of `agnhost-a`. - Updated the `DeleteKubernetesResource` step to target the `agnhost-drop` stateful set instead of `agnhost-a`. ## Related Issue It addresses the issue #449 which talk about the intermittent failures in our e2e test. ## Checklist - [x] I have read the [contributing documentation](https://retina.sh/docs/contributing). - [x] I signed and signed-off the commits (`git commit -S -s ...`). See [this documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification) on signing commits. - [x] I have correctly attributed the author(s) of the code. - [x] I have tested the changes locally. - [x] I have followed the project's style guidelines. - [ ] I have updated the documentation, if necessary. - [ ] I have added tests, if applicable. ## Screenshots (if applicable) or Testing Completed ![image](https://github.com/user-attachments/assets/5d8697b9-99e1-4bf7-9db8-12055b2d5ce0) ## Additional Notes Add any additional notes or context about the pull request here. --- Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more information on how to contribute to this project. Signed-off-by: Ritwik Ranjan <ritwikranjan@microsoft.com>

ritwikranjan · 2024-10-24T08:54:13Z

We have fixed all identified flakiness, closing the issue!

nddq mentioned this issue Jun 6, 2024

e2e for basic and advance metrics #355

Open

github-project-automation bot added this to Retina Triage Board Jun 6, 2024

nddq added the area/infra Test, Release, or CI Infrastructure label Jun 6, 2024

nddq modified the milestone: 1.0 Jun 6, 2024

ibezrukavyi assigned ibezrukavyi and ritwikranjan Oct 3, 2024

ritwikranjan mentioned this issue Oct 16, 2024

fix: improve resiliency for e2e tests by adding tweaking retries and timeouts (partially fixes #449) #867

Merged

7 tasks

ritwikranjan mentioned this issue Oct 23, 2024

fix(tests): Improve e2e resiliency #884

Merged

7 tasks

ritwikranjan closed this as completed Oct 24, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

test(e2e): Implement retriers for tests that could fail due to flakiness #449

test(e2e): Implement retriers for tests that could fail due to flakiness #449

nddq commented Jun 6, 2024

ritwikranjan commented Oct 9, 2024 •

edited

Loading

ritwikranjan commented Oct 10, 2024 •

edited

Loading

ritwikranjan commented Oct 16, 2024

ritwikranjan commented Oct 24, 2024

test(e2e): Implement retriers for tests that could fail due to flakiness #449

test(e2e): Implement retriers for tests that could fail due to flakiness #449

Comments

nddq commented Jun 6, 2024

ritwikranjan commented Oct 9, 2024 • edited Loading

1

2

3

4

5

6

ritwikranjan commented Oct 10, 2024 • edited Loading

Let's triage one by one

For 1

For 2

For 4,

For 6

ritwikranjan commented Oct 16, 2024

ritwikranjan commented Oct 24, 2024

ritwikranjan commented Oct 9, 2024 •

edited

Loading

ritwikranjan commented Oct 10, 2024 •

edited

Loading