Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

test(e2e): Implement retriers for tests that could fail due to flakiness #449

Closed
Tracked by #355
nddq opened this issue Jun 6, 2024 · 4 comments
Closed
Tracked by #355
Assignees
Labels
area/infra Test, Release, or CI Infrastructure
Milestone

Comments

@nddq
Copy link
Contributor

nddq commented Jun 6, 2024

No description provided.

@nddq nddq added the area/infra Test, Release, or CI Infrastructure label Jun 6, 2024
@nddq nddq modified the milestone: 1.0 Jun 6, 2024
@ritwikranjan
Copy link
Contributor

ritwikranjan commented Oct 9, 2024

I dig through some recent failures to find out the common causes and came up with following in order of recency


1

Atleast 2 times and most recent

2024/10/08 19:37:49 executing command "nslookup kubernetes.default" on pod "agnhost-basic-dns-port-forward-2360488725034690767-0" in namespace "kube-system"...
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:77
        	Error:      	Received unexpected error:
        	            	did not expect error from step ExecInPod but got error: error executing command [nslookup kubernetes.default]: error executing command: error dialing backend: EOF
        	Test:       	TestE2ERetina

2

2024/09/20 15:47:12 checking for metrics on http://localhost:10093/metrics
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: could not start port forward within 300000000000s: HTTP request failed: Get "http://localhost:10093/metrics": dial tcp [::1]:10093: connect: connection refused	
        	Test:       	TestE2ERetina

3

2024/08/29 16:04:58 attempting to find pod with label "k8s-app=retina", on a node with a pod with label "app=agnhost-a"
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step PortForward but got error: could not find pod with affinity: could not find a pod with label "k8s-app=retina", on a node that also has a pod with label "app=agnhost-a": no pod with label found with matching pod affinity
        	Test:       	TestE2ERetina

4

Atleast 2 times

2024/08/27 17:47:16 failed to create cluster: context deadline exceeded
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:53
        	Error:      	Received unexpected error:
        	            	did not expect error from step CreateNPMCluster but got error: failed to create cluster: context deadline exceeded
        	Test:       	TestE2ERetina

5

2024/08/26 23:50:24 failed to find metric matching networkobservability_adv_dns_request_count: map[ip:10.224.4.108 namespace:kube-system podname:agnhost-adv-dns-port-forward-4243681485157638409-0 query:kubernetes.default.svc.cluster.local. query_type:A workload_kind:StatefulSet workload_name:agnhost-adv-dns-port-forward-4243681485157638409]
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:69
        	Error:      	Received unexpected error:
        	            	did not expect error from step ValidateAdvancedDNSRequestMetrics but got error: failed to verify advance dns request metrics networkobservability_adv_dns_request_count: failed to get prometheus metrics: no metric found
        	Test:       	TestE2ERetina

6

Most of the older errors, atleast 5 times

2024/08/23 15:53:02 Error received when checking status of resource retina-svc. Error: 'client rate limiter Wait returned an error: context deadline exceeded', Resource details: 'Resource: "/v1, Resource=services", GroupVersionKind: "/v1, Kind=Service"
Name: "retina-svc", Namespace: "kube-system"'
2024/08/23 15:53:02 Retryable error? true
2024/08/23 15:53:02 Retrying as current number of retries 0 less than max number of retries 30
    runner.go:27: 
        	Error Trace:	/home/runner/work/retina/retina/test/e2e/framework/types/runner.go:27
        	            				/home/runner/work/retina/retina/test/e2e/retina_e2e_test.go:65
        	Error:      	Received unexpected error:
        	            	did not expect error from step InstallHelmChart but got error: failed to install chart: context deadline exceeded
        	Test:       	TestE2ERetina

@ritwikranjan
Copy link
Contributor

ritwikranjan commented Oct 10, 2024

Let's triage one by one

For 1

The error is most likely coming from this

		{
			Step: &kubernetes.ExecInPod{
				PodName:      podName,
				PodNamespace: "kube-system",
				Command:      req.Command,
			},
			Opts: &types.StepOptions{
				ExpectError:               req.ExpectError,
				SkipSavingParametersToJob: true,
			},
		},

The ExecInPod this does not has any inbuilt retry mechanism, so any failures in the step calling this function will result in the test failing. We should add a retry strategy that would help us make the running commands resilient.
Potential tool: https://pkg.go.dev/k8s.io/client-go/util/retry

I also think we would benefit by adding these almost all k8s operation that has a change of failing due to network issues. The same solution can be used for 3rd one as well


For 2

We already have retry on port forwading, I can see if I can potentially tune, but other than that I don't see any potential fix for this kind of intermittent issue. My idea would be spread the retry in order to rule out network related issue.


For 4,

maybe increase the time limit and see if the issue is still there. The downside would be that the test would be stuck for longer in case of an actual issue.


For 6

We are being throttled by the k8s api engine, I would reduce the polling frequency to allow more tries at polling, We would have same tradeoff as last one. I also would like to log the state of retina agents and maybe evaluate on runtime if we need to end the test.

@ritwikranjan
Copy link
Contributor

Number 6 was already addressed by increasing the timeout from 8 min to 20 min, hence we don't require much there. I have addressed 1,2 & 4 in PR #867

github-merge-queue bot pushed a commit that referenced this issue Oct 17, 2024
…timeouts (partially fixes #449) (#867)

# Description

This pull request includes several changes to improve the reliability
and efficiency of the Kubernetes end-to-end testing framework. The most
important changes involve adjusting timeouts and retry mechanisms to
enhance robustness and reduce wait times.

### Improvements to retry mechanisms:

*
[`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667L42-R54):
Added retry logic using `retry.OnError` for executing commands in a pod
to handle transient errors more gracefully.
*
[`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30):
Enabled exponential backoff in the default retrier to improve the
efficiency of retry attempts.

### Adjustments to timeouts and delays:

*
[`test/e2e/framework/azure/create-cluster-with-npm.go`](diffhunk://#diff-bf0613d6eb45a2bdc85b1446ab964c4249722b16bdc616129ec7b71dc0185553L21-R21):
Increased the `clusterTimeout` from 10 to 15 minutes to allow more time
for cluster creation.
*
[`test/e2e/framework/kubernetes/port-forward.go`](diffhunk://#diff-ed249ad2b2805041dfd7ff7466005e33aa4a55adc7728524f6c060af8131dd61L22-R30):
Reduced the `defaultRetryDelay` from 5 seconds to 500 milliseconds to
decrease the wait time between retry attempts.

### Dependency updates:

*
[`test/e2e/framework/kubernetes/exec-pod.go`](diffhunk://#diff-ebfee2072870c7e30ca7222eab3f94550af60a5ac8de53aa632a949fcd4fd667R16):
Added import for `k8s.io/client-go/util/retry` to support the new retry
logic.

Please provide a brief description of the changes made in this pull
request.

## Related Issue

It fixes the issue #449 which talk about the intermittent failures in
our e2e test.

## Checklist

- [X] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [X] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [X] I have correctly attributed the author(s) of the code.
- [X] I have tested the changes locally.
- [X] I have followed the project's style guidelines.
- [ ] I have updated the documentation, if necessary.
- [ ] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed

Please add any relevant screenshots or GIFs to showcase the changes
made.

## Additional Notes

Add any additional notes or context about the pull request here.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.
github-merge-queue bot pushed a commit that referenced this issue Oct 23, 2024
# Description

This pull request includes changes to improve the end-to-end resiliency
in DNS scenarios. The changes include:

- Added a delay to guarantee that the pods will have the BPF program
attached before executing further steps in the scenario.

- Updated the label selector for the `CreateDenyAllNetworkPolicy` step
to target the `agnhost-drop` pods instead of `agnhost-a` pods.

- Updated the `CreateAgnhostStatefulSet` step to use the `agnhost-drop`
name instead of `agnhost-a`.

- Updated the `ExecInPod` steps to target the `agnhost-drop-0` pod
instead of `agnhost-a-0`.

- Updated the `ValidateRetinaDropMetric` step to use the `agnhost-drop`
source instead of `agnhost-a`.

- Updated the `DeleteKubernetesResource` step to target the
`agnhost-drop` stateful set instead of `agnhost-a`.


## Related Issue

It addresses the issue #449 which talk about the intermittent failures
in our e2e test.

## Checklist

- [x] I have read the [contributing
documentation](https://retina.sh/docs/contributing).
- [x] I signed and signed-off the commits (`git commit -S -s ...`). See
[this
documentation](https://docs.github.com/en/authentication/managing-commit-signature-verification/about-commit-signature-verification)
on signing commits.
- [x] I have correctly attributed the author(s) of the code.
- [x] I have tested the changes locally.
- [x] I have followed the project's style guidelines.
- [ ] I have updated the documentation, if necessary.
- [ ] I have added tests, if applicable.

## Screenshots (if applicable) or Testing Completed


![image](https://github.com/user-attachments/assets/5d8697b9-99e1-4bf7-9db8-12055b2d5ce0)

## Additional Notes

Add any additional notes or context about the pull request here.

---

Please refer to the [CONTRIBUTING.md](../CONTRIBUTING.md) file for more
information on how to contribute to this project.

Signed-off-by: Ritwik Ranjan <ritwikranjan@microsoft.com>
@ritwikranjan
Copy link
Contributor

We have fixed all identified flakiness, closing the issue!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area/infra Test, Release, or CI Infrastructure
Projects
Archived in project
Development

No branches or pull requests

3 participants