Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix flaky failover tests #1682

Merged
merged 1 commit into from
Aug 1, 2024
Merged

Fix flaky failover tests #1682

merged 1 commit into from
Aug 1, 2024

Conversation

abaguas
Copy link
Collaborator

@abaguas abaguas commented Aug 1, 2024

The following test has been failing a couple of times

    --- FAIL: TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster (124.78s)

Analysing the runs (example run 1, example run 2) in more detail it can be seen that the test expects the local targets to be empty, but they contain the entries of the second cluster:

TestFullFailover 2024-08-01T12:51:13Z logger.go:66: [172.18.0.7 172.18.0.8]
{
	"annotation": "expected IPs: []",
	"app-msg": "us",
	"podinfo-running": true,
	"podinfo-replicas": "1",
	"local-targets-ip": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"ingress-ip": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"dig-result": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"coredns-ip": "10.43.198.151",
	"gslb-status": "map[terratest-failover.cloud.example.com:Healthy]",
	"cluster": "k3d-test-gslb2",
	"namespace": "k8gb-test-qndvzj",
	"ep0-dns-name": "localtargets-terratest-failover.cloud.example.com",
	"ep0-dns-targets": "[172.18.0.7 172.18.0.8]",
	"ep1-dns-name": "terratest-failover.cloud.example.com",
	"ep1-dns-targets": "[172.18.0.7 172.18.0.8]"
}
=== NAME  TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster
    k8gb_full_failover_test.go:107: 
        	Error Trace:	/home/runner/work/k8gb/k8gb/terratest/test/k8gb_full_failover_test.go:107
        	Error:      	Received unexpected error:
        	            	'Wait for failover to happen and coredns to pickup new values...' unsuccessful after 120 retries
        	Test:       	TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster

This doesn't make sense. The app had 0 replicas in both clusters, and then the app was scaled out in the second cluster, so the expected targets should be the IP addresses of the second cluster. This is also what the test tries to verify:

		err = instanceUS.WaitForExpected(usLocalTargets)

It can then be concluded that usLocalTargets is empty. This was double checked by adding some debug statements in the following build.

The usLocalTargets variable is populated by calling instanceUS.GetLocalTargets(), and the test tries to make sure it is not empty by calling WaitForAppIsRunning() beforehand. This function WaitForAppIsRunning() waits until the DNSEndpoint` resource is populated, however it does not verify if coredns picked up these values. This operation should be very fast, but apparently the test execution is sometimes too quick, ending up with empty results.

To fix the above, this PR adds a step to WaitForAppIsRunning() where it verifies if coredns picked up the entries. I am very confidant this will solve the flaky test. If that is not the case at least we will have additional log data to see how quickly coredns takes to pick up entries configured via DNSEndpoint resources.

@abaguas abaguas marked this pull request as draft August 1, 2024 15:41
@abaguas abaguas changed the title Trying to fix flaky failover tests Trying to fix flaky failover tests (in progress) Aug 1, 2024
@abaguas abaguas force-pushed the fix/tests branch 3 times, most recently from 15e5503 to 67f29c7 Compare August 1, 2024 17:12
Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>
@abaguas abaguas changed the title Trying to fix flaky failover tests (in progress) Fix flaky failover tests Aug 1, 2024
@abaguas abaguas marked this pull request as ready for review August 1, 2024 17:40
Copy link
Member

@ytsarev ytsarev left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Great find 👍

@ytsarev ytsarev merged commit 8ef36b4 into k8gb-io:master Aug 1, 2024
15 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants