Fix flaky failover tests #1682

abaguas · 2024-08-01T15:41:18Z

The following test has been failing a couple of times

    --- FAIL: TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster (124.78s)

Analysing the runs (example run 1, example run 2) in more detail it can be seen that the test expects the local targets to be empty, but they contain the entries of the second cluster:

TestFullFailover 2024-08-01T12:51:13Z logger.go:66: [172.18.0.7 172.18.0.8]
{
	"annotation": "expected IPs: []",
	"app-msg": "us",
	"podinfo-running": true,
	"podinfo-replicas": "1",
	"local-targets-ip": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"ingress-ip": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"dig-result": [
		"172.18.0.7",
		"172.18.0.8"
	],
	"coredns-ip": "10.43.198.151",
	"gslb-status": "map[terratest-failover.cloud.example.com:Healthy]",
	"cluster": "k3d-test-gslb2",
	"namespace": "k8gb-test-qndvzj",
	"ep0-dns-name": "localtargets-terratest-failover.cloud.example.com",
	"ep0-dns-targets": "[172.18.0.7 172.18.0.8]",
	"ep1-dns-name": "terratest-failover.cloud.example.com",
	"ep1-dns-targets": "[172.18.0.7 172.18.0.8]"
}
=== NAME  TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster
    k8gb_full_failover_test.go:107: 
        	Error Trace:	/home/runner/work/k8gb/k8gb/terratest/test/k8gb_full_failover_test.go:107
        	Error:      	Received unexpected error:
        	            	'Wait for failover to happen and coredns to pickup new values...' unsuccessful after 120 retries
        	Test:       	TestFullFailover/embedded_ingress_start_podinfo_on_the_second_cluster

This doesn't make sense. The app had 0 replicas in both clusters, and then the app was scaled out in the second cluster, so the expected targets should be the IP addresses of the second cluster. This is also what the test tries to verify:

		err = instanceUS.WaitForExpected(usLocalTargets)

It can then be concluded that usLocalTargets is empty. This was double checked by adding some debug statements in the following build.

The usLocalTargets variable is populated by calling instanceUS.GetLocalTargets(), and the test tries to make sure it is not empty by calling WaitForAppIsRunning() beforehand. This function WaitForAppIsRunning() waits until the DNSEndpoint` resource is populated, however it does not verify if coredns picked up these values. This operation should be very fast, but apparently the test execution is sometimes too quick, ending up with empty results.

To fix the above, this PR adds a step to WaitForAppIsRunning() where it verifies if coredns picked up the entries. I am very confidant this will solve the flaky test. If that is not the case at least we will have additional log data to see how quickly coredns takes to pick up entries configured via DNSEndpoint resources.

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>

ytsarev

Great find 👍

abaguas requested review from donovanmuller, k0da, kuritka, ytsarev and jkremser as code owners August 1, 2024 15:41

abaguas marked this pull request as draft August 1, 2024 15:41

abaguas changed the title ~~Trying to fix flaky failover tests~~ Trying to fix flaky failover tests (in progress) Aug 1, 2024

abaguas force-pushed the fix/tests branch 3 times, most recently from 15e5503 to 67f29c7 Compare August 1, 2024 17:12

debug failing tests

f210504

Signed-off-by: Andre Baptista Aguas <andre.aguas@protonmail.com>

abaguas force-pushed the fix/tests branch from 67f29c7 to f210504 Compare August 1, 2024 17:14

abaguas changed the title ~~Trying to fix flaky failover tests (in progress)~~ Fix flaky failover tests Aug 1, 2024

abaguas marked this pull request as ready for review August 1, 2024 17:40

ytsarev approved these changes Aug 1, 2024

View reviewed changes

ytsarev merged commit 8ef36b4 into k8gb-io:master Aug 1, 2024
15 checks passed

abaguas mentioned this pull request Aug 2, 2024

Fix flaky e2e test TestFailoverPlayground/*stop_podinfo_on_eu_cluster #1684

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix flaky failover tests #1682

Fix flaky failover tests #1682

abaguas commented Aug 1, 2024 •

edited

Loading

ytsarev left a comment

Fix flaky failover tests #1682

Fix flaky failover tests #1682

Conversation

abaguas commented Aug 1, 2024 • edited Loading

ytsarev left a comment

Choose a reason for hiding this comment

abaguas commented Aug 1, 2024 •

edited

Loading