ROX-27073: add cluster migration e2e test #2127

johannes94 · 2024-12-16T09:44:02Z

Description

Adds a ginkgo e2e test suite for multicluster test
First test verifies cluster migration between multiple clusters
Adds a make target to run the tests test/multicluster/e2e
Moved common logic with other e2e tests to a testutils package
Change admin CentralRequest presenter to actually set the ClusterId, which is required for the test
See test manual on how to execute this agains 2 infra clusters
Extends our k8s client pkg to allow loading config for multiple cluster

Checklist (Definition of Done)

Test manual

# Start 2 infractl clusters OSD on AWS
# Login to kerberos for secret access
kinit jmalsam

export CLUSTER_TYPE="infra-openshift"
export CLUSTER_1_KUBECONFIG=/Users/johannes/kubes/cluster1
export CLUSTER_2_KUBECONFIG=/Users/johannes/kubes/cluster2
export QUAY_USER=<your user>
export QUAY_TOKEN=<your token>
export ENABLE_CENTRAL_EXTERNAL_CERTIFICATE="true"
export ROUTE53_ACCESS_KEY=<route53-key-id>
export ROUTE53_SECRET_ACCESS_KEY=<route53-secret>
export STATIC_TOKEN=<static-token-from-ci>
export STATIC_TOKEN_ADMIN=<static-admin-token-from-ci>

export INFRA_TOKEN=<your-infra-token>

url=$(infractl artifacts jm-migration-1 --json | jq '.Artifacts[] | select(.Name=="kubeconfig") | .URL' -r)
wget -O $CLUSTER_1_KUBECONFIG $url

url=$(infractl artifacts jm-migration-2 --json | jq '.Artifacts[] | select(.Name=="kubeconfig") | .URL' -r)
wget -O $CLUSTER_2_KUBECONFIG $url

bash scripts/ci/multicluster_tests/deploy.sh

# Verify all services are running like expected
KUBECONFIG=$CLUSTER_1_KUBECONFIG k get pods -n rhacs # fleet-manager and fleetshard-sync must be healthy
KUBECONFIG=$CLUSTER_2_KUBECONFIG k get pods -n rhacs # fleetshard-sync must be healthy, fleet-manager does not exist

make test/multicluster/e2e

openshift-ci · 2024-12-16T09:47:13Z

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johannes94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

~~OWNERS~~ [johannes94]

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

ebensh

Several nits, nothing major - the flow of the e2e test makes sense and is easy enough to read with the comments you provided throughout

ebensh · 2024-12-19T10:24:20Z

.gitignore

@@ -15,6 +15,10 @@
 emailsender-manifests.yaml
 central-chart/
 pids-port-forward
+# temp files created by multicluster e2e tests
+cluster-list.json


Where do these come from / why can we not clean them up after the test?

Found them in deploy.sh - is there no way to have a cleanup.sh equivalent? Or better to leave the artifacts in case debugging is necessary?

Those files would be created and not cleaned up. So you can use them for debugging already, this is just the .gitignore to prevent those files from being commited and pushed to the repo.

ebensh · 2024-12-19T10:32:29Z

e2e/dns/record_cleanup.go

@@ -0,0 +1,35 @@
+package dns


Would waiting for "deleting" state instead of "deprovisioning" rely on the FM's cleanup of the routes instead of manually cleaning them up here? (sorry if missing something, not used to looking at the e2e tests as much)

I'm not 100% sure about this. But I think FM would only start deleting the route records on deleting state.

We trigger a deletion and wait for deletion at the end of this test, which usually cleans up the records.

This logic is intended as an additional measure on test failures. Because on failures we might be stuck somewhere in the usual reconciliation loop which can lead to a situation where the deletion lifecycle logic doesn't get executed. In a case like this we don't want to leak those resources.

ebensh · 2024-12-19T10:34:02Z

e2e/e2e_suite_test.go

@@ -32,8 +31,8 @@ var (
 	dnsEnabled            bool
 	routesEnabled         bool
 	route53Client         *route53.Route53
-	waitTimeout           = getWaitTimeout()
-	extendedWaitTimeout   = getWaitTimeout() * 3
+	waitTimeout           = testutil.GetWaitTimeout()


Just a meta-comment - adding the testutil. in a separate PR if possible would've made reviewing just the cluster migration part easier. It's fast enough to skim by but next time would recommend splitting into 2 PRs

Yes, I thought about splitting PRs as well along the way, decided to not do it since I only wanted very few changes. Then it grew over time to what you see now :) . I'll split it next time.

ebensh · 2024-12-19T10:34:56Z

e2e/e2e_test.go

-				Expect(*recordSet.Name).To(Equal(domain))
-				Expect(*record.Value).To(Equal(reencryptIngress.RouterCanonicalHostname)) // TODO use route specific ingress instead of comparing with reencryptIngress for all cases
-			}
+			testutil.AssertDNSMatchesRouter(dnsRecordsLoader.CentralDomainNames, recordSets, &reencryptIngress)


ebensh · 2024-12-19T10:40:32Z

e2e/multicluster/multicluster_migration_test.go

+
+const (
+	dpCloudProvider = "standalone"
+	dpRegion        = "standalone"


nit - a unique value from dpCloudProvider

Do you mean dpCloudProvider should have a different value than dpRegion?

Unfortunately that's not so easy, because that's what we have for the cluster configurations all other tests do it like this as well.

ebensh · 2024-12-19T10:53:02Z

e2e/multicluster/multicluster_migration_test.go

+func assertClusterAssignment(expectedClusterID string, centralID string, adminAPI fleetmanager.AdminAPI) {
+	var clusterAssignment string
+	Eventually(func() (err error) {
+		// assert the cluster ID outside the Eventually, since once we have a non-empty


Is this necessary? https://onsi.github.io/gomega/#eventually

Eventually checks that an assertion eventually passes. Eventually blocks when called and attempts an assertion periodically until it passes or a timeout occurs.
It shouldn't keep polling once the condition passes

Ah unless the comments means "Once you get a value it will not change, so then do the assert" - instead of e.g. checking the value and, if the value is there but wrong, then getting stuck in polling loop even though you know it will not change?

Yes, exactly this is what the comment means, I will try to express this more clearly.

ebensh · 2024-12-19T10:54:30Z

e2e/multicluster/multicluster_migration_test.go

+	for _, central := range centralList.Items {
+		if central.Id == centralID {
+			tenantExists = true
+			clusterID = central.ClusterId


ebensh · 2024-12-19T10:57:37Z

e2e/testutil/testutil.go

+}
+
+// ObtainCentralRequest queries fleet-manager public API for the CentralRequest with id and stores in in the given pointer
+func ObtainCentralRequest(ctx context.Context, client *fleetmanager.Client, id string, request *public.CentralRequest) error {


Nit - would use Get or Fetch instead of Obtain, if possible. Just a bit more consistent

You're right. Its called that way because the private function I copied to testuitl was called like that before.

I changed it.

openshift-ci · 2024-12-19T12:59:50Z

@johannes94: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name	Commit	Details	Required	Rerun command
ci/prow/e2e	`86a3959`	link	true	`/test e2e`
ci/prow/images	`86a3959`	link	true	`/test images`

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

openshift-ci bot added the do-not-merge/work-in-progress label Dec 16, 2024

openshift-ci bot added the approved label Dec 16, 2024

openshift-merge-robot added the needs-rebase label Dec 16, 2024

johannes94 added 2 commits December 16, 2024 13:48

implemented ginkgo e2e test for cluster migration

3dc51b6

rebase and some fixes

d7b23d7

johannes94 force-pushed the jmalsam/migration-e2e-test branch from 2c4d4d4 to d7b23d7 Compare December 17, 2024 08:27

johannes94 temporarily deployed to development December 17, 2024 08:27 — with GitHub Actions Inactive

openshift-merge-robot removed the needs-rebase label Dec 17, 2024

small bugfixes for test implementation, tested against infra

22c1b7b

johannes94 temporarily deployed to development December 17, 2024 14:15 — with GitHub Actions Inactive

johannes94 marked this pull request as ready for review December 17, 2024 14:19

openshift-ci bot removed the do-not-merge/work-in-progress label Dec 17, 2024

johannes94 temporarily deployed to development December 17, 2024 14:19 — with GitHub Actions Inactive

johannes94 requested a review from ebensh December 17, 2024 14:19

ebensh requested changes Dec 19, 2024

View reviewed changes

openshift-ci bot assigned ebensh Dec 19, 2024

PR feedback

86a3959

johannes94 temporarily deployed to development December 19, 2024 12:59 — with GitHub Actions Inactive

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ROX-27073: add cluster migration e2e test #2127

ROX-27073: add cluster migration e2e test #2127

johannes94 commented Dec 16, 2024 •

edited

Loading

openshift-ci bot commented Dec 16, 2024

ebensh left a comment

ebensh Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

ebensh Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

ebensh Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

ebensh Dec 19, 2024

ebensh Dec 19, 2024

johannes94 Dec 19, 2024

openshift-ci bot commented Dec 19, 2024

ROX-27073: add cluster migration e2e test #2127

Are you sure you want to change the base?

ROX-27073: add cluster migration e2e test #2127

Conversation

johannes94 commented Dec 16, 2024 • edited Loading

Description

Checklist (Definition of Done)

Test manual

openshift-ci bot commented Dec 16, 2024

ebensh left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

openshift-ci bot commented Dec 19, 2024

johannes94 commented Dec 16, 2024 •

edited

Loading