Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ROX-27073: add cluster migration e2e test #2127

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

johannes94
Copy link
Contributor

@johannes94 johannes94 commented Dec 16, 2024

Description

  • Adds a ginkgo e2e test suite for multicluster test
  • First test verifies cluster migration between multiple clusters
  • Adds a make target to run the tests test/multicluster/e2e
  • Moved common logic with other e2e tests to a testutils package
  • Change admin CentralRequest presenter to actually set the ClusterId, which is required for the test
  • See test manual on how to execute this agains 2 infra clusters
  • Extends our k8s client pkg to allow loading config for multiple cluster

Checklist (Definition of Done)

  • Unit and integration tests added
  • Added test description under Test manual
  • Documentation added if necessary (i.e. changes to dev setup, test execution, ...)
  • CI and all relevant tests are passing
  • Add the ticket number to the PR title if available, i.e. ROX-12345: ...
  • Discussed security and business related topics privately. Will move any security and business related topics that arise to private communication channel.
  • Add secret to app-interface Vault or Secrets Manager if necessary
  • RDS changes were e2e tested manually
  • Check AWS limits are reasonable for changes provisioning new resources
  • (If applicable) Changes to the dp-terraform Helm values have been reflected in the addon on integration environment

Test manual

# Start 2 infractl clusters OSD on AWS
# Login to kerberos for secret access
kinit jmalsam

export CLUSTER_TYPE="infra-openshift"
export CLUSTER_1_KUBECONFIG=/Users/johannes/kubes/cluster1
export CLUSTER_2_KUBECONFIG=/Users/johannes/kubes/cluster2
export QUAY_USER=<your user>
export QUAY_TOKEN=<your token>
export ENABLE_CENTRAL_EXTERNAL_CERTIFICATE="true"
export ROUTE53_ACCESS_KEY=<route53-key-id>
export ROUTE53_SECRET_ACCESS_KEY=<route53-secret>
export STATIC_TOKEN=<static-token-from-ci>
export STATIC_TOKEN_ADMIN=<static-admin-token-from-ci>

export INFRA_TOKEN=<your-infra-token>

url=$(infractl artifacts jm-migration-1 --json | jq '.Artifacts[] | select(.Name=="kubeconfig") | .URL' -r)
wget -O $CLUSTER_1_KUBECONFIG $url

url=$(infractl artifacts jm-migration-2 --json | jq '.Artifacts[] | select(.Name=="kubeconfig") | .URL' -r)
wget -O $CLUSTER_2_KUBECONFIG $url

bash scripts/ci/multicluster_tests/deploy.sh

# Verify all services are running like expected
KUBECONFIG=$CLUSTER_1_KUBECONFIG k get pods -n rhacs # fleet-manager and fleetshard-sync must be healthy
KUBECONFIG=$CLUSTER_2_KUBECONFIG k get pods -n rhacs # fleetshard-sync must be healthy, fleet-manager does not exist

make test/multicluster/e2e

Copy link
Contributor

openshift-ci bot commented Dec 16, 2024

[APPROVALNOTIFIER] This PR is APPROVED

This pull-request has been approved by: johannes94

The full list of commands accepted by this bot can be found here.

The pull request process is described here

Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

Copy link
Collaborator

@ebensh ebensh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Several nits, nothing major - the flow of the e2e test makes sense and is easy enough to read with the comments you provided throughout

@@ -15,6 +15,10 @@
emailsender-manifests.yaml
central-chart/
pids-port-forward
# temp files created by multicluster e2e tests
cluster-list.json
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Where do these come from / why can we not clean them up after the test?

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Found them in deploy.sh - is there no way to have a cleanup.sh equivalent? Or better to leave the artifacts in case debugging is necessary?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Those files would be created and not cleaned up. So you can use them for debugging already, this is just the .gitignore to prevent those files from being commited and pushed to the repo.

@@ -0,0 +1,35 @@
package dns
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would waiting for "deleting" state instead of "deprovisioning" rely on the FM's cleanup of the routes instead of manually cleaning them up here? (sorry if missing something, not used to looking at the e2e tests as much)

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm not 100% sure about this. But I think FM would only start deleting the route records on deleting state.

We trigger a deletion and wait for deletion at the end of this test, which usually cleans up the records.

This logic is intended as an additional measure on test failures. Because on failures we might be stuck somewhere in the usual reconciliation loop which can lead to a situation where the deletion lifecycle logic doesn't get executed. In a case like this we don't want to leak those resources.

@@ -32,8 +31,8 @@ var (
dnsEnabled bool
routesEnabled bool
route53Client *route53.Route53
waitTimeout = getWaitTimeout()
extendedWaitTimeout = getWaitTimeout() * 3
waitTimeout = testutil.GetWaitTimeout()
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just a meta-comment - adding the testutil. in a separate PR if possible would've made reviewing just the cluster migration part easier. It's fast enough to skim by but next time would recommend splitting into 2 PRs

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, I thought about splitting PRs as well along the way, decided to not do it since I only wanted very few changes. Then it grew over time to what you see now :) . I'll split it next time.

Expect(*recordSet.Name).To(Equal(domain))
Expect(*record.Value).To(Equal(reencryptIngress.RouterCanonicalHostname)) // TODO use route specific ingress instead of comparing with reencryptIngress for all cases
}
testutil.AssertDNSMatchesRouter(dnsRecordsLoader.CentralDomainNames, recordSets, &reencryptIngress)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍


const (
dpCloudProvider = "standalone"
dpRegion = "standalone"
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nit - a unique value from dpCloudProvider

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you mean dpCloudProvider should have a different value than dpRegion?

Unfortunately that's not so easy, because that's what we have for the cluster configurations all other tests do it like this as well.

func assertClusterAssignment(expectedClusterID string, centralID string, adminAPI fleetmanager.AdminAPI) {
var clusterAssignment string
Eventually(func() (err error) {
// assert the cluster ID outside the Eventually, since once we have a non-empty
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this necessary? https://onsi.github.io/gomega/#eventually

Eventually checks that an assertion eventually passes. Eventually blocks when called and attempts an assertion periodically until it passes or a timeout occurs.
It shouldn't keep polling once the condition passes

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Ah unless the comments means "Once you get a value it will not change, so then do the assert" - instead of e.g. checking the value and, if the value is there but wrong, then getting stuck in polling loop even though you know it will not change?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, exactly this is what the comment means, I will try to express this more clearly.

for _, central := range centralList.Items {
if central.Id == centralID {
tenantExists = true
clusterID = central.ClusterId
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

break?

}

// ObtainCentralRequest queries fleet-manager public API for the CentralRequest with id and stores in in the given pointer
func ObtainCentralRequest(ctx context.Context, client *fleetmanager.Client, id string, request *public.CentralRequest) error {
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit - would use Get or Fetch instead of Obtain, if possible. Just a bit more consistent

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You're right. Its called that way because the private function I copied to testuitl was called like that before.

I changed it.

Copy link
Contributor

openshift-ci bot commented Dec 19, 2024

@johannes94: The following tests failed, say /retest to rerun all failed tests or /retest-required to rerun all mandatory failed tests:

Test name Commit Details Required Rerun command
ci/prow/e2e 86a3959 link true /test e2e
ci/prow/images 86a3959 link true /test images

Full PR test history. Your PR dashboard.

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. I understand the commands that are listed here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants