Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

squizzi · 2024-09-20T17:30:18Z

This PR refactors our existing e2e test spec into a more conventional e2e test run. We now have a BeforeSuite that runs and installs/validates the controller. Prior to this we had all of our e2e tests under a single Controller spec, but this breaks it into:

Controller (Label(controller))
Providers (Label(provider))
- AWS (Label(provider:aws, provider:cloud)) - test/e2e/provider_aws_test.go
- Azure (Label(provider:azure, provider:cloud)) - test/e2e/provider_azure_test.go
- Vsphere (Label(provider:vsphere, provider:onprem)) - test/e2e/provider_vsphere_test.go

The labels for these tests have been documented here: https://github.com/Mirantis/hmc/pull/360/files#diff-3a6cab76ecce19612a704fe0ef89dc83f1ed2df44f7bfa28d47d2529a7949fafR139

Each of the umbrella labels correspond to a GitHub job that is dependent on the Build and Unit Test job which provides setup. The Controller tests always run, no matter the label, right now the Controller test is disabled because there aren't any tests other than Before/AfterSuite, but we can enable it because that sort of is a test of the controller in itself.

The provider tests are still attached to the test e2e label, but I think we can break these out further with provider specific GitHub labels at some point.

provider:onprem tests run on a self-hosted GitHub runner that has access to Mirantis' network, this is to support providers that do not necessarily have cloud infrastructure we can get to from a GitHub hosted runner -- like VSphere.

This PR also adds support for creating ClusterIdentity, Credential and their associated Secret resources with a new clusteridentity package.

And of course, with testing comes bug fixing, so this PR fixes several different issues discovered along the way.

Closes: #210, Closes: #323

Note: I've opened two other issues to further simplify these tests, we could make these a lot easier to write in the future with:

squizzi · 2024-10-02T19:24:11Z

All standalone-cp tests and AWS hosted is green 🎉
Azure hosted is red, working on debugging this one.

squizzi · 2024-10-03T16:22:47Z

Note: This run shows a vsphere green: https://github.com/Mirantis/hmc/actions/runs/11153080327/job/31036965900?pr=360

We might see some failures in upcoming runs as Mirantis IT is doing some work on our self-hosted runner.

kylewuolle

Looks good just wonder about leaving tmate in there

.github/workflows/build_test.yml

squizzi · 2024-10-04T19:33:36Z

csi-drivers are going to be the death of me 😂

squizzi · 2024-10-09T19:28:57Z

The azure-hosted-cp ManagedCluster appears to have hit some sort of bug in the last run: azuremachines is reporting a 100% ready status while machines is not, and is reporting a NodeHealthy: false status. The cluster-api-operator is reporting that the infra is in status.ready=true state:

I1009 18:24:48.584810       1 machine_controller_phases.go:294] "Infrastructure provider has completed machine infrastructure provisioning and reports status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5" namespace="hmc-system" name="ci-96157-azure-hosted-md-7fvv6-j8hd5" reconcileID="ddd599f6-0410-4f20-91ae-adf60a9e115f" MachineSet="hmc-system/ci-96157-azure-hosted-md-7fvv6" MachineDeployment="hmc-system/ci-96157-azure-hosted-md" Cluster="hmc-system/ci-96157-azure-hosted" AzureMachine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5"
I1009 18:24:48.723231       1 machine_controller_phases.go:294] "Infrastructure provider has completed machine infrastructure provisioning and reports status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5" namespace="hmc-system" name="ci-96157-azure-hosted-md-7fvv6-j8hd5" reconcileID="c7a6b3e7-aef9-4f2a-9d8f-070ad50aaa57" MachineSet="hmc-system/ci-96157-azure-hosted-md-7fvv6" MachineDeployment="hmc-system/ci-96157-azure-hosted-md" Cluster="hmc-system/ci-96157-azure-hosted" AzureMachine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5"

Deleting the machine triggers the machinedeployment to reconcile a new instance but that instance re-enters this same state:

NAME                                   READY   SEVERITY   REASON   STATE       AGE
ci-96157-azure-hosted-md-7fvv6-p8hkh   True                        Succeeded   4m43s

status:
  addresses:
  - address: ci-96157-azure-hosted-md-7fvv6-p8hkh
    type: InternalDNS
  - address: 10.0.0.5
    type: InternalIP
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-09T19:35:25Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    reason: NodeProvisioning
    severity: Warning
    status: "False" <---
    type: NodeHealthy

https://github.com/kubernetes-sigs/cluster-api/blob/7b5489ce78ba178c5d95512fe413a893e3302282/internal/controllers/machine/machine_controller_noderef.go#L71-L88

https://github.com/kubernetes-sigs/cluster-api/blob/7b5489ce78ba178c5d95512fe413a893e3302282/internal/controllers/machine/machine_controller_noderef.go#L250

From the CAPI code it appears were here ^^ but there's no log statements or events pertaining to not being able to find the ProviderID

Checking the code, it iterates the NodeList and tries to ensure that spec.providerID != nil, well.. spec.providerID is non-nil for both nodes and the node with issues reports the following, and these providerID's match those of the machines:

    providerID: azure:///subscriptions/***/resourceGroups/ci-96157-azure/providers/Microsoft.Compute/virtualMachines/ci-96157-azure-hosted-md-7fvv6-vfss4

    providerID: azure:///subscriptions/***/resourceGroups/ci-96157-azure/providers/Microsoft.Compute/virtualMachines/ci-96157-azure-hosted-md-7fvv6-p8hkh

kube-system shows no crashing pods or anything like that:

KUBECONFIG=ci-96157-azure-kubeconfig kubectl get pod -n kube-system
NAME                                        READY   STATUS    RESTARTS       AGE
calico-kube-controllers-695f6448bd-fhsr4    1/1     Running   0              139m
calico-node-8n2fr                           1/1     Running   0              135m
calico-node-btm9j                           1/1     Running   0              139m
cloud-controller-manager-6d6545b996-qhcpg   1/1     Running   0              139m
cloud-node-manager-b66nh                    1/1     Running   0              135m
cloud-node-manager-cm6g5                    1/1     Running   0              139m
coredns-6997b8f8bd-nbzhw                    1/1     Running   0              135m
coredns-6997b8f8bd-xcxbf                    1/1     Running   0              135m
csi-azuredisk-controller-5d67674cb8-d6jld   6/6     Running   2 (133m ago)   137m
csi-azuredisk-controller-5d67674cb8-wmg25   6/6     Running   1 (136m ago)   137m
csi-azuredisk-node-jc46l                    3/3     Running   0              137m
csi-azuredisk-node-l5gql                    3/3     Running   1 (134m ago)   135m
kube-proxy-9qfl9                            1/1     Running   0              135m
kube-proxy-nb6zw                            1/1     Running   0              139m
metrics-server-7cc78958fc-hgzwc             1/1     Running   0              139m

* Break provider tests into seperate files with labels representing either onprem or cloud providers. * Add new jobs to CI workflow which dictate where tests will run, onprem provider tests like vSphere will run on self-hosted runners since they will use internal resources to test. Cloud provider tests will use the existing workflow since they can access providers without network access and can take advantage of the much larger GitHub hosted pool. Hosted/Self-hosted tests can run concurrently. * Make Cleanup depend on the cloud-e2etest only. * Use new GINKGO_LABEL_FILTER to manage what tests run where. * Move controller validatation into BeforeSuite since the controller needs to be up and ready for each provider test, this will also enable us to add controller specific test cases later and make those run without the "test e2e" flag. * Seperate self-hosted and hosted test concurrency groups * Update docs with test filtering instructions * Ensure a Release exists for the custom build.version we deploy * Move all e2e related helpers into e2e dir * Add new clusteridentity package for creating ClusterIdentity kind's and associated Secret's. * Merge PR workflows together * Make sure VERSION gets passed across jobs * Ensure uniqueness among deployed ManagedClusters, simplify MANAGED_CLUSTER_NAME in CI to prevent Azure availabilitySetName validation error. * Default Azure test templates to uswest2 to prevent issues with AvailabilityZone. * Use the same concurrency-group across all jobs, except Cleanup which intentionally does not belong to a concurrency-group. * Use Setup Go across jobs for caching. * Support patching other hosted clusters to status.ready with a common patching helper. * Move VSphere delete into AfterEach to serve as cleanup. * Add support for cleaning Azure resources. * Prevent ginkgo from timing out tests. * Use azure-disk CSI driver we deploy via templates. Signed-off-by: Kyle Squizzato <ksquizzato@mirantis.com>

squizzi marked this pull request as draft September 20, 2024 17:30

squizzi added the test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran label Sep 20, 2024

squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from dc5f8d2 to 605513d Compare September 20, 2024 21:03

squizzi added the github actions Pull requests that update GitHub Actions code label Sep 20, 2024

This comment was marked as resolved.

Sign in to view

squizzi force-pushed the vsphere-tests-on-self-hosted branch 5 times, most recently from 92cc4f5 to d4f0344 Compare September 23, 2024 21:10

This was referenced Sep 24, 2024

Implemented Azure e2e tests #352

Merged

Get status.conditions from CAPI operator during updateComponentsStatus #341

Open

squizzi force-pushed the vsphere-tests-on-self-hosted branch from d4f0344 to fea83d7 Compare September 24, 2024 18:39

squizzi changed the title ~~Try running VSphere tests in self-hosted runner~~ Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering Sep 24, 2024

squizzi force-pushed the vsphere-tests-on-self-hosted branch 4 times, most recently from f76cfe0 to e19f75f Compare September 26, 2024 23:23

squizzi changed the title ~~Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering~~ Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green Sep 26, 2024

squizzi force-pushed the vsphere-tests-on-self-hosted branch 8 times, most recently from 7560f52 to 00088a6 Compare October 2, 2024 19:17

squizzi mentioned this pull request Oct 2, 2024

vsphere-csi: Update default repo for driver/syncer Mirantis/helm-charts#1

Merged

squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from dc6d505 to 6386214 Compare October 2, 2024 20:45

squizzi marked this pull request as ready for review October 2, 2024 20:46

squizzi requested review from Kshatrix, eromanova and a13x5 as code owners October 2, 2024 20:46

squizzi force-pushed the vsphere-tests-on-self-hosted branch 3 times, most recently from 6c6fc9f to 25aef1d Compare October 2, 2024 23:11

squizzi requested review from zerospiel, wahabmk, slysunkin and kylewuolle October 2, 2024 23:38

kylewuolle requested changes Oct 3, 2024

View reviewed changes

.github/workflows/build_test.yml Outdated Show resolved Hide resolved

kylewuolle self-requested a review October 3, 2024 19:26

kylewuolle previously approved these changes Oct 3, 2024

View reviewed changes

squizzi dismissed kylewuolle’s stale review via 4e915c7 October 3, 2024 20:37

squizzi force-pushed the vsphere-tests-on-self-hosted branch from 4e915c7 to 672d5f2 Compare October 3, 2024 20:39

squizzi requested a review from kylewuolle October 8, 2024 19:20

squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from f6cb0da to a65544d Compare October 9, 2024 17:46

squizzi force-pushed the vsphere-tests-on-self-hosted branch from a65544d to 86dc992 Compare October 9, 2024 20:50

squizzi force-pushed the vsphere-tests-on-self-hosted branch from 86dc992 to 3a4c09c Compare October 9, 2024 20:56

kylewuolle approved these changes Oct 9, 2024

View reviewed changes

slysunkin approved these changes Oct 10, 2024

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

squizzi commented Sep 20, 2024 •

edited

Loading

This comment was marked as resolved.

squizzi commented Oct 2, 2024 •

edited

Loading

squizzi commented Oct 3, 2024

kylewuolle left a comment

squizzi commented Oct 4, 2024

squizzi commented Oct 9, 2024 •

edited

Loading

Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

Are you sure you want to change the base?

Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

Conversation

squizzi commented Sep 20, 2024 • edited Loading

This comment was marked as resolved.

squizzi commented Oct 2, 2024 • edited Loading

squizzi commented Oct 3, 2024

kylewuolle left a comment

Choose a reason for hiding this comment

squizzi commented Oct 4, 2024

squizzi commented Oct 9, 2024 • edited Loading

squizzi commented Sep 20, 2024 •

edited

Loading

squizzi commented Oct 2, 2024 •

edited

Loading

squizzi commented Oct 9, 2024 •

edited

Loading