Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green #360

Open
wants to merge 1 commit into
base: main
Choose a base branch
from

Conversation

squizzi
Copy link
Contributor

@squizzi squizzi commented Sep 20, 2024

This PR refactors our existing e2e test spec into a more conventional e2e test run. We now have a BeforeSuite that runs and installs/validates the controller. Prior to this we had all of our e2e tests under a single Controller spec, but this breaks it into:

  • Controller (Label(controller))
  • Providers (Label(provider))
    • AWS (Label(provider:aws, provider:cloud)) - test/e2e/provider_aws_test.go
    • Azure (Label(provider:azure, provider:cloud)) - test/e2e/provider_azure_test.go
    • Vsphere (Label(provider:vsphere, provider:onprem)) - test/e2e/provider_vsphere_test.go

The labels for these tests have been documented here: https://github.com/Mirantis/hmc/pull/360/files#diff-3a6cab76ecce19612a704fe0ef89dc83f1ed2df44f7bfa28d47d2529a7949fafR139

Each of the umbrella labels correspond to a GitHub job that is dependent on the Build and Unit Test job which provides setup. The Controller tests always run, no matter the label, right now the Controller test is disabled because there aren't any tests other than Before/AfterSuite, but we can enable it because that sort of is a test of the controller in itself.

The provider tests are still attached to the test e2e label, but I think we can break these out further with provider specific GitHub labels at some point.

provider:onprem tests run on a self-hosted GitHub runner that has access to Mirantis' network, this is to support providers that do not necessarily have cloud infrastructure we can get to from a GitHub hosted runner -- like VSphere.

This PR also adds support for creating ClusterIdentity, Credential and their associated Secret resources with a new clusteridentity package.

And of course, with testing comes bug fixing, so this PR fixes several different issues discovered along the way.

Closes: #210, Closes: #323


Note: I've opened two other issues to further simplify these tests, we could make these a lot easier to write in the future with:

@squizzi squizzi marked this pull request as draft September 20, 2024 17:30
@squizzi squizzi added the test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran label Sep 20, 2024
@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from dc5f8d2 to 605513d Compare September 20, 2024 21:03
@squizzi squizzi added the github actions Pull requests that update GitHub Actions code label Sep 20, 2024
@squizzi

This comment was marked as resolved.

@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 5 times, most recently from 92cc4f5 to d4f0344 Compare September 23, 2024 21:10
@squizzi squizzi changed the title Try running VSphere tests in self-hosted runner Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering Sep 24, 2024
@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 4 times, most recently from f76cfe0 to e19f75f Compare September 26, 2024 23:23
@squizzi squizzi changed the title Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering Run VSphere tests in self-hosted, refactor test suite into spec per provider, support label filtering, get all tests green Sep 26, 2024
@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 8 times, most recently from 7560f52 to 00088a6 Compare October 2, 2024 19:17
@squizzi
Copy link
Contributor Author

squizzi commented Oct 2, 2024

  • All standalone-cp tests and AWS hosted is green 🎉
  • Azure hosted is red, working on debugging this one.

@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from dc6d505 to 6386214 Compare October 2, 2024 20:45
@squizzi squizzi marked this pull request as ready for review October 2, 2024 20:46
@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 3 times, most recently from 6c6fc9f to 25aef1d Compare October 2, 2024 23:11
@squizzi
Copy link
Contributor Author

squizzi commented Oct 3, 2024

Note: This run shows a vsphere green: https://github.com/Mirantis/hmc/actions/runs/11153080327/job/31036965900?pr=360

We might see some failures in upcoming runs as Mirantis IT is doing some work on our self-hosted runner.

Copy link
Contributor

@kylewuolle kylewuolle left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good just wonder about leaving tmate in there

.github/workflows/build_test.yml Outdated Show resolved Hide resolved
@kylewuolle kylewuolle self-requested a review October 3, 2024 19:26
kylewuolle
kylewuolle previously approved these changes Oct 3, 2024
@squizzi
Copy link
Contributor Author

squizzi commented Oct 4, 2024

csi-drivers are going to be the death of me 😂

@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch 2 times, most recently from f6cb0da to a65544d Compare October 9, 2024 17:46
@squizzi
Copy link
Contributor Author

squizzi commented Oct 9, 2024

The azure-hosted-cp ManagedCluster appears to have hit some sort of bug in the last run: azuremachines is reporting a 100% ready status while machines is not, and is reporting a NodeHealthy: false status. The cluster-api-operator is reporting that the infra is in status.ready=true state:

I1009 18:24:48.584810       1 machine_controller_phases.go:294] "Infrastructure provider has completed machine infrastructure provisioning and reports status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5" namespace="hmc-system" name="ci-96157-azure-hosted-md-7fvv6-j8hd5" reconcileID="ddd599f6-0410-4f20-91ae-adf60a9e115f" MachineSet="hmc-system/ci-96157-azure-hosted-md-7fvv6" MachineDeployment="hmc-system/ci-96157-azure-hosted-md" Cluster="hmc-system/ci-96157-azure-hosted" AzureMachine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5"
I1009 18:24:48.723231       1 machine_controller_phases.go:294] "Infrastructure provider has completed machine infrastructure provisioning and reports status.ready" controller="machine" controllerGroup="cluster.x-k8s.io" controllerKind="Machine" Machine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5" namespace="hmc-system" name="ci-96157-azure-hosted-md-7fvv6-j8hd5" reconcileID="c7a6b3e7-aef9-4f2a-9d8f-070ad50aaa57" MachineSet="hmc-system/ci-96157-azure-hosted-md-7fvv6" MachineDeployment="hmc-system/ci-96157-azure-hosted-md" Cluster="hmc-system/ci-96157-azure-hosted" AzureMachine="hmc-system/ci-96157-azure-hosted-md-7fvv6-j8hd5"

Deleting the machine triggers the machinedeployment to reconcile a new instance but that instance re-enters this same state:

NAME                                   READY   SEVERITY   REASON   STATE       AGE
ci-96157-azure-hosted-md-7fvv6-p8hkh   True                        Succeeded   4m43s
status:
  addresses:
  - address: ci-96157-azure-hosted-md-7fvv6-p8hkh
    type: InternalDNS
  - address: 10.0.0.5
    type: InternalIP
  bootstrapReady: true
  conditions:
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    status: "True"
    type: Ready
  - lastTransitionTime: "2024-10-09T19:35:25Z"
    status: "True"
    type: BootstrapReady
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    status: "True"
    type: InfrastructureReady
  - lastTransitionTime: "2024-10-09T19:36:45Z"
    reason: NodeProvisioning
    severity: Warning
    status: "False" <---
    type: NodeHealthy

https://github.com/kubernetes-sigs/cluster-api/blob/7b5489ce78ba178c5d95512fe413a893e3302282/internal/controllers/machine/machine_controller_noderef.go#L71-L88

https://github.com/kubernetes-sigs/cluster-api/blob/7b5489ce78ba178c5d95512fe413a893e3302282/internal/controllers/machine/machine_controller_noderef.go#L250

From the CAPI code it appears were here ^^ but there's no log statements or events pertaining to not being able to find the ProviderID

Checking the code, it iterates the NodeList and tries to ensure that spec.providerID != nil, well.. spec.providerID is non-nil for both nodes and the node with issues reports the following, and these providerID's match those of the machines:

    providerID: azure:///subscriptions/***/resourceGroups/ci-96157-azure/providers/Microsoft.Compute/virtualMachines/ci-96157-azure-hosted-md-7fvv6-vfss4

    providerID: azure:///subscriptions/***/resourceGroups/ci-96157-azure/providers/Microsoft.Compute/virtualMachines/ci-96157-azure-hosted-md-7fvv6-p8hkh

kube-system shows no crashing pods or anything like that:

KUBECONFIG=ci-96157-azure-kubeconfig kubectl get pod -n kube-system
NAME                                        READY   STATUS    RESTARTS       AGE
calico-kube-controllers-695f6448bd-fhsr4    1/1     Running   0              139m
calico-node-8n2fr                           1/1     Running   0              135m
calico-node-btm9j                           1/1     Running   0              139m
cloud-controller-manager-6d6545b996-qhcpg   1/1     Running   0              139m
cloud-node-manager-b66nh                    1/1     Running   0              135m
cloud-node-manager-cm6g5                    1/1     Running   0              139m
coredns-6997b8f8bd-nbzhw                    1/1     Running   0              135m
coredns-6997b8f8bd-xcxbf                    1/1     Running   0              135m
csi-azuredisk-controller-5d67674cb8-d6jld   6/6     Running   2 (133m ago)   137m
csi-azuredisk-controller-5d67674cb8-wmg25   6/6     Running   1 (136m ago)   137m
csi-azuredisk-node-jc46l                    3/3     Running   0              137m
csi-azuredisk-node-l5gql                    3/3     Running   1 (134m ago)   135m
kube-proxy-9qfl9                            1/1     Running   0              135m
kube-proxy-nb6zw                            1/1     Running   0              139m
metrics-server-7cc78958fc-hgzwc             1/1     Running   0              139m

@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch from a65544d to 86dc992 Compare October 9, 2024 20:50
* Break provider tests into seperate files with labels
  representing either onprem or cloud providers.
* Add new jobs to CI workflow which dictate where tests
  will run, onprem provider tests like vSphere will run
  on self-hosted runners since they will use internal
  resources to test.  Cloud provider tests will use the
  existing workflow since they can access providers without
  network access and can take advantage of the much larger
  GitHub hosted pool.  Hosted/Self-hosted tests can run
  concurrently.
* Make Cleanup depend on the cloud-e2etest only.
* Use new GINKGO_LABEL_FILTER to manage what tests run
  where.
* Move controller validatation into BeforeSuite since the
  controller needs to be up and ready for each provider test,
  this will also enable us to add controller specific test
  cases later and make those run without the "test e2e" flag.
* Seperate self-hosted and hosted test concurrency groups
* Update docs with test filtering instructions
* Ensure a Release exists for the custom build.version we deploy
* Move all e2e related helpers into e2e dir
  * Add new clusteridentity package for creating ClusterIdentity kind's
  and associated Secret's.
* Merge PR workflows together
* Make sure VERSION gets passed across jobs
* Ensure uniqueness among deployed ManagedClusters, simplify
  MANAGED_CLUSTER_NAME in CI to prevent Azure availabilitySetName
  validation error.
* Default Azure test templates to uswest2 to prevent issues with
  AvailabilityZone.
* Use the same concurrency-group across all jobs, except Cleanup
  which intentionally does not belong to a concurrency-group.
* Use Setup Go across jobs for caching.
* Support patching other hosted clusters to status.ready with a
  common patching helper.
* Move VSphere delete into AfterEach to serve as cleanup.
* Add support for cleaning Azure resources.
* Prevent ginkgo from timing out tests.
* Use azure-disk CSI driver we deploy via templates.

Signed-off-by: Kyle Squizzato <ksquizzato@mirantis.com>
@squizzi squizzi force-pushed the vsphere-tests-on-self-hosted branch from 86dc992 to 3a4c09c Compare October 9, 2024 20:56
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
github actions Pull requests that update GitHub Actions code test e2e Runs the entire provider E2E test suite, controller E2E tests are always ran
Projects
Status: No status
Development

Successfully merging this pull request may close these issues.

E2E Test workflow should reuse the result of Build and Test E2e verification of Azure templates
3 participants