Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test deployments #753

Open
3 tasks
TomAugspurger opened this issue Sep 23, 2020 · 5 comments
Open
3 tasks

Test deployments #753

TomAugspurger opened this issue Sep 23, 2020 · 5 comments

Comments

@TomAugspurger
Copy link
Member

Recent commits (67b9df5...b12088d ) added basic integration tests to the GCP staging deployment.

The basic idea is to start up regular user pod (using the JupyterHub REST API) and kubectl exec a pytest session in that pod.

Possible next steps

  • Expand to other hubs (cc @salvis2 & @scottyhq, does this interest you?). This will require marking the classes (TestCommon, TestGCP, TestAWS, ...) and including a pytest -m <marks> in the run_tests.sh.
  • Expand tests. Right now we just do the very basics of starting a dask cluster and connecting to it. What else should we test? Various imports (would have caught the xgcm docrep issue)? Various computations?
  • Revert deployments if the tests fail. Currently we'll just fail CI if the tests fail. We could consider using helm to roll back the deployment
@salvis2
Copy link
Member

salvis2 commented Sep 24, 2020

I think more testing is good in the long run. I haven't written any pytest things before but happy to learn (also it will help me write tests for hubploy).

I'm thinking about the change to GitHub Actions: is the interaction between CI and these tests going to change? Will GitHub Actions be easier to work with? #699

Does seem like we should helm rollback if the deployment fails so we don't leave a hub in a broken state when trying to upgrade. The --cleanup-on-fail flag I think only removes new resources, so updated, previously-existing resources would still be in a bad state. helm rollback docs for reference: https://helm.sh/docs/helm/helm_rollback/

      --cleanup-on-fail    allow deletion of new resources created in this rollback when rollback fails

@salvis2
Copy link
Member

salvis2 commented Sep 24, 2020

Related to hubploy testing: berkeley-dsep-infra/hubploy#95

@consideRatio had suggested using k3s to test deploying the Helm chart. I'm wondering if this is something we could do to validate the Helm chart values during PRs to catch errors earlier. I'm also wondering if our deployments will be too big for the CI machine / if there's enough value in doing this, since we are going to just test / rollback on pushes.

@TomAugspurger
Copy link
Member Author

I'm thinking about the change to GitHub Actions: is the interaction between CI and these tests going to change? Will GitHub Actions be easier to work with? #699

Actions might make this a bit easier / nicer to look at. I couldn't figure out how to get CircleCI "job" to depend on another job. Ideally "test" would depend on "deploy".

Does seem like we should helm rollback if the deployment fails so we don't leave a hub in a broken state when trying to upgrade.

I think we're saying the same thing, but just to clarify: I'm thinking about rolling back in the case where the helm upgrade ... succeeds (so the deployment succeeds) but then our testing fails.

to test deploying the Helm chart

Definitely value in this. I think even a helm upgrade ... --dry-run on PRs would be valuable.

@salvis2
Copy link
Member

salvis2 commented Sep 24, 2020

I think we're saying the same thing, but just to clarify: I'm thinking about rolling back in the case where the helm upgrade ... succeeds (so the deployment succeeds) but then our testing fails.

I didn't realize that's what you meant, so thank you for clarifying. I was confused if --cleanup-on-fail would revert resources that were created in previous deployments. I agree that we should rollback if deployment succeeds but tests fail.

@consideRatio
Copy link
Member

Tests in an ephemeral VM and an ephemeral k3s based cluster

@consideRatio had suggested using k3s to test deploying the Helm chart. I'm wondering if this is something we could do to validate the Helm chart values during PRs to catch errors earlier.

  • I think the key value of a dedicated non-ephemeral staging environment, is:
    • ... to try the upgrade interaction from one version to the next.
    • ... to be able to interact with the deployment pre-production - for manual tests in other words.
  • I think the key value of a ephemeral VM for PR feedback is:
    • ... quicker feedback on all the less complicated interactions unrelated to the upgrade process.

I think you can get the test infrastructure to fit in the ephemeral VMs of various CI providers without trouble. It would be good to avoid having different test configuration for the ephemeral CI k8s cluster and the persistent staging k8s cluster.

What to test in ephemeral VMs?

Some tests can make sense to do even without a k8s cluster in the ephemeral VM, such as verifying helm template can run without errors, or that the k8s resources generated are valid with some special tooling.

I think perhaps doing something like this would make sense.

  • helm lint (check your templates)
  • helm template (check output of templates)
  • k3s create (create k8s cluster)
  • helm template --validate (validate rendered templates against the k8s api-server)
  • helm install (try install the chart from scratch)
  • pytest ... (try things against the installed chart)

In z2jh, we also try to upgrade from the latest version by doing helm install followed by helm upgrade, within a k3s cluster.

GitHub actions vs CircleCI

I'm thinking about the change to GitHub Actions: is the interaction between CI and these tests going to change? Will GitHub Actions be easier to work with? #699

I'm neutral as a whole. I'm biased towards whats already in place, and biased towards GitHub actions over CircleCI in general.

Helm flags

There is a lot to say about --cleanup-on-fail, --wait, --atomic, etc.

  • --cleanup-on-fail will only delete added resources, not updated ones.
  • --cleanup-on-fail will only happen if the Helm command gets time to do its job after it realize it failed.
  • --wait will influence the failure condition, and thereby on what grounds --cleanup-on-fail triggers.
  • --atomic implies --wait and will do a rollback on failure.
  • --atomic with --cleanup-on-fail won't use --cleanup-on-fail during eventual rollback operation that fails.

I've discussed this in helm/helm#7811 and helm/helm#7876.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants