Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Initial commit for auth and stages workflow #1003

Merged
merged 64 commits into from
Feb 3, 2022
Merged

Conversation

costrouc
Copy link
Member

@costrouc costrouc commented Jan 20, 2022

Fixes | Closes | Resolves #

Please remove anything marked as optional that you don't need to fill in. Choose one of the keywords preceding to refer to the issue this PR solves, followed by the issue number (e.g Fixes # 666). If there is no issue, remove the line. Remove this note after reading.

Changes:

Ranked from large to small

  • removal of cookiecutter from qhub
  • removal of monolithic terraform deployment into multiple stages
  • all cloud providers produce a kubeconfig at $TMPDIR/QHUB_KUBECONFIG during deployment that can be used anytime
  • checks between each stage of deployment e.g. dns, webserver up, load balancer up, kubernetes cluster up
  • all qhub resources are targeted with nodeSelectors
  • SSO with keycloak for all services: conda-store, jupyterhub, dask-gateway, and grafana

Types of changes

What types of changes does your code introduce?

Put an x in the boxes that apply

  • Bug fix (non-breaking change which fixes an issue)
  • New feature (non-breaking change which adds a feature)
  • Breaking change (fix or feature that would cause existing features to not work as expected)
  • Documentation Update
  • Code style update (formatting, renaming)
  • Refactoring (no functional changes, no API changes)
  • Build related changes
  • Other (please describe):

Testing

Requires testing

  • Yes
  • No

In case you checked yes, did you write tests?

  • Yes
  • No

Further comments (optional)

If this is a relatively large or complex change, kick off the discussion by explaining why you chose the solution you did and what alternatives you considered and more.

@danlester
Copy link
Contributor

Fixes #992

@viniciusdc viniciusdc mentioned this pull request Jan 26, 2022
12 tasks
@danlester
Copy link
Contributor

danlester commented Jan 26, 2022

@costrouc , how is this intended to work with remote state, or is further work still needed?

We used to attempt to import terraform-state and create e.g. the S3 bucket and DynamoDB if it didn't already exist. Then the infrastructure parts would be able to use that remote state because we know it exists.

In this PR as it stands, a fresh install with e.g.

provider: do
terraform_state:
  type: remote

will just fall down when attempting to create the 01-terraform-state stage with remote backend set to itself!

[terraform]: Successfully configured the backend "s3"! Terraform will automatically
[terraform]: use this backend unless the backend configuration changes.
[terraform]: ╷
[terraform]: │ Error: Failed to get existing workspaces: S3 bucket does not exist.
...
[terraform]: │
[terraform]: │ Error: NoSuchBucket:
[terraform]: │ 	status code: 404, request id: tx00000000000001823392d-0061f12131-2741279b-nyc3c, host id:

Also, qhub destroy will need to be updated to match, but this should now be much clearer since we should just have to reference each stage.

Let me know if I can help!

@viniciusdc
Copy link
Contributor

Hi @costrouc, I tested today on minikube, the deployment worked fine, and I also tested the lingering of pods just to be sure the timeout would occur. I noticed that running qhub render before the qhub deployment breaks the deployment as a FileError occurs. Attempting to redeploy after completion seems to trigger the same error message as well.

@iameskild iameskild mentioned this pull request Jan 27, 2022
92 tasks
@iameskild
Copy link
Member

iameskild commented Jan 27, 2022

@costrouc I also tried deploying on Minikube from this branch but it failed in stage 04, with the error message shown below. Many of the resources that were recently created seemed to be getting destroyed.

[terraform]: module.kubernetes-initialization.kubernetes_namespace.main: Destruction complete after 6s
[terraform]: ╷
[terraform]: │ Error: serviceaccounts "qhub-traefik-ingress" is forbidden: unable to create new content in namespace dev because it is being terminated
[terraform]: │ 
[terraform]: │   with module.kubernetes-ingress.kubernetes_service_account.main,
[terraform]: │   on modules/kubernetes/ingress/main.tf line 1, in resource "kubernetes_service_account" "main":
[terraform]: │    1: resource "kubernetes_service_account" "main" {
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: services "qhub-traefik-ingress" is forbidden: unable to create new content in namespace dev because it is being terminated
[terraform]: │ 
[terraform]: │   with module.kubernetes-ingress.kubernetes_service.main,
[terraform]: │   on modules/kubernetes/ingress/main.tf line 58, in resource "kubernetes_service" "main":
[terraform]: │   58: resource "kubernetes_service" "main" {
[terraform]: │ 
[terraform]: ╵

If I try to rerun the command, I run into another issue.

INFO:root:Modifying for development branch main
Traceback (most recent call last):
  File "/home/ubuntu/qhub/data-venv/bin/qhub", line 33, in <module>
    sys.exit(load_entry_point('qhub', 'console_scripts', 'qhub')())
  File "/home/ubuntu/qhub/qhub/__main__.py", line 7, in main
    cli(sys.argv[1:])
  File "/home/ubuntu/qhub/qhub/cli/__init__.py", line 52, in cli
    args.func(args)
  File "/home/ubuntu/qhub/qhub/cli/deploy.py", line 63, in handle_deploy
    render_template(args.output, args.config, force=True)
  File "/home/ubuntu/qhub/qhub/render/__init__.py", line 266, in render_template
    generate_files(
  File "/home/ubuntu/qhub/data-venv/lib/python3.8/site-packages/cookiecutter/generate.py", line 321, in generate_files
    shutil.copytree(indir, outdir)
  File "/usr/lib/python3.8/shutil.py", line 557, in copytree
    return _copytree(entries=entries, src=src, dst=dst, symlinks=symlinks,
  File "/usr/lib/python3.8/shutil.py", line 458, in _copytree
    os.makedirs(dst, exist_ok=dirs_exist_ok)
  File "/usr/lib/python3.8/os.py", line 223, in makedirs
    mkdir(name, mode)
FileExistsError: [Errno 17] File exists: '/home/ubuntu/qhub/minikube_deploy/stages/04-kubernetes-ingress/modules/kubernetes'

@danlester
Copy link
Contributor

danlester commented Jan 31, 2022

Trying again on Digital Ocean today. You're probably aware @costrouc but:

  • Flag --skip-remote-state-provision is ignored, so always attempts to create the remote state - so necessarily fails on second deploy run. It would be useful to recreate the functionality of the terraform_state_sync function too.
  • I have to rm -rf state before all subsequent deploy runs too because cookie-cutter can't overwrite it when already there.

Don't look into it in detail - I can try again when you say the time is right - but for what it's worth my deployment attempt failed here:

After stage directory=stages/04-kubernetes-ingress kubernetes ingress available on tcp ports={80, 443}
INFO:qhub.provider.dns.cloudflare:record name=qhubdotest type=A address=144.126.250.32 does not exists creating
INFO:qhub.deploy:Couldn't update the DNS record for cloud provider: do
Polling DNS domain=qhubdotest.mydomain.com does not exist
...
Polling DNS domain=qhubdotest.mydomain.com does not exist
ERROR: After stage directory=stages/04-kubernetes-ingress DNS domain=qhubdotest.mydomain.com does not point to ip=144.126.250.32

Looking on CloudFlare, the DNS did seem to be correct, although maybe it took too long to appear.

@costrouc
Copy link
Member Author

@danlester I will make sure this feature works. I agree that it would be nice to have it such that you can completely remove the stages directory and be able to run a deployment.

I have to rm -rf state before all subsequent deploy runs too because cookie-cutter can't overwrite it when already there.

With the work to remove cookiecutter this will no longer be required. There are currently 2 files that I am not sure for how we will remove it. .github/workflows/* and .gitlabci.yml.

@costrouc
Copy link
Member Author

@danlester state sync should now be implented

@iameskild
Copy link
Member

This morning I was testing on AWS and I kept running into an error during the FULL stage.

python -m qhub init aws --project nebarieae --domain nebarieae.qhub.dev --auth-provider github --auth-auto-provision
python -m qhub deploy -c qhub-config.yaml --dns-provider cloudflare --dns-auto-provision

Issue

[terraform]: module.kubernetes-conda-store-server.kubernetes_deployment.main: Still creating... [9m50s elapsed]
[terraform]: module.qhub.module.kubernetes-dask-gateway.kubernetes_deployment.gateway: Still creating... [9m50s elapsed]
[terraform]: module.qhub.module.kubernetes-dask-gateway.kubernetes_deployment.controller: Still creating... [9m50s elapsed]
[terraform]: ╷
[terraform]: │ Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
[terraform]: │ 
[terraform]: │   with module.kubernetes-conda-store-server.kubernetes_deployment.main,
[terraform]: │   on modules/kubernetes/services/conda-store/main.tf line 46, in resource "kubernetes_deployment" "main":
[terraform]: │   46: resource "kubernetes_deployment" "main" {
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
[terraform]: │ 
[terraform]: │   with module.qhub.module.kubernetes-dask-gateway.kubernetes_deployment.controller,
[terraform]: │   on modules/kubernetes/services/dask-gateway/controler.tf line 83, in resource "kubernetes_deployment" "controller":
[terraform]: │   83: resource "kubernetes_deployment" "controller" {
[terraform]: │ 
[terraform]: ╵
[terraform]: ╷
[terraform]: │ Error: Waiting for rollout to finish: 1 replicas wanted; 0 replicas Ready
[terraform]: │ 
[terraform]: │   with module.qhub.module.kubernetes-dask-gateway.kubernetes_deployment.gateway,
[terraform]: │   on modules/kubernetes/services/dask-gateway/gateway.tf line 102, in resource "kubernetes_deployment" "gateway":
[terraform]: │  102: resource "kubernetes_deployment" "gateway" {
[terraform]: │ 
[terraform]: ╵
[terraform]: Releasing state lock. This may take a few moments...

@costrouc costrouc force-pushed the refactor-auth-stages branch from d080c97 to 61ce57e Compare February 1, 2022 21:35
@viniciusdc
Copy link
Contributor

Hi @costrouc, just deployed qhub now on minikube and all services seem to be working normally, but I do notice an issue with the conda-store/keycloak Oauth
image

@costrouc
Copy link
Member Author

costrouc commented Feb 2, 2022

@viniciusdc change your default image for conda-store to be conda_store: quansight/conda-store-server:v0.3.9. Forgot to change the default! Changed now in qhub/initialize.py

@costrouc
Copy link
Member Author

costrouc commented Feb 3, 2022

Known issues that we will need to open issues for:

  • forward-auth and dask-gateway cluster that are created and dashboard not visible but running traefik middleware issue likely
  • traefik certificates from letsencrypt and tlsstore crd (think i've figured this one out will be added after this PR) Solved will open a PR after our meeting today
  • .gitlab and .github/workflows/ for CI without cookiecutter this we may still need a jinja template for this?
    cdsdashboards and conda environments (this is from conda environments trait not being set will need to follow same pattern as in dask-gateway)
  • upgrade command for 0.4.0 working properly

costrouc pushed a commit that referenced this pull request Feb 3, 2022
* blacken

* test flake8

* fix typo

* Add buitlin exception to setup.cfg
costrouc added a commit that referenced this pull request Feb 11, 2022
Prior to #1003 there was the option `jupyterhub.overrides`. Here we
are adding back this option and ensuring that there is documentation
on the option.
danlester pushed a commit that referenced this pull request Feb 12, 2022
Prior to #1003 there was the option `jupyterhub.overrides`. Here we
are adding back this option and ensuring that there is documentation
on the option.
@danlester danlester mentioned this pull request Feb 15, 2022
@viniciusdc viniciusdc deleted the refactor-auth-stages branch August 18, 2022 15:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Integration of Keycloak authentication for Conda-Store
5 participants