Setup autodeploy for GCP blueprints #5

jlewi · 2020-05-04T19:27:36Z

We should setup the auto-deploy infrastructure to autodeploy from blueprints.

This way we ensure that our GCP blueprint is up to date and working.

issue-label-bot · 2020-05-04T19:27:43Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
feature	0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-05-04T19:27:43Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
feature	0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

issue-label-bot · 2020-05-04T19:27:43Z

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

* Fix some bugs in the blueprints that cropped up while working on setting up continuous auto-deployments using the blueprints (GoogleCloudPlatform#5) Fix some bugs in the documentation. * Fix bugs in the management config for the per namespace components of CNRM. The namespaces of the role bindings wasn't correct so the cnrm manager pod ended up not having appropriate permissions. * Also the scoped namespace of the cnrm manager statefulset needs to be set the managed project not the host project. * Update Makefile to point at kubeflow/manifests master to pull in cert-manager changes. * Add check_domain_length to validate the length of the hostname KF deployment name so that we don't end up exceeding the certificate limits. Check in the blueprint manifests. Clean for PR.

* Fix some bugs in the blueprints that cropped up while working on setting up continuous auto-deployments using the blueprints (GoogleCloudPlatform#5) Fix some bugs in the documentation. * Fix bugs in the management config for the per namespace components of CNRM. The namespaces of the role bindings wasn't correct so the cnrm manager pod ended up not having appropriate permissions. * Also the scoped namespace of the cnrm manager statefulset needs to be set the managed project not the host project. * Update Makefile to point at kubeflow/manifests master to pull in cert-manager changes. * Add check_domain_length to validate the length of the hostname KF deployment name so that we don't end up exceeding the certificate limits.

jlewi · 2020-05-05T19:52:19Z

auto-deploy is running on cluster kf-ci-v1.

I made the service account kf-ci-v1-user@kubeflow-ci.iam.gserviceaccount.com an owner of projects in folder ci-projects so that it would be able to deploy in project kubeflow-ci-deployment.

* Fix some bugs in the blueprints that cropped up while working on setting up continuous auto-deployments using the blueprints (#5) Fix some bugs in the documentation. * Fix bugs in the management config for the per namespace components of CNRM. The namespaces of the role bindings wasn't correct so the cnrm manager pod ended up not having appropriate permissions. * Also the scoped namespace of the cnrm manager statefulset needs to be set the managed project not the host project. * Update Makefile to point at kubeflow/manifests master to pull in cert-manager changes. * Add check_domain_length to validate the length of the hostname KF deployment name so that we don't end up exceeding the certificate limits.

* This Tekton pipeline will eventually be used to continually deploy a fresh instance from the blueprint for CI. Reorganize how we are defining reusable Tekton tasks. * Tekton tasks are currently defined in tekton/templates * I reorganized the tekton tasks into kustomize packages * I did this because I want to make it easier to hydrate the tasks for different installs (e.g. different namespaces). * e.g. for auto-deployment we will use namespace auto-deploy but in other settings we might use a different namespace. Start setting up an ACM repo in acm-repo * This will eventually be used to sync our Tekton tasks automatically to our cluster * The idea is to have a single ACM repo to manage all of our CI/CD clusters * A single ACM repo can manage multiple clusters. * We could use ACM cluster selectors to select which target this applies to * So we could eventually reuse this same repo for label-sync configs but only sync label-sync to the cluster where label-sync runs. * Start putting hydrated Tekton pipelines here * ACM isn't actually installed on our cluster yet so we aren't actually syncing the resources yet. Right now we are still applying the manually Update the management cluster to work for autodeployment * Our management cluster needs to grant kf-ci-v1-user@ GSA permissions to create CNRM resources so we can deploy kubeflow. * We do this by adding a K8s RoleBinding binding that GSA to the cnrm-admin ClusterRole in namespace kubeflow-ci-deployment To support GCP blueprints I had to update the test worker image. * Install anthoscli, kpt, and istioctl * Install a newer version of yq (i.e. the yq that is a go binary and not a wrapper around jq). Related to: GoogleCloudPlatform/kubeflow-distribution#5

* cnrm_clients.py is a quick hack to create a wrapper to make it easier to work with CNRM custom resources. Related to: GoogleCloudPlatform/kubeflow-distribution#5 autodeployments of blueprints

* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5) * We need to add logic and K8s resources to cleanup the blueprints so we don't run out of GCP quota. * Create cleanup_blueprints.py to cleanup auto_deployed blueprints. * Don't put this code in cleanup_ci.py because we want to be able to use fire and possibly python3 (not sure code in cleanup_ci is python3 compatible) * Create a CLI create_context.py to create K8s config contexts. This will be used to get credentials to talk to the cleanup cluster when running on K8s. * Create a Tekton task to run the cleanup script. This is intended as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons to start using Tekton i) We are already using Tekton as part of AutoDeploy infrastructure. ii) We can leverage Tekton to handle git checkouts. iii) Tekton makes it easy to additional steps to do things like create the context. * This is a partial solution. This PR contains a Tekton pipeline that is only running cleanup for the blueprints. * To do all cleanup using Tekton we just need to a step or Task to run the existing cleanup-ci script. The only issue I forsee is that the Tekton pipeline runs in the kf-ci-v1 cluster and will need to be granted access to the kubeflow-testing cluster so we can cleanup Argo workflows in that cluster. * To run the Tekton pipeline regulary we create a cronjob that runs kubectl apply. * cnrm_clients.py is a quick hack to create a wrapper to make it easier to work with CNRM custom resources.

* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5) * We need to add logic and K8s resources to cleanup the blueprints so we don't run out of GCP quota. * Create cleanup_blueprints.py to cleanup auto_deployed blueprints. * Don't put this code in cleanup_ci.py because we want to be able to use fire and possibly python3 (not sure code in cleanup_ci is python3 compatible) * Create a CLI create_context.py to create K8s config contexts. This will be used to get credentials to talk to the cleanup cluster when running on K8s. * Create a Tekton task to run the cleanup script. This is intended as a replacement for our existing K8s job (#654). There's a couple reasons to start using Tekton i) We are already using Tekton as part of AutoDeploy infrastructure. ii) We can leverage Tekton to handle git checkouts. iii) Tekton makes it easy to additional steps to do things like create the context. * This is a partial solution. This PR contains a Tekton pipeline that is only running cleanup for the blueprints. * To do all cleanup using Tekton we just need to a step or Task to run the existing cleanup-ci script. The only issue I forsee is that the Tekton pipeline runs in the kf-ci-v1 cluster and will need to be granted access to the kubeflow-testing cluster so we can cleanup Argo workflows in that cluster. * To run the Tekton pipeline regulary we create a cronjob that runs kubectl apply. * cnrm_clients.py is a quick hack to create a wrapper to make it easier to work with CNRM custom resources.

* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5) * We need to add logic and K8s resources to cleanup the blueprints so we don't run out of GCP quota. * Create cleanup_blueprints.py to cleanup auto_deployed blueprints. * Don't put this code in cleanup_ci.py because we want to be able to use fire and possibly python3 (not sure code in cleanup_ci is python3 compatible) * Create a CLI create_context.py to create K8s config contexts. This will be used to get credentials to talk to the cleanup cluster when running on K8s. * Create a Tekton task to run the cleanup script. This is intended as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons to start using Tekton i) We are already using Tekton as part of AutoDeploy infrastructure. ii) We can leverage Tekton to handle git checkouts. iii) Tekton makes it easy to additional steps to do things like create the context. * This is a partial solution. This PR contains a Tekton pipeline that is only running cleanup for the blueprints. * To do all cleanup using Tekton we just need to a step or Task to run the existing cleanup-ci script. The only issue I forsee is that the Tekton pipeline runs in the kf-ci-v1 cluster and will need to be granted access to the kubeflow-testing cluster so we can cleanup Argo workflows in that cluster. * To run the Tekton pipeline regulary we create a cronjob that runs kubectl apply. * cnrm_clients.py is a quick hack to create a wrapper to make it easier to work with CNRM custom resources.

* Create a blueprint reconciler to autodeploy and reconcile blueprints. * The reconciler decides whether we need to deploy a new blueprint and if it does it creates a Tekton PipelineRun to deploy Kubeflow. * Here are some differences in how we are deploying blueprints vs. kfctl deployments * We are using Tekton PipelineRuns as opposed to K8s jobs to do the deployment * We no longer use deployments.yaml to describe the group of deployments. Instead we just create a PipelineRun.yaml and that provides all the information the reconciler needs e.g. the branch to watch for changes. * Update the flask app to provide information about blueprints. * Include a link to the Tekton dashboard showing the PipelineRun that deployed Kubeflow. * Define a Pipeline to deploy Kubeflow so we don't have to inline the spec in the PipelienRun. * Remove Dockerfile.skaffold; we can use skaffold auto-sync in developer mode. Add a column in the webserver to redirect to the Tekton dashboard for the PipelineRun that deployed it. * GoogleCloudPlatform/kubeflow-distribution#5 Setup autodeploy for gcp blueprints.

jlewi · 2020-05-18T14:16:04Z

This was working so I'm closing this issue.
It recently broke and kubeflow/testing#668 is tracking that.

issue-label-bot · 2020-05-18T14:16:12Z

Issue-Label Bot is automatically applying the labels:

Label	Probability
area/engprod	0.84

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

jlewi added kind/process platform/gcp priority/p1 labels May 4, 2020

issue-label-bot bot added the feature label May 4, 2020

jlewi added kind/feature and removed feature labels May 5, 2020

This was referenced May 5, 2020

Fix various bugs as part of setting up auto deployments #7

Merged

A simple python script to deploy Kubeflow using the GCP blueprint. kubeflow/testing#652

Merged

jlewi mentioned this issue May 6, 2020

Create a Tekton pipeline to deploy using the GCP blueprint. kubeflow/testing#653

Merged

jlewi mentioned this issue May 7, 2020

[Multi user] How do we release KFP multi user in Kubeflow? kubeflow/pipelines#3645

Closed

jlewi mentioned this issue May 7, 2020

Code and jobs to cleanup the the new auto-deployed blueprints. kubeflow/testing#655

Merged

jlewi mentioned this issue May 9, 2020

Support blueprints in autodeployment. kubeflow/testing#657

Merged

jlewi closed this as completed May 18, 2020

issue-label-bot bot added the area/engprod label May 18, 2020

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Setup autodeploy for GCP blueprints #5

Setup autodeploy for GCP blueprints #5

jlewi commented May 4, 2020

issue-label-bot bot commented May 4, 2020

issue-label-bot bot commented May 4, 2020

issue-label-bot bot commented May 4, 2020

jlewi commented May 5, 2020

jlewi commented May 18, 2020

issue-label-bot bot commented May 18, 2020

Setup autodeploy for GCP blueprints #5

Setup autodeploy for GCP blueprints #5

Comments

jlewi commented May 4, 2020

issue-label-bot bot commented May 4, 2020

issue-label-bot bot commented May 4, 2020

issue-label-bot bot commented May 4, 2020

jlewi commented May 5, 2020

jlewi commented May 18, 2020

issue-label-bot bot commented May 18, 2020