Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Setup autodeploy for GCP blueprints #5

Closed
jlewi opened this issue May 4, 2020 · 6 comments
Closed

Setup autodeploy for GCP blueprints #5

jlewi opened this issue May 4, 2020 · 6 comments

Comments

@jlewi
Copy link
Contributor

jlewi commented May 4, 2020

We should setup the auto-deploy infrastructure to autodeploy from blueprints.

This way we ensure that our GCP blueprint is up to date and working.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

1 similar comment
@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
feature 0.97

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

@issue-label-bot
Copy link

Issue Label Bot is not confident enough to auto-label this issue.
See dashboard for more details.

@jlewi jlewi added kind/feature and removed feature labels May 5, 2020
jlewi pushed a commit to jlewi/gcp-blueprints that referenced this issue May 5, 2020
* Fix some bugs in the blueprints that cropped up while working on
  setting up continuous auto-deployments using the blueprints (GoogleCloudPlatform#5)

Fix some bugs in the documentation.

* Fix bugs in the management config for the per namespace components
  of CNRM. The namespaces of the role bindings wasn't correct so
  the cnrm manager pod ended up not having appropriate permissions.

  * Also the scoped namespace of the cnrm manager statefulset needs
    to be set the managed project not the host project.

* Update Makefile to point at kubeflow/manifests master to pull in cert-manager
changes.

* Add check_domain_length to validate the length of the hostname KF
  deployment name so that we don't end up exceeding the certificate limits.

Check in the blueprint manifests.

Clean for PR.
jlewi pushed a commit to jlewi/gcp-blueprints that referenced this issue May 5, 2020
* Fix some bugs in the blueprints that cropped up while working on
  setting up continuous auto-deployments using the blueprints (GoogleCloudPlatform#5)

Fix some bugs in the documentation.

* Fix bugs in the management config for the per namespace components
  of CNRM. The namespaces of the role bindings wasn't correct so
  the cnrm manager pod ended up not having appropriate permissions.

  * Also the scoped namespace of the cnrm manager statefulset needs
    to be set the managed project not the host project.

* Update Makefile to point at kubeflow/manifests master to pull in cert-manager
changes.

* Add check_domain_length to validate the length of the hostname KF
  deployment name so that we don't end up exceeding the certificate limits.
jlewi pushed a commit to jlewi/gcp-blueprints that referenced this issue May 5, 2020
* Fix some bugs in the blueprints that cropped up while working on
  setting up continuous auto-deployments using the blueprints (GoogleCloudPlatform#5)

Fix some bugs in the documentation.

* Fix bugs in the management config for the per namespace components
  of CNRM. The namespaces of the role bindings wasn't correct so
  the cnrm manager pod ended up not having appropriate permissions.

  * Also the scoped namespace of the cnrm manager statefulset needs
    to be set the managed project not the host project.

* Update Makefile to point at kubeflow/manifests master to pull in cert-manager
changes.

* Add check_domain_length to validate the length of the hostname KF
  deployment name so that we don't end up exceeding the certificate limits.
@jlewi
Copy link
Contributor Author

jlewi commented May 5, 2020

auto-deploy is running on cluster kf-ci-v1.

I made the service account kf-ci-v1-user@kubeflow-ci.iam.gserviceaccount.com an owner of projects in folder ci-projects so that it would be able to deploy in project kubeflow-ci-deployment.

k8s-ci-robot pushed a commit that referenced this issue May 5, 2020
* Fix some bugs in the blueprints that cropped up while working on
  setting up continuous auto-deployments using the blueprints (#5)

Fix some bugs in the documentation.

* Fix bugs in the management config for the per namespace components
  of CNRM. The namespaces of the role bindings wasn't correct so
  the cnrm manager pod ended up not having appropriate permissions.

  * Also the scoped namespace of the cnrm manager statefulset needs
    to be set the managed project not the host project.

* Update Makefile to point at kubeflow/manifests master to pull in cert-manager
changes.

* Add check_domain_length to validate the length of the hostname KF
  deployment name so that we don't end up exceeding the certificate limits.
jlewi pushed a commit to jlewi/testing that referenced this issue May 6, 2020
* This Tekton pipeline will eventually be used to continually
  deploy a fresh instance from the blueprint for CI.

Reorganize how we are defining reusable Tekton tasks.

  * Tekton tasks are currently defined in tekton/templates

  * I reorganized the tekton tasks into kustomize packages
  * I did this because I want to make it easier to hydrate the tasks
    for different installs (e.g. different namespaces).

    * e.g. for auto-deployment we will use namespace auto-deploy but
      in other settings we might use a different namespace.

Start setting up an ACM repo in acm-repo
* This will eventually be used to sync our Tekton tasks automatically
  to our cluster
* The idea is to have a single ACM repo to manage all of our CI/CD clusters
    * A single ACM repo can manage multiple clusters.
    * We could use ACM cluster selectors to select which target this applies
      to

    * So we could eventually reuse this same repo for label-sync configs
      but only sync label-sync to the cluster where label-sync runs.

* Start putting hydrated Tekton pipelines here
* ACM isn't actually installed on our cluster yet so we aren't
  actually syncing the resources yet. Right now we are still applying the manually

Update the management cluster to work for autodeployment

  * Our management cluster needs to grant kf-ci-v1-user@ GSA permissions
    to create CNRM resources so we can deploy kubeflow.
    * We do this by adding a K8s RoleBinding binding that GSA to the
      cnrm-admin ClusterRole in namespace kubeflow-ci-deployment

To support GCP blueprints I had to update the test worker image.

  * Install anthoscli, kpt, and istioctl
  * Install a newer version of yq (i.e. the yq that is a go binary
    and not a wrapper around jq).

Related to: GoogleCloudPlatform/kubeflow-distribution#5
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue May 6, 2020
* This Tekton pipeline will eventually be used to continually
  deploy a fresh instance from the blueprint for CI.

Reorganize how we are defining reusable Tekton tasks.

  * Tekton tasks are currently defined in tekton/templates

  * I reorganized the tekton tasks into kustomize packages
  * I did this because I want to make it easier to hydrate the tasks
    for different installs (e.g. different namespaces).

    * e.g. for auto-deployment we will use namespace auto-deploy but
      in other settings we might use a different namespace.

Start setting up an ACM repo in acm-repo
* This will eventually be used to sync our Tekton tasks automatically
  to our cluster
* The idea is to have a single ACM repo to manage all of our CI/CD clusters
    * A single ACM repo can manage multiple clusters.
    * We could use ACM cluster selectors to select which target this applies
      to

    * So we could eventually reuse this same repo for label-sync configs
      but only sync label-sync to the cluster where label-sync runs.

* Start putting hydrated Tekton pipelines here
* ACM isn't actually installed on our cluster yet so we aren't
  actually syncing the resources yet. Right now we are still applying the manually

Update the management cluster to work for autodeployment

  * Our management cluster needs to grant kf-ci-v1-user@ GSA permissions
    to create CNRM resources so we can deploy kubeflow.
    * We do this by adding a K8s RoleBinding binding that GSA to the
      cnrm-admin ClusterRole in namespace kubeflow-ci-deployment

To support GCP blueprints I had to update the test worker image.

  * Install anthoscli, kpt, and istioctl
  * Install a newer version of yq (i.e. the yq that is a go binary
    and not a wrapper around jq).

Related to: GoogleCloudPlatform/kubeflow-distribution#5
jlewi pushed a commit to jlewi/testing that referenced this issue May 7, 2020
* This Tekton pipeline will eventually be used to continually
  deploy a fresh instance from the blueprint for CI.

Reorganize how we are defining reusable Tekton tasks.

  * Tekton tasks are currently defined in tekton/templates

  * I reorganized the tekton tasks into kustomize packages
  * I did this because I want to make it easier to hydrate the tasks
    for different installs (e.g. different namespaces).

    * e.g. for auto-deployment we will use namespace auto-deploy but
      in other settings we might use a different namespace.

Start setting up an ACM repo in acm-repo
* This will eventually be used to sync our Tekton tasks automatically
  to our cluster
* The idea is to have a single ACM repo to manage all of our CI/CD clusters
    * A single ACM repo can manage multiple clusters.
    * We could use ACM cluster selectors to select which target this applies
      to

    * So we could eventually reuse this same repo for label-sync configs
      but only sync label-sync to the cluster where label-sync runs.

* Start putting hydrated Tekton pipelines here
* ACM isn't actually installed on our cluster yet so we aren't
  actually syncing the resources yet. Right now we are still applying the manually

Update the management cluster to work for autodeployment

  * Our management cluster needs to grant kf-ci-v1-user@ GSA permissions
    to create CNRM resources so we can deploy kubeflow.
    * We do this by adding a K8s RoleBinding binding that GSA to the
      cnrm-admin ClusterRole in namespace kubeflow-ci-deployment

To support GCP blueprints I had to update the test worker image.

  * Install anthoscli, kpt, and istioctl
  * Install a newer version of yq (i.e. the yq that is a go binary
    and not a wrapper around jq).

Related to: GoogleCloudPlatform/kubeflow-distribution#5
jlewi pushed a commit to jlewi/testing that referenced this issue May 7, 2020
* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.

Related to: GoogleCloudPlatform/kubeflow-distribution#5 autodeployments of blueprints
jlewi pushed a commit to jlewi/testing that referenced this issue May 7, 2020
* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5)

* We need to add logic and K8s resources to cleanup the blueprints so
  we don't run out of GCP quota.

* Create cleanup_blueprints.py to cleanup auto_deployed blueprints.
  * Don't put this code in cleanup_ci.py because we want to be able to
    use fire and possibly python3 (not sure code in cleanup_ci is python3
    compatible)

* Create a CLI create_context.py to create K8s config contexts. This will
  be used to get credentials to talk to the cleanup cluster when running
  on K8s.

* Create a Tekton task to run the cleanup script. This is intended
  as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons
  to start using Tekton

  i) We are already using Tekton as part of AutoDeploy infrastructure.
  ii) We can leverage Tekton to handle git checkouts.
  iii) Tekton makes it easy to additional steps to do things like
       create the context.

  * This is a partial solution. This PR contains a Tekton pipeline
    that is only running cleanup for the blueprints.

    * To do all cleanup using Tekton we just need to a step or Task to
      run the existing cleanup-ci script. The only issue I forsee
      is that the Tekton pipeline runs in the kf-ci-v1 cluster and
      will need to be granted access to the kubeflow-testing cluster
      so we can cleanup Argo workflows in that cluster.

* To run the Tekton pipeline regulary we create a cronjob that runs kubectl
  apply.

* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.
jlewi pushed a commit to jlewi/testing that referenced this issue May 7, 2020
* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5)

* We need to add logic and K8s resources to cleanup the blueprints so
  we don't run out of GCP quota.

* Create cleanup_blueprints.py to cleanup auto_deployed blueprints.
  * Don't put this code in cleanup_ci.py because we want to be able to
    use fire and possibly python3 (not sure code in cleanup_ci is python3
    compatible)

* Create a CLI create_context.py to create K8s config contexts. This will
  be used to get credentials to talk to the cleanup cluster when running
  on K8s.

* Create a Tekton task to run the cleanup script. This is intended
  as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons
  to start using Tekton

  i) We are already using Tekton as part of AutoDeploy infrastructure.
  ii) We can leverage Tekton to handle git checkouts.
  iii) Tekton makes it easy to additional steps to do things like
       create the context.

  * This is a partial solution. This PR contains a Tekton pipeline
    that is only running cleanup for the blueprints.

    * To do all cleanup using Tekton we just need to a step or Task to
      run the existing cleanup-ci script. The only issue I forsee
      is that the Tekton pipeline runs in the kf-ci-v1 cluster and
      will need to be granted access to the kubeflow-testing cluster
      so we can cleanup Argo workflows in that cluster.

* To run the Tekton pipeline regulary we create a cronjob that runs kubectl
  apply.

* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue May 8, 2020
* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5)

* We need to add logic and K8s resources to cleanup the blueprints so
  we don't run out of GCP quota.

* Create cleanup_blueprints.py to cleanup auto_deployed blueprints.
  * Don't put this code in cleanup_ci.py because we want to be able to
    use fire and possibly python3 (not sure code in cleanup_ci is python3
    compatible)

* Create a CLI create_context.py to create K8s config contexts. This will
  be used to get credentials to talk to the cleanup cluster when running
  on K8s.

* Create a Tekton task to run the cleanup script. This is intended
  as a replacement for our existing K8s job (#654). There's a couple reasons
  to start using Tekton

  i) We are already using Tekton as part of AutoDeploy infrastructure.
  ii) We can leverage Tekton to handle git checkouts.
  iii) Tekton makes it easy to additional steps to do things like
       create the context.

  * This is a partial solution. This PR contains a Tekton pipeline
    that is only running cleanup for the blueprints.

    * To do all cleanup using Tekton we just need to a step or Task to
      run the existing cleanup-ci script. The only issue I forsee
      is that the Tekton pipeline runs in the kf-ci-v1 cluster and
      will need to be granted access to the kubeflow-testing cluster
      so we can cleanup Argo workflows in that cluster.

* To run the Tekton pipeline regulary we create a cronjob that runs kubectl
  apply.

* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.
jlewi pushed a commit to jlewi/testing that referenced this issue May 8, 2020
* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5)

* We need to add logic and K8s resources to cleanup the blueprints so
  we don't run out of GCP quota.

* Create cleanup_blueprints.py to cleanup auto_deployed blueprints.
  * Don't put this code in cleanup_ci.py because we want to be able to
    use fire and possibly python3 (not sure code in cleanup_ci is python3
    compatible)

* Create a CLI create_context.py to create K8s config contexts. This will
  be used to get credentials to talk to the cleanup cluster when running
  on K8s.

* Create a Tekton task to run the cleanup script. This is intended
  as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons
  to start using Tekton

  i) We are already using Tekton as part of AutoDeploy infrastructure.
  ii) We can leverage Tekton to handle git checkouts.
  iii) Tekton makes it easy to additional steps to do things like
       create the context.

  * This is a partial solution. This PR contains a Tekton pipeline
    that is only running cleanup for the blueprints.

    * To do all cleanup using Tekton we just need to a step or Task to
      run the existing cleanup-ci script. The only issue I forsee
      is that the Tekton pipeline runs in the kf-ci-v1 cluster and
      will need to be granted access to the kubeflow-testing cluster
      so we can cleanup Argo workflows in that cluster.

* To run the Tekton pipeline regulary we create a cronjob that runs kubectl
  apply.

* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.
jlewi pushed a commit to jlewi/testing that referenced this issue May 9, 2020
* Create a blueprint reconciler to autodeploy and reconcile
blueprints.

  * The reconciler decides whether we need to deploy a new blueprint
    and if it does it creates a Tekton PipelineRun to deploy Kubeflow.

* Here are some differences in how we are deploying blueprints vs. kfctl
  deployments

  * We are using Tekton PipelineRuns as opposed to K8s jobs to do the
    deployment

  * We no longer use deployments.yaml to describe the group of deployments.
    Instead we just create a PipelineRun.yaml and that provides all
    the information the reconciler needs e.g. the branch to watch
    for changes.

* Update the flask app to provide information about blueprints.
  * Include a link to the Tekton dashboard showing the PipelineRun
    that deployed Kubeflow.

* Define a Pipeline to deploy Kubeflow so we don't have to inline the
  spec in the PipelienRun.

* Remove Dockerfile.skaffold; we can use skaffold auto-sync in developer mode. Add a column in the webserver to redirect to the Tekton dashboard for the PipelineRun that deployed it.

* GoogleCloudPlatform/kubeflow-distribution#5 Setup autodeploy for gcp blueprints.
jlewi pushed a commit to jlewi/testing that referenced this issue May 9, 2020
* Create a blueprint reconciler to autodeploy and reconcile
blueprints.

  * The reconciler decides whether we need to deploy a new blueprint
    and if it does it creates a Tekton PipelineRun to deploy Kubeflow.

* Here are some differences in how we are deploying blueprints vs. kfctl
  deployments

  * We are using Tekton PipelineRuns as opposed to K8s jobs to do the
    deployment

  * We no longer use deployments.yaml to describe the group of deployments.
    Instead we just create a PipelineRun.yaml and that provides all
    the information the reconciler needs e.g. the branch to watch
    for changes.

* Update the flask app to provide information about blueprints.
  * Include a link to the Tekton dashboard showing the PipelineRun
    that deployed Kubeflow.

* Define a Pipeline to deploy Kubeflow so we don't have to inline the
  spec in the PipelienRun.

* Remove Dockerfile.skaffold; we can use skaffold auto-sync in developer mode. Add a column in the webserver to redirect to the Tekton dashboard for the PipelineRun that deployed it.

* GoogleCloudPlatform/kubeflow-distribution#5 Setup autodeploy for gcp blueprints.
jlewi pushed a commit to jlewi/testing that referenced this issue May 9, 2020
* Create a blueprint reconciler to autodeploy and reconcile
blueprints.

  * The reconciler decides whether we need to deploy a new blueprint
    and if it does it creates a Tekton PipelineRun to deploy Kubeflow.

* Here are some differences in how we are deploying blueprints vs. kfctl
  deployments

  * We are using Tekton PipelineRuns as opposed to K8s jobs to do the
    deployment

  * We no longer use deployments.yaml to describe the group of deployments.
    Instead we just create a PipelineRun.yaml and that provides all
    the information the reconciler needs e.g. the branch to watch
    for changes.

* Update the flask app to provide information about blueprints.
  * Include a link to the Tekton dashboard showing the PipelineRun
    that deployed Kubeflow.

* Define a Pipeline to deploy Kubeflow so we don't have to inline the
  spec in the PipelienRun.

* Remove Dockerfile.skaffold; we can use skaffold auto-sync in developer mode. Add a column in the webserver to redirect to the Tekton dashboard for the PipelineRun that deployed it.

* GoogleCloudPlatform/kubeflow-distribution#5 Setup autodeploy for gcp blueprints.
k8s-ci-robot pushed a commit to kubeflow/testing that referenced this issue May 9, 2020
* Create a blueprint reconciler to autodeploy and reconcile
blueprints.

  * The reconciler decides whether we need to deploy a new blueprint
    and if it does it creates a Tekton PipelineRun to deploy Kubeflow.

* Here are some differences in how we are deploying blueprints vs. kfctl
  deployments

  * We are using Tekton PipelineRuns as opposed to K8s jobs to do the
    deployment

  * We no longer use deployments.yaml to describe the group of deployments.
    Instead we just create a PipelineRun.yaml and that provides all
    the information the reconciler needs e.g. the branch to watch
    for changes.

* Update the flask app to provide information about blueprints.
  * Include a link to the Tekton dashboard showing the PipelineRun
    that deployed Kubeflow.

* Define a Pipeline to deploy Kubeflow so we don't have to inline the
  spec in the PipelienRun.

* Remove Dockerfile.skaffold; we can use skaffold auto-sync in developer mode. Add a column in the webserver to redirect to the Tekton dashboard for the PipelineRun that deployed it.

* GoogleCloudPlatform/kubeflow-distribution#5 Setup autodeploy for gcp blueprints.
@jlewi
Copy link
Contributor Author

jlewi commented May 18, 2020

This was working so I'm closing this issue.
It recently broke and kubeflow/testing#668 is tracking that.

@issue-label-bot
Copy link

Issue-Label Bot is automatically applying the labels:

Label Probability
area/engprod 0.84

Please mark this comment with 👍 or 👎 to give our bot feedback!
Links: app homepage, dashboard and code for this bot.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

1 participant