Code and jobs to cleanup the the new auto-deployed blueprints.

* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5) * We need to add logic and K8s resources to cleanup the blueprints so we don't run out of GCP quota. * Create cleanup_blueprints.py to cleanup auto_deployed blueprints. * Don't put this code in cleanup_ci.py because we want to be able to use fire and possibly python3 (not sure code in cleanup_ci is python3 compatible) * Create a CLI create_context.py to create K8s config contexts. This will be used to get credentials to talk to the cleanup cluster when running on K8s. * Create a Tekton task to run the cleanup script. This is intended as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons to start using Tekton i) We are already using Tekton as part of AutoDeploy infrastructure. ii) We can leverage Tekton to handle git checkouts. iii) Tekton makes it easy to additional steps to do things like create the context. * This is a partial solution. This PR contains a Tekton pipeline that is only running cleanup for the blueprints. * To do all cleanup using Tekton we just need to a step or Task to run the existing cleanup-ci script. The only issue I forsee is that the Tekton pipeline runs in the kf-ci-v1 cluster and will need to be granted access to the kubeflow-testing cluster so we can cleanup Argo workflows in that cluster. * To run the Tekton pipeline regulary we create a cronjob that runs kubectl apply. * cnrm_clients.py is a quick hack to create a wrapper to make it easier to work with CNRM custom resources.
jlewi · May 7, 2020 · 3c71fc3 · 3c71fc3
1 parent 867fab0
commit 3c71fc3
Show file tree

Hide file tree

Showing 17 changed files with 694 additions and 4 deletions.
diff --git a/acm-repo/namespaces/auto-deploy/batch_v1beta1_cronjob_cleanup-ci-kubeflow-ci-deployment.yaml b/acm-repo/namespaces/auto-deploy/batch_v1beta1_cronjob_cleanup-ci-kubeflow-ci-deployment.yaml
@@ -0,0 +1,50 @@
+apiVersion: batch/v1beta1
+kind: CronJob
+metadata:
+  labels:
+    app: cleanup-ci-kubeflow-ci-deployment
+  name: cleanup-ci-kubeflow-ci-deployment
+  namespace: auto-deploy
+spec:
+  concurrencyPolicy: Forbid
+  failedJobsHistoryLimit: 1
+  jobTemplate:
+    metadata:
+      annotations:
+        sidecar.istio.io/inject: "false"
+      creationTimestamp: null
+      labels:
+        job: cleanup-kubeflow-ci-deployment
+    spec:
+      template:
+        metadata:
+          annotations:
+            sidecar.istio.io/inject: "false"
+          labels:
+            job: cleanup-kubeflow-ci-deployment
+        spec:
+          containers:
+          - command:
+            - kubectl
+            - create
+            - -f
+            - /configs/cleanup-blueprints-pipeline.yaml
+            image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
+            imagePullPolicy: IfNotPresent
+            name: create-pipeline
+            terminationMessagePath: /dev/termination-log
+            terminationMessagePolicy: File
+            volumeMounts:
+            - mountPath: /configs
+              name: cleanup-config
+          restartPolicy: OnFailure
+          serviceAccountName: default-editor
+          volumes:
+          - configMap:
+              name: cleanup-config-4bm54d2bmb
+            name: cleanup-config
+  schedule: 0 */2 * * *
+  successfulJobsHistoryLimit: 3
+  suspend: false
+status:
+  lastScheduleTime: "2020-05-07T14:00:00Z"
diff --git a/acm-repo/namespaces/auto-deploy/tekton.dev_v1alpha1_task_cleanup-kubeflow-ci.yaml b/acm-repo/namespaces/auto-deploy/tekton.dev_v1alpha1_task_cleanup-kubeflow-ci.yaml
@@ -0,0 +1,66 @@
+apiVersion: tekton.dev/v1alpha1
+kind: Task
+metadata:
+  annotations:
+    sidecar.istio.io/inject: "false"
+  name: cleanup-kubeflow-ci
+  namespace: auto-deploy
+spec:
+  inputs:
+    params:
+    - default: kf-vbp-{uid}
+      description: The name for the Kubeflow deployment
+      name: name
+      type: string
+    - default: kubeflow-ci-deployment
+      description: The project to clean up.
+      name: project
+      type: string
+    - default: kf-ci-management
+      description: The name of the management cluster.
+      name: management-cluster-name
+      type: string
+    - default: kubeflow-ci
+      description: The project containing the management cluster
+      name: management-project
+      type: string
+    - default: us-central1
+      description: The location of the management cluster
+      name: management-location
+      type: string
+    resources:
+    - description: The GitHub repo containing kubeflow testing scripts
+      name: testing-repo
+      type: git
+  steps:
+  - command:
+    - python
+    - -m
+    - kubeflow.testing.create_context
+    - create
+    - --name=$(inputs.params.management-project)
+    - --project=$(inputs.params.management-project)
+    - --location=$(inputs.params.management-location)
+    - --cluster=$(inputs.params.management-cluster-name)
+    - --namespace=$(inputs.params.project)
+    env:
+    - name: KUBECONFIG
+      value: /workspace/kubeconfig
+    - name: PYTHONPATH
+      value: /workspace/$(inputs.resources.testing-repo.name)/py
+    image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
+    name: create-context
+  - command:
+    - python
+    - -m
+    - kubeflow.testing.cleanup_blueprints
+    - auto-blueprints
+    - --project=$(inputs.params.project)
+    - --context=$(inputs.params.management-project)
+    env:
+    - name: KUBECONFIG
+      value: /workspace/kubeconfig
+    - name: PYTHONPATH
+      value: /workspace/$(inputs.resources.testing-repo.name)/py
+    image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
+    name: cleanup-ci
diff --git a/acm-repo/namespaces/auto-deploy/~g_v1_configmap_cleanup-config-4bm54d2bmb.yaml b/acm-repo/namespaces/auto-deploy/~g_v1_configmap_cleanup-config-4bm54d2bmb.yaml
@@ -0,0 +1,23 @@
+apiVersion: v1
+data:
+  cleanup-blueprints-pipeline.yaml: "# A Tekton PipelineRun to do a one off \n# cleaning
+    up the Kubeflow auto-deployed blueprints.\n#\napiVersion: tekton.dev/v1alpha1\nkind:
+    PipelineRun\nmetadata:\n  generateName: cleanup-blueprints-\n  namespace: auto-deploy\nspec:\n
+    \ # TODO(jlewi): Override any parameters?\n  #params: {}\n  resources:  \n  -
+    name: testing-repo\n    resourceSpec:\n      type: git\n      params:\n        #
+    TODO(jlewi): Switch to master on kubeflow/gcp-blueprints\n        - name: revision\n
+    \         value: gcp_blueprint\n        - name: url\n          value: https://github.com/jlewi/testing.git\n
+    \ # Need to use a KSA with appropriate GSA\n  serviceAccountName: default-editor\n
+    \ pipelineSpec:\n    params:\n    - name: management-cluster-name\n      type:
+    string\n      description: The name of the management cluster. \n      default:
+    \"kf-ci-management\"\n    resources:\n    - name: testing-repo\n      type: git\n
+    \   tasks:\n    - name: cleanup-blueprints\n      # TODO(jlewi): expose other
+    parameters? Right now\n      # we are just relying on the defaults defined in
+    the task\n      params:\n      - name: management-cluster-name\n        value:
+    \"$(params.management-cluster-name)\"\n      resources:\n        inputs:        \n
+    \       - name: testing-repo\n          resource: testing-repo\n      taskRef:\n
+    \       name: cleanup-kubeflow-ci\n        kind: namespaced  "
+kind: ConfigMap
+metadata:
+  name: cleanup-config-4bm54d2bmb
+  namespace: auto-deploy
diff --git a/playbook/README.md b/playbook/README.md
@@ -2,6 +2,6 @@
 
 This directory contains various playbooks for the Kubeflow test infrastructure.
 
-* [auto_deploy.md][auto_deploy.md] - Playbook for auto deployed infrastructure
+* [auto_deploy.md](auto_deploy.md) - Playbook for auto deployed infrastructure
 * [buildcop.md](buildcop.md) - Playbook for the buildcop
 * [playbook.md](playbook.md) - General playbook for the test infrastructure
diff --git a/py/kubeflow/testing/cleanup_blueprints.py b/py/kubeflow/testing/cleanup_blueprints.py
@@ -0,0 +1,232 @@
+"""Cleanup auto deployed blueprints.
+
+Note: This is in a separate file from cleanup_ci because we wanted to start
+using Fire and python3.
+"""
+import collections
+import datetime
+from dateutil import parser as date_parser
+import fire
+import logging
+
+from kubeflow.testing import cnrm_clients
+from kubeflow.testing import util
+from kubernetes import client as k8s_client
+
+# The names of various labels used to encode information about the
+#
+# Which branch the blueprint was deployed from
+BRANCH_LABEL = "blueprint-branch"
+NAME_LABEL = "kf-name"
+AUTO_DEPLOY_LABEL = "auto-deploy"
+
+def _iter_blueprints(namespace, context=None):
+  """Return an iterator over blueprints.
+
+  Args:
+    namespace: The namespace to look for blueprints
+    context: The kube context to use.
+  """
+  # We need to load the kube config so that we can have credentials to
+  # talk to the APIServer.
+  util.load_kube_config(persist_config=False, context=context)
+
+  client = k8s_client.ApiClient()
+  crd_api = cnrm_clients.CnrmClientApi(client, "containercluster")
+
+  clusters = crd_api.list_namespaced(namespace)
+
+  for c in clusters.get("items"):
+    yield c
+
+def _delete_blueprints(namespace, to_keep_names, context=None, dryrun=True):
+  """Delete all auto-deployed resources that we don't want to keep.
+
+  Args:
+    namespace: The namespace that owns the CNRM objects.
+    to_keep_names: Names of the blueprints to keep.
+    context: The kubeconfig context to use
+
+
+  This function deletes all auto-deployed resources that we don't want
+  to keep. This function is intended to delete any orphaned resources.
+  It works as follows.
+
+    1. For each type of resource we issue a list to find all autodeployed
+       resources
+    2. We then remove any resource which belongs to a blueprint to keep
+    3. We remove any resource that is less than 1 hours old
+       * This is to avoid race conditions where a blueprint was created
+         after to_keep was computedisks
+    4. remaining resources are deleted.
+  """
+
+  util.load_kube_config(persist_config=False, context=context)
+
+  client = k8s_client.ApiClient()
+  crd_api = k8s_client.CustomObjectsApi(client)
+
+  BASE_GROUP = "cnrm.cloud.google.com"
+  CNRM_VERSION = "v1beta1"
+
+
+  # List of resources to GC
+  kinds = ["containercluster", "iampolicymember",
+           "iamserviceaccount", "containernodepool",
+           "computeaddress", "computedisk"]
+
+
+  # Mappings from resource type to list of resources
+  to_keep = collections.defaultdict(lambda: [])
+  to_delete = collections.defaultdict(lambda: [])
+
+  api_client = k8s_client.ApiClient()
+
+  # Loop over resources and identify resources to delete.
+  for kind in kinds:
+    client = cnrm_clients.CnrmClientApi(api_client, kind)
+
+    selector = "{0}=true".format(AUTO_DEPLOY_LABEL)
+    results = client.list_namespaced(namespace, label_selector=selector)
+
+    for i in results.get("items"):
+      name = i["metadata"]["name"]
+
+      if name in to_keep_names:
+        to_keep[kind].append(name)
+        continue
+
+      creation = date_parser.parse(i["metadata"]["creationTimestamp"])
+      age = datetime.datetime.now(creation.tzinfo) - creation
+      if age < datetime.timedelta(hours=1):
+        to_keep[kind].append(name)
+        logging.info("Not GC'ing %s %s; it was created to recently", kind,
+                     name)
+        continue
+
+      to_delete[kind].append(name)
+
+  for kind in kinds:
+    client = cnrm_clients.CnrmClientApi(api_client, kind)
+    for name in to_delete[kind]:
+      if dryrun:
+        logging.info("Dryrun: %s %s would be deleted", kind, name)
+      else:
+        logging.info("Deleting: %s %s", kind, name)
+        client.delete_namespaced(namespace, name, {})
+
+  for kind in kinds:
+    logging.info("Deleted %s:\n%s", kind, "\n".join(to_delete[kind]))
+    logging.info("Kept %s:\n%s", kind, "\n".join(to_keep[kind]))
+
+class Cleanup:
+  @staticmethod
+  def auto_blueprints(project, context, dryrun=True, blueprints=None): # pylint: disable=too-many-branches
+    """Cleanup auto deployed blueprints.
+
+    For auto blueprints we only want to keep the most recent N deployments.
+
+    Args:
+      project: The project that owns the deployments
+      context: The kubernetes context to use to talk to the Cloud config
+        Connector cluster.
+      dryrun: (True) set to false to actually cleanup.
+      blueprints: (Optional) iterator over CNRM ContainerCluster resources
+       corresponding to blueprints.
+
+    Returns:
+      blueprints_to_delete: List of deployments to delete
+      blueprints_to_keep: List of deployments to keep
+    """
+    logging.info("Cleanup auto blueprints")
+
+    # Map from blueprint version e.g. "master" to a map of blueprint names to
+    # their insert time e.g.
+    # auto_deployments["master"]["kf-vbp-abcd"] returns the creation time
+    # of blueprint "kf-vbp-abcd" which was created from the master branch
+    # of the blueprints repo.
+    auto_deployments = collections.defaultdict(lambda: {})
+
+    if not blueprints:
+      blueprints = _iter_blueprints(project, context=context)
+
+    for b in blueprints:
+      name = b["metadata"]["name"]
+      if not b["metadata"].get("creationTimestamp", None):
+        # This should not happen all K8s objects should have creation timestamp
+        logging.error("Cluster %s doesn't have a deployment time "
+                      "skipping it", b["metadata"]["name"])
+        continue
+
+      # Use labels to identify auto-deployed instances
+      auto_deploy_label = b["metadata"].get("labels", {}).get(AUTO_DEPLOY_LABEL,
+                                                              "false")
+
+      is_auto_deploy = auto_deploy_label.lower() == "true"
+
+      if not is_auto_deploy:
+        logging.info("Skipping cluster %s; its missing the auto-deploy label",
+                     name)
+
+      # Tha name of blueprint
+      kf_name = b["metadata"].get("labels", {}).get(NAME_LABEL, "")
+
+      if not kf_name:
+        logging.info("Skipping cluster %s; it is not an auto-deployed instance",
+                     name)
+        continue
+
+      if kf_name != name:
+        # TODO(jlewi): This shouldn't be happening. Hopefully this was just
+        # temporary issue with the first couple of auto-deployed clusters I
+        # created and we can delete this code.
+        logging.error("Found cluster named:%s with label kf-name: %s. The name "
+                      "will be used. This shouldn't happen. This hopefully "
+                      "was just due to a temporary bug in the early versions "
+                      "of create_kf_from_gcp_blueprint.py that should be fixed "
+                      "so it shouldn't be happening in new instances anymore."
+                      , name, kf_name)
+        kf_name = name
+
+      logging.info("Blueprint %s is auto deployed", kf_name)
+
+      blueprint_branch = b["metadata"]["labels"].get(BRANCH_LABEL, "unknown")
+
+      if blueprint_branch == "unknown":
+        logging.warning("Blueprint %s was missing label %s", kf_name,
+                        BRANCH_LABEL)
+
+      if kf_name in auto_deployments[blueprint_branch]:
+        continue
+
+      auto_deployments[blueprint_branch][kf_name] = (
+        date_parser.parse(b["metadata"]["creationTimestamp"]))
+
+    # Garbage collect the blueprints
+    to_keep = []
+    to_delete = []
+    for version, matched_deployments in auto_deployments.items():
+      logging.info("For version=%s found deployments:\n%s", version,
+                   "\n".join(matched_deployments.keys()))
+
+      # Sort the deployment by the insert time
+      pairs = matched_deployments.items()
+      sorted_pairs = sorted(pairs, key=lambda x: x[1])
+
+      # keep the 3 most recent deployments
+      to_keep.extend([p[0] for p in sorted_pairs[-3:]])
+      to_delete.extend([p[0] for p in sorted_pairs[:-3]])
+
+    _delete_blueprints(project, to_keep, context=context,
+                       dryrun=dryrun)
+
+    logging.info("Finish cleanup auto blueprints")
+
+if __name__ == "__main__":
+  logging.basicConfig(level=logging.INFO,
+                      format=('%(levelname)s|%(asctime)s'
+                              '|%(pathname)s|%(lineno)d| %(message)s'),
+                      datefmt='%Y-%m-%dT%H:%M:%S',
+                      )
+  logging.getLogger().setLevel(logging.INFO)
+  fire.Fire(Cleanup)