Skip to content

Commit

Permalink
Code and jobs to cleanup the the new auto-deployed blueprints.
Browse files Browse the repository at this point in the history
* We want to auto-deploy the GCP blueprint (GoogleCloudPlatform/kubeflow-distribution#5)

* We need to add logic and K8s resources to cleanup the blueprints so
  we don't run out of GCP quota.

* Create cleanup_blueprints.py to cleanup auto_deployed blueprints.
  * Don't put this code in cleanup_ci.py because we want to be able to
    use fire and possibly python3 (not sure code in cleanup_ci is python3
    compatible)

* Create a CLI create_context.py to create K8s config contexts. This will
  be used to get credentials to talk to the cleanup cluster when running
  on K8s.

* Create a Tekton task to run the cleanup script. This is intended
  as a replacement for our existing K8s job (kubeflow#654). There's a couple reasons
  to start using Tekton

  i) We are already using Tekton as part of AutoDeploy infrastructure.
  ii) We can leverage Tekton to handle git checkouts.
  iii) Tekton makes it easy to additional steps to do things like
       create the context.

  * This is a partial solution. This PR contains a Tekton pipeline
    that is only running cleanup for the blueprints.

    * To do all cleanup using Tekton we just need to a step or Task to
      run the existing cleanup-ci script. The only issue I forsee
      is that the Tekton pipeline runs in the kf-ci-v1 cluster and
      will need to be granted access to the kubeflow-testing cluster
      so we can cleanup Argo workflows in that cluster.

* To run the Tekton pipeline regulary we create a cronjob that runs kubectl
  apply.

* cnrm_clients.py is a quick hack to create a wrapper to make it
  easier to work with CNRM custom resources.
  • Loading branch information
Jeremy Lewi committed May 7, 2020
1 parent 867fab0 commit 3c71fc3
Show file tree
Hide file tree
Showing 17 changed files with 694 additions and 4 deletions.
Original file line number Diff line number Diff line change
@@ -0,0 +1,50 @@
apiVersion: batch/v1beta1
kind: CronJob
metadata:
labels:
app: cleanup-ci-kubeflow-ci-deployment
name: cleanup-ci-kubeflow-ci-deployment
namespace: auto-deploy
spec:
concurrencyPolicy: Forbid
failedJobsHistoryLimit: 1
jobTemplate:
metadata:
annotations:
sidecar.istio.io/inject: "false"
creationTimestamp: null
labels:
job: cleanup-kubeflow-ci-deployment
spec:
template:
metadata:
annotations:
sidecar.istio.io/inject: "false"
labels:
job: cleanup-kubeflow-ci-deployment
spec:
containers:
- command:
- kubectl
- create
- -f
- /configs/cleanup-blueprints-pipeline.yaml
image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
imagePullPolicy: IfNotPresent
name: create-pipeline
terminationMessagePath: /dev/termination-log
terminationMessagePolicy: File
volumeMounts:
- mountPath: /configs
name: cleanup-config
restartPolicy: OnFailure
serviceAccountName: default-editor
volumes:
- configMap:
name: cleanup-config-4bm54d2bmb
name: cleanup-config
schedule: 0 */2 * * *
successfulJobsHistoryLimit: 3
suspend: false
status:
lastScheduleTime: "2020-05-07T14:00:00Z"
Original file line number Diff line number Diff line change
@@ -0,0 +1,66 @@
apiVersion: tekton.dev/v1alpha1
kind: Task
metadata:
annotations:
sidecar.istio.io/inject: "false"
name: cleanup-kubeflow-ci
namespace: auto-deploy
spec:
inputs:
params:
- default: kf-vbp-{uid}
description: The name for the Kubeflow deployment
name: name
type: string
- default: kubeflow-ci-deployment
description: The project to clean up.
name: project
type: string
- default: kf-ci-management
description: The name of the management cluster.
name: management-cluster-name
type: string
- default: kubeflow-ci
description: The project containing the management cluster
name: management-project
type: string
- default: us-central1
description: The location of the management cluster
name: management-location
type: string
resources:
- description: The GitHub repo containing kubeflow testing scripts
name: testing-repo
type: git
steps:
- command:
- python
- -m
- kubeflow.testing.create_context
- create
- --name=$(inputs.params.management-project)
- --project=$(inputs.params.management-project)
- --location=$(inputs.params.management-location)
- --cluster=$(inputs.params.management-cluster-name)
- --namespace=$(inputs.params.project)
env:
- name: KUBECONFIG
value: /workspace/kubeconfig
- name: PYTHONPATH
value: /workspace/$(inputs.resources.testing-repo.name)/py
image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
name: create-context
- command:
- python
- -m
- kubeflow.testing.cleanup_blueprints
- auto-blueprints
- --project=$(inputs.params.project)
- --context=$(inputs.params.management-project)
env:
- name: KUBECONFIG
value: /workspace/kubeconfig
- name: PYTHONPATH
value: /workspace/$(inputs.resources.testing-repo.name)/py
image: gcr.io/kubeflow-ci/test-worker-py3@sha256:b679ce5d7edbcc373fd7d28c57454f4f22ae987f200f601252b6dcca1fd8823b
name: cleanup-ci
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
apiVersion: v1
data:
cleanup-blueprints-pipeline.yaml: "# A Tekton PipelineRun to do a one off \n# cleaning
up the Kubeflow auto-deployed blueprints.\n#\napiVersion: tekton.dev/v1alpha1\nkind:
PipelineRun\nmetadata:\n generateName: cleanup-blueprints-\n namespace: auto-deploy\nspec:\n
\ # TODO(jlewi): Override any parameters?\n #params: {}\n resources: \n -
name: testing-repo\n resourceSpec:\n type: git\n params:\n #
TODO(jlewi): Switch to master on kubeflow/gcp-blueprints\n - name: revision\n
\ value: gcp_blueprint\n - name: url\n value: https://github.com/jlewi/testing.git\n
\ # Need to use a KSA with appropriate GSA\n serviceAccountName: default-editor\n
\ pipelineSpec:\n params:\n - name: management-cluster-name\n type:
string\n description: The name of the management cluster. \n default:
\"kf-ci-management\"\n resources:\n - name: testing-repo\n type: git\n
\ tasks:\n - name: cleanup-blueprints\n # TODO(jlewi): expose other
parameters? Right now\n # we are just relying on the defaults defined in
the task\n params:\n - name: management-cluster-name\n value:
\"$(params.management-cluster-name)\"\n resources:\n inputs: \n
\ - name: testing-repo\n resource: testing-repo\n taskRef:\n
\ name: cleanup-kubeflow-ci\n kind: namespaced "
kind: ConfigMap
metadata:
name: cleanup-config-4bm54d2bmb
namespace: auto-deploy
2 changes: 1 addition & 1 deletion playbook/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -2,6 +2,6 @@

This directory contains various playbooks for the Kubeflow test infrastructure.

* [auto_deploy.md][auto_deploy.md] - Playbook for auto deployed infrastructure
* [auto_deploy.md](auto_deploy.md) - Playbook for auto deployed infrastructure
* [buildcop.md](buildcop.md) - Playbook for the buildcop
* [playbook.md](playbook.md) - General playbook for the test infrastructure
232 changes: 232 additions & 0 deletions py/kubeflow/testing/cleanup_blueprints.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,232 @@
"""Cleanup auto deployed blueprints.
Note: This is in a separate file from cleanup_ci because we wanted to start
using Fire and python3.
"""
import collections
import datetime
from dateutil import parser as date_parser
import fire
import logging

from kubeflow.testing import cnrm_clients
from kubeflow.testing import util
from kubernetes import client as k8s_client

# The names of various labels used to encode information about the
#
# Which branch the blueprint was deployed from
BRANCH_LABEL = "blueprint-branch"
NAME_LABEL = "kf-name"
AUTO_DEPLOY_LABEL = "auto-deploy"

def _iter_blueprints(namespace, context=None):
"""Return an iterator over blueprints.
Args:
namespace: The namespace to look for blueprints
context: The kube context to use.
"""
# We need to load the kube config so that we can have credentials to
# talk to the APIServer.
util.load_kube_config(persist_config=False, context=context)

client = k8s_client.ApiClient()
crd_api = cnrm_clients.CnrmClientApi(client, "containercluster")

clusters = crd_api.list_namespaced(namespace)

for c in clusters.get("items"):
yield c

def _delete_blueprints(namespace, to_keep_names, context=None, dryrun=True):
"""Delete all auto-deployed resources that we don't want to keep.
Args:
namespace: The namespace that owns the CNRM objects.
to_keep_names: Names of the blueprints to keep.
context: The kubeconfig context to use
This function deletes all auto-deployed resources that we don't want
to keep. This function is intended to delete any orphaned resources.
It works as follows.
1. For each type of resource we issue a list to find all autodeployed
resources
2. We then remove any resource which belongs to a blueprint to keep
3. We remove any resource that is less than 1 hours old
* This is to avoid race conditions where a blueprint was created
after to_keep was computedisks
4. remaining resources are deleted.
"""

util.load_kube_config(persist_config=False, context=context)

client = k8s_client.ApiClient()
crd_api = k8s_client.CustomObjectsApi(client)

BASE_GROUP = "cnrm.cloud.google.com"
CNRM_VERSION = "v1beta1"


# List of resources to GC
kinds = ["containercluster", "iampolicymember",
"iamserviceaccount", "containernodepool",
"computeaddress", "computedisk"]


# Mappings from resource type to list of resources
to_keep = collections.defaultdict(lambda: [])
to_delete = collections.defaultdict(lambda: [])

api_client = k8s_client.ApiClient()

# Loop over resources and identify resources to delete.
for kind in kinds:
client = cnrm_clients.CnrmClientApi(api_client, kind)

selector = "{0}=true".format(AUTO_DEPLOY_LABEL)
results = client.list_namespaced(namespace, label_selector=selector)

for i in results.get("items"):
name = i["metadata"]["name"]

if name in to_keep_names:
to_keep[kind].append(name)
continue

creation = date_parser.parse(i["metadata"]["creationTimestamp"])
age = datetime.datetime.now(creation.tzinfo) - creation
if age < datetime.timedelta(hours=1):
to_keep[kind].append(name)
logging.info("Not GC'ing %s %s; it was created to recently", kind,
name)
continue

to_delete[kind].append(name)

for kind in kinds:
client = cnrm_clients.CnrmClientApi(api_client, kind)
for name in to_delete[kind]:
if dryrun:
logging.info("Dryrun: %s %s would be deleted", kind, name)
else:
logging.info("Deleting: %s %s", kind, name)
client.delete_namespaced(namespace, name, {})

for kind in kinds:
logging.info("Deleted %s:\n%s", kind, "\n".join(to_delete[kind]))
logging.info("Kept %s:\n%s", kind, "\n".join(to_keep[kind]))

class Cleanup:
@staticmethod
def auto_blueprints(project, context, dryrun=True, blueprints=None): # pylint: disable=too-many-branches
"""Cleanup auto deployed blueprints.
For auto blueprints we only want to keep the most recent N deployments.
Args:
project: The project that owns the deployments
context: The kubernetes context to use to talk to the Cloud config
Connector cluster.
dryrun: (True) set to false to actually cleanup.
blueprints: (Optional) iterator over CNRM ContainerCluster resources
corresponding to blueprints.
Returns:
blueprints_to_delete: List of deployments to delete
blueprints_to_keep: List of deployments to keep
"""
logging.info("Cleanup auto blueprints")

# Map from blueprint version e.g. "master" to a map of blueprint names to
# their insert time e.g.
# auto_deployments["master"]["kf-vbp-abcd"] returns the creation time
# of blueprint "kf-vbp-abcd" which was created from the master branch
# of the blueprints repo.
auto_deployments = collections.defaultdict(lambda: {})

if not blueprints:
blueprints = _iter_blueprints(project, context=context)

for b in blueprints:
name = b["metadata"]["name"]
if not b["metadata"].get("creationTimestamp", None):
# This should not happen all K8s objects should have creation timestamp
logging.error("Cluster %s doesn't have a deployment time "
"skipping it", b["metadata"]["name"])
continue

# Use labels to identify auto-deployed instances
auto_deploy_label = b["metadata"].get("labels", {}).get(AUTO_DEPLOY_LABEL,
"false")

is_auto_deploy = auto_deploy_label.lower() == "true"

if not is_auto_deploy:
logging.info("Skipping cluster %s; its missing the auto-deploy label",
name)

# Tha name of blueprint
kf_name = b["metadata"].get("labels", {}).get(NAME_LABEL, "")

if not kf_name:
logging.info("Skipping cluster %s; it is not an auto-deployed instance",
name)
continue

if kf_name != name:
# TODO(jlewi): This shouldn't be happening. Hopefully this was just
# temporary issue with the first couple of auto-deployed clusters I
# created and we can delete this code.
logging.error("Found cluster named:%s with label kf-name: %s. The name "
"will be used. This shouldn't happen. This hopefully "
"was just due to a temporary bug in the early versions "
"of create_kf_from_gcp_blueprint.py that should be fixed "
"so it shouldn't be happening in new instances anymore."
, name, kf_name)
kf_name = name

logging.info("Blueprint %s is auto deployed", kf_name)

blueprint_branch = b["metadata"]["labels"].get(BRANCH_LABEL, "unknown")

if blueprint_branch == "unknown":
logging.warning("Blueprint %s was missing label %s", kf_name,
BRANCH_LABEL)

if kf_name in auto_deployments[blueprint_branch]:
continue

auto_deployments[blueprint_branch][kf_name] = (
date_parser.parse(b["metadata"]["creationTimestamp"]))

# Garbage collect the blueprints
to_keep = []
to_delete = []
for version, matched_deployments in auto_deployments.items():
logging.info("For version=%s found deployments:\n%s", version,
"\n".join(matched_deployments.keys()))

# Sort the deployment by the insert time
pairs = matched_deployments.items()
sorted_pairs = sorted(pairs, key=lambda x: x[1])

# keep the 3 most recent deployments
to_keep.extend([p[0] for p in sorted_pairs[-3:]])
to_delete.extend([p[0] for p in sorted_pairs[:-3]])

_delete_blueprints(project, to_keep, context=context,
dryrun=dryrun)

logging.info("Finish cleanup auto blueprints")

if __name__ == "__main__":
logging.basicConfig(level=logging.INFO,
format=('%(levelname)s|%(asctime)s'
'|%(pathname)s|%(lineno)d| %(message)s'),
datefmt='%Y-%m-%dT%H:%M:%S',
)
logging.getLogger().setLevel(logging.INFO)
fire.Fire(Cleanup)
Loading

0 comments on commit 3c71fc3

Please sign in to comment.