-
Notifications
You must be signed in to change notification settings - Fork 36
Pod Management using Role Manifest Tags
All roles are deployed as stateful sets. Indexes and rolling updates are useful and most similar to how BOSH works.
Role Tag | Description |
---|---|
active-passive | Fissile will setup a special readiness probe that alters a label for the pods, used by the load balancing service as a selector. |
sequential-startup | Pods will be part of a StatefulSet that has an OrderedReady podManagementPolicy . All other roles will be StatefulSets with a Parallel podManagementPolicy . |
stop-on-failure | A bosh-task with the tag stop-on-failure is a basic Pod that won't restart, even if it fails. This can be used to run tests where you need it to fail, and not restart the pod. |
active-passive | sequential-startup | deployment type | spec.index |
spec.bootstrap |
comments |
---|---|---|---|---|---|
× | × | Parallel StatefulSet | arbitrary number | false |
|
× | ✅ | OrderedReady StatefulSet | the pod's index† | if no other pod running this image | |
✅ | × | Parallel StatefulSet | arbitrary number | false |
required A/P healthcheck |
✅ | ✅ | OrderedReady StatefulSet | the pod's index† | if no other pod running this image | required A/P healthcheck |
†: the pod's ordinal index is the number in the pod name; for example, the pod foo-0
has an index of 0
.
Fissile should validate tag usage and throw an error if an unsupported combination is used. The
stop-on-failure
tag can be used in any situation, if the pod should not be restarted on failure.
The service is created with the proper skiff-role-name
selector:
apiVersion: v1
kind: Service
metadata:
name: routing-api
namespace: cf
spec:
clusterIP: 10.254.115.98
ports:
- name: routing-api
port: 3000
protocol: TCP
targetPort: routing-api
selector:
skiff-role-name: routing-api
sessionAffinity: None
type: ClusterIP
status:
loadBalancer: {}
The default readiness probe must check if all monit processes are up and running. Sample implementation:
if [ -z "$(monit summary | grep Process | grep -v running)" ]
then
exit 0
else
exit 1
fi
The default readiness probe cannot be changed.
It should be built into images by fissile, at /opt/fissile/readiness-probe.sh
. The readiness probe for roles should then look something like:
readinessProbe:
exec:
command:
- /opt/fissile/readiness-probe.sh
initialDelaySeconds: 5
periodSeconds: 5
When the active-passive
tag is present, the load balancing service has an extra selector for skiff-role-active
:
selector:
skiff-role-active: "true"
skiff-role-name: routing-api
Important: by default all pods that have an
active-passive
tag are started withskiff-role-active: "false"
Active/Passive roles also need to specify an extra script for checking whether the role is active. Example:
- name: routing-api
tags: [active-passive]
jobs:
- name: routing-api
release_name: routing
processes:
- name: metron_agent
- name: routing-api
run:
active-passive-probe: '/var/vcap/jobs/routing-api-check/bin/routing-api-ready.sh'
The
/var/vcap/jobs/routing-api-check/bin/routing-api-ready.sh
script needs to be rendered as part of a BOSH job.
The script is embedded by fissile within the default readiness probe.
For the above example, /opt/fissile/readiness-probe.sh
would look like this:
if `/var/vcap/jobs/routing-api-check/bin/routing-api-ready.sh` then
kubectl label --overwrite pods `hostname` --namespace $KUBERNETES_NAMESPACE skiff-role-active=true
else
kubectl label --overwrite pods `hostname` --namespace $KUBERNETES_NAMESPACE skiff-role-active=false
fi
if [ -z "$(monit summary | grep Process | grep -v running)" ]
then
exit 0
else
exit 1
fi
Important: stemcells need to include the
kubectl
CLI, to facilitate theactive-passive-probe.sh
script.
The general rules:
- if something needs locket, it will be active/passive
- if something connects to a DB, it should start sequentially since at some point it might need to run migrations
Note: services are referenced as both
<role>.((KUBE_SERVICE_DOMAIN_SUFFIX))
as well as<role>.((KUBERNETES_NAMESPACE))
. This is inconsistent. We should only be using<role>.((KUBE_SERVICE_DOMAIN_SUFFIX))
.
role | tags |
---|---|
syslog-scheduler | |
adapter | |
consul | |
nats | (service needed for other roles to communicate with nats) |
mysql | sequential-startup |
postgres | sequential-startup |
cf-usb |
sequential-startup (connects to a DB) |
diego-api |
active-passive ,sequential-startup (uses locket to grab a lock, connects to a db) |
router | |
tcp-router | |
routing-api |
active-passive ,sequential-startup (uses locket to grab a lock, connects to a db and performs migrations) |
api |
sequential-startup (connects to db, only the replica with spec.bootstrap == true runs migrations) |
cc-worker | |
blobstore | |
cc-clock | |
doppler | |
log-api | |
diego-brain |
active-passive (uses locket to grab a lock) |
cc-uploader | |
diego-access | |
nfs-broker |
sequential-startup (connects to a DB) |
diego-cell | |
acceptance-tests-brain | stop-on-failure |
acceptance-tests | stop-on-failure |
smoke-tests | stop-on-failure |
secret-generation | |
post-deployment-setup | |
credhub-user | |
autoscaler-postgres | |
autoscaler-api | |
autoscaler-metrics | |
autoscaler-actors | |
autoscaler-smoke | stop-on-failure |