Support zone awareness #203

rverma-dev · 2021-08-25T11:00:00Z

To support zone awareness we need to create different ingesters configurations as many zones required. Currently this is not possible through the helm install.

nschad · 2021-08-25T11:49:59Z

Hi @rverma-nsl,

Thank you for reporting this issue. Can you elaborate what you are missing from the helm-chart?

rverma-dev · 2021-09-10T08:19:35Z

If you look at the ingester_config as provided here https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config, you will observe that for a zone awareness, currently we need to provide the az per ingester. It meant that we actually need to have 2 differenet config file and two different deployments of ingesters. I am not sure if this can be handled any other way.

nschad · 2021-09-10T08:25:38Z

If you look at the ingester_config as provided here https://cortexmetrics.io/docs/configuration/configuration-file/#ingester_config, you will observe that for a zone awareness, currently we need to provide the az per ingester. It meant that we actually need to have 2 differenet config file and two different deployments of ingesters. I am not sure if this can be handled any other way.

Yeah to be honest that sounds reasonable to me. With helm you could also have multi values.yml. So you could have one file for the general config and then two or more files with very minimal config which is basically just the cortex AZ config.

helm install cortex-az1 cortex-helm/cortex -f myconfig.yaml -f az1.yaml
helm install cortex-az2 cortex-helm/cortex -f myconfig.yaml -f az2.yaml

Otherwise I'm not sure how the cortex-helm-chart is supposed to support this and I also don't think it makes sense

danfromtitan · 2022-02-02T16:21:14Z

The bigger issue here is making sure the pods end-up in the zone they are configured for and fortunately the chart supports a nodeSeletor label.

Following on the example above, the problem starts when you create different deployments (or statefulsets for what matters) for each zone: they have to have different names, which the chart doesn't support. Say you add support for deployment names override, then you run into the problem with the service name and headless service too: you don't want those to have different names. Now you have to provide support for service to take independent names from the names of the deployments (currently they all take their name from cortex.fullname helper).

This problem is not only limited to ingesters, store gateway and distributors fall in the same category: different deployments are needed in each availability zone. In the end I don't like this approach, it works against K8s' pod scheduler.

Without making any change whatsoever, in the current version of the chart the PODs are evenly distributed across AZ's, which is great. The only thing that's left is to insert the AZ name in the Cortex config. I'm thinking an init container (which the chart supports adding) could do that.

nschad · 2022-02-02T17:21:27Z

The only thing that's left is to insert the AZ name in the Cortex config. I'm thinking an init container (which the chart supports adding) could do that.

I'm open to ideas if you wanna give that a go.

nschad · 2022-02-02T23:38:18Z

@danfromtitan

they have to have different names, which the chart doesn't support

I don't really understand that point. When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue. I never deployed multiple az so maybe I'm missing something.

danfromtitan · 2022-02-03T00:23:21Z

When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue

You'd want a single helm deployment release, otherwise you'd end-up with multiple querier and distributor services for the same Cortex instance.

nschad · 2022-02-03T07:36:32Z

When you use a different helm release name the name also changes also with Kubernetes namespaces I don't see any Issue

You'd want a single helm deployment release, otherwise you'd end-up with multiple querier and distributor services for the same Cortex instance.

Ah I see. I feel like that it could be in somebody best interest in to have multiple "sets". But for memberlist you would have to configure the memberlist service to pick up multiple azs (the selector currently is quite strict). I don't think you can do that as of now without manually changing stuff. Either way it seems that this feature is not really supported by the helm-chart. I'm open to any suggestion

danfromtitan · 2022-02-03T17:44:17Z

Yeah, memberlist is another one, I use etcd so I don't have that problem.

The idea is to create a dynamic config file in Cortex and add the zone awareness configuration to it from an init container that has get access to kube API pods and nodes. That would require adding a cluster role with read access to pods and nodes in core API group and binding it with the cortex service account (similar to how was done for ruler sidecar).

The init container then would:

get the node name where the pod's running
get the AZ name from the node's topology zone label
update availability zone in config file

For init container to write a file, the config file would have to be a file on an emptyDir mount. Currently that comes from a read-only secret or configmap. So we need the ability in the chart to override the config filename, something like:

{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}

The init container would copy the read-ony config that comes from configmap and replace the AZ name. Thoughts ?

nschad · 2022-02-03T18:20:27Z

Yeah, memberlist is another one, I use etcd so I don't have that problem.

The idea is to create a dynamic config file in Cortex and add the zone awareness configuration to it from an init container that has get access to kube API pods and nodes. That would require adding a cluster role with read access to pods and nodes in core API group and binding it with the cortex service account (similar to how was done for ruler sidecar).

The init container then would:

get the node name where the pod's running

get the AZ name from the node's topology zone label

update availability zone in config file

For init container to write a file, the config file would have to be a file on an emptyDir mount. Currently that comes from a read-only secret or configmap. So we need the ability in the chart to override the config filename, something like:
{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}
The init container would copy the read-ony config that comes from configmap and replace the AZ name. Thoughts ?

I like the approach of parsing topology.kubernetes.io/zone of the current node running the pod. However the config file is kinda hacky and the config file would still be the old in the secret/configmap. I not even sure if those overriden changes in the init-container persist through the non-init container.

Wait..What are you trying to do exactly here with the configmap? Are you trying to mount the non-az-config and then alter and save that as az-aware-config for use later with the cortex components?

Do you have some thought's on this @kd7lxl?

kd7lxl · 2022-02-03T18:31:14Z

I think a mutating admission controller might be a better way to inject the zone name into the cortex container arguments or environment. Surely there is an existing admission controller that is capable of injecting a node label into a pod environment, like https://github.com/Collaborne/kubernetes-aws-metadata-service but ideally one that is still maintained. Otherwise, we can write one. Then we can configure Cortex to read that env var.

Related reading: elastic/cloud-on-k8s#3933 (comment)

nschad · 2022-02-03T18:42:02Z

mutating admission controller

~~Okay interesting so we basically watch for resource "node"~~ (didn't know about that binding thing, even better) and then call our webhook which will then read topology.kubernetes.io/zone of that node and then update our cortex ingesters by setting an ENV variable? Sounds cool.

Update: we also would have to support a way on how to configure memberlist with the 2nd and 3rd zone

nschad · 2022-02-07T18:41:36Z

Do you wanna give that a go? @danfromtitan

Would be great if we could just use something that already exists

danfromtitan · 2022-02-07T19:13:49Z

Do you wanna give that a go

I'm looking into it. Do you want the webhook config/deployment added to the chart with a zone_aware_enabled flag around it ?

Would be great if we could just use something that already exists

There ain't any that does exactly what we want here, but I can change an existing one. Either way, the admission controller image is asking for a separate project: how do you want to go about that ?

There is an option to keep everything outside this chart and deploy the admission controller in a separate namespace.

nschad · 2022-02-07T19:25:26Z

I'm looking into it. Do you want the webhook config/deployment added to the chart with a zone_aware_enabled flag around it ?

👍 sounds good.

There ain't any that does exactly what we want here, but I can change an existing one. Either way, the admission controller image is asking for a separate project: how do you want to go about that ?

Yeah I feared this might be the case. I guess I could ask if we could get a seperate repo in the cortexproject space. And then we could use github actions to build/deploy the docker image and maybe even github registry.

There is an option to keep everything outside this chart and deploy the admission controller in a separate namespace.

I'm all ears. I mean we could do our part here and maybe have a guide that references the controller that you would have to setup yourself?

danfromtitan · 2022-02-07T19:32:34Z

I mean we could do our part here and maybe have a guide that references the controller that you would have to setup yourself

That'd be the cleanest approach, the chart project is already overloaded (ruler sidecar is an example that comes to mind).

danfromtitan · 2022-02-08T00:16:18Z

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

nschad · 2022-02-08T00:39:36Z

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

The pod/binding event looked promising. Have you tried that?

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

That surely exists right? Or would we also have to build that ourselves.

kd7lxl · 2022-02-08T01:11:31Z

@kd7lxl - there is one fundamental issue with the idea of using an admission controller: at the time the API request makes it through the controller, there is no decision yet by the scheduler about the worker node where the pod would be deployed, meaning the AZ cannot be determined at that time.

For an admission controller, I think the sequence would have to be like this:

Receive Pod CREATE AdmissionReview
Patch pod with custom envFrom ConfigMap that does not yet exist
Pod gets scheduled, but will be blocked from starting until the ConfigMap exists
Receive ~~Pod UPDATE~~ Binding AdmissionReview (triggered by the scheduling event that sets the pod's .nodeName)
Use the k8s API to get the topology label value of the node.
Store the topology value in the ConfigMap
Pod starts with envFrom ConfigMap

I haven't tested this to confirm the sequence and events all work. Maybe I missed something.

It seems to me an init container is the only way to determine the AZ, because by the time init container is started, the pod has been scheduled to a given node. Without going into nasty hardwiring, I can't think of any other way to tell a cortex container the AZ where it's running.

initContainer would work too, but you still have to come up with a way to get the result of the initContainer into the environment of the primary container. I really don't like having to use wrapper scripts to replace a container's entrypoint, nor do I really want to give the initContainer permission to create ConfigMaps.

The reason I like the admission controller approach is that the only thing that needs permission to mutate pods and create config maps is the admission controller, not the app itself.

This is a commonly requested feature: kubernetes/kubernetes#62078 (and others) If we can provide a good solution I'm sure many people will appreciate it.

rverma-dev · 2022-02-08T01:38:19Z

We can also use https://kubernetes.io/docs/concepts/workloads/pods/pod-topology-spread-constraints/ to ensure that pods are spread across as and then use the configmap with envfrom annotation value which would be correctly assigned.

danfromtitan · 2022-02-08T03:47:21Z

I'm not sure the API service would run the resource admission twice through the admission controller between steps 3 and 4 like you've described.

In fact what I see happening at step 3 is the POD resource is not blocked from starting, but it's rather created and sits there looping with an error "configmap not found". Once the error condition is cleared by creating the configmap, there is no UPDATE admission request send to the controller, the POD rather follows through and it completes the start-up sequence. I tried this twice, with a standalone POD and pods from a deployment, same outcome.

An UPDATE operation could be initiated from an outside request (i.e. helm chart update or kubectl apply) but not by having the admission controller sit in a loop.

In the subsequent UPDATE operation I can see indeed the nodeName making it in the admission request, but that is the node where POD runs at the time before update would get applied. I guess after update the scheduler might move the POD to a different node (i.e. let's say because node capacity has been changed by unrelated PODs), so we still won't know where the POD lands next.

Providing this would somehow work, there is still an open question of what happens with legit deployment updates: the trick with missing configmap won't work a second time since now the resource already exists.

By contrast an initContainer is a bit simpler and it's guaranteed to work all the time since it runs in the same POD:

create a cluster role with read access to pods and nodes in core API group
bind the cluster role with cortex service account
add an init container that makes curl requests to kube API to find the node where the pod's running and get the node topology
in the same init container, copy the read-only cortex configuration coming from a secret/configmap to a read-write emptyDir volume
init container updates the zone name in place and terminates. In-memory volume with dynamic configuration remains.
cortex container starts and loads the configuration from the volatile volume

The only update that's needed in the chart is to change the hardcoded configuration file in the cortex container arguments with a parameter value, i.e.

args:
            - "-target=compactor"
            - "-config.file={{ include "cortex.configFileName" . }}"

where

{{- define "cortex.configFileName" -}}
{{- default "/etc/cortex/cortex.yaml" .Values.configFileName -}}
{{- end -}}

danfromtitan · 2022-02-08T04:10:38Z

use pod-topology-spread-constraints to ensure that pods are spread across

There is no problem spreading PODs evenly across AZ's with the current anti-affinity rules in the chart. The issue is telling cortex distributors/queriers about where the ingesters run, just so time series could be replicated across different AZ's and not being lost in case of a zone failure.

Having read through that part in the Cortex code that implements the writing across AZ's, I'm not even sure zone awareness is worth the effort of implementing it: the distributors would merely attempt a fair distribution of times series within the replication factor configured and won't make any attempt to keep the traffic in the same AZ and save on the cross-zone data transfer cost.

danfromtitan · 2022-02-08T05:19:55Z

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

nschad · 2022-02-08T07:57:09Z

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

elastic/cloud-on-k8s#3933 (comment)

kd7lxl · 2022-02-08T16:49:55Z

The pod/binding event looked promising. Have you tried that?

Couldn't find any "pod/binding" resource in the API reference. The implementation of binding a pod to machine doesn't look like it leaves room for catching/updating the request in the outside admission controler.

It's called pods/binding and the event looked like this when I tested it:

apiVersion: admissionregistration.k8s.io/v1
kind: MutatingWebhookConfiguration
...
webhooks:
...
  rules:
  - apiGroups:
    - ""
    apiVersions:
    - v1
    operations:
    - CREATE
    resources:
    - pods
    - pods/binding
    scope: '*'

{
  "kind": "AdmissionReview",
  "apiVersion": "admission.k8s.io/v1beta1",
  "request": {
    "uid": "645135a1-48c9-4550-8ca0-9e5d1343c3d5",
    "kind": {
      "group": "",
      "version": "v1",
      "kind": "Binding"
    },
    "resource": {
      "group": "",
      "version": "v1",
      "resource": "pods"
    },
    "subResource": "binding",
    "requestKind": {
      "group": "",
      "version": "v1",
      "kind": "Binding"
    },
    "requestResource": {
      "group": "",
      "version": "v1",
      "resource": "pods"
    },
    "requestSubResource": "binding",
    "name": "nginx",
    "namespace": "default",
    "operation": "CREATE",
    "userInfo": {
      "username": "system:kube-scheduler",
      "groups": [
        "system:authenticated"
      ]
    },
    "object": {
      "kind": "Binding",
      "apiVersion": "v1",
      "metadata": {
        "name": "nginx",
        "namespace": "default",
        "uid": "75fc040e-c49e-4418-8525-674c4deeabbc",
        "creationTimestamp": null,
        "managedFields": [
          {
            "manager": "kube-scheduler",
            "operation": "Update",
            "apiVersion": "v1",
            "time": "2022-02-08T03:16:50Z",
            "fieldsType": "FieldsV1",
            "fieldsV1": {
              "f:target": {
                "f:kind": {},
                "f:name": {}
              }
            }
          }
        ]
      },
      "target": {
        "kind": "Node",
        "name": "kind-control-plane"
      }
    },
    "oldObject": null,
    "dryRun": false,
    "options": {
      "kind": "CreateOptions",
      "apiVersion": "meta.k8s.io/v1"
    }
  }
}

No mutation to this resource is required -- the artifact of this event would be creating a ConfigMap.

danfromtitan · 2022-02-10T18:19:33Z

@kd7lxl , there are couple of issues with putting the env vars in a config map:

different Zone values ask for multiple config maps
config maps need to be cleaned-up after pods are deleted

Instead of configmaps I was going to explore the idea of adding container env vars from the binding admission request.

Do you know whether the mutation is possible from the binding event ? I'm seeing the error below while the pod sits in pending and I'm not sure if mutation is not allowed or there is another reason.

 Warning  FailedScheduling  65s (x2 over 67s)  default-scheduler  binding rejected: running Bind plugin "DefaultBinder": Internal error occurred: add operation does not apply

danfromtitan · 2022-02-10T20:02:19Z

NVM, I replied my own question: the mutate admission applies to the object kind in the request, in this case "kind": "Binding", so it would not patch the pod here.

kd7lxl · 2022-02-10T20:11:30Z

I agree that configmaps are not needed. Labeling the pod works just fine -- no need to create another resource.

I wrote a little proof of concept here: https://gist.github.com/kd7lxl/1997a4d162bd49d4dbdb8f69506a1e01. It does the whole process with initContainers, but I've annotated it with which concerns I'd like to move to an admission controller to be able to reduce the permissions granted to the pod.

I would not return a mutation/patch from the pods/binding AdmissionReview (pretty sure that wouldn't work), rather just use the event to trigger a labeling operation on the pod with a client. It's just slightly simpler than watching all pod events.

nschad · 2022-02-10T23:38:24Z

I had a peak at some admission webhook examples and the whole setup can be a bit tricky, especially since you need a proper certificate for your webhook server. Which we can automate with stuff like this: https://github.com/newrelic/k8s-webhook-cert-manager (maybe not exactly this) but this sure as hell won't make things easier. So I'm asking myself at what point does this become an entire software suite :D

rverma-dev · 2022-02-11T01:29:08Z

A noob question, Can't we use istio, they have zone aware routing? And reduce the zone awareness functionalities from cortex? Similar functionalities are there in cilium too.

nschad · 2022-02-11T01:40:56Z

A noob question, Can't we use istio, they have zone aware routing? And reduce the zone awareness functionalities from cortex? Similar functionalities are there in cilium too.

Maybe I wouldn't advise it but that is definitely something you will have to try on your own :)

danfromtitan · 2022-02-26T05:56:16Z

Fully functional admission controller that inserts env vars from node labels.

There are no changes required for the helm chart. To enable zone awareness:

follow the instructions to build and deploy the admission controller (making sure the webhook config allows requests from the namespace where you deploy Cortex)
enable zone awareness in Cortex chart's values under config section, like such:

config:
  ingester:
    lifecycler:
      ring:
[...]
        zone_awareness_enabled: true
      availability_zone: ${NODE_TOPOLOGY_KUBERNETES_IO_ZONE}
  store_gateway:
    sharding_ring:
[...]
      zone_awareness_enabled: true
      instance_availability_zone: ${NODE_TOPOLOGY_KUBERNETES_IO_ZONE}

tell Cortex to resolve env vars in config file, like such:

ingester: 
  extraArgs:
    config.expand-env: true
[...]
store_gateway:
  extraArgs:
    config.expand-env: true

If there are no outstanding questions, this issue can be closed.

oujonny · 2022-07-12T09:44:24Z

Just read this interesting article with fits into this disscussion.
https://gmaslowski.com/kubernetes-node-label-to-pod/

nschad · 2022-07-12T20:14:03Z

@danfromtitan I must have overlooked this. Thank you for your implementation of the custom webhook. Are you willing to write-up a small Guide in our docs/ folder for setting it up? DEV Guide

danfromtitan · 2022-07-13T02:20:06Z

Yeah, can do.

locmai · 2024-04-11T08:44:07Z

Hi folks, to confirm my understanding, we have a couple of options to make this work:

Using the mutating webhook configuration to propagate the node labels down to the pods (since the Kubernetes downward API doesn't support this)
- This requires to deploy an extra webhook - either we consider to add it as supported in the Cortex chart OR advice users to do it themselves via DEV guide.
- Is it a little hacky as in the Kubernetes downward API doesn't support this for reasons?
Make a map of zones and generate the statefulset per zone and have it set right there? (similar to Mimir chart) so instead of {{ include "cortex.ingesterFullname" . }} we can have {{ include "cortex.ingesterFullname" . }}-zone-0 , {{ include "cortex.ingesterFullname" . }}-zone-1, and {{ include "cortex.ingesterFullname" . }}-zone-2 - it's not as dynamic as the node label way.
An extra one is to have Cortex to be able to detect the topology.kubernetes.io/zone label from the node itself (have a client implemented to get the node label).

I think I prefer the option 2 if we wanted to support this in the Helm chart. Will test out the option 1 as well - though we (as a user) usually don't want to add an additional component (with ++ CVE concerns) so the Helm way is more preferred in our case.

nschad closed this as completed Sep 14, 2021

nschad reopened this Feb 3, 2022

nschad added the help wanted Extra attention is needed label Feb 7, 2022

nschad added the keepalive Keeps Issue's from beeing closed label Jul 12, 2022

Support zone awareness #203

Support zone awareness #203

Comments

rverma-dev commented Aug 25, 2021

nschad commented Aug 25, 2021

rverma-dev commented Sep 10, 2021

nschad commented Sep 10, 2021

danfromtitan commented Feb 2, 2022

nschad commented Feb 2, 2022

nschad commented Feb 2, 2022

danfromtitan commented Feb 3, 2022 • edited Loading

nschad commented Feb 3, 2022

danfromtitan commented Feb 3, 2022

nschad commented Feb 3, 2022 • edited Loading

kd7lxl commented Feb 3, 2022 • edited Loading

nschad commented Feb 3, 2022 • edited Loading

nschad commented Feb 7, 2022 • edited Loading

danfromtitan commented Feb 7, 2022 • edited Loading

nschad commented Feb 7, 2022

danfromtitan commented Feb 7, 2022

danfromtitan commented Feb 8, 2022

nschad commented Feb 8, 2022

kd7lxl commented Feb 8, 2022 • edited Loading

rverma-dev commented Feb 8, 2022

danfromtitan commented Feb 8, 2022

danfromtitan commented Feb 8, 2022

danfromtitan commented Feb 8, 2022

nschad commented Feb 8, 2022

kd7lxl commented Feb 8, 2022

danfromtitan commented Feb 10, 2022

danfromtitan commented Feb 10, 2022

kd7lxl commented Feb 10, 2022 • edited Loading

nschad commented Feb 10, 2022

rverma-dev commented Feb 11, 2022

nschad commented Feb 11, 2022

danfromtitan commented Feb 26, 2022 • edited Loading

oujonny commented Jul 12, 2022

nschad commented Jul 12, 2022

danfromtitan commented Jul 13, 2022

locmai commented Apr 11, 2024

danfromtitan commented Feb 3, 2022 •

edited

Loading

nschad commented Feb 3, 2022 •

edited

Loading

kd7lxl commented Feb 3, 2022 •

edited

Loading

nschad commented Feb 3, 2022 •

edited

Loading

nschad commented Feb 7, 2022 •

edited

Loading

danfromtitan commented Feb 7, 2022 •

edited

Loading

kd7lxl commented Feb 8, 2022 •

edited

Loading

kd7lxl commented Feb 10, 2022 •

edited

Loading

danfromtitan commented Feb 26, 2022 •

edited

Loading