Isolate the operator from misconfiguration of HTTP service #4394

pebrc · 2021-03-30T14:54:40Z

Currently it is possible for users to misconfigure the default HTTP service in such a way the ability of the operator to roll out further configuration changes is impeded.

Consider:

...
spec:
  http:
    service:
      spec:
        selector:
          app: coordinating-node
...
  - config:
       node.roles: []
    count: 1
    name: coordinating-node
    podTemplate:
      metadata:
        labels:
          app: client-node

There two ways for users to fix this misconfiguration: a) either change the service selector to client-node or b) change the labels on the client-node nodeSet to coordinating-node. Approach a) would work while b) would not

Once the user has rolled out the change with the incorrect selector in the service, the operator has only limited ability to roll out further changes because it lost connectivity with Elasticsearch. In such a situation any rolling upgrade of existing nodes is effectively blocked while new nodes can still be added to the cluster. Also any updates to the service would be rolled out. This is because these operations don't require API access to Elasticsearch while a rolling upgrade or a downscale does require API access.

We could consider isolating the operator from this class of configuration errors by either not relying on the service (talking to Pods directly as we already when we switch form HTTP to HTTPS ) or creating an internal service that users cannot configure that is used by the operator.

The text was updated successfully, but these errors were encountered:

pebrc · 2021-04-01T17:52:16Z

@sebgl suggested to route requests to individual Pods directly as we are already doing when we are switching protocols (TLS/non-TLS). Instead of this manual routing being the exception we could make it the default and step around accidental misconfiguration this way.

naemono · 2021-10-27T19:21:20Z

Suggestion for approach to take between these 2 options:

Speak to pods directly
Create an internal service managed by the operator

I'd like to suggest an internal service managed by operator for the following reasons:

Speaking to pods directly, we would be incurring some of the burden that lies on the service themselves, such as health checking. If we speak to pods directly, we'd have to look at the readiness check for each pod to see if it's healthy before speaking to it. Pods are already moved in/out of service according to health/readiness checks.
Reduce load on apiserver: Every time we need to speak to a set of ES nodes (master, or data), we'd need to query k8s for the active pods that match the labels. While not a large increase in load, it would be unnecessary if we maintain our own internal service, or a set of services per node type.

This still doesn't remove the possibility of someone doing bad things to the internal service, but it seems to alleviate the initial concern of customer accidentally mis-configuring the operator into a situation it can't get out of.

Thoughts?

pebrc · 2021-10-28T12:26:02Z

Speaking to pods directly, we would be incurring some of the burden that lies on the service themselves, such as health checking. If we speak to pods directly, we'd have to look at the readiness check for each pod to see if it's healthy before speaking to it.

I think that would be a fairly straightforward inspection of the Pods Status subresource wouldn't it?

Reduce load on apiserver: Every time we need to speak to a set of ES nodes (master, or data), we'd need to query k8s for the active pods that match the labels.

I am not sure I follow that line of argument. As we are using a cached client and we are accessing Pods in other places in the operator we are already incurring the cost of keeping that cache in sync. So the additional listing of the Pods for the purpose of finding one we can talk to would be essentially "free"

naemono · 2021-10-28T13:16:40Z

I hadn't examined where we interact with pods already in the code base. Let me take a look and get back to you on this.

naemono · 2021-10-28T14:11:00Z

Ok, I believe I found where you're referring to. Is this the section https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/services/services.go#L149-L155 ?

If so, yeah, I see how just leveraging that would make the most sense. I don't see any filtering of ready pods though. Is that something we should include during this change?

naemono · 2021-10-28T16:25:02Z

Nevermind on the filtering piece, I just ran into this https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/driver/driver.go#L312. 👍

pebrc added the discuss We need to figure this out label Mar 30, 2021

botelastic bot added the triage label Mar 30, 2021

pebrc added the >enhancement Enhancement of existing functionality label Mar 30, 2021

botelastic bot removed the triage label Mar 30, 2021

naemono self-assigned this Oct 27, 2021

naemono mentioned this issue Nov 2, 2021

Isolate operator from http service misconfiguration #5005

Closed

naemono mentioned this issue Dec 27, 2021

Isolate operator from http service misconfiguration - Use internal service #5211

Merged

naemono closed this as completed in #5211 Jan 20, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Isolate the operator from misconfiguration of HTTP service #4394

Isolate the operator from misconfiguration of HTTP service #4394

pebrc commented Mar 30, 2021

pebrc commented Apr 1, 2021

naemono commented Oct 27, 2021

pebrc commented Oct 28, 2021

naemono commented Oct 28, 2021

naemono commented Oct 28, 2021

naemono commented Oct 28, 2021

Isolate the operator from misconfiguration of HTTP service #4394

Isolate the operator from misconfiguration of HTTP service #4394

Comments

pebrc commented Mar 30, 2021

pebrc commented Apr 1, 2021

naemono commented Oct 27, 2021

pebrc commented Oct 28, 2021

naemono commented Oct 28, 2021

naemono commented Oct 28, 2021

naemono commented Oct 28, 2021