Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Isolate the operator from misconfiguration of HTTP service #4394

Closed
pebrc opened this issue Mar 30, 2021 · 6 comments · Fixed by #5211
Closed

Isolate the operator from misconfiguration of HTTP service #4394

pebrc opened this issue Mar 30, 2021 · 6 comments · Fixed by #5211
Assignees
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality

Comments

@pebrc
Copy link
Collaborator

pebrc commented Mar 30, 2021

Related #3732 and #2572

Currently it is possible for users to misconfigure the default HTTP service in such a way the ability of the operator to roll out further configuration changes is impeded.

Consider:

...
spec:
  http:
    service:
      spec:
        selector:
          app: coordinating-node
...
  - config:
       node.roles: []
    count: 1
    name: coordinating-node
    podTemplate:
      metadata:
        labels:
          app: client-node

There two ways for users to fix this misconfiguration: a) either change the service selector to client-node or b) change the labels on the client-node nodeSet to coordinating-node. Approach a) would work while b) would not

Once the user has rolled out the change with the incorrect selector in the service, the operator has only limited ability to roll out further changes because it lost connectivity with Elasticsearch. In such a situation any rolling upgrade of existing nodes is effectively blocked while new nodes can still be added to the cluster. Also any updates to the service would be rolled out. This is because these operations don't require API access to Elasticsearch while a rolling upgrade or a downscale does require API access.

We could consider isolating the operator from this class of configuration errors by either not relying on the service (talking to Pods directly as we already when we switch form HTTP to HTTPS ) or creating an internal service that users cannot configure that is used by the operator.

@pebrc pebrc added the discuss We need to figure this out label Mar 30, 2021
@botelastic botelastic bot added the triage label Mar 30, 2021
@pebrc pebrc added the >enhancement Enhancement of existing functionality label Mar 30, 2021
@botelastic botelastic bot removed the triage label Mar 30, 2021
@pebrc
Copy link
Collaborator Author

pebrc commented Apr 1, 2021

@sebgl suggested to route requests to individual Pods directly as we are already doing when we are switching protocols (TLS/non-TLS). Instead of this manual routing being the exception we could make it the default and step around accidental misconfiguration this way.

@naemono naemono self-assigned this Oct 27, 2021
@naemono
Copy link
Contributor

naemono commented Oct 27, 2021

Suggestion for approach to take between these 2 options:

  1. Speak to pods directly
  2. Create an internal service managed by the operator

I'd like to suggest an internal service managed by operator for the following reasons:

  1. Speaking to pods directly, we would be incurring some of the burden that lies on the service themselves, such as health checking. If we speak to pods directly, we'd have to look at the readiness check for each pod to see if it's healthy before speaking to it. Pods are already moved in/out of service according to health/readiness checks.
  2. Reduce load on apiserver: Every time we need to speak to a set of ES nodes (master, or data), we'd need to query k8s for the active pods that match the labels. While not a large increase in load, it would be unnecessary if we maintain our own internal service, or a set of services per node type.

This still doesn't remove the possibility of someone doing bad things to the internal service, but it seems to alleviate the initial concern of customer accidentally mis-configuring the operator into a situation it can't get out of.

Thoughts?

@pebrc
Copy link
Collaborator Author

pebrc commented Oct 28, 2021

Speaking to pods directly, we would be incurring some of the burden that lies on the service themselves, such as health checking. If we speak to pods directly, we'd have to look at the readiness check for each pod to see if it's healthy before speaking to it.

I think that would be a fairly straightforward inspection of the Pods Status subresource wouldn't it?

Reduce load on apiserver: Every time we need to speak to a set of ES nodes (master, or data), we'd need to query k8s for the active pods that match the labels.

I am not sure I follow that line of argument. As we are using a cached client and we are accessing Pods in other places in the operator we are already incurring the cost of keeping that cache in sync. So the additional listing of the Pods for the purpose of finding one we can talk to would be essentially "free"

@naemono
Copy link
Contributor

naemono commented Oct 28, 2021

I hadn't examined where we interact with pods already in the code base. Let me take a look and get back to you on this.

@naemono
Copy link
Contributor

naemono commented Oct 28, 2021

Ok, I believe I found where you're referring to. Is this the section https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/services/services.go#L149-L155 ?

If so, yeah, I see how just leveraging that would make the most sense. I don't see any filtering of ready pods though. Is that something we should include during this change?

@naemono
Copy link
Contributor

naemono commented Oct 28, 2021

Nevermind on the filtering piece, I just ran into this https://github.com/elastic/cloud-on-k8s/blob/master/pkg/controller/elasticsearch/driver/driver.go#L312. 👍

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
discuss We need to figure this out >enhancement Enhancement of existing functionality
Projects
None yet
2 participants