Skip to content

Commit

Permalink
ADR0029 - grpc in kubernetes
Browse files Browse the repository at this point in the history
Signed-off-by: Jörn Friedrich Dreyer <jfd@butonic.de>
  • Loading branch information
butonic committed Jun 27, 2024
1 parent b78910d commit 4e66dec
Showing 1 changed file with 91 additions and 0 deletions.
91 changes: 91 additions & 0 deletions docs/ocis/adr/0029-grpc-in-kubernetes.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,91 @@
---
title: "29. gRPC in Kubernetes"
date: 2024-06-27T14:05:00+01:00
weight: 29
geekdocRepo: https://github.com/owncloud/ocis
geekdocEditPath: edit/master/docs/ocis/adr
geekdocFilePath: 0029-grpc-in-kubernetes.md
---

* Status: draft
* Deciders: [@butonic](https://github.com/butonic)
* Date: 2024-06-27
## Context and Problem Statement

[Scaling oCIS in kubernetes causes requests to fail
#8589](https://github.com/owncloud/ocis/issues/8589), sometimes until affected services have manually been restarted. This can be observed by two symptoms:
1. when a new pod is added it does not receive traffic
2. when an pod is shut down existing clients still try to send requests to it

To leverage the kubernetes pod state, we first used the go micro kubernetes registry implementation. When a pod fails the health or readyness probes, kubernetes will no longer
- send traffic to the pod via the kube-proxy, which handles the ClusterIP for a service,
- list the pod in DNS responses when the ClusterIP is disabled by setting it to `none`
When using the ClusterIP HTTP/1.1 requests will be routed to a working pod.

This nice setup starts to fail with long lived connections. The kube-proxy is connection based, causing requests with Keep-Alive to stick to the same pod for more than one request. Worse, HTTP/2 and in turn gRPC are multiplexing the connection. They will not pick up any changes to pods, explaining the symptomps:
1. new pods will not be used because clients will reuse the existing gRPC connection
2. gRPC clients will still try to send traffic to killed pods because have not picked up that the pod was killed. Or the pod was killed a millisecond after the lookup was made.

An addition to this problem are the health and readyness implementations of oCIS services not always reflecting the correct state of the service. One example is the storage-users service that returns ready `true` while runing a migration on startup.

Furthermore, the go micro kubernetes registry put too much load on the etcd service registry / kubernetes API. Maybe, because every pod keeps a connection open and is sent events ... causing a lot of traffic when multiple oCIS deployments are running in the same kubernetes cluster. Admittedly, that explanation needs to be verified, but keep this problem in mind as the possible solutions will have to deal with the same root cause.

To take the load off the kubernetes API we now roll our own nats-js-kv based service registry. It works, but we now ignore the readyness and health probes that are made by kubernetes. The go micro client would however address the problems:
1. the selector.Select() call will fetch a client based on the micro registry implementation. All registry implementations subscribe to changes, so new pods will be picked up. The kubernetes registry implementation even takes into account the pod ready probes, so this should be the right solution. As mentioned above, this seems to cause too much load on the kubernetes API. The nats-js-kv registry is aware of pods even if they are not ready yet, leading to race conditions and failed requests when a pod is added. This is mitigated somewhat by the next go micro client feature
2. by default, requests are retried five times when they fail. This mainly addresses failed requests when a pod is killed because the client will make a selector.Select() call to find a working connection. It also helps with pods that have been registered but are not ready, yet.

Unfortunately, reva does not use go micro grpc clients. We implemented our own selector mechanism for the reva pool and always select the next client. This gives us an ip that is used by the upstream grpc-go client to dial the connection. No retries are configured.

What makes this worse is that we cannot use the native retry mechanism of grpc-go, because we already looked up an ip and the pod might already have been killed and we would just retry sending requests to the same ip.

To top it all off, go-micro V5 has changed the license to BSL, which raises concerns on how long we can safely rely on it.

We need to decide how we want to load balance grpc connections in kubernetes.

## Decision Drivers

* oCIS should scale in Kubernetes without losing requests
* The code should be maintainable
* Connections should work on localhost as well as in kubernetes

## Considered Options
* go-micro clients
* Proxy load balancing
* Thick client-side load balancing
* Lookaside Load Balancing

## Decision Outcome
### Positive Consequences:
### Negative consequences:

## Pros and Cons of the Options
### go-micro clients
* good, because we can use the go micro client retry mechanism
* good, because we keep using interfaces that allow us to change the implementatien to test nats-js-kv vs kubernetes or whatever
* bad, pod readiness in kubernetes is basically ignored
* bad, because go-micro v5 changed the licence to BSL
* bad, because we need to change every line of code the makes a grpc call

### [Proxy load balancing](https://grpc.io/blog/grpc-load-balancing/#proxy-load-balancer-options)
We could use a L7 proxy like envoy to do the gRPC load balancing. kubernetes would have to return all pod ips in dns responses by setting `clusterIP: None` to use headless services. And clients would have to use the envoy proxy address
* good, because clients are simple
* bad, because proxy adds an extra hop
* bad, because we would not be using the go-micro service registry

### [Thick client-side load balancing](https://grpc.io/blog/grpc-load-balancing/#thick-client)
For this we would have to replace the go micro service registry and rely on dns as a service registry. service names would have to be configured with a schema, eg. `dns:///localhost:9142`, `dns:///gateway.ocis.svc.cluster.local:9142` or `unix:/var/run/gateway.socket` and kubernetes would have to return all pod ips in dns responses by setting `clusterIP: None` to use headless services.
* good, because we can use the grpc-qo native retry mechanism
* good, because pod readyness is respected
* good, because we get rid of the complexity of a service registry - which means revisiting [ADR0006 Service Discovery](https://owncloud.dev/ocis/adr/0006-service-discovery/)
* bad, because we would lose the service registry
* bad, because every client will hammer the dns causing a similar load problem as with the go micro kubernetes registry implementation.

### [Lookaside Load Balancing](https://grpc.io/blog/grpc-load-balancing/#lookaside-load-balancing)
* good, because the blog reccoments it for Very high performance requirements (low latency, high traffic)
* bad, because we know nothing about it

## Links
* [Load balancing and scaling long-lived connections in Kubernetes](https://learnk8s.io/kubernetes-long-lived-connections) explains why the kube-proxy does not fit long lived connections ... as in grpc
* [Don’t Load Balance GRPC or HTTP2 Using Kubernetes Service](https://medium.com/@lapwingcloud/dont-load-balance-grpc-or-http2-using-kubernetes-service-ae71be026d7f)
* gRPC Blog - gRPC Load Balancing - [Recommendations and best practices](https://grpc.io/blog/grpc-load-balancing/#recommendations-and-best-practices)
* grpc Name Resolution - [Name Syntax](https://github.com/grpc/grpc/blob/master/doc/naming.md)

0 comments on commit 4e66dec

Please sign in to comment.