Skip to content
This repository has been archived by the owner on Jul 11, 2023. It is now read-only.

Commit

Permalink
Adding HA docs
Browse files Browse the repository at this point in the history
Document HA and resource statelessness, distribution
and design decisions taken on OSM for proper HA functioning.

Signed-off-by: Eduard Serra <eduser25@gmail.com>
  • Loading branch information
eduser25 committed Feb 24, 2021
1 parent a3df32d commit b3705df
Show file tree
Hide file tree
Showing 5 changed files with 63 additions and 0 deletions.
2 changes: 2 additions & 0 deletions DESIGN.md
Original file line number Diff line number Diff line change
Expand Up @@ -81,6 +81,8 @@ Let's take a look at each component:
### (1) Proxy Control Plane
The Proxy Control Plane plays a key part in operating the [service mesh](https://www.bing.com/search?q=What%27s+a+service+mesh%3F). All proxies are installed as [sidecars](https://docs.microsoft.com/en-us/azure/architecture/patterns/sidecar) and establish an mTLS gRPC connection to the Proxy Control Plane. The proxies continuously receive configuration updates. This component implements the interfaces required by the specific reverse proxy chosen. OSM implements [Envoy's go-control-plane xDS v3 API](https://github.com/envoyproxy/go-control-plane). The xDS v3 API can also be used to extend the functionality provided by SMI, when [advanced Envoy features are needed](https://github.com/openservicemesh/osm/issues/1376).

The Proxy Control Plane's availability is of foremost importance when it comes to traffic policy enforcement and connectivity management between services. Some of the Control Plane design decisions are heavily influenced by that fact, such as its stateless nature. To read more on the design decisions behind the High Availability design of the Control Plane, please refer to the [HA design doc](/docs/content/docs/HA.md).

### (2) Certificate Manager
Certificate Manager is a component that provides each service participating in the service mesh with a TLS certificate.
These service certificates are used to establish and encrypt connections between services using mTLS.
Expand Down
61 changes: 61 additions & 0 deletions docs/content/docs/HA.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,61 @@
---
title: "HA Design considerations"
description: "Open Service Mesh HA Design considerations"
type: docs
---
# HA Design considerations

OSM's control plane componenets are built with High Availability and Fault Tolerance in mind. The following sections will thoroughly document how are these tackled.

## High Availability and Fault Tolerance

High Availability and Fault Tolerance are implemented and ensured by several design decisions and external mechanisms in OSM, which will be documented in the following points:

### Statelessness
OSM's control plane componenets do not own or have any state-dependent data that needs to be saved at runtime; with the controlled exceptions of:
- CA / Root Certificate: The CA root certificate is required to be the same for multiple OSM instances when running multiple replica. For [Certificate Managers](/DESIGN.md#2-certificate-manager) implementations that require the root CA to have been generated/provided prior OSM execution (Vault, Cert-Manager), the root CA will be fetched from the provider at boot by all instances.
For other Certificate Providers that can autogenerate a CA when none is present (such as Tresor), atomicity and synchronization will be ensured during creation, ensuring all instances load the same CA.
- Envoy Bootstrap Certificates (used by the proxies to authenticate against the control plane): these are created during injection webhook handling and inlined as part of the Proxy's bootstrap configuration. The configuration is stored as a kubernetes secret and mounted in the injected envoy pod as a volume, assuring idempotence for a single pod at any one time.

Other than those exceptions, the rest of the configuration is built and fetched from Kubernetes.

The domain state used to compute the traffic policies is entirely provided by the different runtime providers (Kubernetes _et al_) and Kubernetes client-go informers on the related objects `osm-controller` subscribes to.

Multiple `osm-controller`s running will subscribe to the same set of objects and will generate the identical configurations for the service mesh. Due to the nature of client-go Kubernetes informers being eventually consistent `osm-controller` guarantees policy enforcement to be eventually consistent.

<p align="center">
<img src="/docs/content/docs/images/ha/ha1.png" width="400" height="350"/>
</p>

### Restartability:
The previous stateless design considerations should ensure OSM's control plane componenets are fully restartable.

- A restarting instance will resynchronize all Kubernetes domain resources. Existing proxies will reconnect, and (assuming no changes occurred on the mesh topology or policy) the same configuration should be recomputed and pushed as a new version to the proxies.

### Horizontal scaling
Components `osm-controller` and `osm-injector` allow for separate horizontal scaling, depending on load or availability requirements.
- When an `osm-controller` is spawned with multiple replicas, connecting proxies may be load-balanced and connected to any of the existing OSM instances running for the control plane.
- Similarly, `osm-injector` can be horizontally scaled to handle an increased number/rate of pod onboardings on the mesh.
- In `osm-controller`, Service certificates (used between proxies to TLS authenticate and communicate with each other) are short lived and kept only in runtime on control plane (though pushed as part of proxy xDS protocol when required).

Multiple `osm-controller` instances might create different yet valid service certificates for a single service. These different certificates will (1) have been signed by the same root, as multiple OSM instances must load the same root CA, and (2) will have the same Common Name (CN), which is be used to match against and authenticate when traffic is proxied between services.

<p align="center">
<img src="/docs/content/docs/images/ha/ha2.png" width="450" height="400"/>
</p>

In short, no matter what control plane a proxy connects to, a valid certificate, with correct/proper CN and signed by the shared control plane root CA, will be pushed to it.

- Increasing horizontal scale will NOT redistribute established proxy connections to the control plane unless they are disconnected.
- Decreasing horizontal scale will make the disconnected proxies connect to instances that were not terminated by the downscale. New versions of the config should be computed and pushed upon establishing the connection anew.

<p align="center">
<img src="/docs/content/docs/images/ha/ha3.png" width="450" height="400"/>
</p>

- If the control plane is brought down entirely, running proxies should continue to operate in headless<sup>[1]</sup> mode till they can reconnect to a running control plane.




[1] Headless: usually referred in the control-plane/data-plane design paradigm, refers to the concept that allows, when having a dependency between two componenets, for the depender agent to survive and keep running with latest state when the dependee dies or becomes unreachable.
Binary file added docs/content/docs/images/ha/ha1.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/content/docs/images/ha/ha2.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added docs/content/docs/images/ha/ha3.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.

0 comments on commit b3705df

Please sign in to comment.