-
-
Notifications
You must be signed in to change notification settings - Fork 158
Composable Distributed OS
Spitballing a distributed OS design following the Perlis-Thompson Principle. (An elaboration on this lobste.rs subthread on Unix Shell: History and Trivia)
- Understand Borg (since that's my background). This system makes sense from first principles, but it's easier to explain as a "diff".
- Read "A Better Kubernetes from the Ground Up". This page doesn't have much overlap with that one, but it's worthwhile to see the "diff".
- Kubernetes is Our Generation's Multics, and "cloud review" -- a distributed OS should be a bunch of shell scripts.
Short answer: shell scripts, FastCGI scripts, and git :-) Loosely coupled, concurrent processes and versioned, hierarchical data.
Key point: Borg/Kubernetes are the control plane, not the data plane. They deal with extremely little data and can be written in interpreted languages. The kernel is a very fast data plane, and you just have to use it correctly.
- git needs a couple augmentations:
- The notion of a "pointer" to a tree in another git repository, or a flat blob (leaf). This creates a big uniform namespace where each subtree is a value (in the Rich Hickey sense), and it has a short hash as an immutable identifier (to be concrete, let's say it looks like
K-abcd0123
) - A way to "listen" to for updates across a subtree. The simplest version of this could literally be implemented using
inotify()
?- For example, we might have a hierarchy
/users/alice
and/oci-images/myapp/
. Some scripts/applets might want to listen for changes to/users
in order to commit another file, or listen for changes to/oci-images/myapp/
in order to deploy new versions.
- For example, we might have a hierarchy
- The notion of a "pointer" to a tree in another git repository, or a flat blob (leaf). This creates a big uniform namespace where each subtree is a value (in the Rich Hickey sense), and it has a short hash as an immutable identifier (to be concrete, let's say it looks like
- Unified storage and networking (somewhat like Plan 9). There's no "networking or RPC". It's based on versioned state synchronization, like
git pull
. - Single storage namespace. Logically, there are not separate repos for Debian / PyPI / Docker images. They all live behind one abstraction and are identified by KIDs
- No separation between WAN and LAN. It can run across a WAN.
- The "master" state has two implementations: single node or Paxos. But it's polymorphic.
- It's built to be easily turned up / bootstrapped and doesn't have an "inner platform effect". There's no another distributed OS below it that distributes its binaries and has its own auth system.
- Uses only process-based concurrency. There are no threads and no goroutines. (This solves O(M * N) problems in distributed tracing and debugging, which are very important.)
- There is a single shell language for coordinating processes :)
- It's a source-based OS. You push source code (like Heroku) and build configuration, and the system spins up processes to build it into a binary / OCI image that can be deployed to many machines. Both the source code and binary image are identified with KIDs.
- Being source-based means that distributed debugging and tracing tools can always refer back to source code. The system knows more about what's running on it than just opaque containers.
- There's no geographic split like "AWS regions". The cluster manager itself can manage nodes across a WAN, i.e. across multiple geographic regions. It's based on synchronization of state.
- Applications like databases could have some notion of region if they want, but the cluster manager doesn't.
- Controversial: no types or schemas!
- Schemas are another possibly incommensurable "concept"; types inhibit metaprogramming and generic operations, and are anti-modular. All of this has a bearing on distributed systems. Instead I would use Hickey's style of interfaces: "strengthen a promise" and "relax a requirement".
- The data model is REST with the "uniform interface constraint", which is like Plan 9: a hierarchy of tables, objects, and documents.
- objects are used for configuration (Oil configuration evaluates to JSON)
- tables/relations are used for cluster state and metrics (maybe there are streams which are like infinite tables)
- documents are used for the user interface (as well as docs/online help!). The state of every node has to be reflected a web UI.
- The system should separate policy and mechanism. Example: starting or killing a process is a mechanism. But gracefully restarting 20 copies of a web server is an algorithm (policy) that belongs at a higher layer. This shouldn't be embedded in the config file for your service; it's part of a different tool -- that launches a remote process that operates on remote processes!
- Web as a modest and humble (and this brilliant) extension of Unix. This design is for a very minimal extension of UNix.
- It changes the file system to use distributed syncable state with global IDs (KIDs)
- It uses contained processes.
- It uses existing auth mechanisms
So those are the three concepts: running code, data (control plane or data plane), and some form of identity.
An OS is a mediator between applications, people, and hardware. That's it. This is the minimal system to achieve those goals.
- Use Cases span the gamut. It scales down as well as scales up.
- Hosting Static Files. You need something like nginx and a file system.
- Hosting simple database-backed web apps (like Heroku does, e.g. "12 factor apps")
- Hosting off-the-shelf distributed databases. It needs to be flexible enough for this.
- Hosting a search engine: batch jobs for indexing, a tree of servers for serving posting lists, etc.
- Hosting its own binaries: metrics, monitoring, alerting
- Hosting another copy of itself: It needs to compose.
- There is a "virtualization theorem" for hardware; likewise we want some kind of "virtualization" for a distributed. Not because this is necessarily a great idea, but because it's a test of the power of the abstractions developed. Arguably, the fact that VMs and containers exist is a sign of a deficit in the design of a Unix process. A process was a virtual machine, except where it leaked and caused problems.
-
The Inner Platform effect and the Bootstrapping Problem
- We want to solve this.
-
There is a single storage abstraction and single namespace. Let's call it a "Keg" to be concrete.
- Think of Keg as a wrapper over git. It supports versioning and differential compression. It can efficiently hold 1,000 copies of a 1 GiB container (possibly by using "pointers". Programs can read and write its content with the regular POSIX file system API, not a RPC wrapper.
-
Contrast to the cloud
- there is no separate Debian repository or PyPI repository. We import those all into a Keg and assign them a KID.
- there is no separate Docker image repository
- there is no separate thing to store membership
-
The "master" holds authoriative State, and has at least two implementations.
- A single machine implementation. It just stores everything on a file system, like a git repo. A single machine can be very reliable -- more reliable than an entire AWS region, because those have single points of failure and misconfigurations.
- A implementation that uses Paxos. This probably isn't necessary. If the master is down, then it just means you can't deploy new code or stop processes. All the mayors will maintain the integrity of the images they're assigned to run.
-
Everything is a KID. A KID can be thought of as a value (in the Rich Hickey sense), and also a distributed pointer. It's a handle to versioned data.
- To be concrete, let's write it as K-0123abcd. This is a hex number.
-
Examples
- Every user has a KID (could be a hash of an e-mail address)
- Every machine has a KID. (It could be a hash of a human-readable name)
- Every version of source code is represented by a KID (can be derived from git hash)
- There is a single executable format like an OCI container, and each one has a unique KID.
- Every running process had a KID. The Mayor maintains a map from PID to KID.
-
Operations:
- wait() on a set of process KIDs to exit.
-
Components
- Mayor: a distributed init. Maintains more of its own integrity than Borg.
- Machine state: A table of (KID for binary, KID for user, )
- Mayor: a distributed init. Maintains more of its own integrity than Borg.