Customer Cluster Chronical

Problems

To solve any problem we must first fully understand the problem.

Here at runnable we care about the following things. And so far we have promoted this priority

1. Build Speed
2. Stability
3. Cost Optimization
4. Performance
5. Security

Definitions

Now that I have listed the priorities I will define what each entail.

Build Speed

Here are the factors that go into build speed:

docker cache
FROM image availability
cpu shares
available ram

Docker Cache

Docker cache is the availability of docker layers used in the previous builds The more previous layers that are on the box, the more cache will be used If we have cache misses, the RUN commands will have to be re run The limitation here is disk space

`FROM` Image Availability

If the FROM image is not on the box it will have to be pulled The limitation here is disk space

CPU Shares

The more containers that are on the box, the less CPU each container gets The limitation here is cpus

Available RAM

The more ram a container has to use the more ram cache it will have The limitation here is ram

Stability

Here are the factors that go into stability:

built image repository
services
dock availability
limiting ram
disk space / inodes

Built Image Repository

When a dock gets unhealthy and we need to rollover, we need to migrate the image Having images stored in a repository allows us to recover the image with only pull time

Services

in order to provide the best experience all of our services need to be robust networking / DNS / file tree / registry all need to be up

Dock Availability

In order to run containers we need to ensure we have enough docks to run builds / containers

Limiting RAM

In order for builds and running containers to run smoothly they need enough ram. we need to limit ram so one container does not use all the systems ram

Disk Space / inodes

Each container needs enough disk space and inodes to perform its task if we run out of disk space or inodes containers can not function properly

Cost Optimization

here are the factors that go into cost optimization:

number and size of docks
size of disks
disk and network IO

Number and Size of Docks

the more docks we have the higher our cost the bigger docks we have the higher our cost

Size of Disks

The bigger the disk we put on each dock the higher our cost

Disk and Network IO

The more IO a user container or build uses the higher our cost

Performance

here are the factors that go into performance:

ram
cpu

RAM

the more ram a container has the more performant it will be

CPU

the less cpu is shared the more performant the container will be

Security

here are the factors that go into security:

isolated access

Isolated Access

containers should not be able to access anything they are not supposed to people should not be allowed to access containers they should not be

Solutions

now we know what each problem entails I will detail how things can be improved

Build speed

Docker Cache

To improve docker cache we need layers to be available

ensure layers are on docks builds are scheduled on
distribute layers so we have high availability

FROM Image Availability

ensure FROM images are on docks builds are scheduled on
distribute FROM images so we have high availability

CPU Shares

ensure we run the least amount of containers per dock that we can

Available RAM

ensure we run the least amount of containers per dock that we can

Stability

Built Image Repository

localhost registry
amazon ECR (other hosted solutions like quay.io)

Services

ensure services are always up and robust

Dock Availability

autoscale groups
correct scaling in / out

Limiting RAM

limit ram to reasonable limit

Disk Space / inodes

ensure we provide enough disk space
clean old images

Cost Optimization / Performance / Security

I will stop here as those are our highest priority

Example Tradeoffs

less containers we schedule on docks more CPU and RAM per container, inc stability, inc performance, dec cost optimization dec build speed
build push pull scheduling dec build speed, inc stability, inc performance, in cost optimization

Data

Customer Cluster Chronical

Problems

Definitions

Build Speed

Docker Cache

FROM Image Availability

CPU Shares

Available RAM

Stability

Built Image Repository

Services

Dock Availability

Limiting RAM

Disk Space / inodes

Cost Optimization

Number and Size of Docks

Size of Disks

Disk and Network IO

Performance

RAM

CPU

Security

Isolated Access

Solutions

Build speed

Docker Cache

FROM Image Availability

CPU Shares

Available RAM

Stability

Built Image Repository

Services

Dock Availability

Limiting RAM

Disk Space / inodes

Cost Optimization / Performance / Security

Example Tradeoffs

Data

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Clone this wiki locally

`FROM` Image Availability