Stable Systems Checklist

Below is an opinionated list of attributes and policies that need to be met in order to establish a stable software system.

Preparation

Developers are in control of the Software and they own the code.
Only small units of work every time. Fix, deploy, Develop.
Only few people in each project no more than 6.
Every project has a win condition.
Show feasability of the project by working on an initial seed no longer than 3-4 days.
Gamble any new technology sensibly.
Define scope of project. Make sure is not too broad.
Define feature extensions of each project.
Make experiments and flag them as analysis pre-planning work to the real project.
Be on guard and look for bad APIS.
Data should be clean and not garbage.

Process & People

Pick the parts of Agile/XP/SCRUM/Kanban that work for the team and kill the rest.
Prefer asynchronous communication.
Know how different people on your team likes to work with the code base.

System Planning

The system is built for production.
Design a shared-nothing architecture.
You build your system as a 12 factor app.
- Use revision control with many deploys.
- Declare dependancies with package managers.
- Store configuration in the environment.
- Track backend services as resources.
- Use seperate build, and run stages.
- Execute the app as one or more stateless processes.
- Export services via port binding.
- Scale out via the process model. Never daemonize or write PID files. Use process managers.
- Processes shut down gracefully when they receive a SIGTERM. They have a fast startup and graceful shutdown.
- Keep development, staging, and production as similar as possible. Vagrant allow developers to run local environments.
- Treat logs as event streams.
- Run admin/management tasks as one-off processes. For example django manage.py commands.
The system is a set of modules with loose coupling.
Modules communicate loosely via a protocol.
Design protocols for future extension. Design each module for independence. Design each module so it could be ripped out and placed in another system and still work.
Avoid deep dependency hierarchies.
Avoid intermediaries parsing and interpreting on data.
Have a supervision/restart strategy.
Prefer ratcheting methods via idempotence.
Uses a unique ID on all messages which means you can always retry said message in case of a timeout and be sure it won’t be rerun by the receiving system, if the receiver keeps a log of what it has already done.
UNIX principle: each tool does one thing well.
Define the capacity of the system up front.
Decouple your SLA.
Put limits into other application-level protocols. HTTP, RPC, etc.

Setup

First you build an empty project.
Add this empty project to continuous integration.
Deploy the empty project into staging.
Once this works, you start building the application.
Preconfigure your systems so you need no external dependencies when deploying.
The same artifact is deployed to staging and production. It picks up a context from the environment and this context configures it.
Don’t use advanced technology too early on.
Lock dependencies to specific tags/versions.
Make upgrading dependencies a decision on your part.
Vendor everything.
Make a production deploy take less than 1 minute from button-push-to-operational-on-the-first-instance.
Build a default library you include in every application you write.
Let every application use the same library.

Development

Correctness is more important than fast.
Elegant is more important than fast.
Code Quality is more important than fast.
Fast is not really important.

Build your system to collect metrics about itself as it runs.
Ship metrics to a central point for further analysis.
Unit test, property based test, type systems, static analysis, and profiling.
No vendor lock-in.
Use proven synchronization primitives.
No code formatting disputes.
Use load regulation in the border of the system.
Use a retry policy for failed requests. Consider delayed retries with exp back off.
Use a timeout policy for slow requests.
Use circuit breakers to break cascading dependency failure.
Use Bulkheads to partition systems. Protect critical clients by giving them their own pool to call. Virtual servers provide an excellent mechanism for implementing bulkheads. For smaller scale Bind process to CPU.
Try to utilize Soft/Weak references in order to minimize memory footprint.

Picking a database

Pick postgresql as default.If you need MongoDB-like functionality you create a jsonb column.
Export to elasticsearch from postgres.
Use pg_bouncer.
Isolate complex transactional interactions to a few parts of the store.
Look for idempotent ratcheting methods as an alternative.

Picking a programming language

Avoid the monoculture.
Know the weaknesses of a language.
The deployment tooling must be in place before use.
Use make. Use the same make targets for all projects in the organization.

Picking Architecture

Use REST.
Use REST Specifications like OpenApi, RAML.

Configuration

Secure defaults.
Persistent data lives outside of the artifact path, on a dedicated disk with dedicated quota.
Log rotation.
The artifact path is not writable by the application.
Use different credentials in production and staging.
Deny developers laptops easy access to the production environment.
Avoid the temptation of too early etcd/Consul/chubby setups.

Operations

Optimize for sleep.The system must avoid waking people up in the middle of the night at all costs.
The system must be able to gracefully degrade.
The system runs out of monit, supervise, upstart, systemd, rcNG, SMF, or the like.
The application must gracefully stop and start if given the command to do so.
Every log file is shipped and indexed outside of the system. Every interesting metric too.
Don’t leave log files on production systems. Copy them to a staging area for analysis.
The only way to make changes to a production host is to redeploy.
Make it easy to roll back and downgrade a deployment.
In a production system you must be able to query its state in an ad-hoc fashion.
If you enable ad-hoc query and tracing on the system and then disable it again, there must no segfaults, no kernel crashes and no long-term impact.

Site Reliability

Hire Coders only.
Have an SLA for your service.
Measure performance based on your SLA.
Share 5% Operations work with developers.
Do Postmortems after each event and focus only on processes not people.

Tools

Debugging

Dtrace
Gdb

Cloud Storage

minio

Security

Database

Use encryption for sensitive data.
All backups are stored encrypted as well.
Use minimal privilege for the database access user account.
Store and distribute secrets using a key store designed for the purpose.
Don’t hard code in your applications.
Only using SQL prepared statements.

Development

Use vulnerability scanners for every version pushed to production.
Use memory leak analyzers to your to your production runtime binaries.
Use race condition detection in your runtime binaries.
Acquire and investigate any vendor libraries for surprises and failure modes.

Authentication

Ensure all passwords are hashed using appropriate crypto such as bcrypt. Use secure random bytes.
Apply password rules that encourage users to have long, random passwords.
Use multi-factor authentication for your logins to all your service providers.

Denial of Service Protection

At a minimum, have rate limiters on your slower API paths and authentication related APIs like login and token generation routines.
Use CAPTCHA in front end.
Enforce sanity limits on the size and structure of user submitted data and requests.
Use a global caching proxy service like CloudFlare.
No single points of failure. Have redundancy on machines.
Use Bulkhead server partitioning. In essense assign limited resources to specific (groups of) clients, applications, operations, client endpoints, and so on.

Web Traffic

Use the strict-transport-security header to force HTTPS on all requests.
Cookies must be httpOnly and secure and be scoped by path and domain.
Use Content Security Policy without allowing unsafe-* backdoors.
Use CSP Subresource Integrity for CDN content.
Use X-Frame-Option, X-XSS-Protection headers in client responses.
Use CSRF tokens in all forms.
Use the new SameSite Cookie response header which fixes CSRF once and for all newer browsers.
Keep as little in the session state as possible.
Use a robots.txt file to keep legitimate bots away.

APIs

No resources are enumerable in your public API.
All users are fully authenticated and authorized appropriately when using your API.
Use canary checks in APIs to detect illegal or abnormal requests that indicate attacks.

Validation and Encoding

Do client-side input validation.
Escape text before showing.

Cloud Configuration

Ensure all services have minimum ports open.
Host backend database and services on private VPCs that are not visible on any public network.
Isolate logical services in separate VPCs and peer VPCs to provide inter-service communication.
Ensure all services only accept data from a minimal set of IP addresses.
Restrict outgoing IP and port traffic to minimize APTs and “botification”.
No root credentials.
Use minimal access privilege for all ops and developer staff.
Regularly rotate passwords and access keys according to a schedule.

Infrastructure

Ensure you can do upgrades without downtime. Automated.
Create all infrastructure using a tool such as Terraform, and not via the cloud console. Have zero tolerance for any resource created in the cloud by hand.
Use centralized logging for all services. You should never need SSH to access or retrieve logs.
Don’t SSH into services except for one-off diagnosis. Using SSH regularly, typically means you have not automated an important task.
Don’t keep port 22 open on any AWS service groups on a permanent basis. If you must use SSH, only use public key authentication and not passwords.
Create immutable hosts instead of long-lived servers that you patch and upgrade.
Protect infrastructure secrets with Centralized secret management tools like Vault or Keywhiz.

Operation

Power off unused services and servers.
Have a practiced security incident plan.

Test

Do Penetration Testing.
Do fuzz testing.
Everything is Auditable.
Identify whatever your most expensive transactions are, and double or triple the proportion of those transactions to see how your system handles stress.
Do Stress Tests.

Security tools

Auditing

auditd

Encryption

Keyczar

References

12 Factor app
How to build stable systems
Continious Code Quality
Distirbuted Systems Safety Research
Site Reliability Book
How Complex Systems Fail
Web operations
Continious Delivery
Release it
Web Developer Security Checklist
Making the Netflix API more Resilient

License

To the extent possible under law, Theo Despoudis has waived all copyright and related or neighboring rights to this work.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.md

README.md

Stable Systems Checklist

Preparation

Process & People

System Planning

Setup

Development

Picking a database

Picking a programming language

Picking Architecture

Configuration

Operations

Site Reliability

Tools

Debugging

Cloud Storage

Security

Database

Development

Authentication

Denial of Service Protection

Web Traffic

APIs

Validation and Encoding

Cloud Configuration

Infrastructure

Operation

Test

Security tools

Auditing

Encryption

References

License

Files

README.md

Latest commit

History

README.md

File metadata and controls

Stable Systems Checklist

Preparation

Process & People

System Planning

Setup

Development

Picking a database

Picking a programming language

Picking Architecture

Configuration

Operations

Site Reliability

Tools

Debugging

Cloud Storage

Security

Database

Development

Authentication

Denial of Service Protection

Web Traffic

APIs

Validation and Encoding

Cloud Configuration

Infrastructure

Operation

Test

Security tools

Auditing

Encryption

References

License