What is Chaos Engineering?
Wikipedia: "Chaos Engineering is the discipline of experimenting on a system in order to build confidence in the system's capability to withstand turbulent conditions in production."
TechTarget: "Chaos engineering is the process of testing a distributed computing system to ensure that it can withstand unexpected disruptions."
What's a typical Chaos Engineering workflow?
According to Gremlin there are three steps:
- Planning an experiment where you design and choose a scenario in which your system should fail to operate properly
- You execute the smallest possible experiment to test your theory
- If nothing goes wrong, you scale your experiment and make the blast radius bigger. If your system breaks, you better understand why and start dealing with it
The process then repeats itself either with same scenario or a new one.
Cite a few tools used to operate Chaos exercises
- AWS Fault Injection Simulator: inject failures in AWS resources
- Azure Chaos Studio: inject failures in Azure resources
- Chaos Monkey: one of the most famous tools to orchestrate Chaos on diverse Cloud providers
- Litmus - A Framework for Kubernetes
- Chaos Mesh: for Cloud Kubernetes platforms
See an extensive list here