Chaos engineering


Chaos engineering is the discipline of experimenting on a software system in production in order to build confidence in the system's capability to withstand turbulent and unexpected conditions.

Concept

In software development, a given software system's ability to tolerate failures while still ensuring adequate quality of service—often generalized as resiliency—is typically specified as a requirement. However, development teams often fail to meet this requirement due to factors such as short deadlines or lack of knowledge of the field. Chaos engineering is a technique to meet the resilience requirement.
Chaos engineering can be used to achieve resilience against:
While overseeing Netflix's migration to the cloud in 2011, Greg Orzell had the idea to address the lack of adequate resilience testing by setting up a tool that would cause breakdowns in their production environment, the environment used by Netflix customers. The intent was to move from a development model that assumed no breakdowns to a model where breakdowns were considered to be inevitable, driving developers to consider built-in resilience to be an obligation rather than an option:
"At Netflix, our culture of freedom and responsibility led us not to force engineers to design their code in a specific way. Instead, we discovered that we could align our teams around the notion of infrastructure resilience by isolating the problems created by server neutralization and pushing them to the extreme. We have created Chaos Monkey, a program that randomly chooses a server and disables it during its usual hours of activity. Some will find that crazy, but we could not depend on the random occurrence of an event to test our behavior in the face of the very consequences of this event. Knowing that this would happen frequently has created a strong alignment among engineers to build redundancy and process automation to survive such incidents, without impacting the millions of Netflix users. Chaos Monkey is one of our most effective tools to improve the quality of our services."

By regularly "killing" random instances of a software service, it was possible to test a redundant architecture to verify that a server failure did not noticeably impact customers.

Perturbation models

The Simian Army is a suite of tools developed by Netflix to test the reliability, security, or resiliency of its Amazon Web Services infrastructure and includes the following tools:

Chaos Monkey

Chaos Monkey is a tool invented in 2011 by Netflix to test the resilience of its IT infrastructure. It works by intentionally disabling computers in Netflix's production network to test how remaining systems respond to the outage. Chaos Monkey is now part of a larger suite of tools called the Simian Army designed to simulate and test responses to various system failures and edge cases.
The code behind Chaos Monkey was released by Netflix in 2012 under an Apache 2.0 license.
The name "Chaos Monkey" is explained in the book Chaos Monkeys by Antonio Garcia Martinez:
"Imagine a monkey entering a 'data center', these 'farms' of servers that host all the critical functions of our online activities. The monkey randomly rips cables, destroys devices and returns everything that passes by the hand . The challenge for IT managers is to design the information system they are responsible for so that it can work despite these monkeys, which no one ever knows when they arrive and what they will destroy."

Chaos Kong

At the very top of the Simian Army hierarchy, Chaos Kong drops a full AWS "Region".. Though rare, loss of an entire region does happen and Chaos Kong simulates a systems response and recovery to this type of event.

Chaos Gorilla

Chaos Gorilla drops a full Amazon "Availability Zone".

Latency Monkey

Introduces communication delays to simulate degradation or outages in a network.

Doctor Monkey

Performs health checks, by monitoring performance metrics such as CPU load to detect unhealthy instances, for root-cause analysis and eventual fixing or retirement of the instance.

Janitor Monkey

Identifies and disposes unused resources to avoid waste and clutter.

Conformity Monkey

A tool that determines whether an instance is nonconforming by testing it against a set of rules. If any of the rules determines that the instance is not conforming, the monkey sends an email notification to the owner of the instance.

Security Monkey

Derived from Conformity Monkey, a tool that searches for and disables instances that have known vulnerabilities or improper configurations.

10-18 Monkey

A tool that detects problems with localization and internationalization for software serving customers across different geographic regions.

Byte-Monkey

A small Java library for testing failure scenarios in JVM applications. It works by instrumenting application code on the fly to deliberately introduce faults such as exceptions and latency.

Chaos Machine

ChaosMachine is a tool that does chaos engineering at the application level in the JVM. It concentrates on analyzing the error-handling capability of each try-catch block involved in the application by injecting exceptions.

Proofdock Chaos Engineering Platform

A chaos engineering platform that focuses on and leverages the Microsoft Azure platform and the Azure DevOps services. Users can inject failures on the infrastructure, platform and application level.

Gremlin

A "failure-as-a-service" platform built to make the Internet more reliable. It turns failure into resilience by offering engineers a fully hosted solution to safely experiment on complex systems, in order to identify weaknesses before they impact customers and cause revenue loss.

Facebook Storm

To prepare for the loss of a datacenter, Facebook regularly tests the resistance of its infrastructures to extreme events. Known as the Storm Project, the program simulates massive data center failures.

Days of Chaos

Inspired by AWS GameDays to test the resilience of its applications, teams from Voyages-sncf.com participated in a Day of Chaos. Every 30 minutes, operators simulated failures in pre-production. Teams earned points based on detections, diagnoses, and resolutions. This type of gamified event helps to introduce development teams to the concept of resilience.
Presented at the 2017 DevOps REX conference the concept is presented on the site http://days-of-chaos.com in order to collect the other experiments.

ChaoSlingr

is the first Open Source application of Chaos Engineering to Cyber Security. is focused primarily on performing security experimentation on AWS Infrastructure to proactively discover system security weaknesses in complex distributed system environments. Published on Github in September 2017.

Chaos Toolkit

The Chaos Toolkit was born from the desire to simplify access to the discipline of chaos engineering and demonstrate that the experimentation approach can be done at different levels: infrastructure, platform but also application. The Chaos Toolkit is an open-source tool, licensed under Apache 2, published in October 2017.

Mangle

enables you to run chaos engineering experiments seamlessly against applications and infrastructure components to assess resiliency and fault tolerance. It is designed to introduce faults with very little pre-configuration and can support any infrastructure that you might have including K8S, Docker, vCenter or any Remote Machine with ssh enabled. With its powerful plugin model, you can define a custom fault of your choice based on a template and run it without building your code from scratch.

Chaos Mesh

is a cloud-native Chaos Engineering platform that orchestrates chaos on Kubernetes environments. Chaos Mesh is an open-source tool, licensed under Apache 2, published in December 2019.

Litmus Chaos

Litmus is a toolset to do cloud-native chaos engineering. Litmus provides tools to orchestrate chaos on Kubernetes to help SREs find weaknesses in their deployments. SREs use Litmus to run chaos experiments initially in the staging environment and eventually in production to find bugs, vulnerabilities. Fixing the weaknesses leads to increased resilience of the system.
Also, Litmus Chaos is part of the , licensed under Apache 2

DevOps

The rapid pace of the DevOps methodology of software deployment makes it challenging to ensure a sufficient level of confidence in the face of frequent releases. A key element to address this is for monitoring and testing to be done throughout the development and release cycle. Integrating chaos engineering into the DevOps toolchain contributes to the goal of continuous testing.