An Introduction to Chaos Engineering: Trying to Break Stuff

Written by Gerardo Santovena | Dec 23, 2019 5:00:00 AM

Some time ago there was a car ad with the slogan "Don't use it, abuse it", meaning that no matter what you do to the car, it will continue working just fine. In other words, it was a resilient car. I'm changing "abuse" to "break" and referring not to a car but to your platform! Yes, your platform! We will try to break it.

Let me explain

Today's platforms are comprised of many parts. Those parts are "microservices" and they should be loosely coupled with web services, database servers, API services, storage, and networking. Even users are part of your platform. And they can all fail at any point. These services should (must?) work by themselves serving one and only one purpose. All together, they assemble your platform. For instance, a simple web application can be created by databases, search services, cache services, and of course, web and application servers. One SRE principle is to design your platform architecture with a "this service will fail" mindset, meaning that you have to create each part highly-available and fault-tolerant. This is easy peasy: you will use, for instance, a load balancer in front of your web servers, replication for the database servers, perhaps another load balancer for your X service and another one for your Y service, etc. Now, how can you ensure that if something fails your platform will continue working and will self-heal? What if the load balancer itself is the one that fails? Or, a complete data center fail where the database servers are all down? What will happen if the cache servers fail? Would your platform continue working? What if the underlying networking is the one flapping? Here is where Chaos Engineering can help!

What is Chaos Engineering?

Chaos engineering is the facilitation of experiments to uncover systemic weaknesses. Information about the principles of Chaos Engineering can be found here. There are many tools you can use to create your experiments. In this article, we are going to use the Chaos Toolkit and will jump directly to the practice. Let's start by installing it:

$ pip3 install -U chaostoolkit chaostoolkit-aws
 $ chaos --version
 chaos, version 1.3.0

Now, our resilient application

Chaos Engineering was created to find weaknesses in distributed systems composed of hundreds or even thousands of microservices. For this example, we are going to model a simple application: an Echo server which will be listening on port 5566 for strings and it will echo back the same string to the client until the word 'bye' is received, then it will disconnect. We have some restrictions for it, too:

The Echo server will be a service running on an AWS EC2 instance with an Elastic IP attached to it.
If the EC2 instance fails, another one will take its place and will reclaim the Elastic IP.
KeepaliveD will be used to make this happen.

import java.io.*;
 import java.net.*;
 import java.lang.Thread;
 
 public class EchoServer {
  public static void main (String[] args) {
 
  try {
  ServerSocket server = new ServerSocket(5566);
 
  System.out.println("Listing on port 5566...");
  while (true) {
  Socket client = server.accept();
  EchoHandler handler = new EchoHandler(client);
  handler.start();
  }
  }
  catch (Exception e) {
  System.err.println("Exception caught:" + e);
  }
  }
 }
 
 class EchoHandler extends Thread {
  Socket client;
  InetAddress inetAddress;
  String configServer;
 
  EchoHandler (Socket client) {
  this.client = client;
 
  try {
  inetAddress = InetAddress.getLocalHost();
  } catch(Exception e) {}
  }
 
  public void run () {
  try {
  BufferedReader reader = new BufferedReader(new InputStreamReader(client.getInputStream()));
  PrintWriter writer = new PrintWriter(client.getOutputStream(), true);
 
  System.out.println("[connected] " + client.getInetAddress());
 
  while (true) {
  String line = reader.readLine();
  System.out.println(client.getInetAddress() + " says " + line);
 
  writer.println(line);
  if (line.trim().equals("bye")) {
  break;
  }
  }
  }
  catch (Exception e) {
  System.err.println("Exception caught: client disconnected.");
  }
  finally {
  try { 
  System.out.println("[disconnected] " + client.getInetAddress());
  client.close(); 
  }
  catch (Exception e ){ ; }
  }
  }
 }

Our experiment

The Echo service is simple but it will teach us much about Chaos Engineering. Consider the following JSON file with our Chaos Toolkit experiment:

{
  "version": "1.0.0",
  "title": "Validating High-Availability of the Echo server.",
  "description": "Ensure that it will be always an EC2 instance with the ElasticIP attached.",
  "tags": [
  "echoserver",
  "keepalived",
  "elasticip",
  "ha",
  "master",
  "backup"
  ],
  "configuration": {
  "aws_region": "us-east-1"
  },
  "steady-state-hypothesis": {
  "title": "The backup node will take the role of master when the original one is terminated.",
  "probes": [
  {
  "type": "probe",
  "name": "port-5566-is-listening-and-working-properly",
  "tolerance": true,
  "provider": {
  "type": "python",
  "module": "chaospythian.echo.probes",
  "func": "echoserver",
  "arguments": {
  "tcp_ip": "54.71.185.10",
  "tcp_port": 5566,
  "message": "testing text\n"
  }
  }
  },
  {
  "type": "probe",
  "name": "master-and-backup-instances-up-and-running",
  "tolerance": [1,2],
  "provider": {
  "type": "python",
  "module": "chaosaws.ec2.probes",
  "func": "count_instances",
  "arguments": {
  "filters": [
  {
  "Name": "instance-state-name",
  "Values": ["running"]
  },
  { 
  "Name": "tag:Name",
  "Values": ["production-echoserver"]
  }
  ]
 
  }
  }
  },
  {
  "type": "probe",
  "name": "one-ec2-instance-tagged-as-master",
  "tolerance": 1,
  "provider": {
  "type": "python",
  "module": "chaosaws.ec2.probes",
  "func": "count_instances",
  "arguments": {
  "filters": [
  {
  "Name": "instance-state-name",
  "Values": ["running"]
  },
  {
  "Name": "tag:State",
  "Values": ["MASTER"]
  },
  {
  "Name": "tag:Name",
  "Values": ["production-echoserver"]
  }
  ]
  }
  }
  },
  {
  "type": "probe",
  "name": "one-ec2-instance-tagged-as-backup",
  "tolerance": [0,1],
  "provider": {
  "type": "python",
  "module": "chaosaws.ec2.probes",
  "func": "count_instances",
  "arguments": {
  "filters": [
  {
  "Name": "instance-state-name",
  "Values": ["running"]
  },
  {
  "Name": "tag:State",
  "Values": ["BACKUP"]
  },
  {
  "Name": "tag:Name",
  "Values": ["production-echoserver"]
  }
  ]
  }
  }
 
  }
  ]
  },
  "method": [
  {
  "type": "action",
  "name": "terminate-master-node",
  "provider": {
  "type": "python",
  "module": "chaosaws.ec2.actions",
  "func": "terminate_instance",
  "arguments": {
  "filters": [
  {
  "Name": "instance-state-name",
  "Values": ["running"]
  },
  {
  "Name": "tag:State",
  "Values": ["MASTER"]
  },
  { 
  "Name": "tag:Name",
  "Values": ["production-echoserver"]
  }
  ]
 
  }
  },
  "pauses": {
  "after": 10
  }
  }
  ]
 }

This experiment will probe if the service is up and running, listening on port 5566 and working as expected; if a client connects and sends a string, it will receive the same string back :

Client-side:

$ nc 54.71.185.10 5566
 hello
 hello
 test
 test
 one two three
 one two three
 bye
 bye

Server-side:

Listing on port 5566...
 [connected] /190.104.119.195
 /190.104.119.195 says hello
 /190.104.119.195 says test
 /190.104.119.195 says one two three
 /190.104.119.195 says bye
 [disconnected] /190.104.119.195

Then, it will check the EC2 instance tags looking for one and only one MASTER node and zero or one SLAVE nodes. It's okay that we don't have any slave nodes, because if the master node dies, the other one will take the lead and the service will be up and running from the client's point of view. It will take a few seconds to launch a new server and this will become the new slave node. If the slave node dies, no problem, our autoscaling group will launch a new one and the Elastic IP was never changed from the master node.

Now the fun part: let's terminate the master node!

Before we do this, let's talk about extensions: Chaos Toolkit can use modules to extend its functionality. For instance, in the code above, we are using two extensions: the AWS extension (chaosaws) and a custom one (chaospythian). The latest is a simple method in Python to simulate a client connecting to the Echo server, sending a string, and waiting for the response:

# -*- coding: utf-8 -*-
 import socket
 
 __all__ = ["echoserver"]
 
 
 def echoserver(tcp_ip, tcp_port, message):
  s = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
  s.connect((tcp_ip, tcp_port))
  s.send(message.encode('ascii'))
  data = s.recv(1024)
  s.send("bye".encode('ascii'))
  s.close()
 
  return (data.decode('ascii').strip() == message.strip())

Okay, let's continue. The following part of the Chaos Toolkit experiment is the method, where we are terminating the EC2 instance. If our KeepAliveD configuration works, it will associate the Elastic IP with the slave node and tag it as the new master. It won't be noticeable to our clients. When the hypothesis runs for a second time, it will test if the service is up and running, if there is one master node and if there are zero or one slave nodes one more time. If it fails, then we have found a weakness in our platform and we will have to fix it. You can probe anything using the Chaos Toolkit, not only what happens if one server goes down but also certificate validations, networking connectivity, queries to database servers, cache servers, etc.

Database Consulting Services

Ready to optimize your Database for the future?

View full post