In search of a simple consensus algorithm

It seems that industry reached consensus on consensus:

  • It isn't an easy task to create Paxos-based systems. The Googlers behind the Paxos Made Live paper write:
    Despite the existing literature in the field, building such a database (Paxos-based) proved to be non-trivial.
  • Raft is a way more understandable algorithm battle-tested in such products as Etcd, CockroachDB, RethinkDB and TiDB.

It isn't obvious. Despite being understandable, Raft is a complex protocol[1][2] which has performance penalty when a leader is unstable.

In this post:

  • covered an availability limitation of the Raft protocol
  • demonstrated that modern implementations of Raft are subject to it
  • described an existing simpler approach to the problem of consensus
  • showed that its toy 500-lines implementation has performance similar to Etcd but doesn't suffer from Raft's performance penalty

Raft's instability

Raft is an algorithm which describes how to build a replicated consistent fault-tolerant log. It uses such building blocks as:

  • Strong leader
  • Leader election

Philosoraptor isn't the only one who noticed this fact. The folks behind the the EPaxos paper paper asked the same question. They observed that Multi-Paxos has a defect:

if the master fails, then the system cannot serve requests until a new master is elected.

The same flaw affects Raft. The reason why EPaxos paper doesn't mention it is because the Raft and EPaxos papers were published in the same year.

I picked four modern Raft-based distributed systems (Etcd, CockroachDB, RethinkDB & TiKV) and tested them with default settings to demonstrate that they all are affected by the mentioned issue.

Crashed leader Isolated leader Version
Etcd 2s 2s 3.1.0 (8ba2897)
RethinkDB 0.6s 15s rethinkdb 2.3.5~0xenial (GCC 5.3.1)
CockroachDB 12s 10s beta-20170223
TiDB 18s 2m 40s Bug pd: f5744d7
tikv: eb185b3
tidb: a8d185d
Crashed leader
a case when I stopped a leader with "kill -9"
Isolated leader
a case when I isolated a leader with iptables
Time interval
a duration of unavailability when no one node of storage was able to serve requests

I tested the systems with the default settings. Of course, a test with the defaults settings is the test of the default settings so it can't be used to decide which system handles the leader issues better. Each system can be tuned to have a lesser unavailability window. However, the lesser leader election time means the more sensibility to the network glitches. The point of the test is to demonstrate that all Raft implementations have this tradeoff.

Later in post I'll show that the tradeoff isn't essential to the problem of consensus and it can be avoided with a different consensus algorithm.

The systems were tested on the cluster of 4 machines. Three nodes hosted a storage, and the 4th was a client. The client used three coroutines, each of them opened a connection to its dedicated node of a storage and were executing the following loop:

  1. read a value by a key
  2. if the wasn't set then use 0
  3. increment the value
  4. write it back
  5. increment a number of successful iterations
  6. repeat the loop

Each coroutine used its own key to avoid collisions. If there was an error during the loop, then it closed the current connection, opened a new one and began next iteration of the loop.

Once in a second (100ms in the case of 10x) the app dumped the number of successful iterations since the last dump per cluster and each node. Such simple metrics helped to analyze a behavior of a cluster when a leader was disturbed.

I picked a leader empirically as a node with the highest rate of successful iterations.

Etcd (10x)
261 51 16 16 19
262 44 14 13 17
RethinkDB (10x)
175 74 23 23 28
176 72 22 21 29
CockroachDB
146 489 217 134 138
147 512 234 133 145
TiDB
498 426 120 171 135
499 436 126 171 139
1st column of a storage's subtable
nth second of the experiment
2nd
number of successful iterations per last second per all nodes
3rd
4th
5th
number of successful iterations per second per 1st (2nd or 3rd) node

In the case of Etcd and RethinkDB the 3rd node is the leader, in the case of CockroachDB it's the 1st node and with TiDB it's the 2nd. Let's kill a leader and see how it affects the health of the cluster.

Etcd (10x)
266 39 13 12 14
267  4  1  1  2
268  0  0  0  0
...
286  0  0  0  0
287 23 13 10  0
288 28 15 13  0
RethinkDB (10x)
179 68 21 21 26
180 61 20 19 22
181  0  0  0  0
...            
186  0  0  0  0
187 41 23 18  0
188 42 23 19  0
CockroachDB
150 549 250 143 156
151 410 186 109 115
152   0   0   0   0
...
161   0   0   0   0
162 106   0 106   0
163 221   0 167  54
164 310   0 188 122
TiDB
501 476 134 192 150
502 133  37  54  42
503   0   0   0   0
...
517   0   0   0   0
518  57  56   0   1
519 211 146   0  65
520 294 163   0 131

You can see that the duration of the unavailability windows corresponds to the data provided in the first table.

For more information see the tests repository.

Leaderless consensus algorithms

One may think that since all the tested systems are affected by this issue then probably it's inevitable for all the consensus protocols.

Luckily, it isn't true. Epaxos is an example of a leaderless consensus protocol, but it isn't the only one - it's pretty easy to come up with a design of a "leaderless" system built on top of known battle-tested ideas.

For example, instead of using a distributed log as an Event Sourcing backend for building a key/value storage as an interpretation of an ordered stream of updates we can run a distributed log instance per key. It distributes different leaders across different nodes to achieve the same effect as a leaderless consensus does: a failure of one node doesn't affect the unavailability of the whole cluster.

A log per key looks like overkill. Hopefully, we have a simpler alternative: Single Decree Paxos (Synod). Multi-Paxos (a log) is an abstraction on top of an array of write-once registers implemented with the Single Decree Paxos algorithm, but nothing prevents us from using the same algorithm to build a rewritable register. Once we have it then creating a key/value storage becomes a trivial task since a key/value storage is by definition is a tagged set of rewritable registers.

As far as I know, there isn't a computer science paper dedicated solely to the topic of building a rewritable register on top Single Decree Paxos. But the idea is known both in the industry (see TreodeDB) and in academia (Heidi Howard, Cambridge Ph.D. student, described it in her presentation and going to cover in the Ph.D. thesis).

I practiced logic and proved it in the How Paxos works post, but the proof wasn't peer reviewed so use it at your risk :) However, it was enough for me, so I built Gryadka, a toy key/value storage of 500 line of code, and extensively tested it with the fault injection technique.

Besides the consistency testing, I measured the unavailability window by using the same method I used with Etcd, CockroachDB, RethinkDB and TiDB. Since Gryadka is based on a leaderless (multi-leader) consensus protocol it demonstrated zero downtime both when a node was crashed and when it was isolated.

Gryadka (crash)
182 435 147 143 145
183 435 146 144 145
184 412 116 148 148
185 296   0 149 147
186 309   0 154 155
187 289   0 145 144
Gryadka (isolation)
94 465 152 156 157
95 455 151 149 155
96 453 143 154 156
97 318   0 157 161
98 292   0 144 148
99 290   0 144 146

Performance

It's interesting to compare performance of Gryadka with an acknowledged solution, so I chose Etcd.

Both systems were tested on a 4-nodes cluster (3 nodes for storage + 1 client) and demonstrated similar results:

Avg Latency Max Latency Requests/sec
Etcd 1.55ms 19.53ms 5227.69
Gryadka 1.68ms 28.53ms 4720.19
How were systems tested?

The systems were tested with a similar workload: 8 threads/coroutines, each in a loop reads a value (by its own key to avoid collisions), increments it and writes back.

Usually, linearized reads and linearized writes have similar characteristics, so the statistics aggregate requests of both type and you should treat 4720rps roughly as 2360 reads and 2360 writes per second, the same applies to the latency parts too.

I used wrk with custom lua scripts to test Etcd. Gryadka was tested with a javascript client with async/await-based coroutines.

Etcd was tested with the default parameters. Gryadka doesn't have parameters to tune but the underneath Redis instances were configured to use AOF persistence mode and to flush data to the disk after every request.

Despite the similarities in the workload some other aspects of the experiments, like client-server topology, were different. Etcd's client was talking only with a leader who in its own turn propagates the changes to the followers. Gryadka's client was acting as a proposer and was directly communicating with all the Redis instances.

Complexity

Complexity has formal and informal definitions. Let's see how different interpretations of complexity align with Single-Decree Paxos.

In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program (in a predetermined programming language) that produces the object as output.

When it is applied to the algorithms, it means that an algorithm with the shortest implementation is simpler.

In the 500 Lines or Less book Dustin J. Mitchell writes about Multi-Paxos:

In early drafts of this implementation, I implemented support for membership changes. This seemingly simple change introduced a great deal of complexity ... The result was far too large for this book!

Gryadka has less than 500 lines of code but supports membership change, so Single Decree Paxos seems simpler than Multi-Paxos.

An intuitive perception of complexity correlates with the number of parts: the more parts a jigsaw puzzle has, the more complex it is. Tobias Schottdorf from Cockroach Labs admits several concerns in implementing Raft algorithm, among them are log truncation and snapshotting. Single Decree Paxos isn't a log based, so it doesn't have such parts and hence may be simpler than Raft.

Complexity can be subjective e.g. if something is hard, if it takes a lot of time or effort then it seems complex. Ronald Duncan writes about his experience of implementing Raft that it wasn't easy and took several months. The same I can tell about Gryadka.

I planned to finish it in a couple of days, but the whole endeavor lasted a couple of months, the first version has consistency issues, so I had to mock the network and to introduce fault injections to catch the bugs. So it seems that from the position of subjective complexity both algorithms are sophisticated.

Single Decree Paxos seems simpler than Raft and Multi-Paxos but remains complex enough to spend months of weekends in chasing consistency.

Conclusion

In this post I've demonstrated that Single Decree Paxos is simpler than Multi-Paxos & Raft and can be used for building strongly consistent replicated applications.

A straightforward Single Decree Paxos based key value implementation has latency and throughput similar to Etcd, a popular Raft implementation, but has better resilience among Etcd CockroachDB, RethinkDB, and TiDB.

Hence Single Decree Paxos effectively achieves the goals set in the EPaxos paper:

(1) optimal commit latency in the wide-area when tolerating one and two failures, under realistic conditions; (2) uniform load balancing across all replicas (thus achieving high throughput); and (3) graceful performance degradation when replicas are slow or crash.