It seems that industry reached consensus on consensus:
Despite the existing literature in the field, building such a databaseproved to be non-trivial.
In this post:
Raft is an algorithm which describes how to build a replicated consistent fault-tolerant log. It uses such building blocks as:
Philosoraptor isn't the only one who noticed this fact. The folks behind the the EPaxos paper paper asked the same question. They observed that Multi-Paxos has a defect:
if the master fails, then the system cannot serve requests until a new master is elected.
The same flaw affects Raft. The reason why EPaxos paper doesn't mention it is because the Raft and EPaxos papers were published in the same year.
I picked four modern Raft-based distributed systems (Etcd, CockroachDB, RethinkDB & TiKV) and tested them with default settings to demonstrate that they all are affected by the mentioned issue.
|Crashed leader||Isolated leader||Version|
|RethinkDB||0.6s||15s||rethinkdb 2.3.5~0xenial (GCC 5.3.1)|
|TiDB||18s||2m 40s Bug||pd: f5744d7
I tested the systems with the default settings. Of course, a test with the defaults settings is the test of the default settings so it can't be used to decide which system handles the leader issues better. Each system can be tuned to have a lesser unavailability window. However, the lesser leader election time means the more sensibility to the network glitches. The point of the test is to demonstrate that all Raft implementations have this tradeoff.
Later in post I'll show that the tradeoff isn't essential to the problem of consensus and it can be avoided with a different consensus algorithm.
The systems were tested on the cluster of 4 machines. Three nodes hosted a storage, and the 4th was a client. The client used three coroutines, each of them opened a connection to its dedicated node of a storage and were executing the following loop:
Each coroutine used its own key to avoid collisions. If there was an error during the loop, then it closed the current connection, opened a new one and began next iteration of the loop.
Once in a second (100ms in the case of 10x) the app dumped the number of successful iterations since the last dump per cluster and each node. Such simple metrics helped to analyze a behavior of a cluster when a leader was disturbed.
I picked a leader empirically as a node with the highest rate of successful iterations.
261 51 16 16 19 262 44 14 13 17
175 74 23 23 28 176 72 22 21 29
146 489 217 134 138 147 512 234 133 145
498 426 120 171 135 499 436 126 171 139
In the case of Etcd and RethinkDB the 3rd node is the leader, in the case of CockroachDB it's the 1st node and with TiDB it's the 2nd. Let's kill a leader and see how it affects the health of the cluster.
266 39 13 12 14 267 4 1 1 2 268 0 0 0 0 ... 286 0 0 0 0 287 23 13 10 0 288 28 15 13 0
179 68 21 21 26 180 61 20 19 22 181 0 0 0 0 ... 186 0 0 0 0 187 41 23 18 0 188 42 23 19 0
150 549 250 143 156 151 410 186 109 115 152 0 0 0 0 ... 161 0 0 0 0 162 106 0 106 0 163 221 0 167 54 164 310 0 188 122
501 476 134 192 150 502 133 37 54 42 503 0 0 0 0 ... 517 0 0 0 0 518 57 56 0 1 519 211 146 0 65 520 294 163 0 131
You can see that the duration of the unavailability windows corresponds to the data provided in the first table.
For more information see the tests repository.
One may think that since all the tested systems are affected by this issue then probably it's inevitable for all the consensus protocols.
Luckily, it isn't true. Epaxos is an example of a leaderless consensus protocol, but it isn't the only one - it's pretty easy to come up with a design of a "leaderless" system built on top of known battle-tested ideas.
For example, instead of using a distributed log as an Event Sourcing backend for building a key/value storage as an interpretation of an ordered stream of updates we can run a distributed log instance per key. It distributes different leaders across different nodes to achieve the same effect as a leaderless consensus does: a failure of one node doesn't affect the unavailability of the whole cluster.
A log per key looks like overkill. Hopefully, we have a simpler alternative: Single Decree Paxos (Synod). Multi-Paxos (a log) is an abstraction on top of an array of write-once registers implemented with the Single Decree Paxos algorithm, but nothing prevents us from using the same algorithm to build a rewritable register. Once we have it then creating a key/value storage becomes a trivial task since a key/value storage is by definition is a tagged set of rewritable registers.
As far as I know, there isn't a computer science paper dedicated solely to the topic of building a rewritable register on top Single Decree Paxos. But the idea is known both in the industry (see TreodeDB) and in academia (Heidi Howard, Cambridge Ph.D. student, described it in her presentation and going to cover in the Ph.D. thesis).
I practiced logic and proved it in the How Paxos works post, but the proof wasn't peer reviewed so use it at your risk :) However, it was enough for me, so I built Gryadka, a toy key/value storage of 500 line of code, and extensively tested it with the fault injection technique.
Besides the consistency testing, I measured the unavailability window by using the same method I used with Etcd, CockroachDB, RethinkDB and TiDB. Since Gryadka is based on a leaderless (multi-leader) consensus protocol it demonstrated zero downtime both when a node was crashed and when it was isolated.
182 435 147 143 145 183 435 146 144 145 184 412 116 148 148 185 296 0 149 147 186 309 0 154 155 187 289 0 145 144
94 465 152 156 157 95 455 151 149 155 96 453 143 154 156 97 318 0 157 161 98 292 0 144 148 99 290 0 144 146
It's interesting to compare performance of Gryadka with an acknowledged solution, so I chose Etcd.
Both systems were tested on a 4-nodes cluster (3 nodes for storage + 1 client) and demonstrated similar results:
|Avg Latency||Max Latency||Requests/sec|
The systems were tested with a similar workload: 8 threads/coroutines, each in a loop reads a value (by its own key to avoid collisions), increments it and writes back.
Usually, linearized reads and linearized writes have similar characteristics, so the statistics aggregate requests of both type and you should treat 4720rps roughly as 2360 reads and 2360 writes per second, the same applies to the latency parts too.
Etcd was tested with the default parameters. Gryadka doesn't have parameters to tune but the underneath Redis instances were configured to use AOF persistence mode and to flush data to the disk after every request.
Despite the similarities in the workload some other aspects of the experiments, like client-server topology, were different. Etcd's client was talking only with a leader who in its own turn propagates the changes to the followers. Gryadka's client was acting as a proposer and was directly communicating with all the Redis instances.
Complexity has formal and informal definitions. Let's see how different interpretations of complexity align with Single-Decree Paxos.
In algorithmic information theory, the Kolmogorov complexity of an object, such as a piece of text, is the length of the shortest computer program (in a predetermined programming language) that produces the object as output.
When it is applied to the algorithms, it means that an algorithm with the shortest implementation is simpler.
In the 500 Lines or Less book Dustin J. Mitchell writes about Multi-Paxos:
In early drafts of this implementation, I implemented support for membership changes. This seemingly simple change introduced a great deal of complexity ... The result was far too large for this book!
Gryadka has less than 500 lines of code but supports membership change, so Single Decree Paxos seems simpler than Multi-Paxos.
An intuitive perception of complexity correlates with the number of parts: the more parts a jigsaw puzzle has, the more complex it is. Tobias Schottdorf from Cockroach Labs admits several concerns in implementing Raft algorithm, among them are log truncation and snapshotting. Single Decree Paxos isn't a log based, so it doesn't have such parts and hence may be simpler than Raft.
Complexity can be subjective e.g. if something is hard, if it takes a lot of time or effort then it seems complex. Ronald Duncan writes about his experience of implementing Raft that it wasn't easy and took several months. The same I can tell about Gryadka.
I planned to finish it in a couple of days, but the whole endeavor lasted a couple of months, the first version has consistency issues, so I had to mock the network and to introduce fault injections to catch the bugs. So it seems that from the position of subjective complexity both algorithms are sophisticated.
Single Decree Paxos seems simpler than Raft and Multi-Paxos but remains complex enough to spend months of weekends in chasing consistency.
In this post I've demonstrated that Single Decree Paxos is simpler than Multi-Paxos & Raft and can be used for building strongly consistent replicated applications.
A straightforward Single Decree Paxos based key value implementation has latency and throughput similar to Etcd, a popular Raft implementation, but has better resilience among Etcd CockroachDB, RethinkDB, and TiDB.
Hence Single Decree Paxos effectively achieves the goals set in the EPaxos paper:
(1) optimal commit latency in the wide-area when tolerating one and two failures, under realistic conditions; (2) uniform load balancing across all replicas (thus achieving high throughput); and (3) graceful performance degradation when replicas are slow or crash.