Abstract. Papers on leaderless consensus protocols such as EPaxos and Atlas describe the protocols from the ground up. This approach helps a reader without prior knowledge to understand how they work, but at the same time it makes it hard for an expert to grok them because they need to trace and check all the assumptions from scratch rather than rely on already proved protocols. This may explain why the leaderless consensus protocols didn't get traction in the industry despite being known since 2013.
In this post I present a leaderless consensus protocol built in a modular way on top of known consensus protocol. Its modularity helps in understanding and retrofitting into existing solutions.
From an engineering perspective a consensus protocol is a replication protocol. Usually it is used to replicate a state of an application (a state machine) to increase its availability. The key difference between consensus and primary-backup replication is an ability to tolerate node and network failures without losing integrity of the data.
Ways to replicate common data structures via consensus
|Log||Multi-Paxos||Paxos Made Live - An Engineering Perspective (2007)|
|Raft||In Search of an Understandable Consensus Algorithm (2014)|
|Write once register||Synod||Paxos Made Simple (2001)|
|Rewrite register||CASPaxos||CASPaxos: Replicated State Machines without logs (2018)|
|Active Disk Paxos||Active Disk Paxos with infinitely many processes (2002)|
One may think of consensus replication as a strongly consistent variant of master-master replication where any replica (a proposer) can initiate a change. Althrouth consistency is never compromised in practice the concurrent processes may abort each other and stall the performance. In literature this situation is known as dueling proposers.
"It’s easy to construct a scenario in which two proposers each keep issuing a sequence of proposals with increasing numbers, none of which are ever chosen", Paxos Made Simple
"The presence of two concurrent leaders can prevent the commitment of any consensus decision and hence prevent progress", Deconstructing Paxos
"if multiple processes detect that there is no leaser and all attempt to become leader an the same time, then none of the process is likely to succeed (again dueling proposers)", Google SRE book
The can't be dismissed as a fluke, take latency of CASPaxos when all requests to change the same object land the same replica and latency of CASPaxos when the requests are uniformly distributed across all five replicas.
We see that when there are 20 clients then the dueling reduces throughput 2.85 times from 4513 ops/s to 1581 op/s and increases p99 latency 10.37 times from 4467 microseconds to 46 milliseconds.
The absolute numbers and the degradation ratio depend on the hardware. For this and all the follow up experiments I simulated
5+c nodes cluster (where
c is the number of the clients) with around 178 us RTT between any nodes, 655 bytes/us network throughput, 2 ms fsync disk latency and 1132 bytes/us disk throughput, although I haven't pushed the system to the limit of the bandwidth so the throughput numbers for disk and network don't really matter.
The shape of the charts surprised me. Usually when a system reaches saturation the latency looks linear while throughput reaches plateau and stops growing. But in this case the shapes are inverted which implies that the system is far from saturation and that we can keep adding clients increasing throughput without affecting latency. It happens because CASPaxos replicates state instead of the commands so a proposer can apply multiple commands in chain to the stored state and replicate only the final version (batching on steroids) reducing the load on the network.
Leslie Lamport suggested using a leader election abstraction to select a leader (a distinguished proposer) which is the only one capable of processing requests. It's worth noting that the idea of a leader is used to reduce contention and not to enforce consistency. Meaning the leader election may be weak and may not prevent the case where several replicas become leaders for the same period of time. It allows the use of a large class of protocols for leader election including randomised timeouts & heartbeats, consistent hashing routering or external arbiter.
But the idea of leader election has its flaws:
I built a leadered version of consensus replication and used fault injection to demonstrate the issues. The stable leadership is compared with ad hoc leadership which is known to suffer from dueling proposers under normal circumstances. The leaderd system uses 4ms election timeout while the fault injector isolates a random replica for 12ms.
For 20 clients the fault injections cause the stable leadership to have 2.46 times lower throughput (709 ops/s vs. 1750 ops/s) and 1.5 times higher p99 latency (60ms vs. 40ms) than an ad hoc leadership approach.
It looks like we have to choose between scylla and charybdis: either to have 6.36 times lower than optimal throughput (4513 ops/s vs 709 ops/s) during unavoidable transient issues under strong leadership or to have throughput profile independent of failures (1581 ops/s and 1750 ops/s) but both 2.85 times less than optimal under ad hoc leadership.
The question is if there is a compromise between being fast without failures and being fast on a rainy day. The family of leaderless consensus protocols fills that niche.
EPaxos solves the contention problem and provides excellent fault tolerance. But despite being around for seven years it hasn't generated enough traction and I don't know any product in the industry which is using it. The lack of interest is especially noticeable if we compare EPaxos to Raft protocol which was discovered around the same time.
|EPaxos||There Is More Consensus in Egalitarian Parliaments (2013)|
|Atlas||State-Machine Replication for Planet-Scale Systems (2020)|
|Gryff||Gryff: Unifying Consensus and Shared Registers (2020)|
The lack of popularity may be attributed to the complexity. I've found several reports on implementing EPaxos: Reproducing EPaxos by Robin Qiu and EPaxos by Blake Eggleston, and both of them mention the complexity of EPaxos:
One major drawback of the algorithm is its complexity. From our understanding, EPaxos is more like a two-dimensional Paxos on top of an eventual-consistency-like layer.
We only implemented the basic EPaxos, and it is already much more complicated than Raft. We believe the fully optimized EPaxos will be even more complicated. In practice, complicated algorithms are hard to design, optimize, and maintain, making them less likely to be adopted in production.
It’s still pretty rough, I just wanted to get it to a point where we could get a feel for the performance advantages and decide if the additional complexity was worth it
Funny enough the complexity of the protocol manifested in the official reference implementation of the EPaxos and TLA+ proof having a bug which leads replicas to diverge and break the linearizability of the replicated service.
This post is an attempt to design a leaderless consensus protocol but keep it simple via the modular approach by reusing already proved components.
The core idea of the pacified consensus is to order client requests and to let replicas to co-execute the same request in parallel. The preordering reduces contention and parallel execution increases fault tolerance. Even when one replica gets isolated or goes offline another finishes its request without a delay because it is already working on it in parallel.
The parallel execution doesn't affect correctness because the effects of the parallel execution can be explained by the selective duplication of messages and Paxos tolerates this kind of intrusion
"Messages can take arbitrarily long to be delivered, can be duplicated, and can be lost, but they are not corrupted", Paxos Made Simple
The task of ordering client requests in a fault tolerant way and replicating it to the replicas sounds like a consensus problem. It is natural to suggest that running two consensus replication processes can't be faster than running just one but actually it is.
The trick is that a major part of the latency comes from the disk IO. When we aim for high throughput we should make small delays to accumulate writes (4 ms) and write them in batches even when we use NVM.
4 ms is 20 times higher than intra datacenter RTT, so in-memory consensus in front of on-disk consensus doesn't contribute much to the overall end to end latency but completely eliminates contention.
This idea isn't just wild speculations, I've built it and measured the performance. And the leaderless consensus performs on par with leadered when there are no faults, latency is just 23% worse but throughput is 36% better.
But when we inject the faults latency wise the leadeless consensus protocol becomes almost an order of magnitude better than leadered or ad hoc solutions.
|Metric||Leaderless||Leadership||Ad hoc leadership|
|p99 latency w/o faults||5 ms||4 ms||46 ms|
|p99 latency w faults||9 ms||60 ms||40 ms|
|throuthput w/o faults||6154 ops/s||4513 ops/s||1581 ops/s|
|throuthput w faults||5392 ops/s||709 ops/s||1750 ops/s|
The numbers for throughput are far from saturation and just represent behaviour of the system under the same close loop load caused by 20 clients.
In the next sections we review the optimisations contributed to the improvement of the performance.
Pacified consensus is about running two layers of consensus protocols with different disk access policies. It doesn't matter if it's in-memory Raft in front of on-disk Raft or in-memory Multi-Paxos in front of on-disk Multi-Paxos. To keep things compact I focus on CASPaxos so it's a prerequisite to read the paper about it.
The ad hoc leadership consensus is a straightforward CASPaxos implementation expressed in a js-like pseudocode. The differences are retrying timeouted commands, using vector clocks to prevent execution of the same commands twice and batching to execute multiple commands per a replication cycle.
The leaderless consensus is basically the same code with an exception of synchronization of commands and read state before execution. The last bit is necessary because in case of a failure a consensus based replication may exist in a transient state where multiple outcomes are possible, without a synchronization different proposers executing the same command may choose different states and come with different outcomes.
|Ad hoc leadership consensus||Pacified leaderless consensus|
Just like with regular Multi-Paxos the next prepare may piggyback on the last accept phase and reduce the number of round trips by eliminating prepare and synchronization of state phases in most cases.
When proposers try to suggest and synchronize a batch of new commands using in-memory consensus only one proposer wins while the others read that batch and co-execute it. In case of replication over five replicas on average each proposer wins every 5th round so the requests take 5 fsyncs to execute and the latency becomes too high.
The picture is similar to a bitcoin-like blockchain, the requests are transactions, the proposers are miners, the command synchronization is mining and co-execution is validation. Bitcoin solves the problem of low per miner rate of block generation by sending transactions to multiple miners. Because from a client perspective it doesn't matter which particular miner executes it. Pacified consensus utilizes the same idea.
Proposers take a batch of incoming client request, timestamp them with vector clocks, gossip to another proposers, combine a batch with incoming gossips to form a consolidated incoming batch and then start the synchronize / co-execution loop until the vector timestamp of the executed commands catches up with the timestamp of the consolidated incoming batch. Since the same commands may be part of different consolidated upcoming batches the proposers use vector clocks during the co-execution phase to filter out already executed commands.
The gossip is embedded into the prepare phase of the synchronization so the total number of round trips isn't increased.
Even though synchronization of the commands happens in-memory it still may encounter conflicts leading to multiple retries and increasing latency. It is important to discuss how to reduce the conflicts. Oddly enough we may use the technique similar to the eliminating conflicts of on-disk paxos - co-execution of requests.
A proposer passes a list of commands during a prepare phase and uses a combination of an epoch and a hash of the list as a ballot number. In that way once a concurrent proposer encounters a conflict it could pick up a ballot of a contender, its value and co-execute it. But unlike the on-disk paxos it doesn't need to synchronize the read state because the consolidated incoming batch of the commands of the next epoch semantically doesn't depend on the previous epoch.
The same technique is used with synchronization of the state (its ballot is a combination of epoch and read acceptedAt).
The use of in-memory paxos looks like a potential safety violation because paxos depends on stable storage and in case of a failure all data stored in volatile memory is lost.
The violation happens when different proposers choose different consolidated incoming batches or read state for the same epoch. To prevent it pacified consensus use an eon as another dimension to ballot numbers and epoch. When a proposer loads or restarts it reads an eon, increments it, saves to local storage, generates a ballot number with that eon, executes prepare phase using the ballot and only then starts serving the requests. Once a proposer encounters a ballot with higher eon, it should update its eon in the storage before serving a request.
This fencing mechanism is expressed in terms of ballots numbers and doesn't compromise their uniqueness or monotonicity so by definition of paxos correctness can't be affected. From a client perspective a change of an eon causes requests originated in a previous eon to timeout.
Heidi Howard and Richard Mortier demonstrated in their paper Paxos vs Raft: Have we reached consensus on distributed consensus? that Paxos and Raft take a very similar approach to distributed consensus, differing only in their approach to leader election. With pacified consensus every replica becomes a leader so the need for leader election goes away and Raft becomes almost Paxos. It makes possible to apply the idea of pacified consensus up to almost any Raft-base product.
Once we get rid of heartbeats, leases and leader election and add the command / state synchronization layer Pacified Raft becomes almost indistinguishable from Pacified CASPaxos. The main difference is that CASPaxos replicates state and Raft replicates a log of commands so in case of staleness Raft needs to fetch missing commands to fill the gaps before executing new commands.
When the state is heavy (e.g. a fsm per the whole database) one should choose log replication but when it's tiny (e.g. a fsm per key in a key value storage) the choice between state and logs is a choice between engineering complexity of implementing multi-key transactions and engineering complexity of managing log such as snapshotting and compaction.
|Pacified CASPaxos||Pacified Raft|
Just like with Paxos the prepare, state synchronization and fetching phases may be skipped when there is no gaps between current epoch and previous epoch (
The core idea of the pacified consensus is trivial, let the in-memory CASPaxos consensus order all the incoming requests, and then let proposer to co-execute them in sequences. Co-execution increases fault tolerance while sequential execution eliminates dueling of proposers.
Since both consensuses differ only by storage they use the overall complexity is no greater than Paxos which we know is made simple 😃 The proof of correctness is based on well known properties of Paxos such as duplication of network messages.
The lowest latency is achieved when the core idea is combined with several known and novel optimizations:
The glueing all the parts together is a straightforward engineering so now it looks like we get an understandable leaderless consensus protocol.