Best of both worlds: Raft's joint consensus + Single Decree Paxos

Paxos is a family of protocols used to build CP-distributed data structures. The Paxos Made Simple paper describes two members of that family: Single Decree Paxos for building write-once distributed registers and Multi Paxos for building distributed append-only logs.

The simplest member, Single Decree Paxos, seems impractical because "Paxos Made Simple" doesn't tell us how to change membership. Until we know it, we can't replace failed nodes, and since failures happen — the system eventually becomes unavailable. In this post, I took Raft's idea of joint consensus to manage membership, adjusted it to Single Decree Paxos and proved its correctness.

One might say that it isn't important because nobody wants to build systems on top of write-once distributed registers when they can use distributed logs. It's true, but a little tweak turns a write-once register into a variable one.

Distributed variables may be more appealing to practitioners than distributed logs because variables are already powerful enough to be used in the real-world systems and they don't have log-related problems like log compaction.

For example, we may use an array of distributed variables as a distributed hashtable. Since the only API of many distributed storages is a key/value API (hashtable) it proves that Single Decree Paxos is applicable to the problems which people currently solve with Raft and Multi-Paxos.

Among the popular distributed key/value solutions are:

  • Cassandra with lightweight transactions
  • Riak with consistent buckets
  • RethinkDB
  • ZooKeeper
  • Etdc

Haven't you written about it before?

Yeah, I blogged about this problem and proposed a solution in the Dynamic Plain Paxos post. The algorithm is correct, does the job, but has reliability issues. If we define reliability as the number of nodes that may fail without affecting the system, then during the transition phase of the the Dynamic Plain Paxos algorithm between $2f+1$ and $2f+2$ nodes reliability goes down from $f$ to $f-1$, and then restores back to $f$.

The Raft's joint consensus approach doesn't have this disadvantage and it's worth it to backport it to Single Decree Paxos.

Ok, tell me the details!

The algorithm is based on two principles.

Principle 1. Filter equivalence.
If a change in the behavior of the paxos cluster can be explained by delaying or omitting the messages between the nodes of the cluster, then the change doesn't affect consistency. The principle follows from the ability of paxos to tolerate interventions of such kind. It gives the freedom to change the system as long as the change can be modeled as a message filter on top of the unchanged system.
Principle 2. Reduced read quorums.
When a proposer receives majority of $f+1$ of the 'promises' it may choose to ignore one 'promise'. It doesn't affect consistency if the size of the cluster is even. I wrote about it and proved it in the Read write quorums in Paxos post.

Let's review how Paxos works, describe joint consensus, apply it to Single Decree Paxos and prove that the consistency of the system holds during the transition from a $2f+1$ to a $2f+2$ node cluster.

How Paxos works

When a proposer receives a request from a client to change the distributed variable in the $2f+1$ node paxos cluster it should:

  1. Ask acceptors to provide the current value (send the 'prepare' message).
  2. Wait until majority of the acceptors (at least $f+1$ nodes) answer.
  3. Select the value with the maximum ballot number.
  4. Change the value.
  5. Send the new value back to the acceptors.
  6. Wait until a majority of the acceptors (at least $f+1$ nodes) answer.
  7. Send confirmation to the client.

To read a value, a proposer should execute the same algorithm but keep the value untouched on the 4th step.

Please, see the How Paxos works for details.

Joint consensus

Joint consensus is a way of changing set of acceptors in the paxos cluster without affecting the linearizability. Suppose we want to make a transition from old set of acceptors O to the new one N. To make it we:

  1. Make proposers work and wait confirmation of the both set of acceptors simultaneously.
    The old set of acceptors guarantees linearizability, the new set of acceptors make the system behaves as N-acceptors paxos cluster for the future requests.
  2. Read and write all previously written key/value pairs.
    After we do it all the current and future accepted key/value pairs will accepted by the system which behaves as N-acceptors paxos cluster. So the state of the system is equal to the some state of N-acceptors paxos cluster. Since the current system has the state and the behavior of the N-acceptors paxos cluster then linearizability is now guaranteed by the N set of acceptors too.
  3. Turn off the O configuration.

Now that we got the general idea, we can dive deeper and take a look at the joint consensus inspired algorithm to change membership in paxos.

The algorithm

Let's describe how to enlarge a paxos cluster from $2f+1$ acceptors to $2f+2$ acceptors step by step. The $A_1, A_2 \cdots A_{2f+1}$ are the original acceptors and we want to add a new $A_{2f+2}$ acceptor to the cluster. The steps to achieve this are:

  1. Turn on $A_{2f+2}$ acceptor.
  2. Connect to each proposer and switch it from the regular to the transient mode between O ($A_1, A_2 \cdots A_{2f+1}$) and N ($A_1, A_2 \cdots A_{2f+2}$):
    1. Send a propose message to the $\mathbf{O} \cup \mathbf{N}$ acceptors.
    2. Wait until a majority of O (at least $f+1$ nodes) answer.
    3. Wait until a majority of N (at least $f+2$ nodes) answer.
    4. Use the reduced read quorums principle and filter out the response from $A_{2f+2}$.
    5. Select the value with the maximum ballot number.
    6. Perform a transformation of the value.
    7. Send the new value back to the acceptors.
    8. Wait until a majority of O answer.
    9. Wait until a majority of N answer.
    10. Send a confirmation to the client.
    Since the modification of the proposer behavior can be explained by a filter, the filter equivalence principle guarantees that the system keeps working as a valid O cluster. The reduced read quorums principle guarantees that the system works as a valid N cluster
  3. Connect to each acceptor, get the list of the known keys up to this moment and perform a read for each key. It converts the current state into a N-reachable state via the read operation, so consistency isn't impacted.
  4. Connect to each proposer and switch it to the N-regular mode. If system is in a N-reachable state and works as a N system then it is a N system, so we can't affect it by switching it explicitly to that mode.

The membership change from $2f$ to $2f+1$ is much simpler. We think about the $2f$ node system as a $2f+1$ node system where all the messages to the new node are filtered out. So we just ask each proposer to switch itself to the new configuration. The filter equivalence principle guarantees correctness.

To exclude a node from the cluster, we should follow the same algorithms but in the reversed order.

In case something goes wrong during the cluster extension, we can always go back to the previous configuration or just pause the extension, fix the problem and resume it later. Since the extension doesn't affect reliability we don't have to finish it as fast as possible.