Visualization of a Paxos-based distributed state machine

The main result of the Paxos Made Simple paper by Leslie Lamport is an algorithm to transform any finite-state machine (FSM) into a fault-tolerant distributed system.

It sounds complicated, but with the step-by-step visualization and this post I'm going to demystify its complexity.

Let us start with the FSM. It's a fancy name for a pretty wide class of applications. How wide is it? If you can model your computation as a pure function from a state and an event into a new state, and you have a fixed limit on the size of each state/event then your computation (your application) belongs to the FSM class.

You can guess that almost any web application using three-tier architecture matches the description: the stateless servers play the role of the pure function while the size of the database is limited by the storage capacity.

Now it should be clear why Paxos is important: it allows to convert almost any application into a distributed system to achieve more reliability than each individual node provides.

The paper was written in 2001 and it contains several ground-breaking ideas:

  • Leslie Lamport demonstrated how to build two linearizable distributed data structures:
    • a distributed write-once register (based on his Synod protocol)
    • a distributed append-only log (known as Multi-Paxos, based on the sequence of distributed write-once registers)
  • He suggested to use distributed append-only log to store the sequence of all events leading a system to the current state instead of storing the state itself (later this idea was rediscovered as the event sourcing pattern).

In 2016 it feels natural to use logs for FSM (see ZooKeeper, Etcd, Raft). But logs aren't the first thing that I want to work with as an engineer because they force me to implement garbage collection and snapshots (nasty stuff, see the Paxos Made Live paper by Google). Also it's possible to imagine how distributed data structures other than logs could model FSM. I wonder if they really exist and why FSMs implemented on top of them aren't known...

  • They do exist. For example, TreodeDB used Synod-based distributed rewritable register. It's interesting that I haven't seen any academic paper dedicated to implementation of FSM on top of the distributed rewritable register datastructure (a distributed variable), but there are industrial systems like riak_ensemble and TreodeDB which use it. So I practiced my Math-Fu, described and proved FSM over Single Decree Paxos (a distributed variable flavor) in the How Paxos Works post.
  • IMHO the log-based solutions are more popular because they had the advantage of time. Until Vertical Paxos was published in 2009, log replication was the only consensus protocol capable of cluster membership change. Thus, it was the only practical solution for highly available services. But now we have a few more algorithms to choose from. For example Vertical Paxos and Raft's joint consensus membership change are among them. Take a look at the Best of both worlds: Raft's joint consensus + Single Decree Paxos post if you're interested in understanding how dynamic membership works with a distributed variable.

Visualization

Now when we understand that there is a place for alternative distributed FSMs, let's discuss the visualization. We're going to visualize the work of a distributed FSM managing an intercontinental ballistic missile launch system.

FSM

Imagine that we want to model an intercontinental ballistic missile launch system with the following rules:

  1. The missile is launched when at least two generals signed off the launch.
  2. Until the missile is launched a general can revoke their sign-off.

The (non-distributed) FSM engine may look like

And the state transition functions are:

The rules define the possible states of the system, while the actual state is defined by the generals' behavior. It's quite simple in our example:

  1. unsign the launch
  2. sign the launch
  3. goto 1
Distributed FSM

Distributed FSM has several endpoints (think of nodes running the execute REST api) but it doesn't matter which endpoint you choose to work with because the system behaves exactly like a system with several endpoints leading to the same (non-distributed) FSM. The only difference is that the distributed version runs on 2n+1 nodes and tolerates up to n failures.

Visualization

The visualization consists of two code columns and a side view with a) the persistent state b) a watch list with all variables from the stackframes and c) a network interface to control the order and lost of the cross nodes messages.

The first column contains the code of a distributed Single Decree Paxos FSM engine (37 lines of code); while the second column is dedicated for the state transition functions and the generals behavior (29 lines of code).

You can iterate through the visualization by clicking the start/step buttons in the upper-right corner next to the 'General Lee' and 'General Ross' labels. General Lee works with the 0th proposer and Ross uses the 1th proposer.

It’s worth to mention that the state transition functions weren't changed at all, so it is extremely easy to convert an existing FSM into a DFSM. Incidentaly, the actions in the example look like message handlers in the actor model, so FSM over a distributed variable looks like a great way to store the state of actors from Akka or Orleans frameworks.

FAQ
The Paxos Made Simple paper describes the proposer's, acceptor's and leaner's roles in a Paxos system but the visualization uses only the first two. How does it work without leaners?

Leslie Lamport wrote in the Paxos Made Simple paper

The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal.

In the visualized paxos when an acceptor accepts a value it responds to the proposer so the proposers de facto play the leaner's role.

I heard that Paxos uses a) leader to serialize the updates and b) leader election to be fault-tolerance. How does the visualization work without the leader?

Leadership isn't an essential part of any paxos system (Synod protocol). Even if each proposer "thinks" that it is a leader then the protocol still guarantees safety. Leadership is an optimization reducing the number of roundtrips from 2 to 1 per state transition and making the risk of self terminated concurrent updates lower.

The idea behind the optimization is to let proposers send next 'promise' message as a part of the current 'accept' message even if the next client hasn't requested the change yet. This optimization keeps the safety because every execution of the optimized version has the same behavior as some execution of the untouched version so the optimized version also preserves safety.

The optimization works if all proposers cooperate and use one of them as an leader. Usually it is done with one of the gossip protocols.

I didn't mention the leadership in the post because it isn't essential to the Synod protocol and can be implemented on the next abstraction layer. Hopefully the same idea can be applied to the visualized Single Decree Paxos without modifications.


P.S. Thanks to Gleb Smirnov for a review and helpful suggestions!