Visualization of a Paxos-based distributed state machine
My new post on "Visualization of a distributed state machine with Paxos" – https://t.co/zNt86Pa51i— Rystsov Denis (@rystsov) May 10, 2016
The main result of the Paxos Made Simple paper by Leslie Lamport is an algorithm to transform any finite-state machine (FSM) into a fault-tolerant distributed system.
It sounds complicated, but with the step-by-step visualization and this post I'm going to demystify its complexity.
Let us start with the FSM. It's a fancy name for a pretty wide class of applications. How wide is it? If you can model your computation as a pure function from a state and an event into a new state, and you have a fixed limit on the size of each state/event then your computation (your application) belongs to the FSM class.
You can guess that almost any web application using three-tier architecture matches the description: the stateless servers play the role of the pure function while the size of the database is limited by the storage capacity.
Now it should be clear why Paxos is important: it allows to convert almost any application into a distributed system to achieve more reliability than each individual node provides.
The paper was written in 2001 and it contains several ground-breaking ideas:
In 2016 it feels natural to use logs for FSM (see ZooKeeper, Etcd, Raft). But logs aren't the first thing that I want to work with as an engineer because they force me to implement garbage collection and snapshots (nasty stuff, see the Paxos Made Live paper by Google). Also it's possible to imagine how distributed data structures other than logs could model FSM. I wonder if they really exist and why FSMs implemented on top of them aren't known...
Now when we understand that there is a place for alternative distributed FSMs, let's discuss the visualization. We're going to visualize the work of a distributed FSM managing an intercontinental ballistic missile launch system.
Imagine that we want to model an intercontinental ballistic missile launch system with the following rules:
The (non-distributed) FSM engine may look like
And the state transition functions are:
The rules define the possible states of the system, while the actual state is defined by the generals' behavior. It's quite simple in our example:
Distributed FSM has several endpoints (think of nodes running the
execute REST api) but it doesn't matter which endpoint you choose to work with because the system behaves exactly like a system with several endpoints leading to the same (non-distributed) FSM. The only difference is that the distributed version runs on 2n+1 nodes and tolerates up to n failures.
The visualization consists of two code columns and a side view with a) the persistent state b) a watch list with all variables from the stackframes and c) a network interface to control the order and lost of the cross nodes messages.
The first column contains the code of a distributed Single Decree Paxos FSM engine (37 lines of code); while the second column is dedicated for the state transition functions and the generals behavior (29 lines of code).
You can iterate through the visualization by clicking the start/step buttons in the upper-right corner next to the 'General Lee' and 'General Ross' labels. General Lee works with the 0th proposer and Ross uses the 1th proposer.
It’s worth to mention that the state transition functions weren't changed at all, so it is extremely easy to convert an existing FSM into a DFSM. Incidentaly, the actions in the example look like message handlers in the actor model, so FSM over a distributed variable looks like a great way to store the state of actors from Akka or Orleans frameworks.
Leslie Lamport wrote in the Paxos Made Simple paper
The obvious algorithm is to have each acceptor, whenever it accepts a proposal, respond to all learners, sending them the proposal.
In the visualized paxos when an acceptor accepts a value it responds to the proposer so the proposers de facto play the leaner's role.
Leadership isn't an essential part of any paxos system (Synod protocol). Even if each proposer "thinks" that it is a leader then the protocol still guarantees safety. Leadership is an optimization reducing the number of roundtrips from 2 to 1 per state transition and making the risk of self terminated concurrent updates lower.
The idea behind the optimization is to let proposers send next 'promise' message as a part of the current 'accept' message even if the next client hasn't requested the change yet. This optimization keeps the safety because every execution of the optimized version has the same behavior as some execution of the untouched version so the optimized version also preserves safety.
The optimization works if all proposers cooperate and use one of them as an leader. Usually it is done with one of the gossip protocols.
I didn't mention the leadership in the post because it isn't essential to the Synod protocol and can be implemented on the next abstraction layer. Hopefully the same idea can be applied to the visualized Single Decree Paxos without modifications.
P.S. Thanks to Gleb Smirnov for a review and helpful suggestions!