Visualization of serializable cross shard client-side transactions

TL;DR step-by-step visualization of serializable cross shard transactions

In 2012 I wrote the Transactions in MongoDB, Cassandra, Zookeeper and others post where I described how to implement serializable multi-key client-side transactions for any key/value storage if it supports per-key linearizability and compare-and-set update operation.

In the original post I focused on the distributed storages but the same algorithm can be applied to the traditional RDBMS like MySQL or PostgreSQL. You may wonder why somebody needs to do it if a RDBMS already supports transactions.

In order to implement the client-side transactions we don't require the key/value pairs participating in the transaction to be aware about each other. It means that they can live on different nodes, so we can run a fistful of MySQL servers, shard the data between them and use the proposed solution to perform serializable cross shard transactions.

You can think about the client side transactions as a replacement to the XA transactions. The interesting part is that a vendor shouldn't support XA explicitly, so you can easily do cross MySQL cross PostgreSQL cross Redis transactions. It's hard to justify such polyglot persistence madness but it can be handy when you're migrating from one database engine to another.

Alternatively we still can take a large distributed storage supporting CAS and per-key linearizability. The list of such systems increased since 2012 and now we can choose from:

  • Cassandra with lightweight transactions
  • Riak with consistent buckets
  • RethinkDB
  • ZooKeeper
  • Etdc
  • HBase
  • DynamoDB
  • MongoDB

Use cases

There are three major use cases for the client side transactions. The first one is to use it within a single distributed or sharded key/value store to perform atomic multi-key updates. The second is to use transactions to migrate from one key/value store to another without the downtime. The last major use case is to reshard the sharded key/value store.

Applications and history

I wasn't the only one who noticed such approach. In 2013 Jeff Barr wrote how to perform transactions over DynamoDB and in 2015 the CockroachDB guys wrote how they apply the same idea for the transactions in CockroachDB.

Obviously I also wasn't the first. Google wrote about this idea in the Large-scale Incremental Processing Using Distributed Transactions and Notifications paper in 2010 and the CockroachDB folks told that they were inspired by the works of Maysam Yabandeh.

I bet the algorithm is actually even older since it's a straightforward application of the lock-free ideas to the database world.

Why it's hard to think about distributed systems

Yet the idea is very simple it's still hard to explain it in a way that its correctness is obvious because it's unnatural to people to think about concurrent systems.

People demonstrate a lot of fallacies and limitations when the try to think about events and time (the essence of the distributed systems). Among them are:

What can we do about it

The natural inability to think about the concurrent systems is the reason why I was facinated by The Deadlock Empire game. I heard about it first in the devzen.ru podcast. In this game you act on behalf of a scheduler and try to schedule the execution of the threads in order to archive the desired state of the system.

When you act on behalf of the scheduler you convert the concurrent system with time into the sequential system with causal relation. The latter is much simpler. So I thought that the same effect can be used to simplify the understanding of the distributed systems. As a result I created the step-by-step visualization of the client side transactions where you act as a scheduler and control the execution of the clients.

The visualization

In the visualization you control the execution of the two clients. Both clients are trying transactionally swap the values. The first client is going to swap the values corresponding to the "a" and "b" keys and the second is trying to swap the "b" and "c" keys. Since the keys are intercepting then there is a contention and depending on the scheduling the different outcomes are possible.

I hope it will somebody to understand the distributed transactions and to build more reliable software.

Hacker News submission  Lobsters submission