Paper review: Paxos, Paxos, and Chubby

September 22, 2011

I'm combining my paper reviews this time, since they are pretty closely coupled:

  • "Paxos Made Simple", Lamport, 2001
  • "Paxos Made Practical", Mazieres, 2007
  • "The Chubby Lock Service for Loosely-coupled Distributed Systems", Burrows, 2006

Paxos Made Simple

Paxos is a distributed consensus algorithm. At it's essence, it's a quorum-based fault-tolerant way of arriving at a single consistent value among a group of machines. This can be used to do leader election (consensus on who's the master), or synchronous strong consistency (replicating writes in a distributed database). In the case of Chubby, it's used for both: determining who holds a lock (leader election), and serving strongly consistent small files.

Paxos has two basic types of actors:

  • Proposers are nodes that are trying to get a value accepted as the "true" value during a round of consensus. Normally, which node is the proposer is pretty stable, and only changes on failure.
  • Acceptors receive proposals from proposers, and vote as part of the quorum. They act as replicas for the global state of the system.

Paxos also has two rounds of communication. Here I'm showing the basic version. There are four types of messages here: propose, promise, accept, and accepted.

  1. Propose
    • The proposer sends a proposal to all the acceptors in the system, and waits for a reply. This proposal is tagged with a round number N, which has to increase each time this proposer makes a new proposal. A common thing to do is make this number out of the IP address and some local counter, so they are globally unique.
    • The acceptors promise to accept the proposal if the proposal's N is the highest N they've seen from any proposer. A promise indicates that the acceptor will ignore any proposal with a lower number. This is so there's a total ordering on proposals; we don't care which proposer wins, we just want one to win.
  2. Accept
    • If the proposer hears back from a quorum of acceptors, it sends out accept requests to the quorum with the value it wants to set.
    • If the acceptor hasn't promised to a higher proposal number in the meantime, it tells the proposer it accepted.

If the proposer gets back a quorum of accepted messages, then it commits. At this point, it can further replicate the value to other nodes, or respond to the client that the operation worked. It's normal for all nodes in the system to in fact fulfill the roles of both proposer and acceptor as needed.

There is one detail that prevents future proposers from changing the value when issuing a higher number proposal. In the first round, the acceptor promise also sends the number and value of the highest number proposal they've accepted. Then, the proposer uses the value of the highest number proposal it gets back when it commits in the accept round.

Paxos Made Practical

This paper goes into a lot of detail about how to actually implement Paxos, handling nodes leaving and adding a Paxos replicated state machine (ala an RPC server). All nodes run the same deterministic code which transitions based on input. Non-deterministic bits are decided by a single machine externally before being passed into the state machine. One of the machines act as the master (proposer) in Paxos operations, handling a client request, running Paxos, and executing and responding to the client after a majority of nodes have logged the operation. Changes in group membership (the view) are handled by again running Paxos to agree on membership.

Chubby Lock Service

This is an engineering industry paper from Google, always interesting reads. The basic idea of Chubby is running Paxos on a small cell of five machines to solve consensus problems. One of the five machines in the cell acts as the master, and runs Paxos to decide writes. Reads are all served from the master. This lets Chubby provide coarse-grained locking services, where locks are expected to be held for long periods of time (minutes, hours, days). Finer grained locking is deferred to application-level servers. Chubby's API is through a simple way Unix-like filesystem, upon which clients can open file handles and get and release reader/writer locks. This is also convenient for advertising results, and can be used to store small files in a very consistent manner. A typical use within Google is using it as a better version of DNS (no worrying about stale entries and TTLs).

The rest of the paper describes other features of Chubby: failover, cache consistency, event notifications. Master failover is handled by leader election in the Chubby cell, with a new node brought online in the background. Client caching is also an essential part of reducing load on the Chubby master, but the master has to synchronously invalidate caches on writes. Invalidation is piggybacked on heartbeat KeepAlive messages (default every 12 seconds), a lease mechanism that keeps client locks and cache alive. Event notifications can also be used to watch files for certain events: modification, master failover, etc. This is most often used to wait for modifications to a file, indicating that a new leader has locked and written its address to the file.

Future trends

Seeing how this is being used to great extent at Google, it's got great future applicability. Chubby seems to be used mostly for inter-datacenter locking, I wonder if there are any important modifications that have to be done for intra-datacenter locking. The whole space of eventual consistency gives a lot of alternatives to the strong guarantees that Paxos offers, so there are lots of ways to tradeoff availability and performance with consistency.

blog comments powered by Disqus