The ubiquitous Paxos algorithm is one of the earliest and most widely cited consensus algorithms for designing a replicated state machine, and will be our follow up to two-phase commit and ongoing efforts to build this type of system. I’ll be building off our previous discussion of two-phase commit, understanding what the tradeoffs are of one vs. the other, and introducing some additional constraints to our problem of building a replicated state machine, as well as some verbiage for accurately contrasting different consensus algorithms.
A new system constraint
The constraint to introduce to our problem of building a replicated state machine is the following: we are going to make the assumption that we are trying to build a replicated state machine given an asynchronous network of nodes. What this means exactly is that there is no upper bound on the message delay experienced in the system.
This is an extremely important constraint as designing a consensus algorithm for a synchronous network, where message delivery can be constrained to an upper bound in delay, is a trivial problem, whereas for an asynchronous network, we now cannot even tell the difference between a message which has been infinitely delayed, vs. a node which has actually failed, and therefore, cannot detect failure at all, which is the crux of what makes consensus so difficult.
Safety, Liveness, and Fault-Tolerance
To be able to appropriately compare different solutions for consensus, we need to describe specific guarantees that are provided by one algorithm vs. another, and in particular, we will want to understand the properties, known as safety, liveness, and fault-tolerance. They roughly can be described as follows:
- safety: The guarantee that there will never be a “bad result”, or more specifically, that the result will be identical from any node (which has not failed)
- liveness: The guarantee that a request directed to a node (which has not failed) will yield a result in a finite amount of time
- fault-tolerance: The guarantee that the system can survive the failure of a node at any point
When specifically designing a consensus algorithm for a deterministic asynchronous network, we can actually go so far as to say that we can only guarantee two out of the three of safety, liveness, and fault-tolerance. More on why this is the case is covered in another seminal paper in distributed systems.
What is Paxos?
Paxos has often been described as an extremely difficult algorithm to understand but in truth is fairly intuitive when we drill down to its detail. At its core, Paxos was theorized by Leslie Lamport as a protocol for a system to come to consensus on a single action to take, though the same steps can be extended to a multi-round protocol akin to two-phase commit.
The single action protocol is as follows:
Preparing a Vote
Processes participating in the Paxos protocol can initiate the balloting of a particular vote, which are enumerated with sequence numbers, and are initiated with a proposed action by the initiating process, we can define this message, sent from the initiator to all other nodes in the system of the form Prepare(n, a), where n is the sequence number of the proposal (and must be greater than the sequence number of any prior proposals by this process), and a is the proposed action for the vote being prepared.
Making a Promise
Processes that receive a Prepare(n, a) message can either return back a promise to the initiator, or ignore the message outright. Processes will ignore these messages when they have already returned a promise in response to a separate Prepare(m, x) message, where x can be any action, and m is a sequence number greater than n.
Processes make a promise to an initiator by returning a message of the form Promise(n, b), where n is the sequence number the promise is being made for, and b is the last accepted action by the process. Note that this maybe null if the process has not accepted any prior actions, and also note that accepting an action is NOT the same as making a promise for a given action, more on what it means to accept an action is covered later in Voting on a proposal.
When processes return a promise to an initiator, they also cache the sequence number of the promised proposal, to ensure they abide by the rule of not returning a promise to any lower numbered proposals going forward.
Starting a Vote
Once the majority (not all processes, unlike two-phase commit) of processes return a promise to the initiating process, it is time to start the vote.
The initiator first checks all promises, and if an had a non-null last accepted action, the initiator updates the proposed action to this value. More later in Key Takeaways on why there will never be more than one distinct last accepted action across
Voting on a Proposal
Once a process has received a Propose(n,a) message, it can choose either to accept or reject the proposal, by ignoring it for the former, or by responding with a message of the form Accept(n) for the latter.
If the process chooses to accept the proposal, it also caches the proposed action as the last accepted action, which is used as described above in Making a Promise, with more on why this is important in More Takeaways.
Committing a Vote
Once the initiating process has received an accept from the majority of processes, it can commit the operation and broadcast this to all other nodes with a message of the form Commit(a).
Multi-Round Paxos and Side Tasks
Since Paxos was originally written to be a mathematical proof of the correctness of an algorithm for orchestrating a single-vote consensus, the tasks of extending this to a multi-round protocol, and other tasks which don’t affect the core algorithm correctness e.g. catching up replicas that missed a vote, leader election etc. are left as exercises to the reader of the paper, or described only in short detail.
For the purpose of multi-round Paxos, the existing single round protocol can be expanded to also include a log index, which tracks which entry in the log the Paxos proposal is being made for. Processes can use a null entry for any log indices they are not aware of the action for, and can get these corresponding actions from other processes during normal message passing.
Note that a majority of processes must accept a proposal for it to be committed, and from that we can say a majority set of processes will together be aware of the action for all log indexes (since at least one member of the majority set had to have accepted each of the committed proposals).
For leader election, a background messaging task to multicast some identifying information of each process e.g. PID is sufficient in determining a leader, with processes assuming they are the leader while they have not received a message with a lower-ordered identifier in some time. This does not cause split-brain issues unlike other protocols because Paxos has a well-defined way of prioritizing proposals (always accepting new higher-numbered proposals than the most recently promised one, ignoring lower-numbered proposals, with no duplicate numbered proposals), but can lead to slowing down of the system by causing too many proposals to be cancelled on the arrival of a new one.
Additionally, once a single leader can be established, Paxos message passing becomes much more efficient, as the first step of Prepare/Promise can be skipped entirely (since only one process is making proposals, no others to compete with/get information from). This further allows for a leader to batch a set of proposed actions, and for followers to accept or reject these actions as a batch, rather than having a separate request/response cycle for each action taken.
Catchup is trivial for a Paxos system, and can be taken care of by background messages which ship entries of the log to followers which are missing them. This can be incorporated into the existing message passing of Paxos, where followers can include their latest log sequence numbers, and get back from others missing log entries.
There are some key differences in Paxos vs. two-phase commit, but also some subtle nuances which truly make it the powerful algorithm that it is.
It’s fairly clear from the quorum, or majority based voting style that Paxos is far more tolerant to failures of participating processes. Though leader failure is still an issue as in two-phase commit, and requires followers eventually timing out the leader and starting a new round, followers can fail without the protocol needing to start over, as only the majority of processes in the system is needed to complete a vote. Thus, a round of Paxos is tolerant to up to n out of 2n + 1 process failures.
Accepted Actions are always Committed
This is a crucial innovation of Paxos that stands out from protocols such as two-phase commit. Consider a two-phase commit round where all followers vote to accept a leader’s proposal. At this point, there are still several things that could go wrong to cause the voted action not to be committed, such as the leader crashing before broadcasting the message to commit, some followers crashing prior to committing and responding back to the leader (which requires the committed action to be rolled back) etc. For this reason, its very well possible all followers accept an action, and this action is not eventually committed.
The paper has an awesome proof describing in detail why an accepted proposal in Paxos will always be the committed, which can be described simply as follows.
- Consider a system of 2n + 1 processes, where a round of Paxos is initiated, with no proposals having been accepted so far.
- The first proposal accepted, named A, had to have been both promised by a majority of processes, and accepted by a majority of processes, which is a minimum of n + 1, meaning the majority of processes will now have A as their last accepted vote.
- At this point, assume the action fails to be committed.
- For any proposal going forward, the proposing node will have to get a promise back from a majority of processes before initiating a vote. Since there can be no 2 disjoint majorities in a set of processes, the majority set of processes which respond back to the initiator with a promise, and the majority set of processes which accepted A are guaranteed to overlap in at least 1 process. This means that the initiating process will get back A in the promise from one of the processes, and will change the action for the proposal to A.
More simply put, the above steps can be described as follows: once an action has been accepted for a round of Paxos, it is guaranteed any future vote for this round will be for the same action.
Paxos guarantees safety, similarly to two-phase commit, as we are ensured by the majority consensus rule that there will never be inconsistency among processes in the system.
Many schemes also exist to further substantiate this when it comes to reading up to date data from the system, such as treating reads which must get up to date information as a proposed action, meaning a majority set of processes will be involved in the action, and guarantees log records will be shared over the majority set and bring the replica which serves the read up to date.
As is described in great detail in the famous FLP paper, no asynchronous system can guarantee both safety and liveness in the face of even a single fault.
Although we cannot guarantee liveness when using Paxos, we have a weaker guarantee, which is still much stronger than that provided by two-phase commit. While 2PC can provide liveness under the condition that all processes in the system do not fault, Paxos can provided liveness, so long as only a majority of processes in the system do not fault, which is due to the previously described fact that all accepted actions eventually are committed.
There is never a point where Paxos cannot recover due to a set of process faults. Note that failing to recover is different than making progress, or liveness, as Paxos will not satisfy liveness without a majority set, but is still fault-tolerant in this case, since it can continue to make progress in the future provided a majority set of processes come back online.
This is unlike two-phase commit, which can potentially be derailed permanently by a a set of faults.
- Consider a proposal which is accepted by all followers, with the leader broadcasting to subsequently commit.
- Now, consider a follower which has attempted to commit, and responds back to the leader with the result.
- At this point, if both the leader and that follower crash, no processes in the system know the result of the operation on the crashed follower, and so even if a new leader comes up, it will not know whether to commit or abort the transaction, since it’s unsure what the crashed follower did, and the operation may have irreversible effects
It’s hard to overstate the dramatic influence that Lamport’s Paxos algorithm has had on fault-tolerant replicated state machines and I would be remiss not to strongly recommend reading his seminal paper. The only thing Lamport’s paper lacks is a crisp mapping from the theorized protocol to an actual software implementation.
In my next post, I’ll talk about Raft, which is in a lot of ways the modern reincarnation of Paxos, taking the core pillar of quorum-based consensus, and fitting it with well-defined mechanisms for leader election, checkpoints, and the other components of a production-system level consensus algorithm.