Two-phase commit: The start of our journey to a fault-tolerant replicated state machine
I feel like I drone on way too long usually when discussing papers and algorithms, and based on my not so stellar viewership metrics, I thought i’d cut to the chase with this one.
I talked some time back about consensus algorithms very briefly, but not the core problem we are attempting to solve with them. It suffices to say that much of the goal of consensus algorithms is to accomplish the goal of building fault-tolerant, replicated state machines.
This phrase isn’t the most straightforward, but what it means to convey is that it’s quite hard to come up with an algorithm which allows a group of nodes to unanimously agree upon a sequence of actions, in a manner which is resistant to intermittent failures in the system, whether a node crashing, or a partition in the network.
More formally, when we say we are trying to implement a replicated, fault-tolerant state machine, we are attempting to execute a distributed set of processes each executing the same state machine, where actions can be taken to modify the state of the state machine, and are expected to be taken uniformly, and in the same sequence across all processes. Furthermore, the expectation of a state machine in this context is that any actions are taken are deterministic, and thus, in a replicated state machine where actions are taken uniformly, we expect all processes to always end up in the same state, or in other words, we expect all replicas of the state machine to be consistent.
Two-phase commit is arguably the most intuitive algorithm for implementing a replicated state machine, so let’s start there…
What is 2PC
Two-phase commit is a fairly straightforward protocol, which goes as follows:
- A coordinator process proposes an action to take, and broadcasts this proposal to all follow processes.
- The follower processes all respond back to the coordinator with a yes/no vote to the proposal.
- The coordinator responds back to all of the followers with the decision, which is to commit if all followers voted yes, and to abort otherwise.
- After executing the decision to either commit or abort, the followers respond back to the coordinator, acknowledging they have completed this step.
The steps as described above would be repeated indefinitely, with each successive iteration of the protocol serving the goal of deciding the next action for the state machine.
There are several problems that arise when attempting to use two-phase commit for our goal of a replicated state machine.
Firstly, if the coordinator crashes during any of the sequences of messages between the coordinator and followers after the initial proposal is sent out, it can cause the protocol to get stuck. For example, if the coordinator has crashed between the proposal and the receiving of votes, and a fraction of the followers have returned a vote to the coordinator, while another fraction are still waiting for the proposal, then those that have voted will wait indefinitely, and have no means of recognizing the coordinator has crashed. This problem is a byproduct of the fact that a coordinator failure is indistinguishable from a network partition causing failures to message the coordinator, as an asynchronous network makes no guarantee about an upper bound on network latency.
We can get around coordinator failures by imposing a timeout, and electing a new coordinator after this timeout has passed, which will at least allow for continuing the iteration, by having the new coordinator message all followers asking for their votes again, and continuing from there.
Only as strong as the weakest link
In addition to the problems stemming from coordinator failure, relying on every single follower to vote on a proposal and to acknowledge the decision means that this protocol is not fault tolerant to process or network failures at all, since a broken link between a follower and the coordinator, or a crashed follower will bring the protocol to a grinding halt, waiting infinitely on this follower to respond back with its vote on the proposal, or its acknowledgement of the decision.
This problem can also be solved with a timeout, similarly to what is used to detecting coordinator failure, and the system can continue without the failed follower, but is still not an ideal solution, as we will just restart the protocol at this point, rather than being able to recover the current iteration.
A crash on acknowledgement
The last open issue with 2PC has to do with a type of failure that will render the protocol irrecoverable. Consider the decision to commit the proposed action has been broadcasted to all nodes, and the followers are in the process of committing the action and acknowledging the decision. Now at this point, if this node and the coordinator were to crash, we would lose any state about this particular node, and whether they were able to execute the decision or not, since this is information known only by the node itself, and by the now-crashed coordinator (if it has already received the acknowledgement message from the follower). At this point, we will eventually have a new coordinator if we implement a timeout as suggested earlier, but this doesn’t change the fact that we do not know what decision the crashed process took. We furthermore cannot force commit the action, since the crashed node may have failed to commit the action and responded with an abort. We also cannot force abort, since we may have reached a point where we cannot rollback the changes made by this action. At this point, the system is in an irrecoverable state.
Two-phase commit is a good start, but we’ll need to do better than this to have a reliable protocol for implementing a fault-tolerant, replicated state machine. Next time, I’ll discuss a similar protocol to 2PC which attempts to band-aid some of its lingering issues we’ve discussed, while opening a whole new can of worms in and of itself.