The Trouble with Distributed Systems

Faults and Partial Failures

Software running on a single computer either functions as intended or it does not. There is no reason why it should behave erratically or unpredictably. However, in a distributed system, we must accept the fact that different components may fail in an unpredictable manner. We must design our system to be able to handle partial failures and build fault-tolerance mechanisms into the software to ensure reliable operation despite the unreliable nature of some components. This means that we need to design a system that can continue to function even when some parts are not working as intended.

Unreliable Networks

In a shared-nothing system, communication between machines is solely dependent on the network. However, networks, including the internet and most internal networks, are asynchronous and do not provide guarantees about the delivery or timing of messages. This means that various issues can arise, such as lost or delayed requests or responses, or failures or temporary unavailability of remote nodes. To handle these issues, it is common to use timeouts, where the system gives up waiting for a response after a certain amount of time and assumes that the response will not arrive. However, timeouts can also cause problems. A long timeout may result in a lengthy wait before declaring a node as failed, while a short timeout may result in incorrect declarations of failure. When a node is declared as failed, its responsibilities must be transferred to other nodes, which can place additional load on the system and the network. To ensure that the system can recover from network issues, it is important to understand how the software behaves in the face of such problems and to test the system's response to deliberately triggered network issues. To confirm the success of a request, it is necessary to receive a positive response from the application itself. If something goes wrong, it must be assumed that no response will be received. It is important to carefully consider the trade-offs between long and short timeouts and to design the system to handle both premature and unbounded delays in a way that minimizes negative impacts on the system as a whole.

Network Congestion and Queueing

In a network, there are several factors that can cause delays or queuing of data. For example, when multiple nodes try to send packets simultaneously to the same destination, the network switch may need to queue the packets and deliver them one by one to the destination. If the switch becomes full, it may discard packets. Similarly, if CPU cores are busy, requests may be queued by the operating system until the applications are ready to handle them. In virtual environments, the operating system may be paused while another virtual machine uses a CPU core, resulting in incoming data being queued. Additionally, TCP performs flow control, which can result in a node limiting its own rate of sending data to avoid overloading a network link or the receiving node, leading to additional queuing at the sender. To determine appropriate timeouts, it is possible to measure the distribution of network round-trip times over an extended period and choose timeouts experimentally. Systems can also continuously measure response times and their variability (jitter) and automatically adjust timeouts based on the observed response time distribution.

Synchronous Versus Asynchronous Networks

In a telephone network, a circuit is established for synchronous communication, and the data passes through multiple routers without experiencing queuing delays. The maximum end-to-end latency of the network is fixed, or bounded. This is in contrast to a TCP connection, where packets opportunistically use available network bandwidth and do not have a reserved amount of bandwidth like a circuit does. Using circuits for data transfers that involve bursts of traffic can waste network capacity and make the transfer unnecessarily slow, while TCP dynamically adapts the rate of data transfer to the available network capacity. It is important to recognize that network congestion, queuing, and unbounded delays can occur and design systems accordingly. As a result, there is no "correct" value for timeouts and they must be determined experimentally.

Unreliable Clocks

It is often difficult to determine the order in which events occurred when multiple machines are involved, as the time when a message is received is always later than the time when it is sent, and the extent of the delay is unknown due to network delays. Each machine on the network has its own clock that may be slightly faster or slower than other machines. Clock synchronization can be achieved using the Network Time Protocol (NTP). There are two types of clocks: time-of-day clocks, which return the current date and time according to a calendar, and monotonic clocks, which are guaranteed to always move forward and can be used to measure elapsed time. While NTP allows the clock rate to be adjusted by a small amount, it cannot cause a monotonic clock to jump forward or backward. In a distributed system, using a monotonic clock for measuring elapsed time is usually sufficient. However, it is important to carefully monitor the clock offsets between all the machines to ensure accurate synchronization, as relying on an inaccurately synchronized clock can result in subtle data loss rather than a total crash.

It is tempting but risky to rely on clocks for ordering events across multiple nodes, as it can lead to the use of a last-write-wins (LWW) approach, which is commonly used in multi-leader replication and leaderless databases like Cassandra and Riak, but can result in data loss. The concept of "recent" also depends on the local time-of-day clock, which may not be accurate. As an alternative, logical clocks, which are based on counters rather than oscillating quartz crystals, can be used to safely order events. Logical clocks do not measure the time of day or elapsed time, but rather the relative ordering of events, in contrast to time-of-the-day and monotonic clocks, also known as physical clocks.

It is more accurate to think of a clock reading as a range of times within a confidence interval, rather than a specific point in time. For example, a clock reading may indicate that there is a 98% confidence that the time now is between 8.1 and 8.4. The most common implementation of snapshot isolation requires a monotonically increasing transaction ID. However, the Spanner database implements snapshot isolation across data centers by using the confidence interval of clocks. If two confidence intervals overlap, it means that the events represented by those intervals may have occurred at the same time and may need to be handled accordingly.

If two confidence intervals do not overlap (meaning that the earliest time represented by one interval is earlier than the latest time represented by the same interval and the earliest time represented by the other interval is later than the latest time represented by the same interval), then it can be definitively determined that the events represented by the second interval occurred after those represented by the first interval. To ensure that confidence intervals do not overlap, the Spanner database deliberately waits for the duration of the confidence interval before committing a read-write transaction. To minimize clock uncertainty, Google deploys a GPS receiver or atomic clock in each data center where Spanner is used. This helps to ensure that the confidence intervals are as small as possible.

Process Pauses

One way for a node to determine that it is still the leader is for it to obtain a lease from other nodes, similar to a lock with a timeout. The node will remain the leader until the lease expires, at which point another node can take over. To continue as the leader, the node must periodically renew the lease. However, it is important to be cautious about making assumptions about the time that has passed for processing requests and holding the lease, as there are many reasons a process may be paused, such as garbage collection, virtual machine suspension, execution suspension on laptops, operating system context switches, synchronous disk access, paging, or being stopped with a Unix SIGSTOP signal.

Some systems, such as real-time operating systems (RTOS), require software to respond before a specific deadline. In such systems, library functions must document their worst-case execution times, dynamic memory allocation may be restricted or disallowed, and extensive testing and measurement is required. To ensure that garbage collection does not interfere with real-time performance, it may be treated as a planned outage. If the runtime can warn the application that a node will soon require a pause for garbage collection, the application can stop sending new requests to that node and perform garbage collection while no requests are in progress. Alternatively, the garbage collector can be used only for short-lived objects, with the process being restarted periodically to ensure that garbage collection does not cause delays.

Knowledge, Truth and Lies

In many distributed systems, a quorum of nodes is required to reach a consensus on a decision. The quorum is typically an absolute majority of more than half of the nodes. To ensure that nodes do not trust their own judgment of a situation, fencing tokens can be used. A fencing token is a number that is incremented by the lock service every time it grants a lock or lease. When a client sends a write request to the storage service, it must include the current fencing token. The storage server can then compare the fencing token to the ones it has already processed and reject requests with lower token numbers. In systems that use ZooKeeper as a lock service, the transaction ID (zcid) or the node version (cversion) can be used as fencing tokens. Fencing tokens are effective at detecting and blocking nodes that are acting in error, but they are not sufficient to handle Byzantine faults, where nodes may "lie" or malfunction intentionally. To be Byzantine fault-tolerant, a system must continue to operate correctly even if some of the nodes are malfunctioning. This is particularly important in environments such as aerospace, where multiple organizations may be involved and some participants may attempt to cheat or defraud others.

System Model and Reality

To solve problems in distributed systems, many algorithms have been developed. These algorithms must be able to tolerate the various types of faults that can occur in distributed systems, and they must be written in a way that is not overly dependent on the hardware and software configuration on which they are run. To do this, we need to formally define the types of faults that we expect to occur in a system, through the use of a system model, which is an abstraction that describes the assumptions an algorithm can make. There are three commonly used system models for timing assumptions:

the synchronous model, which assumes bounded network delay, bounded process pauses, and bounded clock error
the partially synchronous model, which assumes that a system behaves like a synchronous system most of the time but may occasionally exceed the bounds for network delay, process pauses, and clock drift
the asynchronous model, which does not allow any timing assumptions and does not have a clock.

In addition to timing issues, we must consider node failures, which can be classified into three models:

crash-stop faults, where nodes may suddenly stop responding and are gone forever
crash-recovery faults, where nodes may crash and then start responding again after some unknown time, with stable storage being preserved across crashes
Byzantine (arbitrary) faults, where nodes may do anything, including attempting to deceive other nodes.

The partially synchronous model with crash-recovery faults is generally the most useful for modeling real systems.

Summary

We discussed the various problems that can occur in distributed systems, including lost or delayed packets, inaccurate or inconsistent clocks, and processes that may pause or fail. These issues are inherent to distributed systems, and therefore it is necessary to build tolerance for partial failures into the software in order to ensure that the system can continue functioning even when some of its parts are not working properly. To handle faults, it is necessary to first detect them, which can be difficult due to the lack of shared state between nodes and the unreliable nature of networks. Once a fault is detected, it can be difficult to tolerate it due to the need to get a quorum of nodes to agree on a decision. While it is possible to create systems with hard real-time response guarantees and bounded delays, doing so is expensive and results in lower utilization of hardware resources. Finally, we discussed the importance of logical clocks and fencing tokens in maintaining order and detecting errors in distributed systems.

Faults and Partial Failures​

Unreliable Networks​

Network Congestion and Queueing​

Synchronous Versus Asynchronous Networks​

Unreliable Clocks​

Process Pauses​

Knowledge, Truth and Lies​

System Model and Reality​

Summary​