The Future of Data Systems
"If the highest aim of a captain was to preserve his ship, he would keep it in port forever".
Data Integration
Updating a derived data system based on an event log can often be made deterministic and idempotent. This is in contrast to distributed transactions, which use locks for mutual exclusion and atomic commit to ensure exactly once semantics. While transaction systems provide linearizability, log-based systems rely on deterministic retry and idempotence. Log-based systems are often used when integrating different data systems, as they do not require widespread support for a good distributed transaction protocol. However, as systems are scaled to handle larger and more complex workloads, limitations can emerge. For example, constructing a totally ordered log may require all events to pass through a single leader node, which can be a bottleneck. Additionally, there may be an undefined ordering of events that originate on multiple datacenters, or when two events originate in different services. In these cases, deciding on a total order of events, known as total order broadcast or consensus, can be challenging. It is still an open research problem to design consensus algorithms that can scale beyond the throughput of a single node.
Batch and Stream Processing
The main difference between batch processors and stream processors is that batch processors operate on a finite amount of data, while stream processors operate on an unbounded amount of data. Spark and Apache Flink both perform stream processing, but Spark does so on top of batch processing, while Flink does the opposite. Batch processing tends to have a strong functional flavor, with the output depending only on the input and no side effects. Stream processing is similar, but it allows for managed, fault-tolerant state. Derived data systems can be maintained synchronously or asynchronously. Asynchrony can make systems based on event logs more robust, as it allows a fault in one part of the system to be contained locally. Stream processing allows for low-delay updates to derived views based on changes in the input data, while batch processing allows for the reprocessing of large amounts of historical data to derive new views onto an existing dataset. Derived views can also enable gradual evolution, as it allows for the maintenance of both an old and new schema as independent views onto the same underlying data until the old view can be dropped.
The Lambda Architecture
The lambda architecture is a proposal for combining batch and stream processing by running them in parallel. In this approach, incoming data is recorded by appending immutable events to an always-growing dataset, similar to event sourcing. Read-optimized views are then derived from these events, with a batch processing system like Hadoop MapReduce producing a corrected version of the view and a separate stream-processing system like Storm producing a quick approximate update to the view. The lambda architecture was influential in shaping the design of data systems, particularly by popularizing the idea of deriving views onto streams of immutable events and reprocessing events as needed. However, it also has some practical problems, such as the added effort of maintaining the same logic for both batch and stream processing frameworks and the complexity of merging the separate outputs of the batch and stream pipelines. Additionally, reprocessing the entire historical dataset can be expensive on large datasets, so the batch pipeline may need to be set up to process incremental batches, which can raise issues like handling stragglers and windows that cross boundaries between batches.
Unbundling Databases
At a high level, databases, Hadoop, and operating systems all perform similar functions by storing data and allowing it to be processed and queried. However, they differ in their approaches to information management. Unix presents programmers with a logical but low-level hardware abstraction, while relational databases offer a high-level abstraction that hides the complexities of data structures, concurrency, and crash recovery. Unix uses pipes and files as sequences of bytes, while databases use SQL and transactions. The tension between these philosophies has persisted for decades, with the NoSQL movement representing an attempt to apply a Unix-like approach of low-level abstractions to distributed OLTP data storage. Let's try to reconcile these philosophies and explore how to combine the best of both worlds.
Creating an Index
Batch and stream processors can be thought of as elaborate implementations of triggers, stored procedures, and materialized view maintenance routines, while the derived data systems they maintain are similar to different index types. There are two ways in which different storage and processing tools can be combined into a cohesive system: federated databases, which unify reads by providing a unified query interface to a variety of underlying storage engines and processing methods, and unbundled databases, which unify writes by ensuring that all data changes are made in all the right places across disparate technologies. Keeping writes in sync across several storage systems is a more challenging engineering problem, and may require the use of distributed transactions or an asynchronous event log with idempotent writes for a more robust and practical solution. The advantages of using asynchronous event streams and unbundling data systems include increased robustness and loose coupling between various components, allowing different software components and services to be developed, improved, and maintained independently by different teams. However, if there is a single technology that meets all of your needs, it may be more efficient to simply use that product rather than trying to recreate it using lower-level components.
Designing Applications Around Dataflow
It is often beneficial to separate stateless application logic from state management (databases) and to have some parts of a system specialize in durable data storage while other parts specialize in running application code. Instead of treating the database as a passive variable manipulated by the application, application code can respond to state changes in one place by triggering state changes in another place, a process known as dataflow. This approach can be implemented using stream processors and services, which can be more efficient and robust than querying an external service or database. Materialized views and caching can also be used to precompute results for common queries and reduce the workload on the read path. Read requests can also be represented as streams of events and processed through a stream processor, allowing for better tracking of causal dependencies and the reconstruction of what the user saw before making a particular decision.
Aiming for Correctness
It is important to build reliable and correct applications, but traditional methods such as transactions with atomicity, isolation, and durability can have limitations in terms of performance and scalability, and may not provide a clear understanding of consistency. In addition, it can be difficult to determine the safety of running an application with a particular transaction isolation level or replication configuration, and even infrastructure products like databases may have issues that can lead to data corruption or loss. However, if an application can tolerate the occasional loss or corruption of data, it may be possible to avoid the cost of serializability and atomic commit, which typically only work within a single datacenter and limit scale and fault-tolerance. In this context, it is important to consider other ways of thinking about correctness in dataflow architectures.
The End-to-end Argument for Databases
Bugs and mistakes are a fact of life in any system. When they occur, they can cause problems in the processing of data and can lead to incorrect results. To mitigate these issues, it is often advisable to use immutable and append-only data structures. These types of data structures make it easier to recover from mistakes and bugs, as the data cannot be modified once it has been written.
Ensuring exactly-once semantics, or effectively-once semantics, means designing the computation in such a way that the final result is the same as if no errors had occurred. This can be achieved through the use of idempotent operations, which have the same effect regardless of whether they are executed once or multiple times. Implementing idempotence requires some effort and care, as it may be necessary to maintain additional metadata (such as operation IDs) and ensure fencing during failover from one node to another.
While transactions have traditionally been seen as a good abstraction for ensuring the correctness of data processing, they are not sufficient on their own to provide end-to-end correctness. To prevent the possibility of duplicate requests being processed, it is necessary to have an end-to-end solution that includes a transaction identifier that is passed all the way from the client to the database. This identifier can be generated in a variety of ways, such as by using a unique identifier (such as a UUID) or by calculating a hash of the relevant form fields.
It is important to note that low-level reliability mechanisms, such as those found in TCP, work quite well and as a result, higher-level faults occur fairly infrequently. However, it is still worth exploring fault-tolerance abstractions that provide application-specific end-to-end correctness while also maintaining good performance and operational characteristics. By doing so, we can ensure that our systems are as robust and reliable as possible, even in the face of bugs and mistakes.
Enforcing Constraints
The most common method of achieving consensus in a distributed system is to designate a single node as the leader, which is responsible for making all decisions. However, if the leader node fails, the problem of achieving consensus must be addressed again. To scale out uniqueness checking, the data can be partitioned based on the values that need to be unique. For example, if unique usernames are required, the data can be partitioned by hash or username. Asynchronous multi-master replication is not suitable for maintaining unique values, as conflicting writes may be accepted concurrently by different masters. To ensure immediate rejection of any writes that would violate the uniqueness constraint, synchronous coordination is necessary.
For streams the case of checking for unique values, such as usernames, the stream processor reads requests for specific names encoded as messages in the log. If the requested name is available, it records the name as taken and sends a success message to an output stream. If the name is already taken, it sends a rejection message to the output stream. The client then waits for a corresponding success or rejection message for its request. This approach can be used not only for enforcing uniqueness constraints, but also for many other types of constraints.
In a scenario where we have multi partition request processing for example a transaction involving the transfer of money from one account (A) to another (B) could potentially involve three partitions: one for the request ID, one for the payee account, and one for the payer account. In a traditional database, executing this transaction would require an atomic commit across all three partitions. However, it is possible to achieve equivalent correctness using partitioned logs and without an atomic commit. To do this, the client generates a unique request ID for the money transfer request and appends it to a log partition based on the request ID. A stream processor reads the log of requests and emits two messages to output streams: a debit instruction for the payer account (partitioned by A) and a credit instruction for the payee account (partitioned by B). The original request ID is included in these emitted messages. Additional processors then consume the streams of credit and debit instructions, deduplicate them based on the request ID, and apply the changes to the account balances.
Timeliness and Integrity
Consumers of a log are designed to operate asynchronously, so a sender does not need to wait for its message to be processed by consumers. However, it is possible for a client to wait for a message to appear on an output stream. Consistency refers to two different requirements:
- Timeliness, which means that users should observe the system in an up-to-date state
- Integrity, which means that the data is correct and there is no corruption or contradiction.
Violations of timeliness are referred to as "eventual consistency," while violations of integrity are referred to as "perpetual inconsistency."
Correctness and Dataflow Systems
When processing event streams asynchronously, there is no guarantee of timeliness unless consumers are designed to wait for a message to arrive. However, integrity is a crucial aspect of streaming systems. To preserve integrity, systems can use mechanisms such as exactly-once or effectively-once semantics, fault-tolerant message delivery, and duplicate suppression. Stream processing systems can maintain integrity without the need for distributed transactions and atomic commit protocols, which allows for better performance and operational robustness. Integrity can be achieved through a combination of techniques such as representing write operations as single messages, using deterministic derivation functions to update state, passing a client-generated request ID for end-to-end duplicate suppression and idempotence, and making messages immutable and allowing derived data to be reprocessed periodically. In some business contexts, it may be acceptable to temporarily violate a constraint and fix it later, as the cost of apologizing (e.g. in terms of money or reputation) may be low.
Coordination-avoiding Data-systems
Dataflow systems can maintain the integrity of derived data without relying on atomic commit, linearizability, or synchronous cross-partition coordination. Some applications may be satisfied with loose uniqueness constraints that may be temporarily violated and corrected later, rather than strict constraints that require timely coordination. Dataflow systems can offer strong integrity guarantees without the need for coordination, resulting in better performance and fault tolerance compared to systems that require synchronous coordination.
Trust, But Verify
Verifying the accuracy and completeness of data is known as auditing. It is important to regularly check the integrity of data by reading it and performing restores from backups to ensure that they are functioning properly. Self-validating or self-auditing systems continuously check their own integrity. Traditional ACID databases often prioritize technology reliability over auditability, while event-based systems, such as event sourcing, can offer better auditability. Cryptographic auditing and integrity checks often utilize Merkle trees, which are also used in security technologies like certificate transparency for validating TLS/SSL certificates.
Doing the Right Thing
It is essential to treat data about individuals with respect and dignity, as these individuals are also humans. There are guidelines, such as ACM's Software Engineering Code of Ethics and Professional Practice, to help navigate these issues. It is not enough for software engineers to solely focus on the technology and ignore its potential consequences; ethical responsibility is a crucial aspect of our work. In societies that value human rights, the criminal justice system assumes innocence until proven guilty, but automated systems can exclude a person from participating in society arbitrarily and without evidence of wrongdoing, with little opportunity for appeal. If there is a systematic bias in the input to an algorithm, the system is likely to amplify this bias in its output. It is important to use data and models as tools, not as masters, and to exercise moral imagination in order to create a better future. Algorithms can also make mistakes, so it is important to determine who is accountable when things go wrong. Predictive analytics, which often base predictions on how people similar to the individual in question have behaved in the past, can perpetuate stereotypes. It is crucial to consider the entire system, including the non-computerized elements, in order to anticipate potential consequences, a approach known as systems thinking. When a service tracks and logs a user's activity for the benefit of advertisers, rather than solely performing the tasks requested by the user, it becomes surveillance. Users should have a clear understanding of how their data is used and should be able to provide meaningful consent. Data about one user can also reveal information about others who have not agreed to any terms of service. It is important to protect privacy and prevent data from being used to harm individuals. Services that provide personalized content may create echo chambers that promote misinformation, stereotypes, and polarization. It is important to consider the ethical implications of technology and to design systems that respect user autonomy and privacy.
In the end I have to qoute the author of Designing Data Intensive Applications:
"As software and data are having such a large impact on the world, we engineers must remember that we carry a responsibility to work toward the kind of world that we want to live in: a world that treats people with humanity and respect. I hope that we can work together toward that goal."
Summary
We explored alternative approaches to designing data systems and discussed the future potential of these approaches. We emphasized the importance of using multiple tools to effectively serve different use cases and discussed how data integration can be achieved through batch processing and event streams. We also discussed the concept of systems of record and the use of transformations to derive data from these systems, as well as the benefits of making these transformations asynchronous and loosely coupled in terms of system robustness and fault tolerance. We emphasized the importance of expressing dataflows as transformations between datasets, which can facilitate application evolution and recovery in case of errors. We also introduced the idea of unbundling the components of a database and composing them to build applications. We discussed the importance of end-to-end request identifiers and asynchronous constraint checking in maintaining integrity and avoiding coordination in dataflow systems. We also touched on the role of audits in verifying data integrity and detecting corruption. Finally, we examined the ethical considerations of building data-intensive applications, including the potential for data to cause harm through decisions with serious consequences, discrimination, surveillance, and data breaches.