Making Sense of Stream Processing
Martin Kleppmann
2016
The Philosophy Behind Apache Kafka and Scalable Stream Data Platforms
Chapter 1 – Events and Stream Processing
Storing raw events vs. storing aggregated data
Raw event data:
- used by analyst
- offline processing
Aggregate immediately:
- scanning raw data too slow
- monitoring, alerting
Aggregated Summaries - Streaming aggregation
- directly implemented by a web server increment
- implemented with an event stream (or message queue, or event log) and multiple consumers (raw storage + aggregation)
Event Sourcing: from the DDD (Domain-Driven Design) Community
Keeping the list of all changes as a log of immutable events thus gives you strictly richer information than if you overwrite things in the database.
Bringing Together Event Sourcing and Stream Processing
- Raw events = history of what happened / Aggregated data = current state / end result
- Raw events are optimised for DB writes / Aggregate values are optimised for reads
- As a rule of thumb, clicking a button causes an event to be written, and what a user sees on their screen corresponds to aggregated data that is read.
- Raw events = Source of truth / Aggregated data = Derived & Denormalised
The database you read from is just a cached view of the event log.
Treat all the database writes as a stream of events, and build aggregates (DWH (ETL), views, caches (update), search indexes, processes to create new output streams) from that stream.
Several reasons why you might benefit from an eventsourced approach:
- Loose coupling between writing an reading
- Read and write performance (normalisation writes faster vs. denormalisation reads faster is no longer relevant)
- Scalability – abstractions are easier to parallelise & to decompose into producers and consumers
- Flexibility & agility – no need for schema migration any more since views are decoupled from the source of truth
- Errors scenarios – the source of truth always remains untainted by potential bad code (no hard overwrite)
Complex Event Processing (CEP) – Stream query engines provide higher-level abstractions than stream processing frameworks, but many CEP products are commercial, expensive enterprise software. With CEP, you write queries or rules that match certain patterns in the events. They are comparable to SQL queries (which describe what results you want to return from a database), except that the CEP engine continually searches the stream for sets of events that match the query and notifies you (generates a “complex event”) whenever a match is found. This is useful for fraud detection or monitoring business processes, for example.
Other event-ish stuff:
- Actor frameworks
- “Reactive”/FRP/Dataflow
- Database change capture
Chapter 2 - Using Logs to Build a Solid Data Infrastructure
Having many different storage systems is not a problem in and of itself: if they were all independent from one another, it wouldn’t be a big deal. The real trouble here is that many of them end up containing the same data, or related data, but in different form. Denormalisation, caching, indexes, and aggregations are various kinds of redundant data: keeping the same data in a different representation in order to speed up reads.
“The problem of data integration: keeping data systems synchronized.”
Concurrent write issue.
Transaction atomicity means that if you make several changes, they either all happen or none happen. The traditional approach of wrapping the two writes in a transaction works fine in databases that support it, but many of the new generation of databases (“NoSQL”) don’t, so you’re on your own. The problem also remains if using two different database.
If you do all your writes sequentially, without any concurrency, you have removed the potential for race conditions. Moreover, if you write down the order in which you make your writes, it becomes much easier to recover from partial failures.
Logs are in every areas of computing:
- DB storage engines – Write-Ahead Log (WAL), Log-Structured Storage
- DB replication – Replication Log
- Distributed consensus – Raft consensus protocol
- Kafka – Kafka is typically used as a message broker for publish-subscribe event streams .
Data streams in Kafka are split into partitions. Each partition is a log, that is, a totally ordered sequence of messages. However, different partitions are completely independent from one another, so there is no ordering guarantee across different partitions. This allows different partitions to be handled on different servers, and so Kafka can scale horizontally.
Advanced Message Queuing Protocol (AMQP) and Java Message Service (JMS) message brokers use per-message acknowledgements to keep track of which messages were successfully consumed, and redeliver any messages on which the consumer failed. A consequence of this redelivery behaviour is that messages can be delivered out-of-order. Kafka maintains a fixed ordering of messages within one partition, and always delivers those messages in the same order. For that reason, Kafka doesn’t need to keep track of acknowledgements for every single message: instead, it is sufficient to keep track of the latest message offset that a consumer has processed in each partition. AMQP and JMS are good for job queues (where the ordering of messages is not important, but it is important to be able to easily use a pool of threads to process jobs in parallel and retry any failed jobs); Kafka is good for event logs.
Solving the Data Integration Problem – Stop doing dual writes (it leads to inconsistent data); have your application only append data to a log, and all databases, indexes, and caches constructed by reading sequentially from the log.
Transactions and Integrity Constraints – using “claim” events
“The truth is the log. He database is a cache of a subset of the log.” – Pat Holland
Chapter 3 - Integrating Databases and Kafka with Change Data Capture (CDC)
Capturing all data changes that are written to a database, and exporting them to a log.
Self-advertising for “Bottled Water”, an implementation to migrate a database to Kafka, via a dump and stream from the end of the dump.
Chapter 4 – The Unix Philosophy of Distributed Data
The Unix philosophy:
- make each program do one thing well
- expect the output of every program to become the input to another, as yet unknown program
Unix Architecture versus Database Architecture
Chapter 5 - Turning the Database Inside Out
...