‘Uncategorized’

Designing Event Driven Systems – Summary of Arguments

Thursday, October 4th, 2018

This post provides a terse summary of the high-level arguments addressed in my book.

Why Change is Needed

Technology has changed:

  • Partitioned/Replayable logs provide previously unattainable levels of throughput (up to Terabit/s), storage (up to PB) and high availability.
  • Stateful Stream Processors include a rich suite of utilities for handling Streams, Tables, Joins, Buffering of late events (important in asynchronous communication), state management. These tools interface directly with business logic. Transactions tie streams and state together efficiently.
  • Kafka Streams and KSQL are DSLs which can be run as standalone clusters, or embedded into applications and services directly. The latter approach makes streaming an API, interfacing inbound and outbound streams directly into your code.

Businesses need asynchronicity:

  • Businesses are a collection of people, teams and departments performing a wide range of functions, backed by technology. Teams need to work asynchronously with respect to one another to be efficient.
  • Many business processes are inherently asynchronous, for example shipping a parcel from a warehouse to a user’s door.
  • A business may start as a website, where the front end makes synchronous calls to backend services, but as it grows the web of synchronous calls tightly couple services together at runtime. Event-based methods reverse this, decoupling systems in time and allowing them to evolve independently of one another.

A message broker has notable benefits:

  • It flips control of routing, so a sender does not know who receives a message, and there may be many different receivers (pub/sub). This makes the system pluggable, as the producer is decoupled from the potentially many consumers.
  • Load and scalability become a concern of the broker, not the source system.
  • There is no requirement for backpressure. The receiver defines their own flow control.

Systems still require Request Response

  • Whilst many systems are built entirely-event driven, request-response protocols remain the best choice for many use cases. The rule of thumb is: use request-response for intra-system communication particularly queries or lookups (customers, shopping carts, DNS), use events for state changes and inter-system communication (changes to business facts that are needed beyond the scope of the originating system).

Data-on-the-outside is different:

  • In service-based ecosystems the data that services share is very different to the data they keep inside their service boundary. Outside data is harder to change, but it has more value in a holistic sense.
  • The events services share form a journal, or ‘Shared Narrative’, describing exactly how your business evolved over time.

Databases aren’t well shared:

  • Databases have rich interfaces that couple them tightly with the programs that use them. This makes them useful tools for data manipulation and storage, but poor tools for data integration.
  • Shared databases form a bottleneck (performance, operability, storage etc.).

Data Services are still “databases”:

  • A database wrapped in a service interface still suffers from many of the issues seen with shared databases (The Integration Database Antipattern). Either it provides all the functionality you need (becoming a homegrown database) or it provides a mechanism for extracting that data and moving it (becoming a homegrown replayable log).

Data movement is inevitable as ecosystems grow.

  • The core datasets of any large business end up being distributed to the majority of applications.  
  • Messaging moves data from a tightly coupled place (the originating service) to a loosely coupled place (the service that is using the data). Because this gives teams more freedom (operationally, data enrichment, processing), it tends to be where they eventually end up.

Why Event Streaming

Events should be 1st Class Entities:

  • Events are two things: (a) a notification and (b) a state transfer. The former leads to stateless architectures, the latter to stateful architectures. Both are useful.
  • Events become a Shared Narrative describing the evolution of the business over time: When used with a replayable log, service interactions create a journal that describes everything a business does, one event at a time. This journal is useful for audit, replay (event sourcing) and debugging inter-service issues.
  • Event-Driven Architectures move data to wherever it is needed: Traditional services are about isolating functionality that can be called upon and reused. Event-Driven architectures are about moving data to code, be it a different process, geography, disconnected device etc. Companies need both. The larger and more complex a system gets, the more it needs to replicate state.

Messaging is the most decoupled form of communication:

  • Coupling relates to a combination of (a) data, (b) function and (c) operability
  • Businesses have core datasets: these provide a base level of unavoidable coupling.  
  • Messaging moves this data from a highly coupled source to a loosely coupled destination which gives destination services control.

A Replayable Log turns ‘Ephemeral Messaging’ into ‘Messaging that Remembers’:

  • Replayable logs can hold large, “Canonical” datasets where anyone can access them.
  • You don’t ‘query’ a log in the traditional sense. You extract the data and create a view, in a cache or database of your own, or you process it in flight. The replayable log provides a central reference. This pattern gives each service the “slack” they need to iterate and change, as well as fitting the ‘derived view’ to the problem they need to solve.

Replayable Logs work better at keeping datasets in sync across a company:

  • Data that is copied around a company can be hard to keep in sync. The different copies have a tendency to slowly diverge over time. Use of messaging in industry has highlighted this.
  • If messaging ‘remembers’, it’s easier to stay in sync. The back-catalogue of data—the source of truth–is readily available.
  • Streaming encourages derived views to be frequently re-derived. This keeps them close to the data in the log.

Replayable logs lead to Polyglot Views:

  • There is no one-size-fits-all in data technology.
  • Logs let you have many different data technologies, or data representations, sourced from the same place.

In Event-Driven Systems the Data Layer isn’t static

  • In traditional applications the data layer is a database that is queried. In event-driven systems the data layer is a stream processor that prepares and coalesces data into a single event stream for ingest by a service or function.
  • KSQL can be used as a data preparation layer that sits apart from the business functionality. KStreams can be used to embed the same functionality into a service.
  • The streaming approach removes shared state (for example a database shared by different processes) allowing systems to scale without contention.

The ‘Database Inside Out’ analogy is useful when applied at cross-team or company scales:

  • A streaming system can be thought of as a database turned inside out. A commit log and a a set of materialized views, caches and indexes created in different datastores or in the streaming system itself. This leads to two benefits.
    • Data locality is used to increase performance: data is streamed to where it is needed, in a different application, a different geography, a different platform, etc.
    • Data locality is used to increase autonomy: Each view can be controlled independently of the central log.
  • At company scales this pattern works well because it carefully balances the need to centralize data (to keep it accurate), with the need to decentralise data access (to keep the organisation moving).

Streaming is a State of Mind:

  • Databases, Request-response protocols and imperative programming lead us to think in blocking calls and command and control structures. Thinking of a business solely in this way is flawed.
  • The streaming mindset starts by asking “what happens in the real world?” and “how does the real world evolve in time?” The business process is then modelled as a set of continuously computing functions driven by these real-world events.
  • Request-response is about displaying information to users. Batch processing is about offline reporting. Streaming is about everything that happens in between.

The Streaming Way:

  • Broadcast events
  • Cache shared datasets in the log and make them discoverable.
  • Let users manipulate event streams directly (e.g., with a streaming engine like KSQL)
  • Drive simple microservices or FaaS, or create use-case-specific views in a database of your choice

The various points above lead to a set of broader principles that summarise the properties we expect in this type of system:

The WIRED Principles

Windowed: Reason accurately about an asynchronous world.

Immutable: Build on a replayable narrative of events.

Reactive: Be asynchronous, elastic & responsive.

Evolutionary: Decouple. Be pluggable. Use canonical event streams.

Data-Enabled: Move data to services and keep it in sync.


Slides from Craft Meetup

Wednesday, May 9th, 2018

The slides for the Craft Meetup can be found here.


Building Event Driven Services with Kafka Streams (Kafka Summit Edition)

Monday, April 23rd, 2018

The Kafka Summit version of this talk is more practical and includes code examples which walk though how to build a streaming application with Kafka Streams.

Building Event Driven Services with Kafka Streams from Ben Stopford

Slides from Strata Software Architecture: The Data Dichotomy – Rethinking data and services with streams

Wednesday, April 5th, 2017

Strata Software Architecture NY: The Data Dichotomy from Ben Stopford

QCon 2017: The Power of the Log

Wednesday, March 8th, 2017

VIDEO HERE

This talk is about the beauty of sequential access and append only data structures. We’ll do this in the context of a little known paper entitled “Log Structured Merge Trees”. LSM describes a surprisingly counterintuitive approach to storing and accessing data in a sequential fashion. It came to prominence in Google’s Big Table paper and today, the use of Logs, LSM and append only data structures drive many of the world’s most influential storage systems: Cassandra, HBase, RocksDB, Kafka and more. Finally we’ll look at how the beauty of sequential access goes beyond database internals, right through to how applications communicate, share data and scale.

The Power of the Log from Ben Stopford

 

 


Streaming, Databases & Distributed Systems – Bridging the Divide

Wednesday, November 23rd, 2016

This talk introduces Stateful Stream Processing and makes a case for SSP as a general approach to data computation in distributed environments.

Slides, alone, can be found here:

Streaming, Database & Distributed Systems Bridging the Divide from Ben Stopford

 


Slides from Codemesh & BigDataLdn

Friday, November 4th, 2016

Streaming, Database & Distributed Systems Bridging the Divide from Ben Stopford

Data Pipelines with Apache Kafka from Ben Stopford

Slides for JAX London

Wednesday, October 12th, 2016

Same title, but different content to the QCon one.

JAX London Slides from Ben Stopford

Slides from QCon: Microservices for a Streaming World

Monday, March 7th, 2016

Full talk can be found HERE


Abstract for Code Mesh 2015

Monday, July 20th, 2015

Contemporary Approaches to Data at Scale (tbc)

We use a host of tricks these days for handling data at scale. Disk structures are tuned to specific workloads. Streams are used to create continuous pipelines of processing. Hardware offers incredible diversity in terms of latency and throughput.

The tools available: Cassandra, Postgres, Hadoop, Kafka, Hazelcast, Storm etc all come with tradeoffs unique to themselves. We’ll look at these as individual elements. We’ll also look at compositions that leverage these individual sweet spots to create more powerful, holistic platforms.


Abstracts for Øredev 2015

Thursday, July 9th, 2015

The Future of Data Technology (6th Nov 15.40)

No longer does one-size-fit-all when it comes to data technology. At least not for many of today’s use cases. Will this ever change? Will we continue to diversify? Will we go full circle? Certainly ours is an industry in flux. NoSQL, Big Data and stream technology, containerisation, commodity PCIe storage, non-volatile memory and a host of other forces will shape the data technologies of the future. 

In this talk will make a case for what the future may look like, what challenges we’ll encounter and how it will likely change the applications we build. 

Elements of Scale: Composing And Scaling Data Platforms (5th Nov 14.20)

Today there are a host of data-centric challenges that need more than a single technology to solve. Data platforms step in, blending different technologies to solve a common goal. 

But to compose such platforms we need an understanding of the tradeoffs of each constituent part: their sweet spots, how they complement one another and what sacrifices they make in return.

This talk is really a grand tour of these evolutionary forces. We’ll cover a lot of ground, building up from disk formats right through to fully distributed, streaming and batch driven architectures. In the end we should see how these various pieces come together to form a pleasant and useful whole. 


Abstract for JAX London 2015

Thursday, July 9th, 2015

Intuitions for Scaling Data-Centric Architectures (14th Oct 11.20)

This talk will examine the various intuitions and trade-offs needed to scale a data-centric application or architecture. Building from the fundamentals of data locality, immutability and parallelism, attendees will gain a sense for how fully blown architectures can be sewn together. The result: a balance of real-time storage, streaming and analytics that plays to the relative strengths of different component parts.


ALL