This transcribed talk explores a range of data platforms through a lens of basic hardware and software tradeoffs.
Streams have many benefits, from promoting reactive architecture and asynchronicity to bridging operational and analytic worlds. This post explores how.
A detailed look at the interesting LSM file organisation seen in BigTable, Cassandra and most recently MongoDB
A lighthearted look at Oracle & Google using a metaphorical format. The style won’t suit everyone, but it’s a bit of fun!
- Codemesh-2016: Streaming, Databases & Distributed Systems – Bridging the Divide
- JAXLondon-2016: Microservices for a Streaming World (The Data Dichotomy)
- QCon-2016: Microservices for a Streaming World (video)
- CodeMesh-2015: Contemporary Approaches to Data at Scale (video)
- Øredev-2015: The Future of Data Technology (video)
- JAXLondon-2015: Intuitions for Scaling Data-Centric Architectures (video)
- ProgsCon/JAXF-2015: Elements of Scale
- RBS-2014: Scaling Data
- BigDataCon-2013: The Return of Big Iron?
- JAX-2013: The Return of Big Iron?
- QCon-2012: Where Big Data meets Big Database (video)
- QCon-2012: Progressive Architectures at RBS (video)
- JavaOne-2011: Balancing Replication and Partitioning in a Distributed Java Database
- QCon-2011: Beyond the Data Grid (video)
- UCL-2011: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access
- CoSIG-2011: Oracle Coherence Implementation Patterns (Special Interest Group)
- ICST-2011: Test-Oriented Languages: a new era?
- ICST-2011: Enabling Development Practices in Remote Locations
- Birkbeck-2011: Data Storage for Extreme Use Cases
- RefTest-2010: Has Mocking Gone Wrong?
- RBS-2009: Data Grids with Oracle Coherence
- Brunel-2008: The Architect's Two Hats
- Brunel-2007: Architecture and Design in Industry
- Does In-Memory Really Make Sense? (2016)
- Elements of Scale: Composing and Scaling Data Platforms (2015)
- Upside Down Databases: Bridging the Operational and Analytic Worlds with Streams (2015)
- Log Structured Merge Trees (2015)
- Building a Career in Technology (2015)
- A World of Chinese Whispers (2014)
- Database Y (2013)
- The Big Data Conundrum (2012)
- Where does Big Data meet Big Database? (2012)
- A Story about George (2012)
- The Rebirth of the In-Memory Database (2011)
- Is the Traditional Database a Thing of the Past? (2009)
- Shared Nothing v.s. Shared Disk Architectures: An Independent View (2009)
- Component Software. Where is it going? (2005)
- Do Metrics Have a Place in Software Engineering Today? (2004)
Test Driven Development (all)
- Test Oriented Languages: Is it Time for a New Era? (2011)
- Beyond Stubs: Why We Need Interaction Testing (2010)
- Isolating Functional Units: Why We Need Stubs (2010)
- Are Mocks All They Are Cracked Up To Be? (2010)
Data Tech (all)
- Best of VLDB 2014 (2015)
- A Guide to building a Central, Consolidated Data Store for a Company (2014)
- An initial look at Actian’s ‘SQL in Hadoop’ (2014)
- The Best of VLDB 2012 (2012)
- Thinking in Graphs: Neo4J (2012)
- A Brief Summary of the NoSQL World (2012)
- ODC – A Distributed Datastore built at RBS (2012)
- Looking at Intel Xeon Phi (Kinghts Corner) (2012)
Team / Process / Interviewing (all)
- The Iffy Tractor (Can they code OO?) (2011)
- The Business Analyst Test (2011)
- Distributing Skills Across a Continental Divide (2011)
- Learning Practices for Distributed Teams (ICST) (2011)
- Interviewing: The Importance of Examining Applied Knowledge (2010)
- Mapping Personal Practices (2010)
- Four HPC Architecture Questions – With Answers (2009)
This talk introduces Stateful Stream Processing and makes a case for SSP as a general approach to data computation in distributed environments.
Slides, alone, can be found here:
Slides from Codemesh & BigDataLdnNov 4th, 2016
Slides for JAX LondonOct 12th, 2016
Same title, but different content to the QCon one.
Full talk can be found HERE
QCon Interview on Microservices and Stream ProcessingFeb 19th, 2016
This is a transcript from an interview I did for QCon (delivered verbally):
QCon: What is your main focus right now at Confluent?
Ben: I work as an engineer in the Apache Kafka Core Team. I do some system architecture work too. At the moment, I am working on automatic data balancing within Kafka.Auto data balancing is basically expanding, contracting and balancing resources within the cluster as you add/remove machines or add some other kind of constraint or invariant. Basically, I’m working on making the cluster grow and shrink dynamically.
QCon: Is stream processing new?
Ben: Stream processing, as we know it, has really come from the background of batch analytics (around Hadoop) and that has kind of evolved into this stream processing thing as people needed to get things done faster.Although to be honest, stream processing has been around for 30 years in one form or another, but it has just always been quite niche. It’s only recently that it’s moved mainstream. That’s important because if you look at the stream processing technology from a decade ago, it was just a bit more specialist, less scalable, less available and less accessible (though, certainly not simple). Now that stream processing is more mainstream, it comes with a lot of quite powerful tooling and the ecosystem is just much bigger.
QCon: Why do you feel streaming data is an important consideration for Microservice architectures?
Ben: So you don’t see people talking about stream processing and Microservices together all that much. This is largely because they came from different places. But stream processing turns out to be pretty interesting from the Microservice perspective because there’s a bunch of overlap in the problems they need to solve as data scales out and business workflows cross service and data boundaries.
As you move from a monolithic application to a set of distributed services, you end up with much more complicated systems to plan and build (whether you like it or not). People typically have ReST for Request/Response, but most of the projects we see have moved their fact distribution to some sort of brokered approach, meaning they end up with some combination of request/response and event-based processing. So if ReST is at one side of the spectrum, then Kafka is at the other and the two end up being pretty complimentary. But there is actually a cool interplay between these two when you start thinking about it. Synchronous communication works well for a bunch of use cases, particularly GUIs or external services that are inherently RPC. Event-driven methods tend to work better for business processes, particularly as they get larger and more complex. This leads to patterns that end up looking a lot like event-driven architectures.
So when we actually build these things a bunch of problems pop up because no longer do we have a single shared database. We have no global bag of state in the sky to lean on. Sure, we have all the benefits of bounded contexts, nicely decoupled from the teams around them and this is pretty great for the most part. Database couplings have always been a bit troublesome and hard to manage. But now we hit all the pains of a distributed system and this means we end up having to be really careful about how we sew data together so we don’t screw it up along the way.
Relying on a persistent, distributed log helps with some of these problems. You can blend the good parts of shared statefulness and reliability without the tight centralised couplings that come with a shared database. That’s actually pretty useful from a microservices perspective because you can lean on the backing layer for a bunch of stuff around durability, availability, recovery, concurrent processing and the like.
But it isn’t just durability and history that helps. Services end up having to tackle a whole bunch of other problems that share similarities with stream processing systems. Scaling out, providing redundancy at a service level, dealing with sources that can come and go, where data may arrive on time or may be held up. Combining data from a bunch of different places. Quite often this ends up being solved by pushing everything into a database, inside the service, and querying it there, but that comes with a bunch of problems in its own right.
So a lot of the functions you end up building to do processing in these services, overlap with what stream processing engines do: join tables and streams from different places, create views that match your own domain model. Filter, aggregate, window these things further. Put this alongside a highly available distributed log and you start to get a pretty compelling toolset for building services that scale simply and efficiently.
QCon: What’s the core message of your talk?
Ben: So the core message is pretty simple. There’s a bunch of stuff going on over there, there’s a bunch of stuff going on over here. Some people are mashing this stuff together and some pretty interesting things are popping out. It’s about bringing these parts of industry together. So utilizing a distributed log as a backbone has some pretty cool side effects. Add a bit of stream processing into the mix and it all gets a little more interesting still.
So say you’re replacing a mainframe with a distributed service architecture, running on the cloud, you actually end up hitting a bunch of the same problems you hit in the analytic space as you try to get away from the Lambda Architecture.
The talk dives into some of these problems and tries to spell out a different way of approaching them from a services perspective, but using a stream processing toolset. Interacting with external sources, slicing and dicing data, dealing with windowing, dealing with exactly once processing, and not just from the point of view of web logs or social data. We’ll be thinking business systems, payment processing and the like.
Does In-Memory Really Make Sense?Jan 3rd, 2016
There is an intuition we all share that RAM is faster than disk. This is a general truth, despite there being examples to the contrary. It’s not surprising then that in-memory technologies remain popular in the data space. Yet they’re not without downsides. Some obvious, some less so.
Lets consider why we use disk at all. To gain a degree of fault tolerance is common. We want to be able to pull the plug without fear of losing data. But if we have the data safely held elsewhere this isn’t such a big deal.
Disk is also useful for providing extra storage. Allowing us to ‘overflow’ our available memory. This can become painful, if we take the concept too far. The sluggish performance of an overladed PC that’s constantly paging memory to and from disk in an intuitive example, but this approach actually proves to be very successful in many data technologies, when the commonly used dataset fits largely in memory. (more…)
View full blogroll