This transcribed talk explores a range of data platforms through a lens of basic hardware and software tradeoffs.
Streams have many benefits, from promoting reactive architecture and asynchronicity to bridging operational and analytic worlds. This post explores how.
Essay exploring whether products like MongoDB are viable threats to the incumbent database vendors
A lighthearted look at Oracle & Google using a metaphorical format. The style won’t suit everyone, but it’s a bit of fun!
- RBS-2014: Scaling Data
- BigDataCon-2013: The Return of Big Iron? (video)
- JAX-2013: The Return of Big Iron?
- QCon-2012: Where does Big Data meet Big Database? (video)
- QCon-2012: Progressive Architectures at RBS (video)
- JavaOne-2011: Balancing Replication and Partitioning in a Distributed Java Database
- QCon-2011: Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability (video)
- UCL-2011: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access
- CoSIG-2011: Oracle Coherence Implementation Patterns (Special Interest Group)
- ICST-2011: Test-Oriented Languages: a new era?
- ICST-2011: Enabling Testing, Design and Refactoring Practices in Remote Locations
- Birkbeck-2011: Data Storage for Extreme Use Cases
- RefTest-2010: Has Mocking Gone Wrong?
- RBS-2009: Data Grids with Oracle Coherence
- Brunel-2008: The Architect's Two Hats
- Brunel-2007: Architecture and Design in Industry
- Upside Down Databases: Bridging the Operational and Analytic Worlds with Streams (2015)
- A World of Chinese Whispers (2014)
- Database Y (2013)
- The Big Data Conundrum (2012)
- Where does Big Data meet Big Database? (2012)
- A Story about George (2012)
- The Rebirth of the In-Memory Database (2011)
- Is the Traditional Database a Thing of the Past? (2009)
- Software Writing and the Intellectual Superiority Complex (2005)
- Component Software. Where is it going? (2005)
- Do Metrics Have a Place in Software Engineering Today? (2004)
Test Driven Development (all)
- Test Oriented Languages: Is it Time for a New Era? (2011)
- Beyond Stubs: Why We Need Interaction Testing (2010)
- Isolating Functional Units: Why We Need Stubs (2010)
- Are Mocks All They Are Cracked Up To Be? (2010)
Data Tech (all)
- Elements of Scale: Composing and Scaling Data Platforms (2015)
- Best of VLDB 2014 (2015)
- Log Structured Merge Trees (2015)
- A Guide to building a Central, Consolidated Data Store for a Company (2014)
- An initial look at Actian’s ‘SQL in Hadoop’ (2014)
- The Best of VLDB 2012 (Very Large Database Conference) (2012)
- Thinking in Graphs: Neo4J (2012)
- ODC – RBS’s Distributed Datastore (2012)
- Shared Nothing v.s. Shared Disk Architectures: An Independent View (2009)
Team / Process / Interviewing (all)
- Building a Career in Technology (2015)
- The Iffy Tractor (Can they code OO?) (2011)
- The Business Analyst Test (2011)
- Distributing Skills Across a Continental Divide (2011)
- Learning Practices for Distributed Teams (ICST) (2011)
- Interviewing: The Importance of Examining Applied Knowledge (2010)
- Mapping Personal Practices (2010)
- Four HPC Architecture Questions – With Answers (2009)
Elements of Scale: Composing and Scaling Data PlatformsApr 28th, 2015
This post is the transcript from a talk, of the same name, given at Progscon & JAX Finance 2015
As software engineers we are inevitably affected by the tools we surround ourselves with. Languages, frameworks, even processes all act to shape the software we build.
Likewise databases, which have trodden a very specific path, inevitably affect the way we treat mutability and share state in our applications.
Over the last decade we’ve explored what the world might look like had we taken a different path. Small open source projects try out different ideas. These grow. They are composed with others. The platforms that result utilise suites of tools, with each component often leveraging some fundamental hardware or systemic efficiency. The result, platforms that solve problems too unwieldy or too specific to work within any single tool.
So today’s data platforms range greatly in complexity. From simple caching layers or polyglotic persistence right through to wholly integrated data pipelines. There are many paths. They go to many different places. In some of these places at least, nice things are found.
So the aim for this talk is to explain how and why some of these popular approaches work. We’ll do this by first considering the building blocks from which they are composed. These are the intuitions we’ll need to pull together the bigger stuff later on.
Remember the days when people would write entire applications, embedded inside a database? It seems a bit crazy now when you think about it. Imagine writing an entire application in SQL. I worked on a beast like that, very briefly, in the late 1990s. It had a few shell scripts but everything else was SQL. Everything. Suffice to say it wasn’t much fun – you can probably imagine – but there was a slightly perverse simplicity to the whole thing.
So Martin Kleppmann did a talk recently around the idea of turning databases inside out. I like this idea. It’s a nice way to frame a problem that has lurked unresolved for years. To paraphrase somewhat… databases do very cool stuff: caching, indexes, replication, materialised views. These are very cool things. They do them well too. It’s a shame that they’re locked in a world dislocated from general consumer programs.
There are also a few things missing, like databases don’t really do events, streams, messaging, whatever you want to call it. Some newer ones do, but none cover what you might call ‘general purpose’ streams. This means the query-driven paradigm often leaks into the application space. Applications end up circling around centralised mutable state. Whilst there are valid use cases for this, the rigid and synchronous world produced can be counterproductive for many types of programs.
Best of VLDB 2014Mar 8th, 2015
Interesting paper on write ahead logs in persistent in memory media. Recent non-volatile memory (NVM) technologies, such as PCM, STT-MRAM and ReRAM, can act as both main memory and storage. This has led to research into NVM programming models, where persistent data structures remain in memory and are accessed directly through CPU loads and stores. REWIND outperforms state-of-the-art approaches for data structure recoverability as well as general purpose and NVM-aware DBMS-based recovery schemes by up to two orders of magnitude.
Asterix is an academically established hierarchical store. It’s now an Apache Incubator project. It utilises sets of LSM structures, tied transactionally together. Additional index structures can also be formed, for example R-Trees.
As the number of cores increases, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.
Log Structured Merge TreesFeb 14th, 2015
It’s nearly a decade since Google released its ‘Big Table’ paper. One of the many cool aspects of that paper was the file organisation it uses. The approach is more generally known as the Log Structured Merge Tree, after this 1996 paper (although not specifically referenced as such by Google).
LSM is now used in a number of products as the main file organisation strategy. HBase, Cassandra, LevelDB, SQLite, even MongoDB 3.0 comes with an optional LSM engine, after it’s acquisition of Wired Tiger.
What makes LSM trees interesting is their departure from binary tree style file organisations that have dominated the space for decades. LSM seems almost counter intuitive when you first look at it, only making sense when you closely consider how files work in modern, memory heavy systems.
In a nutshell LSM trees are designed to provide better write throughput than traditional B+ tree or ISAM approaches. They do this by removing the need to perform random update-in-place operations.
So why is this a good idea? At its core it’s the old problem of disks being slow for random operations, but fast when accessed sequentially. A gulf exists between these two types of access, regardless of whether the disk is magnetic or solid state or even, although to a lesser extent, main memory.
The figures in this ACM report here/here make the point well. They show that, somewhat counter intuitively, sequential disk access is faster than randomly accessing main memory. More relevantly they also show sequential access to disk, be it magnetic or SSD, to be at least three orders of magnitude faster and random IO. This means random operations are to be avoided. Sequential access is well worth designing for.
So with this in mind lets consider a little thought experiment: if we are interested in write throughput, what is the best method to use? A good starting point is to simply append data to a file. This approach, often termed logging, journaling or a heap file, is fully sequential so provides very fast write performance equivalent to theoretical disk speeds (typically 200-300MB/s per drive).
Benefiting from both simplicity and performance log/journal based approaches have rightfully become popular in many big data tools. Yet they have an obvious downside. Reading arbitrary data from a log will be far more time consuming than writing to it, involving a reverse chronological scan, until the required key is found. (more…)
List of Database/BigData BenchmarksFeb 13th, 2015
I did some research at the end of last year looking at the relative performance of different types of databases: key value, Hadoop, NoSQL, relational.
I’ve started a collaborative list of the various benchmarks I came across. There are many! Checkout below and contribute if you know of any more (link).
Building a Career in TechnologyJan 2nd, 2015
I was asked to talk to some young technologists about about their career path in technology. These are my notes which wander somewhat between career and general advice.
- Don’t assume progress means a career into management – unless you really love management. If you do, great, do that. You’ll get paid well, but it will come with downsides too. Focus on what you enjoy.
- Don’t confuse management with autonomy or power, it alone will give you neither. If you work in a company, you will always have a boss. The value you provide to the company gives you autonomy. Power comes mostly from the respect others have for you. Leadership and management are not synonymous. Be valuable by doing things that you love.
- If you want to be a good programmer a foundation in Computer Science is really important. If you didn’t do undergrad CS (like me) you need to compensate. Ensure you know the basics well (data structures, algorithms, hardware architecture, networks, security etc). Learn this sooner rather than later as knowledge compounds.
- Practice communicating your ideas. Blog, convince friends, colleagues, use github, whatever. If you want to change things you need to communicate your ideas, finding ways to reach your different audiences. If you see something that seems wrong, try to change it by both communicating and by doing.
- Try to always have one side project (either in work or outside) bubbling along. Something that’s not directly part of your job. Go to a hack night, learn a new language, write a new website, whatever. Something that makes you learn in new avenues.
- Sometimes things don’t come off the way you expect. Normally there is something good in there anyway. This is ok.
- The T-shaped people idea from the Valve handbook is a good way to think about your personal development. What’s your heavy weaponry?
- It’s cliched but so very true: Don’t prematurely optimise. Know that all good engineers do. All of them. Including you. Even if you don’t think you do, you probably do. Realise this about yourself and fight it.
- If you think any particular technology is the best thing since sliced bread, and it’s somewhere near a top of the Gartner hype-curve, you are probably not seeing the full picture yet. Be critical of your own opinions and look for bias in yourself.
- In my experience the most important characteristic of a good company is that its employee’s assume, by default, that the rest of the company are smart people. If the modus operandi of a company (or worse, a team) is ‘everyone else is an idiot’ look elsewhere.
- If you’re motivated to do something, try to capitalise on that motivation there and then and enjoy the productivity that comes with it. Motivation is your most precious commodity.
- Learn to control your reaction to negative situations. The term ‘well-adjusted’ means exactly that. Start with email. Never press send if you feel angry or slighted. In tricky situations stick purely to facts and remove all subjective or emotional content. Let the tricky situation diffuse organically. Doing this face to face takes more practice as you need to notice the onset of stress and then cage your reaction, but the rules are the same (stick to facts, avoid emotional language, let it go).
- If you offend someone always apologise. Even if you are right about whatever it was it is unlikely you intention was to offend them.
- Recognise the difference between being politically right and emotionally right. As humans we’re great at creating plausible rationalisations/justifications for our actions, both to ourselves and others. Rationalisations are often a ‘sign’ of us covering an emotional mistake. Learn to look past them to your moral compass.
View full blogroll