Log Structured Merge Trees

A detailed look at the interesting LSM file organisation seen in BigTable, Cassandra and most recently MongoDB

Feb 14th, 2015

A World of Chinese Whispers

Data architectures need infrastructure that combines both streaming and persistent state

May 1st, 2014

Database Y

A popular essay looking at whether products like MongoDB are viable threats to the incumbent database vendors

Nov 22nd, 2013

A Story about George

A lighthearted look at Oracle & Google using a metaphorical format. The style won’t suit everyone, but it’s a bit of fun!

Jun 3rd, 2012


Log Structured Merge Trees
Feb 14th, 2015

It’s nearly a decade since Google released its ‘Big Table’ paper. One of the many cool aspects of that paper was the file organisation it uses. The approach is more generally known as the Log Structured Merge Tree, after this 1996 paper (although not specifically referenced as such by Google).

LSM is now used in a number of products as the main file organisation strategy. HBase, Cassandra, LevelDB, SQLite, even MongoDB 3.0 comes with an optional LSM engine, after it’s acquisition of Wired Tiger.

What makes LSM trees interesting is their departure from the traditional file organisations used in most databases. In fact it appears almost counterintuitive when you first look at it.

Some Background

In a nutshell LSM trees are designed to provide far better write throughput than traditional B+ tree or ISAM approaches. They do this by removing the need to perform random update-in-place operations.

ChartGoSo why is this a good idea? At its core it’s the old problem of disks being slow for random operations, but fast when accessed sequentially. A large gulph exists between these two types of access, regardless of whether the disk is magnetic or solid state.

The figures in this ACM report here/here make the point well. They show that, somewhat counterintuitively, sequential disk access is faster than randomly accessing main memory. More relevantly they also show sequential access to disk, be it magnetic or SSD, to be at least three orders of magnitude faster and random IO. This means random operations are to be avoided. Sequential access is well worth designing for.

So with this in mind lets consider a little thought experiment: if we are interested in write throughput, what is the best method to use? A good starting point is to simply append data to a file. This approach, often termed logging, journalling or a heap file, is fully sequential so provides very fast write performance equivalent to theoretical disk speeds (typically 200-300MB/s per drive).

Benefiting from both simplicity and performance log/journal based approaches have rightfully become popular in many big data tools. Yet they have an obvious downside. Reading arbitrary data from a log will be far more time consuming than writing to it, involving a reverse chronological scan, until the required key is found. (more…)

Posted at Feb 14th |Filed Under: Big Data, Blog, Data Tech, Top4 - read on

List of Database/BigData Benchmarks
Feb 13th, 2015

I did some research at the end of last year looking at the relative performance of different types of databases: key value, Hadoop, NoSQL, relational.

I’ve started a collaborative list of the various benchmarks I came across. There are many! Checkout below and contribute if you know of any more (link).

Screen Shot 2015-02-13 at 13.20.06

Posted at Feb 13th |Filed Under: Blog - read on

Building a Career in Technology
Jan 2nd, 2015

I was asked to talk to some young technologists about about their career path in technology. These are my notes which wander somewhat between career and general advice.

  1. Don’t assume progress means a career into management – unless you really love management. If you do, great, do that. You’ll get paid well, but it will come with downsides too. Focus on what you enjoy.
  2. Don’t confuse management with autonomy or power, it alone will give you neither. If you work in a company, you will always have a boss. The value you provide to the company gives you autonomy. Power comes mostly from the respect others have for you. Leadership and management are not synonymous. Be valuable by doing things that you love.
  3. If you want to be a good programmer a foundation in Computer Science is really important. If you didn’t do CS as a degree (like me) you need to compensate. Ensure you know the basics well (data structures, algorithms, hardware architecture, networks, security etc). Learn this sooner rather than later as knowledge compounds.
  4. Practice communicating your ideas. Blog, convince friends, colleagues, use github, whatever. If you want to change things you need to communicate your ideas, finding ways to reach your different audiences. If you see something that seems wrong, try to change it by both communicating and by doing.
  5. Try to always have one side project (either in work or outside) bubbling along. Something that’s not directly part of your job. Go to a hack night, learn a new language, write a new website, whatever. Something that makes you learn in new avenues.
  6. Sometimes things don’t come off the way you expect. Normally there is something good in there anyway. This is ok.
  7. The T-shaped people idea from the Valve handbook is a good way to think about your personal development. What’s your heavy weaponry?
  8. It’s cliched but so very true: Don’t prematurely optimise. Know that all good engineers do. All of them. Including you. Even if you don’t think you do, you probably do. Realise this about yourself and fight it.
  9. If you think any particular technology is the best thing since sliced bread, and it’s somewhere near a top of the Gartner hype-curve, you are probably not seeing the full picture yet. Be critical of your own opinions and look for bias in yourself.
  10. In my experience the most important characteristic of a good company is that its employee’s assume, by default, that the rest of the company are smart people. If the modus operandi of a company (or worse, a team) is ‘everyone else is an idiot’ look elsewhere.
  11. If you’re motivated to do something, try to capitalise on that motivation there and then and enjoy the productivity that comes with it. Motivation is your most precious commodity.
  12. Learn to control your reaction to negative situations. The term ‘well-adjusted’ means exactly that. Start with email. Never press send if you feel angry or slighted. In tricky situations stick purely to facts and remove all subjective or emotional content. Let the tricky situation diffuse organically. Doing this face to face takes more practice as you need to notice the onset of stress and then cage your reaction, but the rules are the same (stick to facts, avoid emotional language, let it go).
  13. If you offend someone always apologise. Even if you are right about whatever it was it is unlikely you intention was to offend them.
  14. Recognise the difference between being politically right and emotionally right. As humans we’re great at creating plausible rationalisations/justifications for our actions, both to ourselves and others. Rationalisations are often a ‘sign’ of us covering an emotional mistake. Learn to look past them to your moral compass.
Posted at Jan 2nd |Filed Under: Blog, Team Development - read on

A Guide to building a Central, Consolidated Data Store for a Company
Dec 2nd, 2014

Quite a few company’s are looking at some form of centralised operational store, data warehouse, or analytics platform. The company I work for set out to build a centralised scale-out operational store using NoSQL technologies five or six years ago and it’s been an interesting journey. The technology landscape has changed a lot in that time, particularly around analytics, although that was not our main focus (but it will be an area of growth). Having an operational store that is used by many teams is, in many ways, harder than an analytics platform as there is a greater need for real time consistency. The below is essentially a brain dump on the subject. 

On Inter-System (Enterprise) Architecture

  1. Having a single schema for the data in your company has little value unless you use it to write down the data it describes as a permanent record. Standardising the wire format alone can create more problems than it solves. Avoid Enterprise Messaging Schemas for the the problem of bulk state transfer between data stores (see here). Do use messages/messaging for notifying the need to act.
  2. Prefer direct access at source, using data virtualisation. Where that doesn’t work (high cardinality joins) collocate data using with replication technologies (relational or via NoSQL) to materialise read only clones of the source. Avoid enterprise messaging.
  3. Federated approaches, which leave data sets in place, will get you there faster if you can get all the different parts of the company to conform. That itself is a big ask though, but a good technical strategy can help. Expect to spend a lot on operations and automation lining disparate datasets up with one another.
  4. When standardising the persisted representation don’t create a single schema upfront if you can help it. You’ll end up in EDW paralysis. Evolve to it over time.
  5. Start with disparate data models and converge them incrementally over time using schema-on-need (and yes you can do this relationally, it’s just a bit harder).


Posted at Dec 2nd |Filed Under: Analysis and Opinion, Blog - read on

Useful talk on Linux Performance Tools
Aug 24th, 2014





Posted at Aug 24th |Filed Under: Blog - read on

An initial look at Actian’s ‘SQL in Hadoop’
Aug 4th, 2014

I had an exploratory chat with Actian today about their new product ‘SQL in Hadoop’.

In short it’s a distributed database which runs on HDFS. The company are bundling their DataFlow technology alongside this. DataFlow is a GUI-driven integration and analytics tool (think suite of connectors, some distributed functions and a GUI to sew it all together).

Neglecting the DataFlow piece for now, SQL in Hadoop has some obvious strengths. The engine is essentially Vectorwise (recently renamed ‘Vector’). A clever, single node columnar database which evolved from MonetDB and touts the use of vectorisation as a key part of its secret sauce. Along with the usual columnar benefits comes the use of positional delta trees which improve on the poor update performance seen in most columnar databases, some clever cooperative scan technology which was presented at VLDB a couple of years back, but they don’t seem to tout this one directly. Most notably Vector has always had quite impressive benchmarks both in absolute and price-performance terms. I’ve always thought of it as the relational analytics tool I’d look to if I were picking up the tab. (more…)

Posted at Aug 4th |Filed Under: Blog, Data Tech - read on

View full blogroll