August, 2012

Where does Big Data meet Big Database

Friday, August 17th, 2012


InfoQ published the video for my Where does Big Data meet Big Database talk at QCon this year.

Thoughts appreciated.


Thinking in Graphs: Neo4J

Friday, August 17th, 2012

I enjoyed listening to Ian Robinson  speak about Neo4J (slides are here Thinking in Graphs). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much a part of the data model as the nodes are themselves. I am left wondering somewhat how one might apply this solution to the enterprise space, particularly finance. Multistep montecarlo springs to mind as it creates a large connected space but there is no real need to traverse that space retrospectively. There may be application in other simulation techniques though. The below is a paraphrased version of Ian’s words.

Today’s problems can be classified as a function of not only size but also connectedness and structure.

F(size, connectedness, structure)

The Relational model struggles to deal with each of these three factors. The use of sparsely populated tables in our databases and null checks in client side code allude to the unsuitability of this model.

NoSQL offers a solution. The majority of this fledgling field rely on the concept of a Map (Dictionary) in some way. First came simple key-value stores like Dynamo. Next column-oriented stores like Cassandra and BigTable.Finally Document Databases provide a more complex document model(for example JSON), with facilities for simple introspection.

Neo4J is quite different to its NoSQL siblings: A graph database that allows users to model data as a set of nodes and relationships. Once modelled the data can be examined based on its connectedness (i.e. how one node relates to others) rather than simply based on its attributes.

Neo4J uses a specific type of graph model termed a Property Graph: Each node has associated attributes that describe its specificities. These need not be homogenous (as they would in a relational or object schema). Further the relationships between nodes are both named and directed. As such they can be used in search criteria to find relationships between nodes.

The Property Graph model represents a pragmatic trade off between the purity of a traditional graph database and what you might see in a document database. This can be contrasted with the other graph database models: In ‘Triple Stores’ every attribute is broken out as a separate node (this is a bit like third normal form for a graph database). Another alternative is Hypergraphs, where an edge can connect more than two nodes (see Ian’s slide to get a better understanding of this). Triple stores suffer from their fine-grained nature (I’m thinking binary vs red-black trees). Hypergraphs can be hard to apply to real world modelling applications as the multiplicity of relationships can make them hard to comprehend. The Property Graph model avoids the verbosity of triple stores and the conceptual complexity of Hypergraphs. As such the model works well for Complex, densely connected domains and ‘Messy’ data.

The fundamental attribute of the graph database is that Relationships are first class elements. That is to say querying relationships in a graph database is as natural as querying the data the nodes contain.

Neo4J, like many NoSQL databases is schemaless. You simply create nodes and relate them to one another to form a graph. Graphs need not be connected and many sub-graphs can be supported.

A query is simply ‘parachuted’ into a point in the graph from where it explores the local areas looking for some search pattern. So for example you might search for the pattern A–>B–>C. The query itself can be executed either via a ‘traversal’ or using the Cypher graph language. The traversal method simply visits the graph based on some criteria.For example it might only traverse arcs of a particular type. Cypher is a more formal graph language that allows the identification of patterns within the graph.

Imagine a simple graph of two anonymous nodes with an arc between them:

O–>O

In Cypher this would be represented

A-[:connected_to]-B

Considering a more complex graph:

A–>B–>C, A–>C or A–>B–>C–>A

We can start to build up pattern matching logic over these graphs for exampleA-[*]->B to represent that A is somehow connected to B (think regex for graphs). This allows the graph to be mined for patterns based on any combination of the properties, arc directions or name (type).

There are further Cypher examples here including links to an online console where you can interactively experiment with the queries. Almost all of the query examples and diagrams are generated from the unit tests used to develop Cypher. This means that the manual is always an accurate reflection of the current feature set.

Physical Characteristics:

The product itself is JVM based (query language written in Scala). There is an HTTP interface too (restful). It is fully transactional (ACID) and it is possible to override the transaction manager should you need to coordinate with an external transaction manager (for example because you want to coordinate with and external store). An object cache is used to store the entities in memory with fall through to memory-mapped files if the dataset does not fit in RAM. There is also an HTTP based API.

HA support uses a master-slave, replicated model (single master model). You can write to a slave (i.e. any node) and it will obtain a lock from the master. Lucene is the default index provider.

The team have several strategies for mitigating the impact of GC pauses, the most important being a GC resistant caching strategy. This assignes a certain amount of space in the JVM heap; it then purges objects whenever the maximum size is about to be reached, instead of relying on GC to make that decision. Here the competition with other objects in the heap, as well as GC-pauses, can be better controlled since the cache gets assigned a maximum heap space usage. Caching is described in more detail here.

Ian mentioned a few applications too:

  • Telcos: Managing the network graph: If something goes wrong they use the graph database he help predict where the problem likely comes from by simulating the network topology.
  • Logistics: parcel routing. This is a hierarchical problem. Neo4J helps by allowing them to model the various routes to get a parcel from it’s start to end locations. Routes change (and become unavailable).
  • Finally the social graph which is fairly self explanatory!

All round an eye-opening approach to the modelling and inspection of connected data sets!


A Brief Summary of the NoSQL World

Saturday, August 11th, 2012

James Phillips (co-founder of Couchbase) did a nice talk on NoSQL Databases at QCon:

Memcached – the simplest and original. Pure key value store. Memory focussed

Redis – Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues etc. Redis is primarily memory focussed.

Membase – extends the membached approach to include persistence, the ability to add nodes, backup’s on other nodes.

Couchbase – a cross between Membase and CouchDB. Membase on the front, Couch DB on the back. The addition of CouchDB means you can can store and reflect on more complex documents (in JSON). To query Couchbase you need to write javascript mapping functions that effectively materialise the schema (think index) so that you can create a query model. Couchbase is CA not AP (i.e. not eventually consistent)

MongoDB – Uses BSON (binary version of JSON which is open source but only really used by Mongo). Mongo unlike the Couchbase in that the query language is dynamic: Mongo doesn’t require the declaration of indexes. This makes it better at adhoc analysis but slightly weaker from a production perspective.

Cassandra – Column oriented, key value. The value are split into columns which are pre-indexed before the information can be retrieved. Eventually consistent (unlike Couchbase). This makes it better for highly distributed use cases or ones where the data is spread over an unreliable networks.

Neo4J – Graph oriented database. Much more niche. Not distributed.

There are obviously a few more that could have been covered (Voldemort, Dynamo etc but a good summary from James none the less)

Full slides/video can be found here.


ODC – A Distributed Datastore built at RBS

Thursday, August 9th, 2012

[Edit – 2014 – a more up to date picture here]

This article describes a little about ODC – primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up – but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.

The Big Idea

ODC Core is the data store that sits at the centre of the ODC project. It was designed to be the one datastore the bank needs; the single port of call for all our trades and valuations with the vision of one day blending processing and data in a collocated manner. In fairness it is not quite that yet, as such a mythical beast is hard to come by, but it has made significant inroads.

So why is one big datastore useful you may ask? In short we, like many organisations, have a lot of problems with data. Most of these problems have nothing to do with technology. They are about different people’s interpretation of their part of our domain. Hundreds of systems across the bank each implement these different interpretations. Data is forwarded from system to system and the problem compounds. Enterprise messaging can only do so much to solve this problem because it is inherently point-in-time (so the interpretation of the message is still left to each application and their own method of persistence). Joining up all the dots to get a global view of the bank’s activity can be a confusing, manual and painful process. So the concept is simple: one golden copy that holds the truth. Get it right in one place and then migrate applications to that one single model and the one single data source. Simple idea. Somewhat harder to make a reality.

What is ODC Now

ODC has been live for coming up to two years with development starting back in Jan 2010. The datastore is written inside Oracle Coherence, which provides a data-fabric in which we have built a distributed, normalised database. ODC Core (which is the data store itself) has some interesting qualities that differentiate it from your average database (or Coherence cluster). The three I cover in more detail below are messaging as a system of record, a dynamic data replication model to support efficient distributed joins and our dynamic object and sql interfaces. There are some other quite neat features that I won’t go into here such as a distributed clock implementation that allows reliable and efficient snapshots of the datastore, the use of compression on large result sets (our own interpretation of dictionary encoding) and a sample-based query optimiser.

Messaging as a System of Record: Unlike most databases ODC Core provides both query and subscription semantics. This actually falls out quite naturally as messaging sits at the very core of the product. In fact messaging is our system of record. So when data is written to the store that data is only ‘accepted’ once it is has been written synchronously to the event stream. Having an event stream as your system of record proves to be a powerful concept.

From a non-functional perspective this allows persistence to scale out linearly in a ‘load balanced’ manner (we use topics rather than queues so there is global ordering and hence no need to share state across different servers in the messaging layer). Providing write scalability is only one advantage though. Having everything persisted through a single event stream means you can hook anything you like into it. If you are interested in a certain type of event you can just subscribe with a message selector. If you want to create a copy of the store in a relational database you can just hook into the same stream. If you want and disaster recovery instance … you get the picture I’m sure.

ODC Core efficiently joins normalised data: All distributed stores that support a degree of normalisation struggle if they need to join data elements are not collocated with one another. They are forced to ship potentially large amounts of data across the network to compute the join. Sharding helps a little but you can only shard by a single key so there will always be elements that don’t end up collocated (because they have ‘crosscutting’ keys). We use a relatively novel approach to solving this problem. In short we replicate data that does not shard. However simply replicating data would cause the cluster to run out of memory as there would simply be too much replicated data on each node.

To get around this problem, when data is written to the store the system walks the object model, ensuring that all items that the data ‘connects to’ are replicated. So we start out by replicating nothing. As data is written to the cluster we walk the domain model to make sure the ‘dimensions’ that data connects to are replicated. Most importantly, at any point in time data that is not ‘connected’ will not be replicated. This reduces the amount of replicated data by an order of magnitude so that replication can be used for efficient joins with ‘Dimensions’. If you’re interested in this pattern you can find out more about it here and here.

ODC supports Object and Relational models through a single interface: ODC is primarily an object database. This is important because it represents a 2D domain model (a representation of the banks Logical Domain Model – something we hold very dear). We have a simple object based query language which allows a user to query (filter, group etc) by element of any object in the store (the API is derived reflectively from the domain model). The language is sql-like but has all the benefits of intellisense in your IDE. That is to say you can filter, group, select etc on any getter, collection etc that any of our objects expose. You can define which joins you would like to make to bring more data back, add predicate logic etc.

In addition we support a basic JDBC driver which means users can get at our data in rows and columns if they wish. We’d prefer that they didn’t as rows and columns just don’t really work for a 2D domain model but we also understand that a lots and lots of tools want to interact with their data in SQL. The SQL adapter actually works in exactly the same way as the Object based interface. That is to say that the information that is sent to the store is the same. We just have to do a little more work to present the data in a tabular form.

ODC is continuously delivered: We’ve put a lot of work in to continuously deliver our application suite, or at least something do something as close to it continuous delivery as we can. The challenge is that ODC is quite big. Each environment runs around 450 processes with 50 different process definitions and the database is around 2TB which means it takes a long time to migrate (see the Future section below).

So why bother with continuous delivery? It’s really about how long it takes to get feedback on a problem. With this system in place we get feedback on changes with a real data set in something that looks and runs identically to production. We get that feedback every day. The effort has gone into a series of ever-increasingly comprehensive tests. 20k unit and functional tests run before you check in (takes just under 20 mins). The MiniMe build migrates the database should any database changes be checked in. It does this on a cut down dataset which means it can do pretty much any migration in twenty to thirty mins. If that passes a full migration ensures that the code with a fully populated data set. Finally, if all that passes we rip down the almost prod env, release to it and start everything up again. If anything goes wrong we roll back using a database flashback. All in all a lot of pain but that’s the world of databases in the terabytes. The luxury of seeing a new bit of work in a production identical environment within a day is worth it though. The continuous delivery system is written in Gradle by Greg Gigon.

The Future

The future for ODC’s data store revolves around its ability to adapt to a changing world. Databases aren’t so good at that. When you have a database you need to understand your data before you store it. Part of moving fast is accepting that you can’t understand all you data at the get-go however much you may wish to. Understanding data just takes time (and you get it wrong). The plan is to avoid these problems by using late binding to wrap a schema onto original, unaltered facts at runtime. This concept of the late bound schema allows us to change our mind and map data late on in the delivery cycle because the unaltered facts always sit at its core. Doing this in a traditional schema oriented store (like a database) isn’t possible since you would have to back-populate any new additions. The schema is more like a view in a database, except that the view is over the data file as it was provided to the database, rather than some mapped version of it. Some big data technologies offer properties like this but none we’ve come across offer this in the context of a statically typed language that can version data, provide consistent views and join entities that have disparate lifecycles. We see this step as an important move towards becoming the one store that a large number of systems can rely on.

The higher level vision (which is the vision of our CIO) is a data oriented architecture in which services are deployed and run in a cloud like environment that is ‘preloaded’ with all the bank’s primary data. That is to say that services running in this environment utilise only centralised persistence for the bank’s core facts.

The Team

The team are split between London and India. There is a strong influence from the software industry and that goes for the work as well as the ethos. We don’t always agree (lots of strong characters) but we always get along. If you are interested some of us mapped out what we value most here a while back.

We practice something that is a little bit like agile. We work iteratively. We write lots of tests. We keep the build time down. But we’re aging slightly which means we don’t pair as often as we used to (but we do still pair). Iterations overhang a little too often but hopefully you can forgive us for that.

So if you’re looking for a work because you want to pay the bills there are better teams out there. If you’ve chosen a life in software because it’s something that you find yourself musing about in idle moments and excited about when you wake in the morning then it could be for you.

If you’d like to find out more just email me or get me on twitter @benstopford.


Looking at Intel Xeon Phi (Kinghts Corner)

Thursday, August 9th, 2012

Characteristics:

  • Intel’s new MIC ‘Knights Corner’ coprocessor (in the Intel Xeon Phi line) is targeted at the high concurrency market, previously dominated by GPGPUs, but without the need for code to be rewritten into Cuda etc (note Knights Ferry is the older prototype version).
  • The chip has 64 cores and 8GBs of RAM with a 512b vector engine. Clock speed is ~ 1.1Ghz and have a 512k L1 cache. The linux kernel runs on two 2.2GHZ processors.
  • It comes on a card that drops into a PCI slot so machines can install multiple units.
  • It uses a MESI protocol for cache coherence.
  • There is a slimmed down linux OS that can run on the processor.
  • Code must be compiled to two binaries, one for the main processor and one for Knights Corner.
  • Compilers are currently available only for C+ and Fortran. Only Intel compilers at present.
  • It’s on the cusp of being released (Q4 this year) for NDA partners (though we – GBM – have access to one off-site at Maidenhead). Due to be announced at the Supercomputing conference in November(?).
  • KC is 4-6 GFLOPS/W – which works out at 0.85-1.8 TFLOPS for double precision.
  • It is expected to be GA Q1 ’13.
  • It’s a large ‘device’ the wafer is a 70mm square form-factor!
  • Access to a separate board over PCI is a temporary step. Expected that future versions will be a tightly-coupled co-processor. This will also be on the back of the move to the 14nm process.
  • A single host can (depending on OEM design) support several PCI cards.
  • Similarly power-draw and heat-dispersal an OEM decision.
  • Reduced instruction set e.g. no VM support instructions or context-switch optimisations.
  • Performance now being expressed as GFlops per Watt. This is a result of US Government (efficiency) requirements.
  • A single machine is can go faster than a room-filling supercomputer of ’97 – ASIC_Red!
  • The main constraint to doing even more has been the limited volume production pipeline.
  • Pricing not announced, but expected to be ‘consistent with’ GPGPUs.
  • Key goal is to make programming it ‘easy’ or rather: a lot easier than the platform dedicated approaches or abstraction mechanisms such as OpenCL.
  • Once booted (probably by a push of an OS image from the main host’s store to the device) it can appear as a distinct host over the network.

Commentary:

  • The key point is that Knights Corner provides most of the advantages of a GPGPU but without the painful and costly exercise of migrating software from one language to another (that is to say it is based on the familiar x86 programming model).
  • Offloading work to the card is instructed through the offload pragma or offloading keywords via shared virtual memory.
  • Computation occurs in a heterogeneous environment that spans both the main CPU and the MIC card which is how execution can be performed with minimal code changes.
  • There is a reduced instruction set for Knights Corner but the majority of the x86 instructions are there.
  • There is support for OpenCl although Intel are not recommending that route to customers due to performance constraints.
  • Real world testing has shown a provisional 4x improviement in throughput using an early version of the card running some real programs. However results from a sample test shows perfect scaling.  Some restructuring of the code was necessary. Not huge but not insignificant.
  • There is currently only C++ and Fortran interfaces (so not much use if you’re running Java or C#)
  • You need to remember that you are on PCI Express so you don’t have the memory bandwidth you might want.

References:

Good introduction to the history and development of Knights Corner
A second recent article on Knight Ferry/Corner
Intel slides discussing KC and finishing wiht  a Black Scholes example

Other things worth thinking about:

http://www.altera.com/

Thanks to Mark Atwell  for his help with this post.


ALL


Talks (View on YouTube)