Whilst Big Data contains the promise of fame and fortune, the signal to noise ratio is high. How do you make sense of the marketing blurb?
Once upon a time, in a land of metaphor, there lived a database called George…
Address-spaces are becoming both vast and durable. What affect will this have on the way we store our data?
Have you ever wished that code was easier to test? The talk, and associated paper, explores how things might be different if we started from scratch.
My experiences of late have largely focused around distributed data storage, in particular Oracle Coherence, and this has inevitably shaped my understanding of systems, so keep that in mind as you read on.
The articles above are those that are recent or popular. There are a variety of other articles, categorised to the right and more traditional blog entries below.
- QCon-2012: Where does Big Data meet Big Database?
- QCon-2012: Progressive Architectures at RBS
- JavaOne-2011: Balancing Replication and Partitioning in a Distributed Java Database
- QCon-2011: Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability
- OpenWorld-2011: Adopting Oracle Coherence as an Enterprise Standard
- UCL-2011: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access
- CoSIG-2011: Oracle Coherence Implementation Patterns (Special Interest Group)
- ICST-2011: Test-Oriented Languages: a new era?
- ICST-2011: Enabling Testing, Design and Refactoring Practices in Remote Locations
- Birkbeck-2011: Data Storage for Extreme Use Cases
- RefTest-2010: Has Mocking Gone Wrong?
- RBS-2009: Data Grids with Oracle Coherence
- Brunel-2008: The Architect's Two Hats
- Brunel-2007: Architecture and Design in Industry
- Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability (QCon)
- Coherence Part I: An Introduction
- Coherence Part II: Delving a Little Deeper
- Coherence Part III: The Coherence Toolbox
- Coherence Part IV: Merging Data And Processing
- Coherence: The Fallacy of Linear Scalability
- How Fault Tolerant Is Coherence Really?
- Merging Data And Processing: Why it doesn’t “just work”
- A Reliable version of putAll()
- A Singleton Service
- Coherence Implementation Patterns – Slides from Coherence SIG
- Joins: using Key-Association (Simplest)
- Joins: using Snowflake Schemas & CQCs (Intermediate)
- Joins: with Connected-Replication (Advanced)
- Managing Versioning
- The Collections Cache
- The Big Data Conundrum (2012)
- Thoughts on Big Data Technologies (4): Our Love-Hate relationship with the Relational Database (2012)
- Thoughts on Big Data Technologies (3): Objections Worth Thinking About (2012)
- Thoughts on Big Data Technologies (2): How big is Big? (2012)
- Thoughts on Big Data Technologies (1) (2012)
- A Story about George (2012)
- Data Storage for Extreme Use Cases: An Introduction (and a Peek at ODC) (2011)
- The Rebirth of the In-Memory Database (2011)
- Is the Traditional Database a Thing of the Past? (2009)
- Shared Nothing v.s. Shared Disk Architectures: An Independent View (2009)
- Component Software. Where is it going? (2005)
- Do Metrics Have a Place in Software Engineering Today? (2004)
- Are Mocks All They Are Cracked Up To Be?
- Beyond Stubs: Why We Need Interaction Testing
- Isolating Functional Units: Why We Need Stubs
- Test Oriented Languages: Is it Time for a New Era?
- Distributing Skills Across a Continental Divide
- Four HPC Architecture Questions – With Answers
- Interviewing: The Importance of Examining Applied Knowledge
- Learning Practices for Distributed Teams (ICST)
- Mapping Personal Practices
- The Business Analyst Test
The slides from yesterday’s guest lecture on NoSQL, NewSQL and Big Data can be found here.
Slides from today’s European Trading Architecture Summit 2012 are here.
I attended an interesting talk at JAX earlier this year by guy called Ian Polsker, from Riak, somewhat amusingly entitled ‘TheBigDataCon’ (worth a look by the way – the slides are good). Ian makes a little fun of all the current hype, joking that vendors seemed to be the only people actually monitising Big Data. I think we can’t help but be a little cynical of anything that has this much hype.
On another level the term has become overloaded. It has many definitions, Oracle for example talk about Big Data in a very different way to say MapR. It seems to broadly boil down to two angles though:
- The promise of greater insight using the huge amounts of data we produce
- A change in the technologies we use to crunch our way through the data we have (or expect to have)
Like any other commodity, the harder it is to extract, the more it costs. The aspirational, needle-in-a-haystack concept that drives much of the marketing paraphernalia is certainly real and should not be ignored. However the hype around the ‘hidden insight’ thing masks a more fundamental, and grounded point: the technology shift that facilitates all this.
There is a view that todays data is ‘big’ and that having big data means some form of MapReduce. Yet it is not size that really matters. Both relational and nosql camps can deal with the data volumes (and even, for the most part, the three Vs in one way or another). Ebay for example runs a 20PB+ database. Yahoo and Google both have larger MR clusters, but not that much larger. For most problems data volume alone is not enough to make a sensible technology choice (and I’d contest that any of the Vs were really enough either). As the academic world likes to keep reminding us (here and here) performance is not the reason to pick up a big data technology. There reason is that these new technologies embrace a very different approach to data analysis, particularly in the context of the whole ‘lifecycle’ of our data analysis work. Big Data technologies decouple us from some of the shackles that make big data problems hard. However, there is no free lunch and they come with some shackles of their own.
A core difference is the ability to define a schema at runtime, rather than upfront. That alone is a powerful, and game changing idea. Dave Campell put it well in his VLDB keynote when he says ‘ability to model data is much more of a gating factor than raw size, particularly when considering new forms of data’. Modelling data, getting it into a form we can interpret and understand can be a longwinded and painful task, and something we must do before we can do anything useful with it.
Our ability to model data is much more of a gating factor than raw size
Traditional databases push us to model our data before we store it. Big data solutions often leave their data in its natural form. A ‘virtual schema’ is bound at runtime. This concept of binding the schema ‘late’ is powerful. It allows the interpretation of the data to be changed at any time without having to change the physical format of the data on disk. Something that becomes increasingly important as the size of the dataset increases. The downside of not imposing a schema from the point of ingestion is that keeping old forms of data ‘current’ becomes an increasingly difficult task. That’s to say that the client is left with the problem of handling many data representations. Fine if the model is free text, tougher if the model has any real structure (explicit or implicit).
The concept of binding the schema ‘late’, with the data held in its natural form, is powerful.
Big Data technologies offer very different performance profiles to relational analytics tools. The lack of indexing and overarching structure means inserts are fast, making them suitable for high velocity systems and batch processing. The imperative interface and the absence of a schema, makes diverse, ad hoc analytics hard though. Instead they work best for specific, well-defined data operations (I often use the data enabled grid analogy). It is likely this that has driven Big Data leaders, Google, and more recently Hadoop, to add more database-like features to their products (Dremel/Magastore/Spanner providing SQL like interface and ACID semantics). Yet it’s much harder to optimise in a late-bound world, no big data solution today comes close to the raw performance of the top end analytics engines.
Most of today’s databases are hindered hugely by needing the schema to be defined upfront.
The last thing to consider is the cost of change. For simple data sets it is less apparent. Start joining data sets together though and it becomes a different ball game. Whist possible with Big Data technologies, it’s just going to cost more and managing the complexity with the absence of a schema becomes an increasingly uphill struggle. In this case, better to stick in the relational world (for now at least).
However most of today’s databases have the HUGE disadvantage that the schema needs to be defined, and the data understood, upfront. Great for simple, well defined business data, but if you’re searching free text, machine generated data or simply a hugely diverse data population (like the data that gets thrown around most big organisations) it’s simply not practical or maybe not possible to understand, and model, the data upfront. By applying the schema later in the cycle the cost of change, the availability of insight and the inherent feedback cycles can all be improved.
As for the future?
You’ll probably have noticed that every database vendor worth their salt now have some form or Big Data offering, be it bought in, ‘tacked on’ or genuinely integrated. Likewise the Big Data vendors are looking more and more like their relational counterparts, sprouting query languages, loose schemas, columnar storage, indexing, even elements of transactionality. The two camps are converging.
Many of new set of relational technologies look more like MapReduce than they do like System R (IBM’s original relational database). Yet the majority of the database community still seem to be lurking in the corner of the playground, wearing anoraks and murmuring (although these days the anoraks are made by Armani). They are a long way from penetrating the progressive Internet space. Joe Hellerstien’s words still ring true today.
The new cool kids of the database world are making their mark with technologies of the moment, backed with a hefty dose of academic acumen.
The future is likely to be one of convergence, and redirecting the database community is undoubtedly good. In fact possibly the most most useful things the NoSQL movement has done has been to give a well timed boot to the database world’s behind, reminding them that they need to listen to their consumers. They got stuck in a rut and the internet space wasn’t going to wait around. Convergence over some newly shared values that sit between the two camps is of course inevitable, and welcomed.
The database world got stuck in it’s ways and the internet space wasn’t going to wait around.
The evidence is quite plain already. There are a host of young (ish) upstart technologies hitting the space. The number of shared nothing analytics engines has significantly increased (Asterdata, Vertica, VoltDB, Exasol, ParAccel, Greenplum, Hana the list goes on) and the benchmarks they are extremely impressive. There are hybrid engines mixing MapReduce engines with smarter storage, routing and indexing strategies. Hadapt and Impala are good examples. The former particularly as it is the one that probably best personifies the blending of the two worlds.
These new upstart database technologies redefine the current mainstream with, not in spite of, the lessons of the past.
Finally there are some interesting one-stop-shop approaches. Holistic solutions that span dynamic schema provisioning and data access, all the way to presentation in a single package. Originating in the machine generated data space, Splunk (dominant) and Logscape (scalable), are the current leaders in this space and there is likely to be a lot more activity. For answering the what-if questions or assembling high level MI stacks these all inclusive solutions get the closest to answering the more insightful questions we have today.
Whether this ever breaks the strangle hold clenched by the oligopoly of key database players remains to be seen. Michael Stonebraker still rains disdain on the NoSQL world even today [see here]. He may be outspoken, he may come across like a bit of **** at times, but it is unlikely that he is wrong. The solutions of the future will not be the pure (and relatively simplistic) MapReduce of today. They will be blends that protect our data, even at scale. For me the new technologies coming from both camps are exciting as they redefine the current mainstream thinking with, not in spite of, the lessons of the past.
- Thoughts on Big Data Technologies (4): Our Love-Hate relationship with the Relational Database (2012)
- Thoughts on Big Data Technologies (3): Objections Worth Thinking About (2012)
- Thoughts on Big Data Technologies (2): How big is Big? (2012)
- Thoughts on Big Data Technologies (1) (2012)
Over the last few years we’ve had a fair few discussions around the various different ways to branch and how they fit into a world of Continuous Integration (and more recently Continuous Delivery). It’s so fundamental that it’s worth a post of its own!
Dave Farley (the man that literally wrote the book on it) penned a the best advice I’ve seen on the topic a while back. Worth a read, or even a reread (and gets better towards the end).
Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list here.
From Cooperative Scans to Predictive Buffer Management (here)
Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server and DB2 both have mechanisms for sharing IO between queries (by attaching to an existing scan or throttling faster queries so that IO can be shared). The Cooperative Scans discussed here takes this a step further by incorporating an active buffer manager which scans use to register their interest in data. The manager then adaptively chooses which pages to load and pass to the various concurrent requests.
There is another related paper at this conference SharedDB: Killing One Thousand Queries With One Stone (here)
Processing a Trillion Cells per Mouse Click (Google) (here)
Interesting paper from Google suggesting an alternative to the approach to column orientation taken in Dremel. PowerDrill uses a double-dictionary encoded column store where the encodings live largely in memory. Further optimisations are made at load time to ensure minimal access to persistent storage. This makes it more akin to column stores like ParAccel or Vectorwise, applied to analytical workloads (aggregates, group bys etc).
Can the elephants handle the NoSQL onslaught (here)
Another paper comparing the performance of Hadoop with a relational database (in a similar vein to the Sigmod 09 paper DeWitt published previously here). I sympathise with the message – databases outperform hadoop on small to medium workloads – but I hope that most people know that already. This time the comparison is with Microsoft’s Sql Server PDW (Parallel Data Warehouse). The choice of data sizes between 250Gb and 16TB means that the study has the same failing as the previous Sigmod one; it’s not looking at large dataset performance.
Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads (here)
Useful, empirically driven paper with detailed data sets from a number of NoSQL implementations including Facebook. Chen et al. performed an empirical study on the implementation of Hadoop at a number of companies including Facebook. It hints at the current ‘elephant in the room’ that is Hadoop’s focus on batch-time over real-time performance (roll on Impala!) . Having data of this level of granularity over a range of real time systems in itself is quite valuable. They note that 90% of jobs are small (resulting in MBs of data returned).
High-Performance Concurrency Control Mechanisms for Main-Memory Databases (here)
Proposes an optimistic MVCC method for in memory concurrency control. The conclusion: single-version locking performs well only when transactions are short and contention is low; higher contention or workloads including some long transactions favor the multiversion methods, and the optimistic method performs better than the pessimistic one.
Blink and It’s Done: Interactive Queries on Very Large Data (here)
Blink is different to the mainstream database as it’s not designed to give you an exact answer. Instead you specify either error (confidence) or maximum time constraints on your query. The approach uses a number of sampling based strategies to achieve the required confidence level. There is a related paper: Model-based Integration of Past & Future in TimeTravel (here)
Developing and Analyzing XSDs through BonXai (here)
B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (here)
SSDs performance increases (initially) with the number of concurrent executions (in stark contrast with magnetic drives). This paper looks into maximising this with the use of concurrent B-trees that utalise parallel IO. Useful research as flash is only going to get cheaper.
SCOUT: Prefetching for Latent Structure Following Queries (here)
I quite like the ideas in this paper around prefetching data based on a known structure (probably because it’s similar to some of the stuff we do).
Fast Updates on Read-Optimized Databases Using Multi-Core CPUs (here)
Addresses the problem some columnar architectures suffer where they accumulate writes in a separate partition, which must be periodically merged with the read-optimised main one.
FDB: A Query Engine for Factorised Relational Databases (here)
I hadn’t come across the idea of Factorised Databsaes before. An interesting concept. The paper demonstrates performance improvements over traditional methods for many-to-many join criteria.
Only Agressive Elephants are Fast Elephants (here)
Interesting approach to indexing Hadoop that claims to improve both read and write performance. I couldn’t find the code though so couldn’t try it.
The Vertica Analytic Database: C-Store 7 Years Later (here)
A good summary of this mature shared-everything, columnar database. They discuss their use of super projections over join indexes, due to the overheads associated with tuple construction and the verbosity of storing the associated rowids. There is a summary of the encoding types used as well as partitioning and locking strategies.
Muppet: MapReduce-Style Processing of Fast Data (here)
Whilst the majority of MapReduce commentary focuses on improving MR query performance this paper looks at the problem of injesting data quickly for high throughput, streaming workloads. The interesting approach focuses on data as streams (in and out) in association with a moving historical window (they denote a slate). To me there seems to be a lot of similarity between this approach the one taken by products like StreamBase and Cloudscale but the authors differentiate themselves my being less schema oriented, more akin to the traditional MR style.
Serializable Snapshot Isolation in PostgreSQL (here)
Interesting paper on the implementation of serializable isolation using the snapshot model.
Other papers of note:
- Minuet: A Scalable Distributed Multiversion B-Tree (here)
- A Statistical Approach Towards Robust ProgressEstimation (here)
- Efﬁcient Multi-way Theta-Join Processing UsingMapReduce (here)
- Avatara: OLAP for Web-scale Analytics Products (OLAP cubes over a NoSQL @LinkedIn) (here)
- 10 Year Best Paper Award: Approximate Frequency Counts over Data Streams (here)
InfoQ published the video for my Where does Big Data meet Big Database talk at QCon this year.
Ian Robinson kindly came to RBS yesterday to speak about Neo4J (slides are here Thinking in Graphs). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much a part of the data model as the nodes are themselves. I am left wondering somewhat how one might apply this solution to the enterprise space, particularly finance. Multistep montecarlo springs to mind as it creates a large connected space but there is no real need to traverse that space retrospectively. There may be application in other simulation techniques though. The below is a paraphrased version of Ian’s words.
Today’s problems can be classified as a function of not only size but also connectedness and structure.
F(size, connectedness, structure)
The Relational model struggles to deal with each of these three factors. The use of sparsely populated tables in our databases and null checks in client side code allude to the unsuitability of this model.
NoSQL offers a solution. The majority of this fledgling field rely on the concept of a Map (Dictionary) in some way. First came simple key-value stores like Dynamo. Next column-oriented stores like Cassandra and BigTable.Finally Document Databases provide a more complex document model(for example JSON), with facilities for simple introspection.
Neo4J is quite different to its NoSQL siblings: A graph database that allows users to model data as a set of nodes and relationships. Once modelled the data can be examined based on its connectedness (i.e. how one node relates to others) rather than simply based on its attributes.
Neo4J uses a specific type of graph model termed a Property Graph: Each node has associated attributes that describe its specificities. These need not be homogenous (as they would in a relational or object schema). Further the relationships between nodes are both namedand directed. As such they can be used in search criteria to find relationships between nodes.
The Property Graph model represents a pragmatic trade off between the purity of a traditional graph database and what you might see in a document database. This can be contrasted with the other graph database models: In ‘Triple Stores’every attribute is broken out as a separate node (this is a bit like third normal form for a graph database). Another alterative is Hypergraphs, where an edge can connect more than two nodes (see Ian’s slide to get a better understanding of this). Triple stores suffer from their fine-grained nature (I’m thinking binary vs red-black trees). Hypergaphs can be hard to apply to real world modelling applications as the multiplicity of relationships can make them hard to comprehend. The Property Graph model avoids the verbosity of triple stores and the conceptual complexity of Hypergraphs. As such the model works well for Complex, densely connected domains and ‘Messy’ data.
The fundamental attribute of the graph database is that Relationships are first class elements. That is to say querying relationships in a graph database is as natural as querying the data the nodes contain.
Neo4J, like many NoSQL databases is schemaless. You simply create nodes and relate them to one another to form a graph. Graphs need not be connected and many sub-graphs can be supported.
A query is simply ‘parachuted’ into a point in the graph from where it explores the local areas looking for some search pattern. So for example you might search for the pattern A–>B–>C. The query itself can be executed either via a ‘traversal’ or using the Cypher graph language. The traversal method simply visits the graph based on some criteria.For example it might only traverse arcs of a particular type. Cypher is a more formal graph language that allows the identification of patterns within the graph.
Imagine a simple graph of two anonymous nodes with an arc between them:
In Cypher this would be represented A-[:connected_to]-B
Considering a more complex graph:
A–>B–>C, A–>C or A–>B–>C–>A
We can start to build up pattern matching logic over these graphs for exampleA-[*]->B to represent that A is somehow connected to B (think regex for graphs). This allows the graph to be mined for patterns based on any combination of the properties, arc directions or name (type).
There are further Cypher examples here including links to an online console where you can interactively experiment with the queries. Almost all of the query examples and diagrams are generated from the unit tests used to develop Cypher. This means that the manual is always an accurate reflection of the current feature set.
The product itself is JVM based (query language written in Scala). There is an HTTP interface too (restful). It is fully transactional (ACID) and it is possible to override the transaction manager should you need to coordinate with an external transaction manager (for example because you want to coordinate with and external store). An object cache is used to store the entities in memory with fall through to memory-mapped files if the dataset does not fit in RAM. There is also an HTTP based API.
HA support uses a master-slave, replicated model (single master model). You can write to a slave (i.e. any node) and it will obtain a lock from the master. Lucene is the default index provider.
The team have several strategies for mitigating the impact of GC pauses, the most important being a GC resistant caching strategy. This assignes a certain amount of space in the JVM heap; it then purges objects whenever the maximum size is about to be reached, instead of relying on GC to make that decision. Here the competition with other objects in the heap, as well as GC-pauses, can be better controlled since the cache gets assigned a maximum heap space usage. Caching is described in more detail here.
Ian mentioned a few applications too:
- Telcos: Managing the network graph: If something goes wrong they use the graph database he help predict where the problem likely comes from by simulating the network topology.
- Logistics: parcel routing. This is a hierarchical problem. Neo4J helps by allowing them to model the various routes to get a parcel from it’s start to end locations. Routes change (and become unavailable).
- Finally the social graph which is fairly self explanatory!
All round an eye-opening approach to the modelling and inspection of connected data sets.
James Phillips (co-founder of Couchbase) did a nice talk on NoSQL Databases at QCon:
Memcached – the simplest and original. Pure key value store. Memory focussed
Redis – Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues etc. Redis is primarily memory focussed.
Membase – extends the membached approach to include persistence, the ability to add nodes, backup’s on other nodes.
MongoDB – Uses BSON (binary version of JSON which is open source but only really used by Mongo). Mongo unlike the Couchbase in that the query language is dynamic: Mongo doesn’t require the declaration of indexes. This makes it better at adhoc analysis but slightly weaker from a production perspective.
Cassandra – Column oriented, key value. The value are split into columns which are pre-indexed before the information can be retrieved. Eventually consistent (unlike Couchbase). This makes it better for highly distributed use cases or ones where the data is spread over an unreliable networks.
Neo4J – Graph oriented database. Much more niche. Not distributed.
There are obviously a few more that could have been covered (Voldemort, Dynamo etc but a good summary from James none the less)
Full slides/video can be found here.
This article describes a little about ODC – primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up – but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.
The Big Idea
ODC Core is the data store that sits at the centre of the ODC project. It was designed to be the one datastore the bank needs; the single port of call for all our trades and valuations with the vision of one day blending processing and data in a collocated manner. In fairness it is not quite that yet, as such a mythical beast is hard to come by, but it has made significant inroads.
So why is one big datastore useful you may ask? In short we, like many organisations, have a lot of problems with data. Most of these problems have nothing to do with technology. They are about different people’s interpretation of their part of our domain. Hundreds of systems across the bank each implement these different interpretations. Data is forwarded from system to system and the problem compounds. Enterprise messaging can only do so much to solve this problem because it is inherently point-in-time (so the interpretation of the message is still left to each application and their own method of persistence). Joining up all the dots to get a global view of the bank’s activity can be a confusing, manual and painful process. So the concept is simple: one golden copy that holds the truth. Get it right in one place and then migrate applications to that one single model and the one single data source. Simple idea. Somewhat harder to make a reality.
What is ODC Now
ODC has been live for coming up to two years with development starting back in Jan 2010. The datastore is written inside Oracle Coherence, which provides a data-fabric in which we have built a distributed, normalised database. ODC Core (which is the data store itself) has some interesting qualities that differentiate it from your average database (or Coherence cluster). The three I cover in more detail below are messaging as a system of record, a dynamic data replication model to support efficient distributed joins and our dynamic object and sql interfaces. There are some other quite neat features that I won’t go into here such as a distributed clock implementation that allows reliable and efficient snapshots of the datastore, the use of compression on large result sets (our own interpretation of dictionary encoding) and a sample-based query optimiser.
Messaging as a System of Record: Unlike most databases ODC Core provides both query and subscription semantics. This actually falls out quite naturally as messaging sits at the very core of the product. In fact messaging is our system of record. So when data is written to the store that data is only ‘accepted’ once it is has been written synchronously to the event stream. Having an event stream as your system of record proves to be a powerful concept.
From a non-functional perspective this allows persistence to scale out linearly in a ‘load balanced’ manner (we use topics rather than queues so there is global ordering and hence no need to share state across different servers in the messaging layer). Providing write scalability is only one advantage though. Having everything persisted through a single event stream means you can hook anything you like into it. If you are interested in a certain type of event you can just subscribe with a message selector. If you want to create a copy of the store in a relational database you can just hook into the same stream. If you want and disaster recovery instance … you get the picture I’m sure.
ODC Core efficiently joins normalised data: All distributed stores that support a degree of normalisation struggle if they need to join data elements are not collocated with one another. They are forced to ship potentially large amounts of data across the network to compute the join. Sharding helps a little but you can only shard by a single key so there will always be elements that don’t end up collocated (because they have ‘crosscutting’ keys). We use a relatively novel approach to solving this problem. In short we replicate data that does not shard. However simply replicating data would cause the cluster to run out of memory as there would simply be too much replicated data on each node.
To get around this problem, when data is written to the store the system walks the object model, ensuring that all items that the data ‘connects to’ are replicated. So we start out by replicating nothing. As data is written to the cluster we walk the domain model to make sure the ‘dimensions’ that data connects to are replicated. Most importantly, at any point in time data that is not ‘connected’ will not be replicated. This reduces the amount of replicated data by an order of magnitude so that replication can be used for efficient joins with ‘Dimensions’. If you’re interested in this pattern you can find out more about it here and here.
ODC supports Object and Relational models through a single interface: ODC is primarily an object database. This is important because it represents a 2D domain model (a representation of the banks Logical Domain Model – something we hold very dear). We have a simple object based query language which allows a user to query (filter, group etc) by element of any object in the store (the API is derived reflectively from the domain model). The language is sql-like but has all the benefits of intellisense in your IDE. That is to say you can filter, group, select etc on any getter, collection etc that any of our objects expose. You can define which joins you would like to make to bring more data back, add predicate logic etc.
In addition we support a basic JDBC driver which means users can get at our data in rows and columns if they wish. We’d prefer that they didn’t as rows and columns just don’t really work for a 2D domain model but we also understand that a lots and lots of tools want to interact with their data in SQL. The SQL adapter actually works in exactly the same way as the Object based interface. That is to say that the information that is sent to the store is the same. We just have to do a little more work to present the data in a tabular form.
ODC is continuously delivered: We’ve put a lot of work in to continuously deliver our application suite, or at least something do something as close to it continuous delivery as we can. The challenge is that ODC is quite big. Each environment runs around 450 processes with 50 different process definitions and the database is around 2TB which means it takes a long time to migrate (see the Future section below).
So why bother with continuous delivery? It’s really about how long it takes to get feedback on a problem. With this system in place we get feedback on changes with a real data set in something that looks and runs identically to production. We get that feedback every day. The effort has gone into a series of ever-increasingly comprehensive tests. 20k unit and functional tests run before you check in (takes just under 20 mins). The MiniMe build migrates the database should any database changes be checked in. It does this on a cut down dataset which means it can do pretty much any migration in twenty to thirty mins. If that passes a full migration ensures that the code with a fully populated data set. Finally, if all that passes we rip down the almost prod env, release to it and start everything up again. If anything goes wrong we roll back using a database flashback. All in all a lot of pain but that’s the world of databases in the terabytes. The luxury of seeing a new bit of work in a production identical environment within a day is worth it though. The continuous delivery system is written in Gradle by Greg Gigon.
The future for ODC’s data store revolves around its ability to adapt to a changing world. Databases aren’t so good at that. When you have a database you need to understand your data before you store it. Part of moving fast is accepting that you can’t understand all you data at the get-go however much you may wish to. Understanding data just takes time (and you get it wrong). The plan is to avoid these problems by using late binding to wrap a schema onto original, unaltered facts at runtime. This concept of the late bound schema allows us to change our mind and map data late on in the delivery cycle because the unaltered facts always sit at its core. Doing this in a traditional schema oriented store (like a database) isn’t possible since you would have to back-populate any new additions. The schema is more like a view in a database, except that the view is over the data file as it was provided to the database, rather than some mapped version of it. Some big data technologies offer properties like this but none we’ve come across offer this in the context of a statically typed language that can version data, provide consistent views and join entities that have disparate lifecycles. We see this step as an important move towards becoming the one store that a large number of systems can rely on.
The higher level vision (which is the vision of our CIO) is a data oriented architecture in which services are deployed and run in a cloud like environment that is ‘preloaded’ with all the bank’s primary data. That is to say that services running in this environment utilise only centralised persistence for the bank’s core facts.
The team are split between London and India. There is a strong influence from the software industry and that goes for the work as well as the ethos. We don’t always agree (lots of strong characters) but we always get along. If you are interested some of us mapped out what we value most here a while back.
We practice something that is a little bit like agile. We work iteratively. We write lots of tests. We keep the build time down. But we’re aging slightly which means we don’t pair as often as we used to (but we do still pair). Iterations overhang a little too often but hopefully you can forgive us for that.
So if you’re looking for a work because you want to pay the bills there are better teams out there. If you’ve chosen a life in software because it’s something that you find yourself musing about in idle moments and excited about when you wake in the morning then it could be for you.
If you’d like to find out more just email me: benjamin[dot]stopford[at]rbs.com
- Intel’s new MIC ‘Knights Corner’ coprocessor (in the Intel Xeon Phi line) is targeted at the high concurrency market, previously dominated by GPGPUs, but without the need for code to be rewritten into Cuda etc (note Knights Ferry is the older prototype version).
- The chip has 64 cores and 8GBs of RAM with a 512b vector engine. Clock speed is ~ 1.1Ghz and have a 512k L1 cache. The linux kernel runs on two 2.2GHZ processors.
- It comes on a card that drops into a PCI slot so machines can install multiple units.
- It uses a MESI protocol for cache coherence.
- There is a slimmed down linux OS that can run on the processor.
- Code must be compiled to two binaries, one for the main processor and one for Knights Corner.
- Compilers are currently available only for C+ and Fortran. Only Intel compilers at present.
- It’s on the cusp of being released (Q4 this year) for NDA partners (though we – GBM – have access to one off-site at Maidenhead). Due to be announced at the Supercomputing conference in November(?).
- KC is 4-6 GFLOPS/W – which works out at 0.85-1.8 TFLOPS for double precision.
- It is expected to be GA Q1 ‘13.
- It’s a large ‘device’ the wafer is a 70mm square form-factor!
- Access to a separate board over PCI is a temporary step. Expected that future versions will be a tightly-coupled co-processor. This will also be on the back of the move to the 14nm process.
- A single host can (depending on OEM design) support several PCI cards.
- Similarly power-draw and heat-dispersal an OEM decision.
- Reduced instruction set e.g. no VM support instructions or context-switch optimisations.
- Performance now being expressed as GFlops per Watt. This is a result of US Government (efficiency) requirements.
- A single machine is can go faster than a room-filling supercomputer of ‘97 – ASIC_Red!
- The main constraint to doing even more has been the limited volume production pipeline.
- Pricing not announced, but expected to be ‘consistent with’ GPGPUs.
- Key goal is to make programming it ‘easy’ or rather: a lot easier than the platform dedicated approaches or abstraction mechanisms such as OpenCL.
- Once booted (probably by a push of an OS image from the main host’s store to the device) it can appear as a distinct host over the network.
- The key point is that Knights Corner provides most of the advantages of a GPGPU but without the painful and costly exercise of migrating software from one language to another (that is to say it is based on the familiar x86 programming model).
- Offloading work to the card is instructed through the offload pragma or offloading keywords via shared virtual memory.
- Computation occurs in a heterogeneous environment that spans both the main CPU and the MIC card which is how execution can be performed with minimal code changes.
- There is a reduced instruction set for Knights Corner but the majority of the x86 instructions are there.
- There is support for OpenCl although Intel are not recommending that route to customers due to performance constraints.
- Real world testing has shown a provisional 4x improviement in throughput using an early version of the card running some real programs. However results from a sample test shows perfect scaling. Some restructuring of the code was necessary. Not huge but not insignificant.
- There is currently only C++ and Fortran interfaces (so not much use if you’re running Java or C#)
- You need to remember that you are on PCI Express so you don’t have the memory bandwidth you might want.
Other things worth thinking about:
Thanks to Mark Atwell for his help with this post.
Michael Stal wrote a nice article about the our Progressive Architectures talk from this year’s QCon. The video is up too.
Read the article here.
Watch the video here.
A big thanks to Fuzz, Mark and Ciaran for making this happen.
I really enjoyed Harvey’s ‘POF Art’ talk at the Coherence SIG. Slides are here if you’re into that kind of thing POF-Art.
What if, more than anything else, we valued helping each other out? What if this was the ultimate praise, not the best technologists, not an ability to hit deadlines, not production stability. What if the ultimate accolade was to consistently help others get things done? Is that crazy? It’s certainly not always natural; we innately divide into groups, building psychological boundaries. Politics erupts from trivial things. And what about the business? How would we ever deliver anything if we spent all our time helping each other out? But maybe we’d deliver quite a lot.
If helping each other out were our default position wouldn’t we be more efficient? We’d have less politics, less conflict, fewer empires and we’d spend less money managing them.
We probabably can’t change who we are. We’ll always behave a bit like we do now. Conflict will always arise and it will always result in problems, we all have tempers, we play games, we frustrate others and retort to the slights and injustices.
But what if it was simply our default position. Our core value. The thing we fall back on. It wouldn’t change the world, but it might make us a little bit more efficient.
… right back to the real world
Valve handbook. Very cool:
Jon ‘The Gridman’ Knight has finally dusted off his keyboard and entered the blogsphere with fantastic post on how we implement a reliable version of Coherence’s putAll() over here on ODC. One to add to your feed if you are interested in all things Coherence.
- Intel managing to squeeze 50 cores on a single chip, breaking through the teraflop boundary as they do so: Brier Dudley’s Blog | Wow: Intel unveils 1 teraflop chip with 50-plus cores | Seattle Times Newspaper
- RISC architectures have had a renaissance thanks largely to the needs of the mobile sector, could their low power consumption make them a serious contender for enterprise space? x86 Faces Unexpected RISC Competition
- AMD announce 4 memory channels allowing massive addressable spaces up to 364GB per CMP : AMD’s Interlagos and Valencia finally emerge
- Anyone who follows my blog will know of my belief in large address spaces reshaping the landscape, certainly for enterprise applications. This articles echoes these views: Megatrend: Cheap RAM Reshaping All of Computing | Dr Dobb’s
- IBM’s Lime is an interesting approach to simplifying the programming of secondary devices. See Lime paper and the related Liquid Metal project.
- JVM on FPGA: JOP: A Tiny Java Processor Core for FPGA
- An interesting paper on using FPGA for Monte Carlo Simulation: FPGA for monte carlos
High Performance Java
- An excellent talk about using memory efficiently in Java applications, that the costs are often higher than we think. It includes clear descriptions of the footprint of all Java objects and utilities : Building Memory Efficient Java Applications
- There has been a flurry of activity coming from Azul Systems recently. Most notably the release of Zing, their pauseless garbage collector. Gene Til’s talk about the State of the Art in GC from QCon SF 2011 is one of the best I’ve seen (QConSF 2011: State of the Art in Garbage Collection).
- Azul have also recently released JHiccup. An interesting utility that measures operating system stalls. Java Developer Tools: jHiccup Java Performance Analysis
- Charles Nutter’s comments on his favourite JVM flags including my favourite (-XX:+PrintOptoAssembly): Headius: My Favorite Hotspot JVM Flags
Distributed Data Storage
- A great paper from VLDB describing an approach for balancing replication and partitioning, something close to my own heart: Schism: a Workload-Driven Approach to Database Replication and Partitioning
- Hasso Plattner (the P is SAP) wrote this paper which provides an insigntful view of where he believes the field should be going (and of course SAP’s solution Hana): Hasso Plattner on In-Memory OLAP & OLTP
- I enjoyed watching this talk about Mongo: InfoQ: Scaling with MongoDB
- An entertaining article from the Economist about David Gelernter’s predictions of the future of computing: Brain scan: Seer of the mirror world | The Economist
- Could Prezi really dislodge PowerPoint? Prezi
- Double Loop Learning – a different view on organizational learning. Chris Argyris.
- Worth reading if you are not familiar with the idea already: CQRS
- An interesting twist on the traditional storyboard approach Our Story Board is Better Than Yours… I’m a big fan of replacing estimation with uniformly sized stories.
- Booked your next holiday? What about a Code Retreat with Corey Haines
High Performance Java
- Not exactly lightweight reading but one of the most detailed and influential papers on tuning your software for processing efficiency: What Developers Should Understand About Memory
- If you read the above and want to put some of it into action then VTune should be your next port of call. Diagnostic software for CPU cache hits etc: VTune™ Amplifier XE 2011 from Intel – Intel® Software Network
- When it really won’t go any faster, look at the Assembler: Deep dive into assembly code from Java | Java.net
- In anticipation of G1 (in case they ever get it finished) here’s the original paper with anticipated performance figures: G1 paper with figures
- A different approach to GC using processor specific minor collections (in Haskell): Multicore Garbage Collection with Local Heaps
Distributed Data Storage:
- The new Oracle NoSQL database – this is the best article I’ve read summarising it’s position in the market: DBMS Musings: Overview of the Oracle NoSQL Database
- The official Oracle NoSQL Whitepaper: Oracle NoSQL Database White Paper
- An interesting approach to data storage: an FPGA based data warehouse: FPGA Data Warehouse
- Google’s interesting SQL wrapped MapReduce framework: Tenzing A SQL Implementation On The MapReduce Framework
- The Actors Model – just in case you’re not familiar with it: Actors model for distribution
- Gluster – an open source distributed file system: Gluster
- Running Cuda natively on x86 processors: Running CUDA Code Natively on x86 Processors | Dr Dobb’s Journal
- Thinking about using 64bit JVMs with compressed pointers : 32-bit or 64-bit JVM? How about a Hybrid?
- Using different caches for read and write. A sensible pattern for Cohernece implementation: Alexey Ragozin’s Blog
- OCZ Z-Drive – an interesting and competitively priced alternative to FusionIO:
- The architecture of the transputer. An interesting reflection on a couple of Bristol’s finest exports (other than Portishead): the Transputer and the Occum programming language. David May, parallel processing pioneer • reghardware
- Is your brain like an Iphone? Is Your Brain Like an iPhone? Which App is Running Now? – Novato, CA Patch
- Just be still for once: No Shame in Stillness « Under the Apricot Tree
- Of the huge amount of writing about Steve Jobs I thought the Economist’s coverage was the best: Steve Jobs: The magician | The Economist
- Scott Marcar’s thought prevoking dialog on technology through a financial crisis: The Long Haul: Scott Marcar Leads RBS’ Tech Team Through the Financial Crisis- WatersTechnology.com
- Short but thought provoking article on company culture: Why You Should Question Your Culture – Ron Ashkenas – Harvard Business Review
Here are a the slides for the talk I gave at JavaOne:
Balancing Replication and Partitioning in a Distributed Java Database
This session describes the ODC, a distributed, in-memory database built in Java that holds objects in a normalized form in a way that alleviates the traditional degradation in performance associated with joins in shared-nothing architectures. The presentation describes the two patterns that lie at the core of this model. The first is an adaptation of the Star Schema model used to hold data either replicated or partitioned data, depending on whether the data is a fact or a dimension. In the second pattern, the data store tracks arcs on the object graph to ensure that only the minimum amount of data is replicated. Through these mechanisms, almost any join can be performed across the various entities stored in the grid, without the need for key shipping or iterative wire calls.
- Original talk, given at QCon London, which is more Coherence specific
- [pptx-9MB] [pdf-48MB]
- A related post documenting the main points covered in the talk
I’m heading to JavaOne in October to talk about some of the stuff we’ve been doing at RBS. The talk is entitled “Balancing Replication and Partitioning in a Distributed Java Database”.
Is anyone else going?
Because the future will inevitably be in-memory databases:
- SAP (slightly weirdly) is leading the way with Hana
- SSD makes a new kind of database possible
- The move away from clusters is not restricted to the enterprise
- More drinking of the Hana Kool-Aid
- Fusion IO
- Phase Change Memory breakthrough at IBM
Other interesting stuff:
- Interesting retrospective on computing giants of the past and future (in typical Economist style)
- A mathematician’s lament
- The next generation of Map Reduce
- Where google may be going wrong
The LMAX guys have open-sourced their Disruptor queue implementation. Their stats show some significant improvements (over an order of magnitude) over standard ArrayBlockingQueues in a range of concurrent tests. Both interesting and useful.
The slides/video from the my talk at QCon London have been put up on InfoQ.
An effort well worthy of it’s own post: http://www.christof-strauch.de/nosqldbs.pdf
- Nice talk covering optimising code in a single JVM: LMAX
- Biased locking in Hotspot: biased_locking_in_hotspot
- Good overview of caching: intel-cpu-caches
- Good overview of lock free algorithms: lock-free-algorithms
- Nice overview of the key NoSQL players: cassandra-vs-mongodb-vs-couchdb-vs-redis
- Google’s layering of ACID over BigTable (at least ACID inside a partition):
- Typically Economist: economist.com
Just a little plug for the 5th annual QCon London on March 7-11, 2011. There is a bunch of cool speakers inlcuding Craig Larman and Juergen Hoeller as well as the obligitory set of Ex-TW types. I’ll be doing a session on Going beyond the Data Grid.
You can save £100 and give £100 to charity is you book with this code: STOP100
More discussions on the move to in memory storage:
- RAM is my friend
- LMAX – How to Do 100K TPS at Less than 1ms Latency
- The problems with ACID, and how to fix them without going NoSQL
- Basho Riak: An Open Source Scalable Data Store
- Facebook’s belief in HBase
- Numbers Everyone Should Know
- Google Dremel Paper
- Facebook’s New Year Performance Stats
I’ve been working on a medium sized data store (around half a TB) that provides high bandwidth and low latency access to data.
Caching and Warehousing techniques push you towards denormalisation but this becomes increasingly problematic when you move to a highly distributed environment (certainly if the data is long lived). We’ve worked on a model that is semi normalised whilst retaining the performance benefits associated with denormalisation.
The other somewhat novel attribute of the system is its use of Messaging as a system of record.
I’ll also be adding some more posts in the near future to flesh out how this all works.
Submissions are being accepted for RefTest at IEEE International Conference on Testing, Verification and Validation.
Submissions can be short (2 page) or full length conference papers. The deadline in Jan 4th 2011.
Full details are here.