A look at where change in the database market has come from and where it might be going
Whilst Big Data contains the promise of fame and fortune, the signal to noise ratio is high. How do you make sense of the marketing blurb?
Once upon a time, in a land of metaphor, there lived a database called George…
Have you ever wished that code was easier to test? The talk, and associated paper, explores how things might be different if we started from scratch.
My experiences of late have largely focused around distributed data storage, in particular Oracle Coherence, and this has inevitably shaped my understanding of systems, so keep that in mind as you read on.
The articles above are those that are recent or popular. There are a variety of other articles, categorised to the right and more traditional blog entries below.
- JAX-2013: The Return of Big Iron?
- QCon-2012: Where does Big Data meet Big Database?
- QCon-2012: Progressive Architectures at RBS
- JavaOne-2011: Balancing Replication and Partitioning in a Distributed Java Database
- QCon-2011: Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability
- OpenWorld-2011: Adopting Oracle Coherence as an Enterprise Standard
- UCL-2011: A Paradigm Shift: The Increasing Dominance of Memory-Oriented Solutions for High Performance Data Access
- CoSIG-2011: Oracle Coherence Implementation Patterns (Special Interest Group)
- ICST-2011: Test-Oriented Languages: a new era?
- ICST-2011: Enabling Testing, Design and Refactoring Practices in Remote Locations
- Birkbeck-2011: Data Storage for Extreme Use Cases
- RefTest-2010: Has Mocking Gone Wrong?
- RBS-2009: Data Grids with Oracle Coherence
- Brunel-2008: The Architect's Two Hats
- Brunel-2007: Architecture and Design in Industry
- Beyond the Data Grid: Coherence, Normalisation, Joins and Linear Scalability (QCon)
- Coherence Part I: An Introduction
- Coherence Part II: Delving a Little Deeper
- Coherence Part III: The Coherence Toolbox
- Coherence Part IV: Merging Data And Processing
- Coherence: The Fallacy of Linear Scalability
- How Fault Tolerant Is Coherence Really?
- Merging Data And Processing: Why it doesn’t “just work”
- A Reliable version of putAll()
- A Singleton Service
- Coherence Implementation Patterns – Slides from Coherence SIG
- Joins: using Key-Association (Simplest)
- Joins: using Snowflake Schemas & CQCs (Intermediate)
- Joins: with Connected-Replication (Advanced)
- Managing Versioning
- The Collections Cache
- Database Y (2013)
- The Big Data Conundrum (2012)
- Thoughts on Big Data Technologies (4): Our Love-Hate relationship with the Relational Database (2012)
- Thoughts on Big Data Technologies (3): Objections Worth Thinking About (2012)
- Thoughts on Big Data Technologies (2): How big is Big? (2012)
- Thoughts on Big Data Technologies (1) (2012)
- A Story about George (2012)
- Data Storage for Extreme Use Cases: An Introduction (and a Peek at ODC) (2011)
- The Rebirth of the In-Memory Database (2011)
- Is the Traditional Database a Thing of the Past? (2009)
- Shared Nothing v.s. Shared Disk Architectures: An Independent View (2009)
- Component Software. Where is it going? (2005)
- Do Metrics Have a Place in Software Engineering Today? (2004)
- Are Mocks All They Are Cracked Up To Be?
- Beyond Stubs: Why We Need Interaction Testing
- Isolating Functional Units: Why We Need Stubs
- Test Oriented Languages: Is it Time for a New Era?
- Distributing Skills Across a Continental Divide
- Four HPC Architecture Questions – With Answers
- Interviewing: The Importance of Examining Applied Knowledge
- Learning Practices for Distributed Teams (ICST)
- Mapping Personal Practices
- The Business Analyst Test
MongoDB recently secured $150m in funding. If you’re not sure how to place that figure, it is more than any database vendor has secured in a single funding round, ever! The company was reportedly valued at $1.2 billion, a huge amount considering the total annual revenue in the NoSQL market was only $542m in 2012. News reports of new databases appear on what feels like a weekly basis. The latest database landscape cites more than two hundred products, and only really scratches the surface.
This is an interesting time for the database world and there are some inevitable questions arising from where this change has come from. Certainly it would seem that the database field has not been serving us, the customers, sufficiently well. If it were these new products would be unlikely to exist. Behind this likely sits a more fundamental change in our needs. The internet has been an obvious contributory factor but there is likely more to this change than simply the need to scale. Joe Hellerstein, something of a guru in the database field, called this out more than a decade ago, and his words make interesting reading today with provocative, if informal, comments like:
“Databases commoditized and cornered … to slow-moving, evolving, structure-intensive apps that require schema evolution”
The companies that innovate, in the Christiansen sense, tend to operate in smaller, niche markets that are less well served; markets that tolerate the rough edges that new products inevitably have. These innovations often pre-empt changes in the mainstream. If and when shifts occur the mainstream incumbents can be left lacking the expertise they need to compete.
Amongst other examples he cites analysis of hard disk production in the 80s where Seagate came to dominance through the mainstream shift from 8” to 5.25” disk technology. Disk manufactures of the time, preoccupied with increasing the performance of the 8” format preferred by mainframe customers, could not complete when the market changed to the smaller format. By focussing to their customers and the flow of sales they missed what, in hindsight, seems like an obvious shift but likely seemed less certain at the time.
Seagate, having cut their teeth in 5.25” technology for the best part of a decade, selling to the then niche desktop computer market, filled the gap and replaced them.
So I find myself wondering if the database world with its flourishing, open source wedded NoSQL movement may also be following this pattern?
Whist it seems unlikely that the latest bedroom-crafted distributed hash table will oust the likes of Microsoft, Oracle and IBM that does not mean that the pattern will not play out in one form or other. Enterprise customers are not demanding change in the same way the internet companies are and with enterprise customers making up the lion’s share of the database gravy train the money is not going anywhere fast. The interesting question is whether there is something genuinely disruptive here? Something that could leave the incumbent database vendors lacking.
One of the key problems that many of the NoSQL (and NewSQL) vendors face is the absence of the rigorous heritage of the traditional database products. Databases are complicated software, the most complex libraries you will likely program with on a day-to-day basis. Most recent relational database contenders; Aster Data, ParAccel, Greenplum, Netezza, Vertica – to name just a few of many – have been Postgres forks. This being necessary to get a leg up the steep development curve required to build a fully functional DBMS.
NoSQLs; Couchbase, Cassandra and Dynamo (Voldemort) are all under 200k lines of code, a questionable metric I know, but it hints to the comparative simplicity of some of these solutions.
The NoSQLs are, in many ways, slightly crappy products. A quick scour of the Internet will find you numerous articles citing the crappyness of Mongodb. It’s strange and weird failure scenarios, overly simplistic delegation of IO to the operating system and (the now changed) default policy to not put any of your data on disk synchronously. The MongoDB is Webscale clip plays on this to hilarious ends.
Yet the market is surprisingly tolerant, at least for the meanwhile. The Christiensen explanation would be that niche markets are inherently tolerant of slightly crappy products because there is simply less to compare them to. The first Walkman, personal computer, mobile phone were all actually pretty crappy. They were however also very cool. They did stuff that the mainstream did not, opening markets that were previously unavailable. Few large corporates manage to do this (Job’s Apple being the obvious counter).
Despite being a little rough round the edges, the use of Mongo has gone viral . The recent valuation of 1.2 billion evidences this success. Marklogic, a far more mature, solid (and expensive) NoSQL with a heritage in XML document storage, makes for an interesting comparison. Marklogic has also had commercial success (Wikibon putting Marklogic as the dominant player in the NoSQL space – see graphic) but has nothing like the community penetration of Mongo. Google Trends puts this well (lower graphic) and VC money in this day and age often follows the horde rather than the balance sheet.
Part of this success can be attributed to the open-source pricing model. Open source makes technology more accessible, coming in under the corporate radar. People are more tolerant of something that is free. Finally and importantly it has a laid back dressing of web-cool that captivates young engineering minds… although that might change if a Oracle-esque corporate sales machine kicks in.
So if NoSQL is simply about scaling then why Mongo’s recent success? Mongo has a fraction of the scalability of Aerospike, Riak or Dynamo. This is because it has chosen the less scalable, but far more user friendly free-form query API which puts it far closer to the traditional OLTP space than other products in the NoSQL space.
Possibly more importantly it panders to the iterative, poly-structured world of today’s engineers and in doing so it is locking into a fundamental truth: the relational world doesn’t sit well with today’s development practices and the latest generation of programmers are less wedded to it than the last.
So are we witnessing an interesting niche field growing up or is this the prologue for a Christiensen-style shift of the mainstream?
Oracle took $11 billion of the $34 billion database market in 2012. The whole NoSQL market is a little over half a billion currently (2%), so it seems unlikely that Larry will be quaking in his platinum-coated deck shoes just yet. But to Christensen’s point, if the differentiation is both significant and sustained and the mainstream vendors do not keep up, that could spell change.
It is unclear to me, as it stands today, whether this path is really disruptive. I’ve discussed the signs of convergence before but there are also a number of technical approaches that are acting as differentiators: The use of shards and replicas are in many ways more advanced that those seen in relational systems. The application of schema on need provides an iterative, explorative approach to data-modelling and data-model evolution. More progressively No/NewSQL is moving beyond CAP to include lightweight transactional models that provide stronger consistency guarantees without all the baggage of serialisable two-phase commit.
So whether this is tangential or revolutionary remains to be seen. Intuitively it certainly seems unlikely that David will topple any of the Goliaths, in fact it seems more likely that the Goliaths with simply eat (buy) the Davids, but we should still see deeper changes than those evidenced by the current lip service paid by the big vendors. This is a good thing.
Big Data’s ill-defined promise of otherwise missed insight has done more for bolstering vendor sales figures than it has to derive value in many of the organisations that implement it.
That’s not to say that it is without successes, far from it. Empirical exploration of customer behaviours has changed the way we interact with customers and has in fact raised the bar for all online custom. But approaches like bolting a MR framework onto to an existing storage engine or adding a JSON column type smack more of feature-sheet bolstering then they do of true technical innovation.
Instead the big vendors should look to the world of today’s engineers: cheap commodity hardware, agile development, continuous integration / delivery, poly-structured data both on the wire and in programs and a general lack of tolerance for anything that removes control from the development process. Mongo’s success can certainly be attributed, at least in part, to this.
The relational world is however both deeply embedded and wholly trusted, and so it should be. Data is something that lasts for decades and holding long-lived data in a schemaless store will bring with it many additional challenges not seen today. This fact alone may see truly schemaless stores marginalised in the long run.
But the generational thing may be decisive. If decision makers in large companies have one thing in common it is their age. Having sat through decades of relational dominance we can cite a thousand sensible, logical, often unquestionable reasons why this dominance should continue. Generation Y however are simply less likely to see it the same way.
Slides for today’s talk at RBS Techstock:
Similar name to the Big Data 2013 but a very different deck:
The slides from yesterday’s guest lecture on NoSQL, NewSQL and Big Data can be found here.
Slides from today’s European Trading Architecture Summit 2012 are here.
Over the last few years we’ve had a fair few discussions around the various different ways to branch and how they fit into a world of Continuous Integration (and more recently Continuous Delivery). It’s so fundamental that it’s worth a post of its own!
Dave Farley (the man that literally wrote the book on it) penned a the best advice I’ve seen on the topic a while back. Worth a read, or even a reread (and gets better towards the end).
Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list here.
From Cooperative Scans to Predictive Buffer Management (here)
Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server and DB2 both have mechanisms for sharing IO between queries (by attaching to an existing scan or throttling faster queries so that IO can be shared). The Cooperative Scans discussed here takes this a step further by incorporating an active buffer manager which scans use to register their interest in data. The manager then adaptively chooses which pages to load and pass to the various concurrent requests.
There is another related paper at this conference SharedDB: Killing One Thousand Queries With One Stone (here)
Processing a Trillion Cells per Mouse Click (Google) (here)
Interesting paper from Google suggesting an alternative to the approach to column orientation taken in Dremel. PowerDrill uses a double-dictionary encoded column store where the encodings live largely in memory. Further optimisations are made at load time to ensure minimal access to persistent storage. This makes it more akin to column stores like ParAccel or Vectorwise, applied to analytical workloads (aggregates, group bys etc).
Can the elephants handle the NoSQL onslaught (here)
Another paper comparing the performance of Hadoop with a relational database (in a similar vein to the Sigmod 09 paper DeWitt published previously here). I sympathise with the message – databases outperform hadoop on small to medium workloads – but I hope that most people know that already. This time the comparison is with Microsoft’s Sql Server PDW (Parallel Data Warehouse). The choice of data sizes between 250Gb and 16TB means that the study has the same failing as the previous Sigmod one; it’s not looking at large dataset performance.
Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads (here)
Useful, empirically driven paper with detailed data sets from a number of NoSQL implementations including Facebook. Chen et al. performed an empirical study on the implementation of Hadoop at a number of companies including Facebook. It hints at the current ‘elephant in the room’ that is Hadoop’s focus on batch-time over real-time performance (roll on Impala!) . Having data of this level of granularity over a range of real time systems in itself is quite valuable. They note that 90% of jobs are small (resulting in MBs of data returned).
High-Performance Concurrency Control Mechanisms for Main-Memory Databases (here)
Proposes an optimistic MVCC method for in memory concurrency control. The conclusion: single-version locking performs well only when transactions are short and contention is low; higher contention or workloads including some long transactions favor the multiversion methods, and the optimistic method performs better than the pessimistic one.
Blink and It’s Done: Interactive Queries on Very Large Data (here)
Blink is different to the mainstream database as it’s not designed to give you an exact answer. Instead you specify either error (confidence) or maximum time constraints on your query. The approach uses a number of sampling based strategies to achieve the required confidence level. There is a related paper: Model-based Integration of Past & Future in TimeTravel (here)
Developing and Analyzing XSDs through BonXai (here)
B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (here)
SSDs performance increases (initially) with the number of concurrent executions (in stark contrast with magnetic drives). This paper looks into maximising this with the use of concurrent B-trees that utalise parallel IO. Useful research as flash is only going to get cheaper.
SCOUT: Prefetching for Latent Structure Following Queries (here)
I quite like the ideas in this paper around prefetching data based on a known structure (probably because it’s similar to some of the stuff we do).
Fast Updates on Read-Optimized Databases Using Multi-Core CPUs (here)
Addresses the problem some columnar architectures suffer where they accumulate writes in a separate partition, which must be periodically merged with the read-optimised main one.
FDB: A Query Engine for Factorised Relational Databases (here)
I hadn’t come across the idea of Factorised Databsaes before. An interesting concept. The paper demonstrates performance improvements over traditional methods for many-to-many join criteria.
Only Agressive Elephants are Fast Elephants (here)
Interesting approach to indexing Hadoop that claims to improve both read and write performance. I couldn’t find the code though so couldn’t try it.
The Vertica Analytic Database: C-Store 7 Years Later (here)
A good summary of this mature shared-everything, columnar database. They discuss their use of super projections over join indexes, due to the overheads associated with tuple construction and the verbosity of storing the associated rowids. There is a summary of the encoding types used as well as partitioning and locking strategies.
Muppet: MapReduce-Style Processing of Fast Data (here)
Whilst the majority of MapReduce commentary focuses on improving MR query performance this paper looks at the problem of injesting data quickly for high throughput, streaming workloads. The interesting approach focuses on data as streams (in and out) in association with a moving historical window (they denote a slate). To me there seems to be a lot of similarity between this approach the one taken by products like StreamBase and Cloudscale but the authors differentiate themselves my being less schema oriented, more akin to the traditional MR style.
Serializable Snapshot Isolation in PostgreSQL (here)
Interesting paper on the implementation of serializable isolation using the snapshot model.
Other papers of note:
- Minuet: A Scalable Distributed Multiversion B-Tree (here)
- A Statistical Approach Towards Robust ProgressEstimation (here)
- Efﬁcient Multi-way Theta-Join Processing UsingMapReduce (here)
- Avatara: OLAP for Web-scale Analytics Products (OLAP cubes over a NoSQL @LinkedIn) (here)
- 10 Year Best Paper Award: Approximate Frequency Counts over Data Streams (here)
InfoQ published the video for my Where does Big Data meet Big Database talk at QCon this year.
Ian Robinson kindly came to RBS yesterday to speak about Neo4J (slides are here Thinking in Graphs). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much a part of the data model as the nodes are themselves. I am left wondering somewhat how one might apply this solution to the enterprise space, particularly finance. Multistep montecarlo springs to mind as it creates a large connected space but there is no real need to traverse that space retrospectively. There may be application in other simulation techniques though. The below is a paraphrased version of Ian’s words.
Today’s problems can be classified as a function of not only size but also connectedness and structure.
F(size, connectedness, structure)
The Relational model struggles to deal with each of these three factors. The use of sparsely populated tables in our databases and null checks in client side code allude to the unsuitability of this model.
NoSQL offers a solution. The majority of this fledgling field rely on the concept of a Map (Dictionary) in some way. First came simple key-value stores like Dynamo. Next column-oriented stores like Cassandra and BigTable.Finally Document Databases provide a more complex document model(for example JSON), with facilities for simple introspection.
Neo4J is quite different to its NoSQL siblings: A graph database that allows users to model data as a set of nodes and relationships. Once modelled the data can be examined based on its connectedness (i.e. how one node relates to others) rather than simply based on its attributes.
Neo4J uses a specific type of graph model termed a Property Graph: Each node has associated attributes that describe its specificities. These need not be homogenous (as they would in a relational or object schema). Further the relationships between nodes are both namedand directed. As such they can be used in search criteria to find relationships between nodes.
The Property Graph model represents a pragmatic trade off between the purity of a traditional graph database and what you might see in a document database. This can be contrasted with the other graph database models: In ‘Triple Stores’every attribute is broken out as a separate node (this is a bit like third normal form for a graph database). Another alterative is Hypergraphs, where an edge can connect more than two nodes (see Ian’s slide to get a better understanding of this). Triple stores suffer from their fine-grained nature (I’m thinking binary vs red-black trees). Hypergaphs can be hard to apply to real world modelling applications as the multiplicity of relationships can make them hard to comprehend. The Property Graph model avoids the verbosity of triple stores and the conceptual complexity of Hypergraphs. As such the model works well for Complex, densely connected domains and ‘Messy’ data.
The fundamental attribute of the graph database is that Relationships are first class elements. That is to say querying relationships in a graph database is as natural as querying the data the nodes contain.
Neo4J, like many NoSQL databases is schemaless. You simply create nodes and relate them to one another to form a graph. Graphs need not be connected and many sub-graphs can be supported.
A query is simply ‘parachuted’ into a point in the graph from where it explores the local areas looking for some search pattern. So for example you might search for the pattern A–>B–>C. The query itself can be executed either via a ‘traversal’ or using the Cypher graph language. The traversal method simply visits the graph based on some criteria.For example it might only traverse arcs of a particular type. Cypher is a more formal graph language that allows the identification of patterns within the graph.
Imagine a simple graph of two anonymous nodes with an arc between them:
In Cypher this would be represented A-[:connected_to]-B
Considering a more complex graph:
A–>B–>C, A–>C or A–>B–>C–>A
We can start to build up pattern matching logic over these graphs for exampleA-[*]->B to represent that A is somehow connected to B (think regex for graphs). This allows the graph to be mined for patterns based on any combination of the properties, arc directions or name (type).
There are further Cypher examples here including links to an online console where you can interactively experiment with the queries. Almost all of the query examples and diagrams are generated from the unit tests used to develop Cypher. This means that the manual is always an accurate reflection of the current feature set.
The product itself is JVM based (query language written in Scala). There is an HTTP interface too (restful). It is fully transactional (ACID) and it is possible to override the transaction manager should you need to coordinate with an external transaction manager (for example because you want to coordinate with and external store). An object cache is used to store the entities in memory with fall through to memory-mapped files if the dataset does not fit in RAM. There is also an HTTP based API.
HA support uses a master-slave, replicated model (single master model). You can write to a slave (i.e. any node) and it will obtain a lock from the master. Lucene is the default index provider.
The team have several strategies for mitigating the impact of GC pauses, the most important being a GC resistant caching strategy. This assignes a certain amount of space in the JVM heap; it then purges objects whenever the maximum size is about to be reached, instead of relying on GC to make that decision. Here the competition with other objects in the heap, as well as GC-pauses, can be better controlled since the cache gets assigned a maximum heap space usage. Caching is described in more detail here.
Ian mentioned a few applications too:
- Telcos: Managing the network graph: If something goes wrong they use the graph database he help predict where the problem likely comes from by simulating the network topology.
- Logistics: parcel routing. This is a hierarchical problem. Neo4J helps by allowing them to model the various routes to get a parcel from it’s start to end locations. Routes change (and become unavailable).
- Finally the social graph which is fairly self explanatory!
All round an eye-opening approach to the modelling and inspection of connected data sets.
James Phillips (co-founder of Couchbase) did a nice talk on NoSQL Databases at QCon:
Memcached – the simplest and original. Pure key value store. Memory focussed
Redis – Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues etc. Redis is primarily memory focussed.
Membase – extends the membached approach to include persistence, the ability to add nodes, backup’s on other nodes.
MongoDB – Uses BSON (binary version of JSON which is open source but only really used by Mongo). Mongo unlike the Couchbase in that the query language is dynamic: Mongo doesn’t require the declaration of indexes. This makes it better at adhoc analysis but slightly weaker from a production perspective.
Cassandra – Column oriented, key value. The value are split into columns which are pre-indexed before the information can be retrieved. Eventually consistent (unlike Couchbase). This makes it better for highly distributed use cases or ones where the data is spread over an unreliable networks.
Neo4J – Graph oriented database. Much more niche. Not distributed.
There are obviously a few more that could have been covered (Voldemort, Dynamo etc but a good summary from James none the less)
Full slides/video can be found here.
This article describes a little about ODC – primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up – but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.
The Big Idea
ODC Core is the data store that sits at the centre of the ODC project. It was designed to be the one datastore the bank needs; the single port of call for all our trades and valuations with the vision of one day blending processing and data in a collocated manner. In fairness it is not quite that yet, as such a mythical beast is hard to come by, but it has made significant inroads.
So why is one big datastore useful you may ask? In short we, like many organisations, have a lot of problems with data. Most of these problems have nothing to do with technology. They are about different people’s interpretation of their part of our domain. Hundreds of systems across the bank each implement these different interpretations. Data is forwarded from system to system and the problem compounds. Enterprise messaging can only do so much to solve this problem because it is inherently point-in-time (so the interpretation of the message is still left to each application and their own method of persistence). Joining up all the dots to get a global view of the bank’s activity can be a confusing, manual and painful process. So the concept is simple: one golden copy that holds the truth. Get it right in one place and then migrate applications to that one single model and the one single data source. Simple idea. Somewhat harder to make a reality.
What is ODC Now
ODC has been live for coming up to two years with development starting back in Jan 2010. The datastore is written inside Oracle Coherence, which provides a data-fabric in which we have built a distributed, normalised database. ODC Core (which is the data store itself) has some interesting qualities that differentiate it from your average database (or Coherence cluster). The three I cover in more detail below are messaging as a system of record, a dynamic data replication model to support efficient distributed joins and our dynamic object and sql interfaces. There are some other quite neat features that I won’t go into here such as a distributed clock implementation that allows reliable and efficient snapshots of the datastore, the use of compression on large result sets (our own interpretation of dictionary encoding) and a sample-based query optimiser.
Messaging as a System of Record: Unlike most databases ODC Core provides both query and subscription semantics. This actually falls out quite naturally as messaging sits at the very core of the product. In fact messaging is our system of record. So when data is written to the store that data is only ‘accepted’ once it is has been written synchronously to the event stream. Having an event stream as your system of record proves to be a powerful concept.
From a non-functional perspective this allows persistence to scale out linearly in a ‘load balanced’ manner (we use topics rather than queues so there is global ordering and hence no need to share state across different servers in the messaging layer). Providing write scalability is only one advantage though. Having everything persisted through a single event stream means you can hook anything you like into it. If you are interested in a certain type of event you can just subscribe with a message selector. If you want to create a copy of the store in a relational database you can just hook into the same stream. If you want and disaster recovery instance … you get the picture I’m sure.
ODC Core efficiently joins normalised data: All distributed stores that support a degree of normalisation struggle if they need to join data elements are not collocated with one another. They are forced to ship potentially large amounts of data across the network to compute the join. Sharding helps a little but you can only shard by a single key so there will always be elements that don’t end up collocated (because they have ‘crosscutting’ keys). We use a relatively novel approach to solving this problem. In short we replicate data that does not shard. However simply replicating data would cause the cluster to run out of memory as there would simply be too much replicated data on each node.
To get around this problem, when data is written to the store the system walks the object model, ensuring that all items that the data ‘connects to’ are replicated. So we start out by replicating nothing. As data is written to the cluster we walk the domain model to make sure the ‘dimensions’ that data connects to are replicated. Most importantly, at any point in time data that is not ‘connected’ will not be replicated. This reduces the amount of replicated data by an order of magnitude so that replication can be used for efficient joins with ‘Dimensions’. If you’re interested in this pattern you can find out more about it here and here.
ODC supports Object and Relational models through a single interface: ODC is primarily an object database. This is important because it represents a 2D domain model (a representation of the banks Logical Domain Model – something we hold very dear). We have a simple object based query language which allows a user to query (filter, group etc) by element of any object in the store (the API is derived reflectively from the domain model). The language is sql-like but has all the benefits of intellisense in your IDE. That is to say you can filter, group, select etc on any getter, collection etc that any of our objects expose. You can define which joins you would like to make to bring more data back, add predicate logic etc.
In addition we support a basic JDBC driver which means users can get at our data in rows and columns if they wish. We’d prefer that they didn’t as rows and columns just don’t really work for a 2D domain model but we also understand that a lots and lots of tools want to interact with their data in SQL. The SQL adapter actually works in exactly the same way as the Object based interface. That is to say that the information that is sent to the store is the same. We just have to do a little more work to present the data in a tabular form.
ODC is continuously delivered: We’ve put a lot of work in to continuously deliver our application suite, or at least something do something as close to it continuous delivery as we can. The challenge is that ODC is quite big. Each environment runs around 450 processes with 50 different process definitions and the database is around 2TB which means it takes a long time to migrate (see the Future section below).
So why bother with continuous delivery? It’s really about how long it takes to get feedback on a problem. With this system in place we get feedback on changes with a real data set in something that looks and runs identically to production. We get that feedback every day. The effort has gone into a series of ever-increasingly comprehensive tests. 20k unit and functional tests run before you check in (takes just under 20 mins). The MiniMe build migrates the database should any database changes be checked in. It does this on a cut down dataset which means it can do pretty much any migration in twenty to thirty mins. If that passes a full migration ensures that the code with a fully populated data set. Finally, if all that passes we rip down the almost prod env, release to it and start everything up again. If anything goes wrong we roll back using a database flashback. All in all a lot of pain but that’s the world of databases in the terabytes. The luxury of seeing a new bit of work in a production identical environment within a day is worth it though. The continuous delivery system is written in Gradle by Greg Gigon.
The future for ODC’s data store revolves around its ability to adapt to a changing world. Databases aren’t so good at that. When you have a database you need to understand your data before you store it. Part of moving fast is accepting that you can’t understand all you data at the get-go however much you may wish to. Understanding data just takes time (and you get it wrong). The plan is to avoid these problems by using late binding to wrap a schema onto original, unaltered facts at runtime. This concept of the late bound schema allows us to change our mind and map data late on in the delivery cycle because the unaltered facts always sit at its core. Doing this in a traditional schema oriented store (like a database) isn’t possible since you would have to back-populate any new additions. The schema is more like a view in a database, except that the view is over the data file as it was provided to the database, rather than some mapped version of it. Some big data technologies offer properties like this but none we’ve come across offer this in the context of a statically typed language that can version data, provide consistent views and join entities that have disparate lifecycles. We see this step as an important move towards becoming the one store that a large number of systems can rely on.
The higher level vision (which is the vision of our CIO) is a data oriented architecture in which services are deployed and run in a cloud like environment that is ‘preloaded’ with all the bank’s primary data. That is to say that services running in this environment utilise only centralised persistence for the bank’s core facts.
The team are split between London and India. There is a strong influence from the software industry and that goes for the work as well as the ethos. We don’t always agree (lots of strong characters) but we always get along. If you are interested some of us mapped out what we value most here a while back.
We practice something that is a little bit like agile. We work iteratively. We write lots of tests. We keep the build time down. But we’re aging slightly which means we don’t pair as often as we used to (but we do still pair). Iterations overhang a little too often but hopefully you can forgive us for that.
So if you’re looking for a work because you want to pay the bills there are better teams out there. If you’ve chosen a life in software because it’s something that you find yourself musing about in idle moments and excited about when you wake in the morning then it could be for you.
If you’d like to find out more just email me: benjamin[dot]stopford[at]rbs.com
- Intel’s new MIC ‘Knights Corner’ coprocessor (in the Intel Xeon Phi line) is targeted at the high concurrency market, previously dominated by GPGPUs, but without the need for code to be rewritten into Cuda etc (note Knights Ferry is the older prototype version).
- The chip has 64 cores and 8GBs of RAM with a 512b vector engine. Clock speed is ~ 1.1Ghz and have a 512k L1 cache. The linux kernel runs on two 2.2GHZ processors.
- It comes on a card that drops into a PCI slot so machines can install multiple units.
- It uses a MESI protocol for cache coherence.
- There is a slimmed down linux OS that can run on the processor.
- Code must be compiled to two binaries, one for the main processor and one for Knights Corner.
- Compilers are currently available only for C+ and Fortran. Only Intel compilers at present.
- It’s on the cusp of being released (Q4 this year) for NDA partners (though we – GBM – have access to one off-site at Maidenhead). Due to be announced at the Supercomputing conference in November(?).
- KC is 4-6 GFLOPS/W – which works out at 0.85-1.8 TFLOPS for double precision.
- It is expected to be GA Q1 ‘13.
- It’s a large ‘device’ the wafer is a 70mm square form-factor!
- Access to a separate board over PCI is a temporary step. Expected that future versions will be a tightly-coupled co-processor. This will also be on the back of the move to the 14nm process.
- A single host can (depending on OEM design) support several PCI cards.
- Similarly power-draw and heat-dispersal an OEM decision.
- Reduced instruction set e.g. no VM support instructions or context-switch optimisations.
- Performance now being expressed as GFlops per Watt. This is a result of US Government (efficiency) requirements.
- A single machine is can go faster than a room-filling supercomputer of ‘97 – ASIC_Red!
- The main constraint to doing even more has been the limited volume production pipeline.
- Pricing not announced, but expected to be ‘consistent with’ GPGPUs.
- Key goal is to make programming it ‘easy’ or rather: a lot easier than the platform dedicated approaches or abstraction mechanisms such as OpenCL.
- Once booted (probably by a push of an OS image from the main host’s store to the device) it can appear as a distinct host over the network.
- The key point is that Knights Corner provides most of the advantages of a GPGPU but without the painful and costly exercise of migrating software from one language to another (that is to say it is based on the familiar x86 programming model).
- Offloading work to the card is instructed through the offload pragma or offloading keywords via shared virtual memory.
- Computation occurs in a heterogeneous environment that spans both the main CPU and the MIC card which is how execution can be performed with minimal code changes.
- There is a reduced instruction set for Knights Corner but the majority of the x86 instructions are there.
- There is support for OpenCl although Intel are not recommending that route to customers due to performance constraints.
- Real world testing has shown a provisional 4x improviement in throughput using an early version of the card running some real programs. However results from a sample test shows perfect scaling. Some restructuring of the code was necessary. Not huge but not insignificant.
- There is currently only C++ and Fortran interfaces (so not much use if you’re running Java or C#)
- You need to remember that you are on PCI Express so you don’t have the memory bandwidth you might want.
Other things worth thinking about:
Thanks to Mark Atwell for his help with this post.
Michael Stal wrote a nice article about the our Progressive Architectures talk from this year’s QCon. The video is up too.
Read the article here.
Watch the video here.
A big thanks to Fuzz, Mark and Ciaran for making this happen.
I really enjoyed Harvey’s ‘POF Art’ talk at the Coherence SIG. Slides are here if you’re into that kind of thing POF-Art.
What if, more than anything else, we valued helping each other out? What if this was the ultimate praise, not the best technologists, not an ability to hit deadlines, not production stability. What if the ultimate accolade was to consistently help others get things done? Is that crazy? It’s certainly not always natural; we innately divide into groups, building psychological boundaries. Politics erupts from trivial things. And what about the business? How would we ever deliver anything if we spent all our time helping each other out? But maybe we’d deliver quite a lot.
If helping each other out were our default position wouldn’t we be more efficient? We’d have less politics, less conflict, fewer empires and we’d spend less money managing them.
We probabably can’t change who we are. We’ll always behave a bit like we do now. Conflict will always arise and it will always result in problems, we all have tempers, we play games, we frustrate others and retort to the slights and injustices.
But what if it was simply our default position. Our core value. The thing we fall back on. It wouldn’t change the world, but it might make us a little bit more efficient.
… right back to the real world
Valve handbook. Very cool:
Jon ‘The Gridman’ Knight has finally dusted off his keyboard and entered the blogsphere with fantastic post on how we implement a reliable version of Coherence’s putAll() over here on ODC. One to add to your feed if you are interested in all things Coherence.
- Intel managing to squeeze 50 cores on a single chip, breaking through the teraflop boundary as they do so: Brier Dudley’s Blog | Wow: Intel unveils 1 teraflop chip with 50-plus cores | Seattle Times Newspaper
- RISC architectures have had a renaissance thanks largely to the needs of the mobile sector, could their low power consumption make them a serious contender for enterprise space? x86 Faces Unexpected RISC Competition
- AMD announce 4 memory channels allowing massive addressable spaces up to 364GB per CMP : AMD’s Interlagos and Valencia finally emerge
- Anyone who follows my blog will know of my belief in large address spaces reshaping the landscape, certainly for enterprise applications. This articles echoes these views: Megatrend: Cheap RAM Reshaping All of Computing | Dr Dobb’s
- IBM’s Lime is an interesting approach to simplifying the programming of secondary devices. See Lime paper and the related Liquid Metal project.
- JVM on FPGA: JOP: A Tiny Java Processor Core for FPGA
- An interesting paper on using FPGA for Monte Carlo Simulation: FPGA for monte carlos
High Performance Java
- An excellent talk about using memory efficiently in Java applications, that the costs are often higher than we think. It includes clear descriptions of the footprint of all Java objects and utilities : Building Memory Efficient Java Applications
- There has been a flurry of activity coming from Azul Systems recently. Most notably the release of Zing, their pauseless garbage collector. Gene Til’s talk about the State of the Art in GC from QCon SF 2011 is one of the best I’ve seen (QConSF 2011: State of the Art in Garbage Collection).
- Azul have also recently released JHiccup. An interesting utility that measures operating system stalls. Java Developer Tools: jHiccup Java Performance Analysis
- Charles Nutter’s comments on his favourite JVM flags including my favourite (-XX:+PrintOptoAssembly): Headius: My Favorite Hotspot JVM Flags
Distributed Data Storage
- A great paper from VLDB describing an approach for balancing replication and partitioning, something close to my own heart: Schism: a Workload-Driven Approach to Database Replication and Partitioning
- Hasso Plattner (the P is SAP) wrote this paper which provides an insigntful view of where he believes the field should be going (and of course SAP’s solution Hana): Hasso Plattner on In-Memory OLAP & OLTP
- I enjoyed watching this talk about Mongo: InfoQ: Scaling with MongoDB
- An entertaining article from the Economist about David Gelernter’s predictions of the future of computing: Brain scan: Seer of the mirror world | The Economist
- Could Prezi really dislodge PowerPoint? Prezi
- Double Loop Learning – a different view on organizational learning. Chris Argyris.
- Worth reading if you are not familiar with the idea already: CQRS
- An interesting twist on the traditional storyboard approach Our Story Board is Better Than Yours… I’m a big fan of replacing estimation with uniformly sized stories.
- Booked your next holiday? What about a Code Retreat with Corey Haines
High Performance Java
- Not exactly lightweight reading but one of the most detailed and influential papers on tuning your software for processing efficiency: What Developers Should Understand About Memory
- If you read the above and want to put some of it into action then VTune should be your next port of call. Diagnostic software for CPU cache hits etc: VTune™ Amplifier XE 2011 from Intel – Intel® Software Network
- When it really won’t go any faster, look at the Assembler: Deep dive into assembly code from Java | Java.net
- In anticipation of G1 (in case they ever get it finished) here’s the original paper with anticipated performance figures: G1 paper with figures
- A different approach to GC using processor specific minor collections (in Haskell): Multicore Garbage Collection with Local Heaps
Distributed Data Storage:
- The new Oracle NoSQL database – this is the best article I’ve read summarising it’s position in the market: DBMS Musings: Overview of the Oracle NoSQL Database
- The official Oracle NoSQL Whitepaper: Oracle NoSQL Database White Paper
- An interesting approach to data storage: an FPGA based data warehouse: FPGA Data Warehouse
- Google’s interesting SQL wrapped MapReduce framework: Tenzing A SQL Implementation On The MapReduce Framework
- The Actors Model – just in case you’re not familiar with it: Actors model for distribution
- Gluster – an open source distributed file system: Gluster
- Running Cuda natively on x86 processors: Running CUDA Code Natively on x86 Processors | Dr Dobb’s Journal
- Thinking about using 64bit JVMs with compressed pointers : 32-bit or 64-bit JVM? How about a Hybrid?
- Using different caches for read and write. A sensible pattern for Cohernece implementation: Alexey Ragozin’s Blog
- OCZ Z-Drive – an interesting and competitively priced alternative to FusionIO:
- The architecture of the transputer. An interesting reflection on a couple of Bristol’s finest exports (other than Portishead): the Transputer and the Occum programming language. David May, parallel processing pioneer • reghardware
- Is your brain like an Iphone? Is Your Brain Like an iPhone? Which App is Running Now? – Novato, CA Patch
- Just be still for once: No Shame in Stillness « Under the Apricot Tree
- Of the huge amount of writing about Steve Jobs I thought the Economist’s coverage was the best: Steve Jobs: The magician | The Economist
- Scott Marcar’s thought prevoking dialog on technology through a financial crisis: The Long Haul: Scott Marcar Leads RBS’ Tech Team Through the Financial Crisis- WatersTechnology.com
- Short but thought provoking article on company culture: Why You Should Question Your Culture – Ron Ashkenas – Harvard Business Review
Here are a the slides for the talk I gave at JavaOne:
Balancing Replication and Partitioning in a Distributed Java Database
This session describes the ODC, a distributed, in-memory database built in Java that holds objects in a normalized form in a way that alleviates the traditional degradation in performance associated with joins in shared-nothing architectures. The presentation describes the two patterns that lie at the core of this model. The first is an adaptation of the Star Schema model used to hold data either replicated or partitioned data, depending on whether the data is a fact or a dimension. In the second pattern, the data store tracks arcs on the object graph to ensure that only the minimum amount of data is replicated. Through these mechanisms, almost any join can be performed across the various entities stored in the grid, without the need for key shipping or iterative wire calls.
- Original talk, given at QCon London, which is more Coherence specific
- [pptx-9MB] [pdf-48MB]
- A related post documenting the main points covered in the talk
I’m heading to JavaOne in October to talk about some of the stuff we’ve been doing at RBS. The talk is entitled “Balancing Replication and Partitioning in a Distributed Java Database”.
Is anyone else going?
Because the future will inevitably be in-memory databases:
- SAP (slightly weirdly) is leading the way with Hana
- SSD makes a new kind of database possible
- The move away from clusters is not restricted to the enterprise
- More drinking of the Hana Kool-Aid
- Fusion IO
- Phase Change Memory breakthrough at IBM
Other interesting stuff:
- Interesting retrospective on computing giants of the past and future (in typical Economist style)
- A mathematician’s lament
- The next generation of Map Reduce
- Where google may be going wrong
The LMAX guys have open-sourced their Disruptor queue implementation. Their stats show some significant improvements (over an order of magnitude) over standard ArrayBlockingQueues in a range of concurrent tests. Both interesting and useful.
The slides/video from the my talk at QCon London have been put up on InfoQ.
An effort well worthy of it’s own post: http://www.christof-strauch.de/nosqldbs.pdf
- Nice talk covering optimising code in a single JVM: LMAX
- Biased locking in Hotspot: biased_locking_in_hotspot
- Good overview of caching: intel-cpu-caches
- Good overview of lock free algorithms: lock-free-algorithms
- Nice overview of the key NoSQL players: cassandra-vs-mongodb-vs-couchdb-vs-redis
- Google’s layering of ACID over BigTable (at least ACID inside a partition):
- Typically Economist: economist.com
Just a little plug for the 5th annual QCon London on March 7-11, 2011. There is a bunch of cool speakers inlcuding Craig Larman and Juergen Hoeller as well as the obligitory set of Ex-TW types. I’ll be doing a session on Going beyond the Data Grid.
You can save £100 and give £100 to charity is you book with this code: STOP100
More discussions on the move to in memory storage:
- RAM is my friend
- LMAX – How to Do 100K TPS at Less than 1ms Latency
- The problems with ACID, and how to fix them without going NoSQL
- Basho Riak: An Open Source Scalable Data Store
- Facebook’s belief in HBase
- Numbers Everyone Should Know
- Google Dremel Paper
- Facebook’s New Year Performance Stats
I’ve been working on a medium sized data store (around half a TB) that provides high bandwidth and low latency access to data.
Caching and Warehousing techniques push you towards denormalisation but this becomes increasingly problematic when you move to a highly distributed environment (certainly if the data is long lived). We’ve worked on a model that is semi normalised whilst retaining the performance benefits associated with denormalisation.
The other somewhat novel attribute of the system is its use of Messaging as a system of record.
I’ll also be adding some more posts in the near future to flesh out how this all works.
Submissions are being accepted for RefTest at IEEE International Conference on Testing, Verification and Validation.
Submissions can be short (2 page) or full length conference papers. The deadline in Jan 4th 2011.
Full details are here.