‘Data Tech’

Best of VLDB 2014

Sunday, March 8th, 2015

REWIND: Recovery Write-Ahead System for In-Memory Non-Volatile Data-Structures

Interesting paper on write ahead logs in persistent in memory media. Recent non-volatile memory (NVM) technologies, such as PCM, STT-MRAM and ReRAM, can act as both main memory and storage. This has led to research into NVM programming models, where persistent data structures remain in memory and are accessed directly through CPU loads and stores. REWIND outperforms state-of-the-art approaches for data structure recoverability as well as general purpose and NVM-aware DBMS-based recovery schemes by up to two orders of magnitude.

Storage Management in Asterix

Asterix is an academically established hierarchical store. It’s now an Apache Incubator project. It utilises sets of LSM structures, tied transactionally together. Additional index structures can also be formed, for example R-Trees.

Staring into the Abyss: An Evaluation of Concurrency Control with One Thousand Cores

As the number of cores increases, the complexity of coordinating competing accesses to data will likely diminish the gains from increased core counts.We conclude that rather than pursuing incremental solutions, many-core chips may require a completely redesigned DBMS architecture that is built from ground up and is tightly coupled with the hardware.

E-Store: Fine-Grained Elastic Partitioning for Distributed Transaction Processing Systems

OLTP DBMS need to be elastic; that is, they must be able to expand and contract resources in response to load fluctuations and dynamically balance load as hot tuples vary over time. This paper presents E-Store, an elastic partitioning framework for distributed OLTP DBMSs. It automatically scales resources in response to demand spikes, periodic events, and gradual changes in an application’s workload. E-Store addresses localized bottlenecks through a two-tier data placement strategy: cold data is distributed in large chunks, while smaller ranges of hot tuples are assigned explicitly to individual nodes. This is in contrast to traditional single-tier hash and range partitioning strategies.

Large-Scale Distributed Graph Computing Systems
An Experimental EvaluationGood coverage of different systems for distributed graph computation, including reflection on why some work better than others. Interesting.

Faster Set Intersection with SIMD instructions by Reducing Branch Mispredictions

This paper describes our new algorithm to efficiently find set intersections with sorted arrays on modern processors with SIMD instructions and high branch misprediction penalties. The key insight for our improvement is that we can reduce the number of costly hard-to-predict conditional branches by advancing a pointer by more than one element at a time. Although this algorithm increases the total number of comparisons, we can execute these comparisons more efficiently using the SIMD instructions and gain the benefits of the reduced branch misprediction overhead. Also see Improving Main Memory Hash Joins on Intel Xeon Phi Processors: An Experimental Approach

Memory-Efficient Hash Joins

A new hash tables for joins, and a hash join based on them, that consumes far less memory and is usually faster than recently published in-memory joins. The join mechanism is not restricted to outer tables that fit wholly in memory. The key to this hash join is a new concise hash table (CHT), a linear probing hash table that has 100% fill factor, and uses a sparse bitmap with embedded population counts to almost entirely avoid collisions. This bitmap also serves as a Bloom filter for use in multi-table joins. Our experiments show that we can reduce the memory usage by one to three orders of magnitude, while also being competitive in performance.

General Incremental Sliding-Window Aggregation

This paper presents Reactive Aggregator (RA), a new framework for incremental sliding-window aggregation. RA is general in that it does not require aggregation functions to be invertible or commutative, and it does not require windows to be FIFO. We implemented RA as a drop-in replacement for the Aggregate operator of a commercial streaming engine.

Persistent B+-Trees in Non-Volatile Main Memory

A look at the application of B+- trees optimisation to provide efficient retrieval algorithms for Phase Change Memory structures.

Understanding the Causes of Consistency Anomalies in Apache Cassandra

A recent paper on benchmarking eventual consistency showed that when a constant workload is applied against Cassandra, the staleness of values returned by read operations exhibits interesting but unexplained variations when plotted against time. In this paper we reproduce this phenomenon and investigate in greater depth the low-level mechanisms that give rise to stale reads. We show that the staleness spikes exhibited by Cassandra are strongly correlated with garbage collection, particularly the “stop-the-world” phase which pauses all application threads in a Java virtual machine. We show experimentally that the staleness spikes can be virtually eliminated by delaying read operations artificially at servers immediately after a garbage collection pause. In our experiments this yields more than a 98% reduction in the number of consistency anomalies that exceed 5ms, and has negligible impact on throughput and latency.

MRCSI: Compressing and Searching String Collections with Multiple References

Efficiently storing and searching collections of similar strings, such as large populations of genomes or long change histories of documents from Wikis, is a timely and challenging problem. We then propose three heuristics for computing Multi-Reference Compressed Search Indexes, achieving increasing compression ratios. Compared to state-of-the-art competitors, our methods target an interesting and novel sweet-spot between high compression ratio versus search efficiency.

Posted in Blog, Data Tech | No Comments »

A Guide to building a Central, Consolidated Data Store for a Company

Tuesday, December 2nd, 2014

Quite a few companies are looking at some form of centralised operational store, data warehouse, or analytics platform. The company I work for set out to build a centralised scale-out operational store using NoSQL technologies five or six years ago and it’s been an interesting journey. The technology landscape has changed a lot in that time, particularly around analytics, although that was not our main focus (but it will be an area of growth). Having an operational store that is used by many teams is, in many ways, harder than an analytics platform as there is a greater need for real-time consistency. The below is essentially a brain dump on the subject.

On Inter-System (Enterprise) Architecture

Assuming you use some governed enterprise messaging schema: if it ends up just being an intermediary for replicating from one database to another, then you’ll be in for trouble in the long run (see here). Make something the system of record. Replicate that as a stream of changes. Ideally, make it a database replication stream via goldengate or similar so it’s exactly what is in the source database.
Following this – clone data from a database transaction log, rather than extracting it and manually transforming to a wire format. The problem here is the word ‘manually’.
Prefer direct access at source. Use data virtualisation if you can get it to work.
Federated approaches, which leave data sets in place, will get you there faster if you can get all the different parts of the company to conform. That itself is a big ask though, but a good technical strategy can help. Expect to spend a lot on operations and automation lining disparate datasets up with one another.
When standardising the persisted representation don’t create a single schema upfront if you can help it. You’ll end up in EDW paralysis. Evolve to it over time.
Start with disparate data models and converge them incrementally over time using schema-on-need (and yes you can do this relationally, it’s just a bit harder).

On building a Multi-tenanted Read-Write store

Your goal, in essence, is to appear like a single store from the outside but with performance and flexibility that simulates (or is) a dedicated instance for each customer. It works best to think federated from the start and build centralisation in, not the other way around.
Utilise off the shelf products as much as possible. NoSQL products, in particular, are often better suited to this use case, but know their limitations too (see technology choice later)
Do everything you can to frame your use case to not be everything a traditional database has to do, at least holistically, but you will probably end up having to do the majority of it anyway, at least from the point of view of a single customer.
Question whether you really need a synchronous consolidation point for writes. It’s much easier to scale asynchronous replicas.
Use both sharding and read-replicas to manage query load. Sharding scales key-value read and write throughput linearly, replication scales complex (non-key-value) query load linearly (at the cost of write performance if it’s synchronous). You need primitives for both sharding and replication to scale non-trivial workloads.
Help yourself by grouping actors as either Read-Only or Read-Write for different data sets. Read-Only users can generally operate on an asynchronous dataset. This removes them from the global-write-path and hence avoids them contributing to the bottleneck that forms around it. Again, question whether you really need a single consolidation point for writes.
This is important enough to restate: leverage both sharding and replication (both sync and async). Make async the default. Use sharding + synchronous replicas to scale complex query load on read-write datasets. Use async for everything else. Place replicas on different hardware to provide resource isolation. Use this to create a ‘store-of-stores’ model that mixes replicas (for workload isolation) with sharding (for scale out in a replica).
Include a single event stream (log); one that exactly represents the entire stream of state. This should serve both as your asynchronous replication stream, but also as a basis for notification so you can listen as easily as you can query.
If you provide notifications on asynchronous replicas use a proxy, located on each replica, to republish events so that the read and notification ‘views’ line up temporally. This allows clients to listen to the replicas they are reading from without race conditions between events and data being present in a replica. A prerequisite for this is consistent timestamping (covered below).
Leverage schema-on-need. This is a powerful concept for a centralised store as it provides an incremental approach for conformity. It gets you there far faster than schema-upfront approaches. I cannot overstate how useful this is and is the backing for concepts like Data Lake etc.
However, if you take the schemaless/on-need route be warned: historic data will need to be ‘migrated’ as the schema of the data changes, else programs will have to support different schemas or simply won’t work with old data. You can do this with different sets of ‘late-bindings’ but the eventually migration is always needed so make sure you’re tooled up for this.
So… provision a mechanism for schema migration, else new programs won’t work with older data (note that schema migration doesn’t imply an upfront schema. It does imply a schema for each entity type though).
Accept that all non-freeform data has a schema itself, whether you declare it up-front, on-need or not-at-all.
Leverage the difference between query-schema and data-schema (query- being a subset of the data-schema) to minimise coupling to the store itself (all stores which utilise indexes will need a query schema as a minimum).
Even if you are schemaless, grow some mandatory schema over time, as data stabilises. Schemas make it much easier for you to manage change when you have more customers.
Whether you have a schema or not, data will change in a non-backwardly compatible way over time. Support this with two schemas (or data sets) visible concurrently to allow customers to upgrade using a rolling window rather than individual, lock-step releases.
If you have to transform data on the way in, keep hold of the original in the store and give it the same key/versioning/timestamping so you can refer back to it. Make this original form a first class citizen.
Apply the single writer principal wherever possible so only one source masters a certain dataset. Often this won’t be possible but do it wherever you can. It will allow you to localise/isolate their consistency concerns and leverage asynchronicity where consumption is off the write path.
Don’t mix async inputs (e.g. messages that overwrite) with synchronous updates (edits) on the same entity. At best people will get very confused. If you have to mix them, hold them separately and version each. Expose this difference between these updates/overwrites on your API so they can at least be specified declaratively to the user.
Leverage the fact that, in a collaborative store, atomaticity is a requirement but consistency can be varied. That is to say that different observers (readers not writers) can lock into different temporal points and see an atomic stream of changes. This alleviates the need for a single, global synchronisation on read. This only works if systems don’t message one another directly, else you’ll get race conditions but releases you from global transactional locks and that’s often worth the tradeoff, particularly if you’re crossing data-centre boundaries.
Make data immutable. Timestamp and version everything (validFrom-validTo etc). Build this into the very core of the system. Flag the latest record so you don’t always have to read all versions to get the latest. Keep all changes if you can afford the storage. It will allow you to look back in time. But know that temporal indexes are the least-efficient / most-complex-to-manage index type (generally require temporal partitioning).
Applying time consistently in a distributed environment requires synchronisation on a central clock (don’t rely on NTP). As a central clock is not always desirable, consider using epochs (consistent periods) which are pushed to each node so to define global read-consistent periods without explicit synchronisation (but at a courser granularity of ‘tick’). See here.
Don’t fall into the relational trap of splitting entities just because they are polystructured and don’t fit in a single row. Hold an entity separately only where real world entities they represented vary independently.
In tension with that, don’t denormalise data from different sources on the write path (i.e. using aggregates), if those aggregates contain many->1 relationships that do change independently. It will make writing atomically more difficult as writes must update multiple denormalised entities. Prefer joins to bring together the data at runtime or use aggregates in reporting replicas.
Provide, as a minimum, multi-key transactions in the context of a master replica. This will require synchronisation of writes, but it is possible to avoid synchronisation of reads.
Access to the data should be declarative (don’t fall into the trap of appending methods to an API to add new functionality). Make requests declarative. Use SQL (or a subset of) if you can.
Know that building a platform used by different teams is much harder than building a system. Much of the extra difficulty comes from peripheral concerns like testing, automation, reliability, SLAs that compound as the number of customers grow.
Following from this think about the customer development lifecycle early, not just your production system. Make environments self-service. Remember data movement is network limited and datasets will be large.
Testing will hurt you more and more as you grow. Provide system-replay functions to make testing faster.
Your value in a company is based mostly on a perception of your overall value. If you’re doing a platform-based data consolidation project you will inevitably restrict what they can do somewhat. Focus on marketing and support to keep your users happy.

On Technology Choice

Use off the shelf products as much as possible.
The relational model is good for data you author but not so good for data sourced from elsewhere (data tends to arrive polystructured so is best stored polystructured).
Be wary of pure in-memory products and impressive RAM-centric benchmarks. Note their performance as data expands onto disk. You always end up with more data than you expect and write amplification is always more than vendors state.
Accept that, despite the attraction of a polystructured approach, the relational model is a necessity in most companies, at least for reporting workloads, so the wealth of software designed to work with it (and people skilled in it) can be leveraged.
NoSQL technologies are stronger in a number of areas, most notably:
1. scalability for simple workloads,
2. polystructured data,
3. replication primitives,
4. continuous availability,
5. complex programmable analytics.
Relational technologies are stronger at:
1. keeping data usable for evolving programs over long periods of time,
2. transactional changes,
3. pre-defined schemas,
4. breadth of query function,
5. analytical query performance.
In effect this ends up being: use the right tool for the right job, refer to 5/6 with as few technologies as you can survive with.

Posted in Blog, Data Tech | No Comments »

An initial look at Actian’s ‘SQL in Hadoop’

Monday, August 4th, 2014

I had an exploratory chat with Actian today about their new product ‘SQL in Hadoop’.

In short it’s a distributed database which runs on HDFS. The company are bundling their DataFlow technology alongside this. DataFlow is a GUI-driven integration and analytics tool (think suite of connectors, some distributed functions and a GUI to sew it all together).

Neglecting the DataFlow piece for now, SQL in Hadoop has some obvious strengths. The engine is essentially Vectorwise (recently renamed ‘Vector’). A clever, single node columnar database which evolved from MonetDB and touts the use of vectorisation as a key part of its secret sauce. Along with the usual columnar benefits comes the use of positional delta trees which improve on the poor update performance seen in most columnar databases, some clever cooperative scan technology which was presented at VLDB a couple of years back, but they don’t seem to tout this one directly. Most notably Vector has always had quite impressive benchmarks both in absolute and price-performance terms. I’ve always thought of it as the relational analytics tool I’d look to if I were picking up the tab.

The Vector engine (termed x100) is the core of Actian’s SQL in Hadoop platform. It’s been reengineered to use HDFS as its storage layer, which one has to assume will allow it to operate better price performance when compared to storage-array based MPPs. It has also had a distribution layer placed above it to manage distributed queries. This appears to leverage parts of the DataFlow cluster as well as Yarn and some other elements of the standard Hadoop stack. It inherits some aspects of traditional MPPs such as the use of replication to improve data locality over declared, foreign key, join conditions. The file format in HDFS is wholly proprietary though so it can’t be introspected directly by other tools.

So whilst it can be deployed inside an existing Hadoop ecosystem, the only benefit gained from the Hadoop stack, from a user’s perspective, is the use of HDFS and YARN. There is no mechanism for integrating MR or other tools directly with the x100 engines. As the file format is opaque the database must be addressed as an SQL appliance even from elsewhere within the Hadoop ecosystem.

Another oddity is that, by branching Vector into the distributed world the product directly competes with its sister product Matrix (a.k.a. ParAccel); another fairly accomplished MPP which Actian acquired a couple of years ago. If nothing else this leads to a website that is pretty confusing.

So is the product worth consideration?

It’s most natural competition must be Impala. Impala however abstracts itself away from the storage format, meaning it can theoretically operate on any data in HDFS, although from what I can tell all practical applications appear to transform source files to something better tuned for navigation (most notably parquet). Impala thus has the benefit that it will play nicely with other areas of the Hadoop stack, Hive, Hbase etc. You won’t get any of this with the Actian SQL in Hadoop product although nothing is to stop you running these tools alongside Actian on Hadoop, inside the same HDFS cluster.

Actian’s answer to this may be the use of its DataFlow product to widen the appeal to non-sql workloads as well as data ingestion and transformation tasks. The DataFlow UI does provide a high level abstraction for sewing together flows. I’ve always been suspicious of how effective this is for non-trivial workflows which generally involve the authoring of custom adapters and functions, but it obviously has some traction.

A more natural comparison might come from other MPPs such as Greenplum, which offers a HDFS integrated version and ties in to the Pivotal Hadoop distribution. Comparisons with other MPPs, Paraccel, Netazza, Vertica etc are also reasonable if you are not restricted to HDFS.

So we may really be looking at a tradeoff between the breadth of the OS Hadoop stack vs. SQL compliance & raw performance. The benefits of playing entirely in a single ecosystem, like the Cloudera offering, is a better integration between the tools, an open source stack which has a rich developer community, less issues of vendor lock-in and a much broader ecosystem of players (Drill, Storm, Spark, Skoop and many more).

Actian on the other hand can leverage its database heritage; efficient support the full SQL spec, ACID transactions and a performance baseline that comes from a relatively mature data warehousing offering where aggregation performance was the dominant driver.

As a full ecosystem it is probably fair to say it lacks maturity at this time, certainly when compared to Hadoop’s. In the absence of native connectivity with other Hadoop products it is really a database appliance on HDFS rather than a 1st class entity in the Hadoop world. But there are certainly companies pushing for better performance than they can currently get on existing HDFS infrastructure with the core Hadoop ecosystem. For this Actian’s product could provide a useful tool.

In reality the proof will be in the benchmarks. Actian claim order of magnitude performance improvements over Impala. Hadapt, an SQL on Hadoop startup which was backed by and ex-Vertica/academic partnership was hammered by market pressure from Impala and was eventually bought by Teradata. The point being that the performance needs to justify breaking out of the core ecosystem.

There may be a different market though in companies with relationally-centric users looking to capitalise on the cheaper storage offered by HDFS. This would also aid, or possibly form a strategy away from siloed, SAN based products and towards the broadly touted (if somewhat overhyped) merits of Big Data and commodity hardware on the Cloud. Hype aside that could have considerable financial benefit.

Edit: Peter Boncz, who is one of the academics behind the original vector product, published an informative review with benchmarks here. There is also an academic overview of (a precursor to) the system here. Worth a read.

Posted in Blog, Data Tech | No Comments »

The Best of VLDB 2012

Sunday, October 28th, 2012

Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list here.

Solving Big Data Challenges for Enterprise ApplicationPerformance Management (here)

This paper includes a detailed analysis of the performance characteristics of Cassandra, HBase, Redis, VoltDB, MySQL Cluster and is the most comprehensive comparison I’ve come across.

From Cooperative Scans to Predictive Buffer Management (here)

Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server and DB2 both have mechanisms for sharing IO between queries (by attaching to an existing scan or throttling faster queries so that IO can be shared). The Cooperative Scans discussed here takes this a step further by incorporating an active buffer manager which scans use to register their interest in data. The manager then adaptively chooses which pages to load and pass to the various concurrent requests.

There is another related paper at this conference SharedDB: Killing One Thousand Queries With One Stone (here)

Processing a Trillion Cells per Mouse Click (Google) (here)

Interesting paper from Google suggesting an alternative to the approach to column orientation taken in Dremel. PowerDrill uses a double-dictionary encoded column store where the encodings live largely in memory. Further optimisations are made at load time to ensure minimal access to persistent storage. This makes it more akin to column stores like ParAccel or Vectorwise, applied to analytical workloads (aggregates, group bys etc).

Can the elephants handle the NoSQL onslaught (here)

Another paper comparing the performance of Hadoop with a relational database (in a similar vein to the Sigmod 09 paper DeWitt published previously here). I sympathise with the message – databases outperform hadoop on small to medium workloads – but I hope that most people know that already. This time the comparison is with Microsoft’s Sql Server PDW (Parallel Data Warehouse). The choice of data sizes between 250Gb and 16TB means that the study has the same failing as the previous Sigmod one; it’s not looking at large dataset performance.

Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads (here)

Useful, empirically driven paper with detailed data sets from a number of NoSQL implementations including Facebook. Chen et al. performed an empirical study on the implementation of Hadoop at a number of companies including Facebook. It hints at the current ‘elephant in the room’ that is Hadoop’s focus on batch-time over real-time performance (roll on Impala!) . Having data of this level of granularity over a range of real time systems in itself is quite valuable. They note that 90% of jobs are small (resulting in MBs of data returned).

High-Performance Concurrency Control Mechanisms for Main-Memory Databases (here)

Proposes an optimistic MVCC method for in memory concurrency control. The conclusion: single-version locking performs well only when transactions are short and contention is low; higher contention or workloads including some long transactions favor the multiversion methods, and the optimistic method performs better than the pessimistic one.

Blink and It’s Done: Interactive Queries on Very Large Data (here)

Blink is different to the mainstream database as it’s not designed to give you an exact answer. Instead you specify either error (confidence) or maximum time constraints on your query. The approach uses a number of sampling based strategies to achieve the required confidence level. There is a related paper: Model-based Integration of Past & Future in TimeTravel (here)

Developing and Analyzing XSDs through BonXai (here)

This one struck a cord with me as I’m not the biggest fan of xsd. Bonxai provides and expression rather than type based approach to defining the data schema. More info here and here.

B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (here)

SSDs performance increases (initially) with the number of concurrent executions (in stark contrast with magnetic drives). This paper looks into maximising this with the use of concurrent B-trees that utalise parallel IO. Useful research as flash is only going to get cheaper.

SCOUT: Prefetching for Latent Structure Following Queries (here)

I quite like the ideas in this paper around prefetching data based on a known structure (probably because it’s similar to some of the stuff we do).

Fast Updates on Read-Optimized Databases Using Multi-Core CPUs (here)

Addresses the problem some columnar architectures suffer where they accumulate writes in a separate partition, which must be periodically merged with the read-optimised main one.

FDB: A Query Engine for Factorised Relational Databases (here)

I hadn’t come across the idea of Factorised Databsaes before. An interesting concept. The paper demonstrates performance improvements over traditional methods for many-to-many join criteria.

Only Agressive Elephants are Fast Elephants (here)

Interesting approach to indexing Hadoop that claims to improve both read and write performance. I couldn’t find the code though so couldn’t try it.

The Vertica Analytic Database: C-Store 7 Years Later (here)

A good summary of this mature shared-everything, columnar database. They discuss their use of super projections over join indexes, due to the overheads associated with tuple construction and the verbosity of storing the associated rowids. There is a summary of the encoding types used as well as partitioning and locking strategies.

Muppet: MapReduce-Style Processing of Fast Data (here)

Whilst the majority of MapReduce commentary focuses on improving MR query performance this paper looks at the problem of injesting data quickly for high throughput, streaming workloads. The interesting approach focuses on data as streams (in and out) in association with a moving historical window (they denote a slate). To me there seems to be a lot of similarity between this approach the one taken by products like StreamBase and Cloudscale but the authors differentiate themselves my being less schema oriented, more akin to the traditional MR style.

Serializable Snapshot Isolation in PostgreSQL (here)

Interesting paper on the implementation of serializable isolation using the snapshot model.

Other papers of note:

Minuet: A Scalable Distributed Multiversion B-Tree (here)
A Statistical Approach Towards Robust ProgressEstimation (here)
Efﬁcient Multi-way Theta-Join Processing UsingMapReduce (here)
Avatara: OLAP for Web-scale Analytics Products (OLAP cubes over a NoSQL @LinkedIn) (here)
10 Year Best Paper Award: Approximate Frequency Counts over Data Streams (here)

Posted in Data Tech | No Comments »

Thinking in Graphs: Neo4J

Friday, August 17th, 2012

I enjoyed listening to Ian Robinson speak about Neo4J (slides are here Thinking in Graphs). The odd one out of the NoSQL pack, Neo4J is a fascinating alternative to your regular key value store. For me it’s about a different way of thinking about data simply because the relations between nodes are as much a part of the data model as the nodes are themselves. I am left wondering somewhat how one might apply this solution to the enterprise space, particularly finance. Multistep montecarlo springs to mind as it creates a large connected space but there is no real need to traverse that space retrospectively. There may be application in other simulation techniques though. The below is a paraphrased version of Ian’s words.

Today’s problems can be classified as a function of not only size but also connectedness and structure.

F(size, connectedness, structure)

The Relational model struggles to deal with each of these three factors. The use of sparsely populated tables in our databases and null checks in client side code allude to the unsuitability of this model.

NoSQL offers a solution. The majority of this fledgling field rely on the concept of a Map (Dictionary) in some way. First came simple key-value stores like Dynamo. Next column-oriented stores like Cassandra and BigTable.Finally Document Databases provide a more complex document model(for example JSON), with facilities for simple introspection.

Neo4J is quite different to its NoSQL siblings: A graph database that allows users to model data as a set of nodes and relationships. Once modelled the data can be examined based on its connectedness (i.e. how one node relates to others) rather than simply based on its attributes.

Neo4J uses a specific type of graph model termed a Property Graph: Each node has associated attributes that describe its specificities. These need not be homogenous (as they would in a relational or object schema). Further the relationships between nodes are both named and directed. As such they can be used in search criteria to find relationships between nodes.

The Property Graph model represents a pragmatic trade off between the purity of a traditional graph database and what you might see in a document database. This can be contrasted with the other graph database models: In ‘Triple Stores’ every attribute is broken out as a separate node (this is a bit like third normal form for a graph database). Another alternative is Hypergraphs, where an edge can connect more than two nodes (see Ian’s slide to get a better understanding of this). Triple stores suffer from their fine-grained nature (I’m thinking binary vs red-black trees). Hypergraphs can be hard to apply to real world modelling applications as the multiplicity of relationships can make them hard to comprehend. The Property Graph model avoids the verbosity of triple stores and the conceptual complexity of Hypergraphs. As such the model works well for Complex, densely connected domains and ‘Messy’ data.

The fundamental attribute of the graph database is that Relationships are first class elements. That is to say querying relationships in a graph database is as natural as querying the data the nodes contain.

Neo4J, like many NoSQL databases is schemaless. You simply create nodes and relate them to one another to form a graph. Graphs need not be connected and many sub-graphs can be supported.

A query is simply ‘parachuted’ into a point in the graph from where it explores the local areas looking for some search pattern. So for example you might search for the pattern A–>B–>C. The query itself can be executed either via a ‘traversal’ or using the Cypher graph language. The traversal method simply visits the graph based on some criteria.For example it might only traverse arcs of a particular type. Cypher is a more formal graph language that allows the identification of patterns within the graph.

Imagine a simple graph of two anonymous nodes with an arc between them:

O–>O

In Cypher this would be represented

A-[:connected_to]-B

Considering a more complex graph:

A–>B–>C, A–>C or A–>B–>C–>A

We can start to build up pattern matching logic over these graphs for exampleA-[*]->B to represent that A is somehow connected to B (think regex for graphs). This allows the graph to be mined for patterns based on any combination of the properties, arc directions or name (type).

There are further Cypher examples here including links to an online console where you can interactively experiment with the queries. Almost all of the query examples and diagrams are generated from the unit tests used to develop Cypher. This means that the manual is always an accurate reflection of the current feature set.

Physical Characteristics:

The product itself is JVM based (query language written in Scala). There is an HTTP interface too (restful). It is fully transactional (ACID) and it is possible to override the transaction manager should you need to coordinate with an external transaction manager (for example because you want to coordinate with and external store). An object cache is used to store the entities in memory with fall through to memory-mapped files if the dataset does not fit in RAM. There is also an HTTP based API.

HA support uses a master-slave, replicated model (single master model). You can write to a slave (i.e. any node) and it will obtain a lock from the master. Lucene is the default index provider.

The team have several strategies for mitigating the impact of GC pauses, the most important being a GC resistant caching strategy. This assignes a certain amount of space in the JVM heap; it then purges objects whenever the maximum size is about to be reached, instead of relying on GC to make that decision. Here the competition with other objects in the heap, as well as GC-pauses, can be better controlled since the cache gets assigned a maximum heap space usage. Caching is described in more detail here.

Ian mentioned a few applications too:

Telcos: Managing the network graph: If something goes wrong they use the graph database he help predict where the problem likely comes from by simulating the network topology.
Logistics: parcel routing. This is a hierarchical problem. Neo4J helps by allowing them to model the various routes to get a parcel from it’s start to end locations. Routes change (and become unavailable).
Finally the social graph which is fairly self explanatory!

All round an eye-opening approach to the modelling and inspection of connected data sets!

Posted in Data Tech | No Comments »

A Brief Summary of the NoSQL World

Saturday, August 11th, 2012

James Phillips (co-founder of Couchbase) did a nice talk on NoSQL Databases at QCon:

Memcached – the simplest and original. Pure key value store. Memory focussed

Redis – Extends the simple map-like semantic with extensions that allow the manipulation of certain specific data structures, stored as values. So there are operations for manipulating values as lists, queues etc. Redis is primarily memory focussed.

Membase – extends the membached approach to include persistence, the ability to add nodes, backup’s on other nodes.

Couchbase – a cross between Membase and CouchDB. Membase on the front, Couch DB on the back. The addition of CouchDB means you can can store and reflect on more complex documents (in JSON). To query Couchbase you need to write javascript mapping functions that effectively materialise the schema (think index) so that you can create a query model. Couchbase is CA not AP (i.e. not eventually consistent)

MongoDB – Uses BSON (binary version of JSON which is open source but only really used by Mongo). Mongo unlike the Couchbase in that the query language is dynamic: Mongo doesn’t require the declaration of indexes. This makes it better at adhoc analysis but slightly weaker from a production perspective.

Cassandra – Column oriented, key value. The value are split into columns which are pre-indexed before the information can be retrieved. Eventually consistent (unlike Couchbase). This makes it better for highly distributed use cases or ones where the data is spread over an unreliable networks.

Neo4J – Graph oriented database. Much more niche. Not distributed.

There are obviously a few more that could have been covered (Voldemort, Dynamo etc but a good summary from James none the less)

Full slides/video can be found here.

Posted in Blog, Data Tech | 889 Comments »

ODC – A Distributed Datastore built at RBS

Thursday, August 9th, 2012

[Edit – 2014 – a more up to date picture here]

This article describes a little about ODC – primarily because we are hiring and we’d like candidates to know a little more about what we do here before they rock up – but it may also be of interest to those attempting to consolidate large amounts of data into a single, real-time, enterprise-wide store.

The Big Idea

ODC Core is the data store that sits at the centre of the ODC project. It was designed to be the one datastore the bank needs; the single port of call for all our trades and valuations with the vision of one day blending processing and data in a collocated manner. In fairness it is not quite that yet, as such a mythical beast is hard to come by, but it has made significant inroads.

So why is one big datastore useful you may ask? In short we, like many organisations, have a lot of problems with data. Most of these problems have nothing to do with technology. They are about different people’s interpretation of their part of our domain. Hundreds of systems across the bank each implement these different interpretations. Data is forwarded from system to system and the problem compounds. Enterprise messaging can only do so much to solve this problem because it is inherently point-in-time (so the interpretation of the message is still left to each application and their own method of persistence). Joining up all the dots to get a global view of the bank’s activity can be a confusing, manual and painful process. So the concept is simple: one golden copy that holds the truth. Get it right in one place and then migrate applications to that one single model and the one single data source. Simple idea. Somewhat harder to make a reality.

What is ODC Now

ODC has been live for coming up to two years with development starting back in Jan 2010. The datastore is written inside Oracle Coherence, which provides a data-fabric in which we have built a distributed, normalised database. ODC Core (which is the data store itself) has some interesting qualities that differentiate it from your average database (or Coherence cluster). The three I cover in more detail below are messaging as a system of record, a dynamic data replication model to support efficient distributed joins and our dynamic object and sql interfaces. There are some other quite neat features that I won’t go into here such as a distributed clock implementation that allows reliable and efficient snapshots of the datastore, the use of compression on large result sets (our own interpretation of dictionary encoding) and a sample-based query optimiser.

Messaging as a System of Record: Unlike most databases ODC Core provides both query and subscription semantics. This actually falls out quite naturally as messaging sits at the very core of the product. In fact messaging is our system of record. So when data is written to the store that data is only ‘accepted’ once it is has been written synchronously to the event stream. Having an event stream as your system of record proves to be a powerful concept.

From a non-functional perspective this allows persistence to scale out linearly in a ‘load balanced’ manner (we use topics rather than queues so there is global ordering and hence no need to share state across different servers in the messaging layer). Providing write scalability is only one advantage though. Having everything persisted through a single event stream means you can hook anything you like into it. If you are interested in a certain type of event you can just subscribe with a message selector. If you want to create a copy of the store in a relational database you can just hook into the same stream. If you want and disaster recovery instance … you get the picture I’m sure.

ODC Core efficiently joins normalised data: All distributed stores that support a degree of normalisation struggle if they need to join data elements are not collocated with one another. They are forced to ship potentially large amounts of data across the network to compute the join. Sharding helps a little but you can only shard by a single key so there will always be elements that don’t end up collocated (because they have ‘crosscutting’ keys). We use a relatively novel approach to solving this problem. In short we replicate data that does not shard. However simply replicating data would cause the cluster to run out of memory as there would simply be too much replicated data on each node.

To get around this problem, when data is written to the store the system walks the object model, ensuring that all items that the data ‘connects to’ are replicated. So we start out by replicating nothing. As data is written to the cluster we walk the domain model to make sure the ‘dimensions’ that data connects to are replicated. Most importantly, at any point in time data that is not ‘connected’ will not be replicated. This reduces the amount of replicated data by an order of magnitude so that replication can be used for efficient joins with ‘Dimensions’. If you’re interested in this pattern you can find out more about it here and here.

ODC supports Object and Relational models through a single interface: ODC is primarily an object database. This is important because it represents a 2D domain model (a representation of the banks Logical Domain Model – something we hold very dear). We have a simple object based query language which allows a user to query (filter, group etc) by element of any object in the store (the API is derived reflectively from the domain model). The language is sql-like but has all the benefits of intellisense in your IDE. That is to say you can filter, group, select etc on any getter, collection etc that any of our objects expose. You can define which joins you would like to make to bring more data back, add predicate logic etc.

In addition we support a basic JDBC driver which means users can get at our data in rows and columns if they wish. We’d prefer that they didn’t as rows and columns just don’t really work for a 2D domain model but we also understand that a lots and lots of tools want to interact with their data in SQL. The SQL adapter actually works in exactly the same way as the Object based interface. That is to say that the information that is sent to the store is the same. We just have to do a little more work to present the data in a tabular form.

ODC is continuously delivered: We’ve put a lot of work in to continuously deliver our application suite, or at least something do something as close to it continuous delivery as we can. The challenge is that ODC is quite big. Each environment runs around 450 processes with 50 different process definitions and the database is around 2TB which means it takes a long time to migrate (see the Future section below).

So why bother with continuous delivery? It’s really about how long it takes to get feedback on a problem. With this system in place we get feedback on changes with a real data set in something that looks and runs identically to production. We get that feedback every day. The effort has gone into a series of ever-increasingly comprehensive tests. 20k unit and functional tests run before you check in (takes just under 20 mins). The MiniMe build migrates the database should any database changes be checked in. It does this on a cut down dataset which means it can do pretty much any migration in twenty to thirty mins. If that passes a full migration ensures that the code with a fully populated data set. Finally, if all that passes we rip down the almost prod env, release to it and start everything up again. If anything goes wrong we roll back using a database flashback. All in all a lot of pain but that’s the world of databases in the terabytes. The luxury of seeing a new bit of work in a production identical environment within a day is worth it though. The continuous delivery system is written in Gradle by Greg Gigon.

The Future

The future for ODC’s data store revolves around its ability to adapt to a changing world. Databases aren’t so good at that. When you have a database you need to understand your data before you store it. Part of moving fast is accepting that you can’t understand all you data at the get-go however much you may wish to. Understanding data just takes time (and you get it wrong). The plan is to avoid these problems by using late binding to wrap a schema onto original, unaltered facts at runtime. This concept of the late bound schema allows us to change our mind and map data late on in the delivery cycle because the unaltered facts always sit at its core. Doing this in a traditional schema oriented store (like a database) isn’t possible since you would have to back-populate any new additions. The schema is more like a view in a database, except that the view is over the data file as it was provided to the database, rather than some mapped version of it. Some big data technologies offer properties like this but none we’ve come across offer this in the context of a statically typed language that can version data, provide consistent views and join entities that have disparate lifecycles. We see this step as an important move towards becoming the one store that a large number of systems can rely on.

The higher level vision (which is the vision of our CIO) is a data oriented architecture in which services are deployed and run in a cloud like environment that is ‘preloaded’ with all the bank’s primary data. That is to say that services running in this environment utilise only centralised persistence for the bank’s core facts.

The Team

The team are split between London and India. There is a strong influence from the software industry and that goes for the work as well as the ethos. We don’t always agree (lots of strong characters) but we always get along. If you are interested some of us mapped out what we value most here a while back.

We practice something that is a little bit like agile. We work iteratively. We write lots of tests. We keep the build time down. But we’re aging slightly which means we don’t pair as often as we used to (but we do still pair). Iterations overhang a little too often but hopefully you can forgive us for that.

So if you’re looking for a work because you want to pay the bills there are better teams out there. If you’ve chosen a life in software because it’s something that you find yourself musing about in idle moments and excited about when you wake in the morning then it could be for you.

If you’d like to find out more just email me or get me on twitter @benstopford.

Posted in Data Tech | 3 Comments »

Looking at Intel Xeon Phi (Kinghts Corner)

Thursday, August 9th, 2012

Characteristics:

Intel’s new MIC ‘Knights Corner’ coprocessor (in the Intel Xeon Phi line) is targeted at the high concurrency market, previously dominated by GPGPUs, but without the need for code to be rewritten into Cuda etc (note Knights Ferry is the older prototype version).
The chip has 64 cores and 8GBs of RAM with a 512b vector engine. Clock speed is ~ 1.1Ghz and have a 512k L1 cache. The linux kernel runs on two 2.2GHZ processors.
It comes on a card that drops into a PCI slot so machines can install multiple units.
It uses a MESI protocol for cache coherence.
There is a slimmed down linux OS that can run on the processor.
Code must be compiled to two binaries, one for the main processor and one for Knights Corner.
Compilers are currently available only for C+ and Fortran. Only Intel compilers at present.
It’s on the cusp of being released (Q4 this year) for NDA partners (though we – GBM – have access to one off-site at Maidenhead). Due to be announced at the Supercomputing conference in November(?).
KC is 4-6 GFLOPS/W – which works out at 0.85-1.8 TFLOPS for double precision.
It is expected to be GA Q1 ’13.
It’s a large ‘device’ the wafer is a 70mm square form-factor!
Access to a separate board over PCI is a temporary step. Expected that future versions will be a tightly-coupled co-processor. This will also be on the back of the move to the 14nm process.
A single host can (depending on OEM design) support several PCI cards.
Similarly power-draw and heat-dispersal an OEM decision.
Reduced instruction set e.g. no VM support instructions or context-switch optimisations.
Performance now being expressed as GFlops per Watt. This is a result of US Government (efficiency) requirements.
A single machine is can go faster than a room-filling supercomputer of ’97 – ASIC_Red!
The main constraint to doing even more has been the limited volume production pipeline.
Pricing not announced, but expected to be ‘consistent with’ GPGPUs.
Key goal is to make programming it ‘easy’ or rather: a lot easier than the platform dedicated approaches or abstraction mechanisms such as OpenCL.
Once booted (probably by a push of an OS image from the main host’s store to the device) it can appear as a distinct host over the network.

Commentary:

The key point is that Knights Corner provides most of the advantages of a GPGPU but without the painful and costly exercise of migrating software from one language to another (that is to say it is based on the familiar x86 programming model).
Offloading work to the card is instructed through the offload pragma or offloading keywords via shared virtual memory.
Computation occurs in a heterogeneous environment that spans both the main CPU and the MIC card which is how execution can be performed with minimal code changes.
There is a reduced instruction set for Knights Corner but the majority of the x86 instructions are there.
There is support for OpenCl although Intel are not recommending that route to customers due to performance constraints.
Real world testing has shown a provisional 4x improviement in throughput using an early version of the card running some real programs. However results from a sample test shows perfect scaling. Some restructuring of the code was necessary. Not huge but not insignificant.
There is currently only C++ and Fortran interfaces (so not much use if you’re running Java or C#)
You need to remember that you are on PCI Express so you don’t have the memory bandwidth you might want.

References:

Good introduction to the history and development of Knights Corner

A second recent article on Knight Ferry/Corner

Intel slides discussing KC and finishing wiht a Black Scholes example

Other things worth thinking about:

http://www.altera.com/

Thanks to Mark Atwell for his help with this post.

Posted in Blog, Data Tech | No Comments »

ben stopford

‘Data Tech’

Best of VLDB 2014

A Guide to building a Central, Consolidated Data Store for a Company

An initial look at Actian’s ‘SQL in Hadoop’

The Best of VLDB 2012

Thinking in Graphs: Neo4J

A Brief Summary of the NoSQL World

ODC – A Distributed Datastore built at RBS

The Big Idea

What is ODC Now

The Future

The Team

Looking at Intel Xeon Phi (Kinghts Corner)

ALL

Talks (View on YouTube)

Essays (all)

Test Driven Development (all)

Coherence (all)

About

Data Tech (all)

Team / Process / Interviewing (all)