Disruptive Technologies in the Data Layer
The last time I was in the market for a commercial database (back in 2003) it was a pretty simple process. The DBA team supplied the most reasonably priced database from one of the available vendors (with this preference changing frequently as contracts were renegotiated). We didn’t have that much involvement in the decision, largely because there was little between the vendors anyway. Each vendor offered only minor variations on the same basic product.
Revisiting the market now is a very different experience. Most notably the introduction, or at least coming of age, of a number of disruptive technologies which have segregated the market into smaller niece fields. Some of these have evolved from the application framework arena, some from super-computing and others from the database world itself. The overriding point is however: No one database technology can do it all.
These days you need to work out want type of database you need. No one product can do it all.
Of course it is not really the technologies that have changed, as such, but rather the database market has evolved to have needs that can only be serviced by new technologies. These days there are far more diverse requirements for the consumption of data, lead largely by the distribution of systems over greater and greater hardware estates.
For the sake of this article I’ll present a summary of the newer technologies that have entered the database world. Hopefully this will provide sufficient a primer for you to form your own opinions about what you may need.
Four main disruptive technologies have entered the database market: Clustering, Shared Nothing Architectures, Column Orientation, Distributed Caching.
Firstly lets look at four of the new technologies that have entered the database world:
-
Clustering
-
Shared Nothing Architectures
-
Column Orientation
-
Distributed Caching
Clustering
Clustering is the road out of the scale-up world. It’s a hard technology to implement well, fortunately it’s an easy technology to understand, being simply the mechanism for joining a collection of computers together so they look something like a single entity. The challenges are far and wide and go way beyond the scope of this article but if you are interested they include consensus problems, ordering problems, concurrency. Clustering, in some form, is fundamental to any scale-out system, particularly a database where there is need for shared state in the application tier and where a load balanced architecture will not suffice. The downside is that clustering pushes a fundamental problem of hardware architecture; access to shared memory, into the software domain. Not only must software handle the federation of hardware but these disparate machines are connected via dramatically slower interconnects then their scale-up counterparts (10μs being typical for a wire call vs 100ns for local memory access). This complexity is no doubt at the heart of the many stories told of clustered data systems performing more poorly than their single hosted brethren.
However the problems seen with clustered databases can, in at least some cases, be put down to bad system design. When it comes to distributed systems, and even more so for distributed data management, it is inadvisable to treat the system simply as a black box if you expect it to perform. If a black box is what you need then buy a bigger machine (or try Oracle Exadata, an appliance that is practically synonymous with a large machine even though it’s really a cluster).
In distributed computing systems cannot be treated simply as black boxes if you expect them to perform in the general case due to the cost of sharing data. You need to think about what the system is doing. If you want a black box don’t go distributed, just buy a bigger machine.
The challenges of remote shared memory are a problem for clustered databases primarily during the orchestration of updates when distributed lock management must be performed. In the worst case scenario nodes in the cluster become ‘chatty’ as they move blocks, or locks from one machine to another, adding large latencies to writes. This challenge is dependent on the architecture used by the DBMS and is discussed further in [2]).
The benefits of clustering are not limited to scaling out performance. Clustering also provides implicit fault tolerance and products in this space have matured significantly in the last few years. Certainly a frequent use case for providing an Oracle RAC cluster at my current workplace is High Availability alone.
Shared Nothing Architectures
In terms of a categorisation these are often termed MPP databases (Massively Parallel Processors) although their defining property is a lack of shared state. The Shared Nothing Architecture involves a physical partitioning of resources so that processing, memory and disk become dedicated to a certain sub-section of the data set (the local partition). Partitioning (or Sharding [3]) provides the decoupling required for scaling clusters beyond non-trivial sizes. Thus a MPP stores will scale linearly as hardware is added, assuming the technology is implemented well.
However SNAs come at a price. The partitioning model breaks down when operations require that intermediary results be shipped between machines, particularly where those intermediary results will not form part of the result. Examples include joins between Fact tables (which will require the join keys to be moved form one machine to another), multidimensional aggregations such as multi-dimensional risk calculations (i.e. the OLAP domain), or transactional writes that span the current partitioning strategy. A more detailed comparison between shared nothing and shred disk can be found here [2].
Column-Oriented Databases
Column oriented databases have been around for over twenty years but only recently have they really become popular. There are a couple of reasons for this. Firstly the technology has matured significantly of late and secondly because increased data sizes have pushed mainstream use towards the realms data warehousing.
Column oriented databases fundamentally change the way data is stored and the repercussions of this are fairly extensive. Of the disruptive technologies discussed here the trade-offs between column/row approaches are probably the hardest to fully understand. So much so that the subject really deserves an article to itself (however if you are interested [6] is an excellent resource on the subject).
To summarise the benefits of Column Stores; by storing data in columns certain operations are much faster. As an example consider a simple query that returns the sum of integer values in a single column. The row based database would need to read all rows from disk to memory before performing the summation on just one column. The column based approach however only need extract the data for that column. If there are 20 columns in the table that ~ 1/20th less data to shift off the disk.
In addition to this more precise retrieval for single column queries, holding data as columns allows large decreases in stored data size through compression, particularly when the cardinality of the data in the column is low. This compression reduces the amount of data that needs to be stored, and read around as a part of queries.
There are, of course, some downside to the column oriented model. The most notable are these are the fact that inserts are slow, when compared to their row based brethren. There is also significant cost associated with ‘sewing’ columns back together (tuple construction) so if there are a large number of columns returned this can be costly.
Both of these inefficiencies arise from the fact that a single ‘row’ will be spread across multiple parts of the disk (i.e. one per column). Thus writing a single row will involve the mutation of each of the columns in the row and hence each column incurs a separate I/O. This can be compared to the row based approach which writes the row’s data as a contiguous section of disk and hence the whole row is a single I/O.
The problem with returning large numbers of columns is analogous. Each column in the result set must be bound together. This process is called tuple construction. In the extreme case of returning a single row of many columns the cost would be 1 I/O per column as apposed to a single I/O for the row based approach. A full treatment of columnar stores can be found in [6].
Distributed Caching
The final disruptive technology is distributed caching. Distributed caches generally follow a Shared Nothing Architecture. Those that don’t are really Replicated Caches (having all elements replicated to each node). The later is generally considered an inferior architecture for the majority of use cases as (i) storage is limited to the size of a single process and (ii) write performance degrades as the number of copies increases as the cluster scales.
In practice the use of distributed caching has a fundamental impact on the way we write data-based systems that utilise them because they simplify the contract between client and data. This change in contract pushes transactionality into the realm of the user with the return being hugely increased performance.
The big BUT is that distributed caches do not have full ACID support. In fact Atomic is generally all you get. This means that, should your application write to the cache, transactions to tie together different entities, consistency checks (for example foreign key constraints) etc must be performed manually or deferred to the database if they are needed. Obviously there are many applications where this would be disastrous, or at least prohibitive. However there are also many cases where the application designers are quite happy to sacrifice such safety checks for the sake of speed.
Distributed Caches force a change in contract to a non-transactional world.
To site some concrete examples: Distributed caching if often used in conjunction with large compute grids as a means for engines to get fast access to the data they may need. If the data set is well known, and loaded from a database, then there is no requirement for consistency checks. Their existence in a DBMS guarantees the C in ACID, Consistency, by proxy.
A more subtle example might be a trading system that writes trades, as a two dimensional entity; Trade+Legs, to the cache. Traditionally a transaction would be used to ensure that a 2-phase commit released trade and corresponding legs in one go. However some user cases will be satisfied by simply writing the legs before the trade ensuring that rule: trade’s always have a complete set of legs, but not the inverse, trade legs always have a trade (Note such models require version based entities to work correctly). The point is that, in the general case, a rule like this must be symmetric (hence ACID) but by making some simple statements about how your system operates (legs are never queried without their corresponding trades) the ‘I’ in ACID can be worked around.
Distributed caching should be employed where a significant increase in speed or scalability is worth the effort of working around the guarantees of an ACID resource.
Distributed Caching is different to the other technologies sited here in that it often augments a data architecture. However to be used well it needs to be more than a cache. It forms a data repository for a targeted data set that requires massive scalability and low latency access. However the change in the contract changes everything and this change may be prohibitive in some use cases. It is safe to say that whatever your use case the engineering cost will be greater if you put it in a caching technology. However this discussion goes beyond the scope of this paper. It should be sufficient to conclude that distributed caching should be employed where a significant increase in speed or scalability is worth the effort of working around the guarantees of an ACID resource.
Conclusions
For the majority of enterprise users the single node database should still suffice, with the limit tending to be the network connection in most modern enterprises (32 CPU/128GB machines are likely to thrash a 1 GB Ethernet connection). The point is; if you can manage with a single box it is so much simpler to do so.
If your requirement for bandwidth or processor exceeds that provided by a single machine, or you require high availability, then a clustered solution will be required.
Column orientation should be included in the mix if you can sidestep the restrictions on writes i.e. your writes can be batched into bulk load processes rather than frequent, small transactions. If you also have a use case that is particularly analytical or ad hoc, then column orientation will increase performance. Column oriented solutions exist in each of the other spaces; single machine/SMP (Sybase IQ), Clustered Shared Disk (Sybase IQ), Clustered Shared Nothing (Vertica).
Shared Nothing should be employed for very large data processing requirements; typically clusters beyond 10 machines. The overhead of shared state becomes prohibitive in such large clustered environments, Shared Nothing Architectures are the solution.
Use Distributed Caching if you can avoid ACID for at least a subset of your use case. The tricky parts tend be be providing sufficient Isolation and Consistency (see examples above). A good approach is to use caching for specific subsets of the full dataset, ones for which concerns of Isolation and Consistency can be avoided or satisfied through trivial checks coded into the system.
Closing Thoughts
The database world has certainly changed a lot since 2003 but having looked at it, in some detail, there are some fairly obvious conclusions. The first is that the domain of the traditional DBMS is definitely under threat. Shared Nothing Architectures and Column Orientation represent radically different approaches to the database problem. But Shared Nothing, in particular, feels immature. It is no doubt the influence of my background but I can’t help feeling the Shared Nothing architectures have geared themselves towards the data warehousing markets where input data streams are large and course grained, yet their fundamental architecture favours high transactional throughput. Where are the Shared Nothing systems that are as good at Data Warehousing as they are at OLTP use cases?
Finally if you are interested in what the future will likely look like have a read of this paper by the inventor or POSTGRES and noted academic [4] [5].
References:
[1] http://www.sun.com/servers/white-papers/med-smp.architecture.wp.html#1.1
[2] http://www.benstopford.com/2009/11/24/understanding-the-shared-nothing-architecture/
[3] http://en.wikipedia.org/wiki/Shard_%28database_architecture%29
[4] http://www.vldb.org/conf/2007/papers/industrial/p1150-stonebraker.pdf
No comments
Jump to comment form | comments rss | trackback uri