Shared Nothing v.s. Shared Disk Architectures: An Independent View
The Shared Nothing Architecture is a relatively old pattern that has had a resurgence of late in data storage technologies, particularly in the NoSQL, Data Warehousing and Big Data spaces. As architectures go it’s there are fairly dramatic performance tradeoffs across the two. This article contrasts Shared Nothing with Shared Disk Architectures.
Shared Disk and Shared Nothing?
Shared Nothing is a data architecture for distributed data storage in a clustered environment. Data is partitioned in some manner and spread across a set of machines with each machine having sole access, and hence sole responsibility, for the data it holds.
By comparison Shared Disk is exactly what it says; disk accessible from all cluster nodes. These two architectural styles are described in the two figures. In the Shared Nothing Architecture notice how there is complete segregation of data i.e. it is owned solely by a single node.
In the Shared Disk any node can access any piece of data and any single piece of data has no dedicated owner. The node is, from a data perspective, autonomous.
Understanding the Trade-offs for Writing
When persisting data in a Shared Disk architecture writes can be performed against any node. If node 1 and 2 both attempt to write a tuple then, to ensure consistency with other nodes, the management system must either use a disk based lock table or else communicated their intention to lock the tuple with the other nodes in the cluster. Both methods provide scalability issues. Adding more nodes either increases contention on the lock table or alternatively increases the number of nodes over which lock agreement must be found.
To explain this a little further consider the case described by the below diagram. The clustered Shared Disk database contains a record with PK = 1 and data = foo. For efficiency both nodes have cached local copies of record 1 in memory. A client then tries to update record 1 so that ‘foo’ becomes ‘bar’. To do this in a consistent manner the DBMS must take a distributed lock on all nodes that may have cached record 1. Such distributed locks become slower and slower as you increase the number of machines in the cluster and as a result can impede scalability.
The other mechanism, locking explicitly on disk, is rarely done in practice in modern systems as caching is so fundamental to performance.
However the Shared Nothing Architecture does not suffer from this distributed locking problem, assuming that the client is directed to the correct node (that is to say a client writing ‘A’, in the figure above, directs that write at Node 1) , the write can flow straight though to disk with any lock mediation performed in memory. This is because only one machine has ownership of any single piece of data, hence by definition there only ever needs to be one lock.
Thus Shared Nothing Architectures can scale linearly from a write perspective without increasing the overhead of locking data items, because each node has sole responsibility for the data it owns.
However Shared Nothing will still have to execute a distributed lock for transactional writes that span data on multiple nodes (i.e. a distributed two-phase commit). These are not as large an impedance on scalability as the caching problem above, as they span only the nodes involved in the transaction (as apposed to the caching case which spans all nodes), but they add a scalability limit none the less (and they are also likely to be quite slow when compared to the shared disk case).
So Shared Nothing is great for systems needing high throughput writes if you can shard your data and stick clear of transactions that span different shards. The trick for this is to find the right partitioning strategy, for instance you might partition data for a online banking system such that all aspects of a user’s account are on the same machine. If the data set can be partitioned in such a way that distributed transactions are avoided then linear scalability, at least for key-based access, is at your fingertips.
In summary, Shared Disk Architectures are write limited as locks must be coordinated across the cluster. Writes to Shared Nothing architectures are limited should they require a distributed two phase commit.
The counter, from the Shared Disk camp, is that they can use partitioning too. Just because the disk is shared does not mean that data can’t be partitioned logically with different nodes servicing different partitions. There is some truth to this, assuming you can set up your architecture so that write requests are routed to the correct machine, as this tactic will reduce the amount of lock (or block) shipping taking place (and is exactly how you optimise Oracle RAC). So that is to say that a shared disk implementation can be configured in a shared nothing mode. The difference here is just the physical placement of data. Shared disk is always network attached in some way, never local. From a performance perspective remote disks can be very fast. As fast as local. But at a much greater monetary cost.
Considering the Retrieval of Data
The retrieval of data in these architectures suffers from different constraints to writes. Looking firstly Shared Disk we find it to have two significant drawbacks when compared with Shared Nothing:
The first is the potential for resource starvation, most notably disk contention on the SAN/NAS drives. This can be combated, to a certain extent, through the use of partitioning but then the Shared Disk architecture starts to suffer from the other downside of Shared Nothing; the data must be physically partitioned in advance.
The second issues is that caching is less efficient due to the scope of each cache. Unlike the Shared Nothing architecture, where queries for a certain data set will be consistently directed to the node that owns the data, Shared Disk architectures will spread query load across all nodes. This increases the data churn in the cache on each node and hence caching is less effective (a better way to describe this may be that in Shared Disk each cache must serve the whole data set, in Shared Nothing each cache must only serve 1/n of the data where n = number of nodes).
Shared Nothing has the one major flaw: Shared Nothing works brilliantly if the query is self sufficient i.e. it can complete entirely on one node or through an efficient parallel processing pattern like MapReduce. However there will inevitably come a time when data, from multiple nodes, must be operated on. This is most notable in cross table joins. Such cases require that the data, that will not included in the final result, be shipped from one node to another and thus degrades query performance.
The reality is that the number of queries requiring data shipping will depend on the use case and the partitioning strategy and in many cases can be minimised or eliminated (for example commercial search engines). However, for general business cases , for example ones with related fact tables, some data shipping is inevitable. In some cases this can be quite crippling, for example the ‘Complex Case’ discussed here. This is why all self respecting Shared Nothing solutions recommend the use of fast 10GE networks. Five years ago it was a far (a.k.a. ten times worse) greater problem.
There is one additional issue with reads in a Shared Nothing architecture. Locating unindexed data (i.e where the partitioning key can not be leveraged) requires sending the query to all machines (partitions)). This presents a limit to the scalability of the architecture for questions one items other than the key. For example adding more users will increase the number of such queries and this, in turn, will increase the number of queries that each machine must service (This being independent of how many machines you have). This becomes prohibitive as the database scales as each query will contribute a base latency to every node.
This is important enough to restate. Shared nothing is only lineary scalable for key-based access. The use of secondary indexes always results in every node being consulted. This limits scalability, certainly in terms of the number of concurrent requests that can be services. This is one of the reasons for many distributed key-value stores sticking to the very simple K-V contract.
The retort is that the problem can be managed by reducing the amount of data stored per machine so that each query is faster and hence the average load per query is reduced. This being achieved by adding more machines to the cluster but keeping the total disk usage constant. The problem is there is always a base latency, an overhead on every machine. The reality is that such queries present problems for both architectures so to really scale you need a combination (For example MongoDB uses shared nothing and data replication to achieve the benefits of both of these architectures – replication being, in some ways, similar to the trade offs of shared disk)
Complexity at Scale
A possibly less obvious benefit of shared nothing is to do with complexity at scale. Put simply, because of the autonomous nature, each node in a SN system has a relatively simple contract. It’s concerns are encapsulated in its own data partition. This means the software to manage failure can be simpler, behaving with little or no knowledge of it’s wider role in the cluster.
In contrast shared disk systems are fully open to the influence of other nodes. These couplings take the forms of locks, with timeouts and relatively complex failure semantics. If we consider a failure in a shared disk system, the node is likely to have locks out on the underlying shared disk structure. These locks implicitly affect the other processing nodes and the system must go through a process of discovering the failed node, it’s locks and then releasing them or letting them timeout.
The shared nothing system only has to detect failure and promote the backup node (or similar depending on the failure strategy of the system). In fairness these problems are well understood, but often still misapplied and always bring, IMHO, a little more complexity to bare. Certainly the complexity of these issues seems to grow with the number of nodes and the heterogeneity of the deployment, favouring SN for the very large.
So Which Should You Use?
If you are Google or Amazon then you will inevitably choose scalability over consistency and use a shared nothing architecture. The size and heterogeneity of your deployment make the simple, autonomous SN model attractive.
If you are a business system that is unlikely to need more than two or three servers then the complexities of partitioning a complex domain model efficiently may outweigh the benefits, so Shared Disk may be preferable. Particularly since you can partition in a SD model too, simulating the autonomy of a SN model but on a SD subsystem.
In the relational world the two big players for warehousing are Exadata (shared disk) and Teradata (shared nothing). These are both fairly pure implementations of the respective patterns and sit on the respective ends of the spectrum. Probably the most interesting technologies of the moment sit somewhere in between.
Blending sharding (shared nothing) and replication is a good way to gain different types of scalability. Databases like MongoDB provide primitives for both of these.
Hadoop provides a blended model. HDFS is really a type of shared disk but the execution model used by Hadoop is shared nothing, with computation routed to the nodes where data lies, wherever this is possible. This composite model is attractive as it provides benefits from both approaches: a shared disk subsystem which spans the various machines in the cluster, and a programming model that treats it as shared nothing, with each node assuming an autonomous data subset, which will be collocated with processing most of the time. So the benefits of shared nothing’s scaling through autonomy but with the power to break from the model where needed. Clever!
There are a number of good papers on the subject. The infamous Michael Stonebraker was one of the early SN evangelists, back in the early 80’s. His paper The Case For Shared Nothing still makes good reading, even if it does skip some issues.
Also Shared-Disk vs. Shared Nothing by the makers of ScaleDB – a Shared Disk database. This paper makes the case for Shared Disk and enumerates the downsides of Shared Nothing.
The last paper presents the opposite view. How to Build A High Performance Data Warehouse is well written, mapping the pros and cons of each architecture. However don’t be sucked in by the academic URL. The authors are all affiliated with Vertica which is a commercial implementation from the Stonebraker camp, and the paper noticeably favours a Shared Nothing Columnar Architecture model, like the one used by Vertica. Never the less it’s a good read.
Finally there is a good section in Architecture of a Database System.
See also Are Databases a Thing of the Past?
Posted on November 24th, 2009 in Data Tech