Shared Nothing v.s. Shared Disk Architectures: An Independent View

The Shared Nothing Architecture is a relatively old pattern that has had a resurgence of late in data storage technologies, particularly in the NoSQL, Data Warehousing and Big Data spaces. As architectures go it’s there are fairly dramatic performance tradeoffs across the two. This article contrasts Shared Nothing  with Shared Disk Architectures.

Shared Disk and Shared Nothing?

Shared Nothing is a data architecture for distributed data storage in a clustered environment. Data is partitioned in some manner and spread across a set of machines with each machine having sole access, and hence sole responsibility, for the data it holds.

By comparison Shared Disk is exactly what it says; disk accessible from all cluster nodes. These two architectural styles are described in the two figures.  In the Shared Nothing Architecture notice how there is complete segregation of data i.e. it is owned solely by a single node.

In the Shared Disk any node can access any piece of data and any single piece of data has no dedicated owner. The node is, from a data perspective, autonomous.

Understanding the Trade-offs for Writing

When persisting data in a Shared Disk architecture writes can be performed against any node. If node 1 and 2 both attempt to write a tuple then, to ensure consistency with other nodes, the management system must either use a disk based lock table or else communicated their intention to lock the tuple with the other nodes in the cluster. Both methods provide scalability issues. Adding more nodes either increases contention on the lock table or alternatively increases the number of nodes over which lock agreement must be found.

To explain this a little further consider the case described by the below diagram. The clustered Shared Disk database contains a record with PK = 1 and data = foo. For efficiency both nodes have cached local copies of record 1 in memory. A client then tries to update record 1 so that ‘foo’ becomes ‘bar’. To do this in a consistent manner the DBMS must take a distributed lock on all nodes that may have cached record 1. Such distributed locks become slower and slower as you increase the number of machines in the cluster and as a result can impede scalability.

The other mechanism, locking explicitly on disk, is rarely done in practice in modern systems as caching is so fundamental to performance.

However the Shared Nothing Architecture does not suffer from this distributed locking problem, assuming that the client is directed to the correct node (that is to say a client writing ‘A’, in the figure above, directs that write at Node 1) , the write can flow straight though to disk with any lock mediation performed in memory. This is because only one machine has ownership of any single piece of data, hence by definition there only ever needs to be one lock.

Thus Shared Nothing Architectures can scale linearly from a write perspective without increasing the overhead of locking data items, because each node has sole responsibility for the data it owns.

However Shared Nothing will still have to execute a distributed lock for transactional writes that span data on multiple nodes (i.e. a distributed two-phase commit). These are not as large an impedance on scalability as the caching problem above, as they span only the nodes involved in the transaction (as apposed to the caching case which spans all nodes), but they add a scalability limit none the less (and they are also likely to be quite slow when compared to the shared disk case).distributed lock

So Shared Nothing is great for systems needing high throughput writes if you can shard your data and stick clear of transactions that span different shards. The trick for this is to find the right partitioning strategy, for instance you might partition data for a online banking system such that all aspects of a user’s account are on the same machine. If the data set can be partitioned in such a way that distributed transactions are avoided then linear scalability, at least for key-based access, is at your fingertips.

In summary, Shared Disk Architectures are write limited as locks must be coordinated across the cluster. Writes to Shared Nothing architectures are limited should they require a distributed two phase commit.

Shared Disk Architectures are write limited should they require locks that must be coordinated across the cluster. Shared Nothing Architectures are write limited should they require a distributed two phase commit.

The counter, from the Shared Disk camp, is that they can use partitioning too. Just because the disk is shared does not mean that data can’t be partitioned logically with different nodes servicing different partitions. There is some truth to this, assuming you can set up your architecture so that write requests are routed to the correct machine, as this tactic will reduce the amount of lock (or block) shipping taking place (and is exactly how you optimise Oracle RAC). So that is to say that a shared disk implementation can be configured in a shared nothing mode. The difference here is just the physical placement of data. Shared disk is always network attached in some way, never local. From a performance perspective remote disks can be very fast. As fast as local. But at a much greater monetary cost.

Considering the Retrieval of Data

The retrieval of data in these architectures suffers from different constraints to writes. Looking firstly Shared Disk we find it to have two significant drawbacks when compared with Shared Nothing:

The first is the potential for resource starvation, most notably disk contention on the SAN/NAS drives. This can be combated, to a certain extent, through the use of partitioning but then the Shared Disk architecture starts to suffer from the other downside of Shared Nothing; the data must be physically partitioned in advance.

The second issues is that caching is less efficient due to the scope of each cache. Unlike the Shared Nothing architecture, where  queries for a certain data set will be consistently directed to the node that owns the data, Shared Disk architectures will spread query load across all nodes. This increases the data churn in the cache on each node and hence caching is less effective (a better way to describe this may be that in Shared Disk each cache must serve the whole data set, in Shared Nothing each cache must only serve 1/n of the data where n = number of nodes).

Shared Nothing has the one major flaw: Shared Nothing works brilliantly if the query is self sufficient i.e. it can complete entirely on one node or through an efficient parallel processing pattern like MapReduce. However there will inevitably come a time when data, from multiple nodes, must be operated on. This is most notable in cross table joins. Such cases require that the data, that will not included in the final result, be shipped from one node to another and thus degrades query performance.

The reality is that the number of queries requiring data shipping will depend on the use case and the partitioning strategy and in many cases can be minimised or eliminated (for example commercial search engines). However, for general business cases , for example ones with related fact tables, some data shipping is inevitable. In some cases this can be quite crippling, for example the ‘Complex Case’ discussed here. This is why all self respecting Shared Nothing solutions recommend the use of fast 10GE networks. Five years ago it was a far (a.k.a. ten times worse) greater problem.

There is one additional issue with reads in a Shared Nothing architecture. Locating unindexed data (i.e where the partitioning key can not be leveraged) requires sending the query to all machines (partitions)). This presents a limit to the scalability of the architecture for questions one items other than the key. For example adding more users will increase the number of such queries and this, in turn, will increase the number of queries that each machine must service (This being independent of how many machines you have). This becomes prohibitive as the database scales as each query will contribute a base latency to every node.

This is important enough to restate. Shared nothing is only lineary scalable for key-based access. The use of secondary indexes always results in every node being consulted. This limits scalability, certainly in terms of the number of concurrent requests that can be services. This is one of the reasons for many distributed key-value stores sticking to the very simple K-V contract.

The retort is that the problem can be managed by reducing the amount of data stored per machine so that each query is faster and hence the average load per query is reduced. This being achieved by adding more machines to the cluster but keeping the total disk usage constant. The problem is there is always a base latency, an overhead on every machine. The reality is that such queries present problems for both architectures so to really scale you need a combination (For example MongoDB uses shared nothing and data replication to achieve the benefits of both of these architectures – replication being, in some ways, similar to the trade offs of shared disk)

Reads in Shared Disk Architectures can suffer from resource starvation issues and less efficient caching as the cluster scales. Shared Nothing  Architectures have the potential for far more scale but this can be hampered by queries that must hit all machines. Query speed can also be affected if  non-result (intermediary) data sets must be shipped cross-machine.

Complexity at Scale

A possibly less obvious benefit of shared nothing is to do with complexity at scale. Put simply, because of the autonomous nature, each node in a SN system has a relatively simple contract. It’s concerns are encapsulated in its own data partition. This means the software to manage failure can be simpler, behaving with little or no knowledge of it’s wider role in the cluster.

In contrast shared disk systems are fully open to the influence of other nodes. These couplings take the forms of locks, with timeouts and relatively complex failure semantics. If we consider a failure in a shared disk system, the node is likely to have locks out on the underlying shared disk structure. These locks implicitly affect the other processing nodes and the system must go through a process of discovering the failed node, it’s locks and then releasing them or letting them timeout.

The shared nothing system only has to detect failure and promote the backup node (or similar depending on the failure strategy of the system). In fairness these problems are well understood, but often still misapplied and always bring, IMHO, a little more complexity to bare. Certainly the complexity of these issues seems to grow with the number of nodes and the heterogeneity of the deployment, favouring SN for the very large.

So Which Should You Use?

If you are Google or Amazon then you will inevitably choose scalability over consistency and use a shared nothing architecture. The size and heterogeneity of your deployment make the simple, autonomous SN model attractive.

If you are a business system that is unlikely to need more than two or three servers then the complexities of partitioning a complex domain model efficiently may outweigh the benefits, so Shared Disk may be preferable. Particularly since you can partition in a SD model too, simulating the autonomy of a SN model but on a SD subsystem.

In the relational world the two big players for warehousing are Exadata (shared disk) and Teradata (shared nothing). These are both fairly pure implementations of the respective patterns and sit on the respective ends of the spectrum. Probably the most interesting technologies of the moment sit somewhere in between.

Blending sharding (shared nothing) and replication is a good way to gain different types of scalability. Databases like MongoDB provide primitives for both of these.

Hadoop provides a blended model. HDFS is really a type of shared disk but the execution model used by Hadoop is shared nothing, with computation routed to the nodes where data lies, wherever this is possible. This composite model is attractive as it provides benefits from both approaches: a shared disk subsystem which spans the various machines in the cluster, and a programming model that treats it as shared nothing, with each node assuming an autonomous data subset, which will be collocated with processing most of the time. So the benefits of shared nothing’s scaling through autonomy but with the power to break from the model where needed. Clever!

Further Reading

There are a number of good papers on the subject. The infamous Michael Stonebraker was one of the early SN evangelists, back in the early 80’s. His paper The Case For Shared Nothing still makes good reading, even if it does skip some issues.

Also Shared-Disk vs. Shared Nothing by the makers of ScaleDB – a Shared Disk database. This paper makes the case for Shared Disk and enumerates the downsides of Shared Nothing.

The last paper presents the opposite view. How to Build A High Performance Data Warehouse is well written, mapping the pros and cons of each architecture. However don’t be sucked in by the academic URL. The authors are all affiliated with Vertica which is a commercial implementation from the Stonebraker camp, and the paper noticeably favours a Shared Nothing Columnar Architecture model, like the one used by Vertica. Never the less it’s a good read.

Finally there is a good section in Architecture of a Database System.

See also Are Databases a Thing of the Past?

Posted on November 24th, 2009 in Data Tech

  1. Andrew Wilson December 3rd, 2009
    9:20 GMT

    Nice, lots of good background reading. Thanks, A.

  2. Hal December 9th, 2009
    13:38 GMT

    Good precis of the issues. Nice bit of cynicism at the end to warn that “academic” paper is indeed marketing material. PS: Didn’t realise the term Shared Nothing was as old as that – good to know that there is Nothing new in this world.

  3. Jeff Darcy January 4th, 2011
    14:13 GMT

    Good article. Thanks.

    Besides performance/scalability, another difference between shared-disk and shared-nothing has to do with how failures are handling. In a shared-disk model, failure of a node requires a complex recovery/failover procedure in which locks are broken, ownership of storage transferred, etc. This is fairly well understood technology today, but – having been around since we were all inventing that technology together – I continue to see it misapplied. In a shared-nothing model, storage and node failures can be handled the same way. Precisely because any given piece of raw storage is “stranded” on one node, systems built around this paradigm tend to involve replication “further out” from the disks so the requesting node simply uses another replica regardless of which type of failure it was. It’s one of those cases IMO where a worse problem leads to a better solution, because the problem (half solved by the multi-controller RAID hardware that shared-disk types use and shared-nothing types eschew) can’t be ignored.

    Another interesting angle here, since you work on Coherence, is how some of these same principles apply to more memory-centric stores. Even with SCI and IB available, memory is essentially shared-nothing so it makes sense if it and disk can be handled in consistent ways instead of having two different systems with sharply different performance and failure-handling characteristics.

  4. ben January 5th, 2011
    17:47 GMT

    An interesting comment Jeff. I hadn’t really considered the differences with respect to failure semantics. From a Coherence perspective a fair bit simpler as it’s all shared nothing and there is no disk (on my current project we use messaging as a system of record, which is of course disk based but it’s fire and forget). Your very point is the beauty of Coherence and the reason that a lot of folks are pushing all memory solutions. These guys are leading the way IMO (if you haven’t seen it already): However, as the use of SSD increases we may see the differences you allude to come back again!

  5. Simon Griffiths January 9th, 2011
    2:07 GMT

    Ben, an excellent precis of the problem. My expertise is in the large scale DW and I’d like to add a few comments.

    One of the interesting aspects of data warehousing is that in many cases, data is only ever added to the database and rows should not be updated. This is often a guiding principle and applies to both fact based and dimensional data. This is often implemented through the use of effective dates. In that context, the requirement to ‘lock’ rows for update is substantially reduced. In this situation the main limiting factor that you have identified is substantially reduced in its effect for SD. How about the following as a suggested amendment :

    Shared Disk Architectures are write limited should they require locks that must be coordinated across the cluster. Shared Nothing Architectures are limited should they require a distributed two phase commit.

    In fact there is then a close relationship between the two limiting factors – essentially both systems are limited by the requirement to co-ordinate data change across the cluster. In the case of SN, the co-ordination is required to ensure two or more changes are co-ordinated, in the case of SD the co-ordination is required to co-ordinate competing write requests.

    This observation is also relevant when considering partitioning of data in a SD system, If data is only ever being added, then data-loading becomes straightforward to parallelise up to the level of a load process per data partition and the SN system becomes the equivalent of a SN system for the load process.

    When considering the case of complex queries, then the second limitation of SN you mention (data shipping) becomes the main limiting factor. Again in a DW use case, complex queries that include many tens of joins are the norm – queries with 20-30 are not unusual. Thus the output of each join action will probably not be partitioned on the same key as the next join action and so cross-node shipping of data becomes the norm rather then the exception. The bottleneck for query performance then becomes the network between the nodes. Interestingly, this provides a performance profile for such queries which can provide a reduction in performance as the number of nodes in a SN cluster increases – for 2 nodes, 50% of the data has to be shipped, for 10 nodes, that becomes 90%. I have seen situations where scaling the cluster on SN is near linear until the network is saturated, after that, performance will be flat on extending the cluster. However, as congestion on the network increases, performance will then degrade and so the addition of an node to an SN cluster can result in slower query times.

    A final point I would add is that a major factor effecting the performance of many implementations of SN in data warehousing is data skew. Where the partitioning (or sharding) is determined by the value of a data item then if that data item has anything other than a symmetric distribution then it’s likely that the partitions will have unequal sizes and thus unequal scan times. This is not mitigated by hashing (as hashing changes data location but not distribution) but is mitigated by a round-robin partitioning algorithm.

  6. Dominique De Vito July 6th, 2011
    14:10 GMT

    Good article. Thanks.

    Let’s mention also, here, the denormalization method that is one method, used in SN architecture, for enabling all related data to hold on the same node.

  7. ben July 7th, 2011
    7:26 GMT

    That’s very true Dominique and a point close to my own heart. At RBS we’ve put a lot of effort into the problem of balancing replication and partitioning with the result being the Connected Replication pattern discussed here. The pattern splits data into a star schema and denormalises only the Dimensions. The trick being that the absolute minimum amount of Dimension data is replicated around the grid by constantly tracking what Dimensions are connected to Facts. I’ll update the post to reflect your point. Thanks

Have your say

XHTML: You can use these tags: <a href="" title=""> <abbr title=""> <acronym title=""> <b> <blockquote cite=""> <cite> <code> <del datetime=""> <em> <i> <q cite=""> <strike> <strong>

Safari hates me
IMPORTANT! To be able to proceed, you need to solve the following simple problem (so we know that you are a human) :-)

Add the numbers ( 14 + 15 ) and SUBTRACT two ?
Please leave these two fields as-is: