Coherence: The Fallacy of Linear Scalability

I stated in a previous post: “Dessert Island Disks Top 3 reasons for using Coherence have to be: Speed, Scalability and Fault Tolerance”. There are good reasons for this statement (discussed further here) but in some ways it is a little naive, particularly when considering scalability.

The underpinning of Coherence’s scalability is that adding a machine to a cluster proportionally increases the amount of CPU, network bandwidth and storage (memory) available. This is, of course, a fact, however the statement is only really of worth if the mechanism used for reading and writing data scales too.

Coherence uses a Shared Nothing architecture and, as is typical with this style, basic reads (cache.put(key, val)) and writes (cache.get(key)) scale linearly as the number of nodes in the cluster is increased. This scaling leverages the fact that data is partitioned (spread) across the available machines. Any single read or write operation is simply routed to the machine that owns the partition i.e. only one machine is ever asked to service a single ‘get’.get

The problem is that, in real world use cases, ‘get’ and ‘put’ are not enough. Users inevitably evolve more complex access patterns that necessitate the use of queries (by queries I means non-key-based access to data)…  and queries don’t scale in the same way.query

When Coherence executes a query, that query cannot leverage the hashing algorithm used to partition data. Thus the query must be sent to all nodes in the cluster. This implies that ‘the number of queries serviced by the system’ = ‘the number of queries serviced by each machine’. The implications for scalability are obvious.

So how do you manage this problem in Coherence? There are a few techniques you can use:

  • Try to use key based access instead of queries wherever possible.
  • Increase the cluster size so that the amount of data serviced by each node is reduced. This decreases the response time for each request and thus the overall load on each server. It is however somewhat expensive and wasteful.
  • Use a Partition Filter to paginate over the available partitions. This spreads the query over a longer time frame reducing the risk of load spikes.

You may be slightly disappointed with this list as it contains no ‘silver bullet’ solution. The reason is that none of them address the fundamental problem directly, it being intrinsic to the architectural style (shared nothing). Addressing the problem would require a change to the architecture at a macroscopic level. The techniques suggested here are simply tips that help postpone the onset of the problem.