Joins: using Snowflake Schemas & CQCs (Intermediate)
The application we recently built (ODC) departs slightly from the standard “dump the whole denormalised object into a cache” approach recommended for most caching implementations. The reason is that the project is not really a cache, it’s a data store, and as such we need to be able to manipulate different parts of the domain model independently. Holding objects in a denormalised form leads to the problem of how you keep all that denormalised data consistent. Data must be duplicated and this both eats memory (and much more so if you implement versioning) as well as making it very hard to implement any kind of consistency across all those copies.
To get around this we use a Star-Schema approach asa convenient mechansim for denoting whether an entity should be (a) a big thing that needs to partitioned accross the grid or (b) a smaller thing that we can afford to replicate in our query processing layer as there are simply not that many of them. In fact we take this model a step further my tracking the arcs on our domain model and only replicating those that are ‘connected’… but more on that later (see here).
In our case we split the application architecture into two layers: The Query Layer and the Data Layer. Dimensions are cached in Continuous Query Caches in the Query layer and the Facts are spread across the Coherence cluster. Related facts can be joined in-process as Key Affinity is used to ensure collocation (i.e. they are partitioned with the same key and Coherence uses the key to determine which partition they should go in).
The star-schema approach is appropriate as it allows dimensions to be changed in one place (atomically/isolated) in contrast to the denormalised approach, in which the dimension attributes would be held with the many facts that related to them leading to lots of duplication. Using this model you can store a complex relational model in Coherence and query it in an efficient way.
Efficiently Joining Facts and Dimensions in Coherence
The de facto standard for joining in Coherence is the 2 stage query (although it is really multi-stage). The first stage always hits the dimension tables for the query predicates. The second stage returns the facts based on the dimension sets. The third stage returns the dimensions required to present the users view.
If there are extra references internal to a dimension then these will elongate steps 1 and 3. The key point however is that we use Continuous Query Caches to ensure all joins to Dimensions are local.
Efficiently Joining Facts with other Facts in Coherence
If two fact entities must be joined then the join should be done object by object across the cluster. An Aggregator is used to run the computation with Affinity used to bind related records into the same JVM. The mechanism for doing this is documented fully in [i].
The Fact-Fact join approach is likely to be used in conjunction with the 2-stage query with the later composing in relevant dimension objects.
Managing Concurrency and Isolation in the Data Model
Coherence provides little support Atomic Transactions (beyond the granularity of a single object), Consistency nor Isolation. Thus we must simulate them as required (Recently Oracle has introduced a Transaction API but we still find it preferable to avoid the need for taking out distributed locks wherever possible).
The first tool for doing this is the implementation of MVCC[ii] within the data model. This is really just a fancy way of saying that we version objects and thus those objects are immutable. The benefit of such a model is that the view of the system can be defined, at any time, as the set of all objects whose version matches that point in time. The versioning also allows clients to determine consistency themselves (when writing a record you can ensure that there were no other concurrent updates from other users that might be overwritten).
As Fact objects enter the cache their content must be checked for referential integrity. This means that the user’s identification of every Dimension object must validated that:
- The Dimension is identifiable (i.e. it exists)
- It is unique (only one dimension is identified)
All dimensions are checked, in process, via the same CQCs used to speed up the 2-stage query. This has an unfortunate consequence: Changing a dimension will result in CQCs being updated across the cluster via one-phase commit. This presents a potential threat to atomaticity and isolation since the changing dimension will be incoherent across multiple JVMs during the one-phase commit. Fortunately query isolation in such a model is still ensured, from our perspective by making the simplifying assumption: consistency is only implemented within a single query processing node.
For more information see QCon presentation [iii].
[i] http://www.benstopford.com/2009/11/20/how-to-perform-efficient-cross-cache-joins-in-coherence/
[ii] http://en.wikipedia.org/wiki/Multiversion_concurrency_control
1 Comment
Jump to comment form | comments rss | trackback uri