Archive for the ‘Links’ Category
Sunday, October 28th, 2012
Here are some of the highlights of the 210 papers presented at VLDB earlier this year. You can find the full list here.
From Cooperative Scans to Predictive Buffer Management (here)
Intriguing paper from the Vectorwise guys for improving IO efficiency under load. LRU/MRU caching policies are known to break down under large, concurrent workloads. SQL Server and DB2 both have mechanisms for sharing IO between queries (by attaching to an existing scan or throttling faster queries so that IO can be shared). The Cooperative Scans discussed here takes this a step further by incorporating an active buffer manager which scans use to register their interest in data. The manager then adaptively chooses which pages to load and pass to the various concurrent requests.
There is another related paper at this conference SharedDB: Killing One Thousand Queries With One Stone (here)
Processing a Trillion Cells per Mouse Click (Google) (here)
Interesting paper from Google suggesting an alternative to the approach to column orientation taken in Dremel. PowerDrill uses a double-dictionary encoded column store where the encodings live largely in memory. Further optimisations are made at load time to ensure minimal access to persistent storage. This makes it more akin to column stores like ParAccel or Vectorwise, applied to analytical workloads (aggregates, group bys etc).
Can the elephants handle the NoSQL onslaught (here)
Another paper comparing the performance of Hadoop with a relational database (in a similar vein to the Sigmod 09 paper DeWitt published previously here). I sympathise with the message – databases outperform hadoop on small to medium workloads – but I hope that most people know that already. This time the comparison is with Microsoft’s Sql Server PDW (Parallel Data Warehouse). The choice of data sizes between 250Gb and 16TB means that the study has the same failing as the previous Sigmod one; it’s not looking at large dataset performance.
Interactive Query Processing in Big Data Systems: A Cross-Industry Study of MapReduce Workloads (here)
Useful, empirically driven paper with detailed data sets from a number of NoSQL implementations including Facebook. Chen et al. performed an empirical study on the implementation of Hadoop at a number of companies including Facebook. It hints at the current ‘elephant in the room’ that is Hadoop’s focus on batch-time over real-time performance (roll on Impala!) . Having data of this level of granularity over a range of real time systems in itself is quite valuable. They note that 90% of jobs are small (resulting in MBs of data returned).
High-Performance Concurrency Control Mechanisms for Main-Memory Databases (here)
Proposes an optimistic MVCC method for in memory concurrency control. The conclusion: single-version locking performs well only when transactions are short and contention is low; higher contention or workloads including some long transactions favor the multiversion methods, and the optimistic method performs better than the pessimistic one.
Blink and It’s Done: Interactive Queries on Very Large Data (here)
Blink is different to the mainstream database as it’s not designed to give you an exact answer. Instead you specify either error (confidence) or maximum time constraints on your query. The approach uses a number of sampling based strategies to achieve the required confidence level. There is a related paper: Model-based Integration of Past & Future in TimeTravel (here)
Developing and Analyzing XSDs through BonXai (here)
B+-tree Index Optimization by Exploiting Internal Parallelism of Flash-based Solid State Drives (here)
SSDs performance increases (initially) with the number of concurrent executions (in stark contrast with magnetic drives). This paper looks into maximising this with the use of concurrent B-trees that utalise parallel IO. Useful research as flash is only going to get cheaper.
SCOUT: Prefetching for Latent Structure Following Queries (here)
I quite like the ideas in this paper around prefetching data based on a known structure (probably because it’s similar to some of the stuff we do).
Fast Updates on Read-Optimized Databases Using Multi-Core CPUs (here)
Addresses the problem some columnar architectures suffer where they accumulate writes in a separate partition, which must be periodically merged with the read-optimised main one.
FDB: A Query Engine for Factorised Relational Databases (here)
I hadn’t come across the idea of Factorised Databsaes before. An interesting concept. The paper demonstrates performance improvements over traditional methods for many-to-many join criteria.
Only Agressive Elephants are Fast Elephants (here)
Interesting approach to indexing Hadoop that claims to improve both read and write performance. I couldn’t find the code though so couldn’t try it.
The Vertica Analytic Database: C-Store 7 Years Later (here)
A good summary of this mature shared-everything, columnar database. They discuss their use of super projections over join indexes, due to the overheads associated with tuple construction and the verbosity of storing the associated rowids. There is a summary of the encoding types used as well as partitioning and locking strategies.
Muppet: MapReduce-Style Processing of Fast Data (here)
Whilst the majority of MapReduce commentary focuses on improving MR query performance this paper looks at the problem of injesting data quickly for high throughput, streaming workloads. The interesting approach focuses on data as streams (in and out) in association with a moving historical window (they denote a slate). To me there seems to be a lot of similarity between this approach the one taken by products like StreamBase and Cloudscale but the authors differentiate themselves my being less schema oriented, more akin to the traditional MR style.
Serializable Snapshot Isolation in PostgreSQL (here)
Interesting paper on the implementation of serializable isolation using the snapshot model.
Other papers of note:
- Minuet: A Scalable Distributed Multiversion B-Tree (here)
- A Statistical Approach Towards Robust ProgressEstimation (here)
- Efﬁcient Multi-way Theta-Join Processing UsingMapReduce (here)
- Avatara: OLAP for Web-scale Analytics Products (OLAP cubes over a NoSQL @LinkedIn) (here)
- 10 Year Best Paper Award: Approximate Frequency Counts over Data Streams (here)
Saturday, December 31st, 2011
- Intel managing to squeeze 50 cores on a single chip, breaking through the teraflop boundary as they do so: Brier Dudley’s Blog | Wow: Intel unveils 1 teraflop chip with 50-plus cores | Seattle Times Newspaper
- RISC architectures have had a renaissance thanks largely to the needs of the mobile sector, could their low power consumption make them a serious contender for enterprise space? x86 Faces Unexpected RISC Competition
- AMD announce 4 memory channels allowing massive addressable spaces up to 364GB per CMP : AMD’s Interlagos and Valencia finally emerge
- Anyone who follows my blog will know of my belief in large address spaces reshaping the landscape, certainly for enterprise applications. This articles echoes these views: Megatrend: Cheap RAM Reshaping All of Computing | Dr Dobb’s
- IBM’s Lime is an interesting approach to simplifying the programming of secondary devices. See Lime paper and the related Liquid Metal project.
- JVM on FPGA: JOP: A Tiny Java Processor Core for FPGA
- An interesting paper on using FPGA for Monte Carlo Simulation: FPGA for monte carlos
High Performance Java
- An excellent talk about using memory efficiently in Java applications, that the costs are often higher than we think. It includes clear descriptions of the footprint of all Java objects and utilities : Building Memory Efficient Java Applications
- There has been a flurry of activity coming from Azul Systems recently. Most notably the release of Zing, their pauseless garbage collector. Gene Til’s talk about the State of the Art in GC from QCon SF 2011 is one of the best I’ve seen (QConSF 2011: State of the Art in Garbage Collection).
- Azul have also recently released JHiccup. An interesting utility that measures operating system stalls. Java Developer Tools: jHiccup Java Performance Analysis
- Charles Nutter’s comments on his favourite JVM flags including my favourite (-XX:+PrintOptoAssembly): Headius: My Favorite Hotspot JVM Flags
Distributed Data Storage
- A great paper from VLDB describing an approach for balancing replication and partitioning, something close to my own heart: Schism: a Workload-Driven Approach to Database Replication and Partitioning
- Hasso Plattner (the P is SAP) wrote this paper which provides an insigntful view of where he believes the field should be going (and of course SAP’s solution Hana): Hasso Plattner on In-Memory OLAP & OLTP
- I enjoyed watching this talk about Mongo: InfoQ: Scaling with MongoDB
- An entertaining article from the Economist about David Gelernter’s predictions of the future of computing: Brain scan: Seer of the mirror world | The Economist
- Could Prezi really dislodge PowerPoint? Prezi
- Double Loop Learning – a different view on organizational learning. Chris Argyris.
- Worth reading if you are not familiar with the idea already: CQRS
- An interesting twist on the traditional storyboard approach Our Story Board is Better Than Yours… I’m a big fan of replacing estimation with uniformly sized stories.
- Booked your next holiday? What about a Code Retreat with Corey Haines
Tuesday, October 25th, 2011
High Performance Java
- Not exactly lightweight reading but one of the most detailed and influential papers on tuning your software for processing efficiency: What Developers Should Understand About Memory
- If you read the above and want to put some of it into action then VTune should be your next port of call. Diagnostic software for CPU cache hits etc: VTune™ Amplifier XE 2011 from Intel – Intel® Software Network
- When it really won’t go any faster, look at the Assembler: Deep dive into assembly code from Java | Java.net
- In anticipation of G1 (in case they ever get it finished) here’s the original paper with anticipated performance figures: G1 paper with figures
- A different approach to GC using processor specific minor collections (in Haskell): Multicore Garbage Collection with Local Heaps
Distributed Data Storage:
- The new Oracle NoSQL database – this is the best article I’ve read summarising it’s position in the market: DBMS Musings: Overview of the Oracle NoSQL Database
- The official Oracle NoSQL Whitepaper: Oracle NoSQL Database White Paper
- An interesting approach to data storage: an FPGA based data warehouse: FPGA Data Warehouse
- Google’s interesting SQL wrapped MapReduce framework: Tenzing A SQL Implementation On The MapReduce Framework
- The Actors Model – just in case you’re not familiar with it: Actors model for distribution
- Gluster – an open source distributed file system: Gluster
- Running Cuda natively on x86 processors: Running CUDA Code Natively on x86 Processors | Dr Dobb’s Journal
- Thinking about using 64bit JVMs with compressed pointers : 32-bit or 64-bit JVM? How about a Hybrid?
- Using different caches for read and write. A sensible pattern for Cohernece implementation: Alexey Ragozin’s Blog
- OCZ Z-Drive – an interesting and competitively priced alternative to FusionIO:
- The architecture of the transputer. An interesting reflection on a couple of Bristol’s finest exports (other than Portishead): the Transputer and the Occum programming language. David May, parallel processing pioneer • reghardware
- Is your brain like an Iphone? Is Your Brain Like an iPhone? Which App is Running Now? – Novato, CA Patch
- Just be still for once: No Shame in Stillness « Under the Apricot Tree
- Of the huge amount of writing about Steve Jobs I thought the Economist’s coverage was the best: Steve Jobs: The magician | The Economist
- Scott Marcar’s thought prevoking dialog on technology through a financial crisis: The Long Haul: Scott Marcar Leads RBS’ Tech Team Through the Financial Crisis- WatersTechnology.com
- Short but thought provoking article on company culture: Why You Should Question Your Culture – Ron Ashkenas – Harvard Business Review
Wednesday, July 20th, 2011
Because the future will inevitably be in-memory databases:
- SAP (slightly weirdly) is leading the way with Hana
- SSD makes a new kind of database possible
- The move away from clusters is not restricted to the enterprise
- More drinking of the Hana Kool-Aid
- Fusion IO
- Phase Change Memory breakthrough at IBM
Other interesting stuff:
- Interesting retrospective on computing giants of the past and future (in typical Economist style)
- A mathematician’s lament
- The next generation of Map Reduce
- Where google may be going wrong
Sunday, February 20th, 2011
- Nice talk covering optimising code in a single JVM: LMAX
- Biased locking in Hotspot: biased_locking_in_hotspot
- Good overview of caching: intel-cpu-caches
- Good overview of lock free algorithms: lock-free-algorithms
- Nice overview of the key NoSQL players: cassandra-vs-mongodb-vs-couchdb-vs-redis
- Google’s layering of ACID over BigTable (at least ACID inside a partition):
- Typically Economist: economist.com
Monday, January 3rd, 2011
More discussions on the move to in memory storage:
- RAM is my friend
- LMAX – How to Do 100K TPS at Less than 1ms Latency
- The problems with ACID, and how to fix them without going NoSQL
- Basho Riak: An Open Source Scalable Data Store
- Facebook’s belief in HBase
- Numbers Everyone Should Know
- Google Dremel Paper
- Facebook’s New Year Performance Stats