Parallel Distributed Collections API
Josh Suereth and Daniel Mahler
This talk outlines the construction of a Parallel Distributed Collections API (Cascade) and outlines the challenges associated with developing a Scala API over the existing Java solution (Flume). The details of the JavaFlume library are discussed, including the relevant parallel collections abstractions and the parallel operations allowed against these collections. The current implementation of JavaFlume works with distributed sharded files and Google's BigTable.
The optimisation engine for Flume is its defining feature. High level operations like sort, map, reduce and join can be reduced into a series of map-reductions and executed against a cluster of machines. This allows users of the library to develop complex parallel operations in peicemeal fashion and construct a pipeline of data processing.
Cascade is a Scala built on top of Flume to take advantage of its functional nature. Challenges in developing Cascade and practical solutions will be examined in depth. Cascade presents a new way of performing Map Reduce operations that is innovate and elegant.