Paper review: MapReduce and Dryad

September 30, 2011

This is another combined paper review, because the ideas are again pretty similar, and it's a useful compare/contrast. The first is the famous MapReduce paper from Google, and the second is Microsoft's response, Dryad.

  • "MapReduce: Simplified Data Processing on Large Clusters", Dean and Ghemawat. Published at OSDI 2004.
  • "Dryad: Distributed Data-Parallel Programs from Sequential Building Blocks", Isard et al. Published at Eurosys 2007.

Main ideas from MapReduce

MapReduce is a parallel data processing framework designed initially for a very specific task: scanning large amounts of textual data to create a web search index. It was essentially codesigned with GFS for this purpose. As a result, it boils down computation into just two phases: map, followed by reduce. Programmers have to write just two functions, one for each phase. Then, these two functions are run in massively parallel fashion: mappers run the map function, and the output from mappers is then fed to reducers, which run the reduce function. Mappers are scheduled for data-locality, moving computation to where data is stored to minimize network communication. The map phase essentially does some data-parallel operation, while the reduce phase aggregates results from the map phase to produce the final output.

The cool parts of this paper are twofold: first, that such a simple, limited programming model can accommodate such a wide variety of tasks, and that almost all of the complexity of running code on thousands of machines can be abstracted away from the programmer.

Regarding the programming model, it's something that can be taught in a matter of days. Map and reduce are familiar from functional programming languages, and really small amounts of code can do very powerful things. It is quite limited (only works for data-parallel operations), but when dealing with big data, your operations basically have to be data-parallel to complete in any reasonable time. Google used MapReduce to do a wide variety of tasks, so the proof of utility is in the pudding.

The distributed, fault tolerant framework is what really drew me in. A single master manages all the workers (which potentially means a single point of failure), but this is way less likely than a worker failure. Failure of workers is handled transparently, by restarting the worker. This is possible because the output from the map stage is durably written to disk storage, then read by the reducers. Mapper input, of course, is durably stored as well, so they can also be easily restarted. Another feature I liked is chopping off the latency tail caused by "straggler" workers by starting duplicate tasks towards the end of the job.

In summation, MapReduce is both a powerful and simple framework that saw a lot of use at Google for a variety of tasks.

Main ideas from Dryad

Dryad is what some people see as "MapReduce done right", but this is a contentious claim. It's a more general framework in two important ways. First, it allows for more general styles of computation, meaning more than just two phases, and more than just map and reduce elements in the graph. Second, it allows communication between stages to happen over more than just files stored in the DFS: Dryad allows for sockets, shared memory, and pipes to be used as channels between elements. It ultimately ends up looking like a DAG of user-defined elements. Data flows between elements over a choice of channels, and the elements are all user-defined. This has a number of benefits: more efficient communication, the ability to chain together multiple stages, and express more complicated computation.

This leads to a number of complications. While it does subsume the MapReduce paradigm, with generality comes complexity. The programming model is nowhere near as simple (the authors cite "a couple weeks" to get started), and to me, it feels like doing the work of a database: designing all the elements and communication in a physical query plan, and optimizing it. They even do a direct comparison against SQL Server in the paper, in fact showing that they have similar query plans but Dryad comes out a little bit faster. Doing this really isn't simple at all, and the example queries they show do nothing to deny it. I consider dataflow programming (what this is, essentially) to be difficult to reason about for most programmers.

Dryad also incorporates the same fault-tolerance as MapReduce, and is able to restart failed tasks correctly. It also has this idea of "dynamic runtime optimization" which sounds very DB, and is hard for MapReduce to do since it's the equivalent of UDFs in a database.

Comparison and evaluation

My impression is that people were pretty unhappy with Dryad when it came out. It's not nearly as elegant as MapReduce, there aren't any cool operational insights, and feels very "me too". However, as stated in the Dryad paper, programmers aren't really meant to interface with Dryad directly, and are instead supposed to use things like DryadLINQ (which turns declarative LINQ queries into Dryad execution graphs, exactly how everyone wanted). This is true for MapReduce too, since FlumeJava has seen heavy use at Google, and Hive and Pig dominate Hadoop workloads at Facebook. As nice and "simple" MapReduce is compared to Dryad, no one is directly programming on either these days, instead doing the DB-like thing and using declarative query languages.

Dryad also did correctly identify all the flaws with MapReduce, flaws that have to be papered over and hacked around to get the same kind of performance and generality. Hadoop is going to have to become more memory aware to eke out additional performance, and there are "workflow management" tools that allow chaining of multiple MapReduce jobs to effectively achieve multi-stage workflows. As long as the user never has to worry about the details, declarative execution engines built on top of Dryad rather than Hadoop have an advantage.

In terms of future relevance, I think that the basic idea of hiding faults and communication from the programmer is totally the right idea. It's way easier to write programs within a Dryad or MapReduce framework than something like MPI, which didn't hide anything. The DB community had it right though in calling for declarative query languages, and Hadoop and MapReduce these days are essentially being used as distributed query execution engines. I think we're going to see a wider variety of query languages in the future though, since there's a tradeoff between generality and simplicity. I doubt Hive and FlumeJava are the final word. There's also room for other types of query execution engines; Pregel's BSP is an example.

blog comments powered by Disqus