CS 385 Lab 13, Spring 2009

Due date: Wednesday, May 6, 1:00 PM

Reading

In preparation for this lab, you should read the following article:

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters." Proceedings of the Symposium on Operating System Design and Implementation (OSDI), 2004. [pdf]

This article describes a programming model for processing large-scale data sets in parallel on distributed systems. This interface is based on features found in functional languages, but it should be understandable even if you haven't taken programming languages. Here are my tips for reading the article:

Read the abstract and introduction. Try to put each idea into your own words and to put it in the context of your experiences with parallel programming. Note the things you don't understand: is the problem with terminology, or do you not follow the argument? Generally, it's worth understanding the introduction of a paper as well as you possibly can, or you're likely to get lost later.
Look over the conclusions. This paper discusses MapReduce in the context of large-scale clusters; are the concerns expressed in the conclusion relevant to shared-memory multicore processors? Is MapReduce even relevant for multicore processors? (This is more than a rhetorical question; see this 2007 paper if you're interested in the answer.)
Read Section 2. Read the introduction to this section, and make sure you understand the concept of key/value pairs. If you've never encountered this idea before, search for "associative arrays" online. Then go over Section 2.1 in detail. Read Section 2.2, and go over the examples in Section 2.3: what problems are they trying to solve? How do they use the MapReduce framework to solve them?
Read Section 3. The hardware details aren't important to us (although they are interesting); focus on the division of labor among masters and workers as well as how fault tolerance is implemented.
Skim Section 4. Thoroughly understand at least one of the subsections.
Sections 5, 6, and 7 are strictly for the curious. This paper is well worth reading in its entirety; these sections are just out of the scope of our discussion in this class.

Writeup

You will turn in your writeup electronically, as a text file. To get a good grade on the writeup, your answers should be in your own words and should show that you have worked to understand the article and that you are prepared for discussion.

Answer all of the questions in the following list:

What issues or concerns in this paper are specific to large-scale distributed systems and are different for shared-memory multicores?
For the word count program in Section 2.1, what would be the output of the map stage and the reduce stage when applied to a file with contents "word1 word1 word2"?
Write pseudo-MapReduce code for any of the examples in Section 2.3.
What is the master's job? What role does it play in fault tolerance?
In your own words, summarize one of the refinements described in Section 4.

Submitting your writeup

Make sure that your writeup is named yourlastnameL13.txt, and then copy it to ~srivoire/cs385/submit. Wait 2 minutes, and then check that it was correctly submitted by visiting http://rivoire.cs.sonoma.edu/cs385/lab13sub.txt.