CS 385 Lab 13, Spring 2009

[Back to CS 385 schedule]

Due date: Wednesday, May 6, 1:00 PM

Reading

In preparation for this lab, you should read the following article:

Jeffrey Dean and Sanjay Ghemawat, "MapReduce: Simplified Data Processing on Large Clusters." Proceedings of the Symposium on Operating System Design and Implementation (OSDI), 2004. [pdf]

This article describes a programming model for processing large-scale data sets in parallel on distributed systems. This interface is based on features found in functional languages, but it should be understandable even if you haven't taken programming languages. Here are my tips for reading the article:

Writeup

You will turn in your writeup electronically, as a text file. To get a good grade on the writeup, your answers should be in your own words and should show that you have worked to understand the article and that you are prepared for discussion.

Answer all of the questions in the following list:

  1. What issues or concerns in this paper are specific to large-scale distributed systems and are different for shared-memory multicores?
  2. For the word count program in Section 2.1, what would be the output of the map stage and the reduce stage when applied to a file with contents "word1 word1 word2"?
  3. Write pseudo-MapReduce code for any of the examples in Section 2.3.
  4. What is the master's job? What role does it play in fault tolerance?
  5. In your own words, summarize one of the refinements described in Section 4.

Submitting your writeup

Make sure that your writeup is named yourlastnameL13.txt, and then copy it to ~srivoire/cs385/submit. Wait 2 minutes, and then check that it was correctly submitted by visiting http://rivoire.cs.sonoma.edu/cs385/lab13sub.txt.