MapReduce
When working with partners, the collaboration policy applies to a pair as a whole, rather than to individuals. Indeed, you are required to collaborate.
We strongly recommend a pair-programming strategy called driver/navigator. The driver is the person with the keyboard; the navigator is the person discussing, reading each line as it’s written, and providing feedback. Every fifteen minutes, swap roles. It’s a really good habit to get used to.
Please turn in only one submission, agreeing with your partner who will do the submitting.
You may not use a late pass for this assignment!
You will have to implement your own version of the essence of Google’s MapReduce.MapReduce was one of the most important innovations in the 2000s. Dave Patterson, one of the world’s leading computer systems researchers, called it the first “instruction” for cloud computing.
One of the things you will discover is that their operator is horribly mis-named: it isn’t the composition of map and reduce. Nevertheless, we’ll stick to their naming to respect (unconventional) convention.
1 Theme Song
The 3 R’s by Jack Johnson
2 Background
The MapReduce pattern allows you to divide problems into a series of independent, equivalent tasks that can be parallelized. Specifically, all the map tasks can be run at the same time, as can all of the reduce tasks, because the result of each task does not depend on any of the other tasks. Because your implementation will run on only a single processor, you won’t see any performance gains. However, working through the central idea should help familiarize you with it so you recognize applications that can use it later.
data Tv-pair<A, B>:
| tv(tag :: A, value :: B)
end
3 Your MapReduce
fun map-reduce<A, B, M, N, O>(
input :: List<Tv-pair<A, B>>,
mapper :: (Tv-pair<A, B> -> List<Tv-pair<M, N>>),
reducer :: (Tv-pair<M, List<N>> -> Tv-pair<M, O>))
-> List<Tv-pair<M, O>>
4 An Example Scenario: Word Counting
A word count mapper:
fun wc-map(
file :: Tv-pair<String, String>)
-> List<Tv-pair<String, Number>>
wc-map consumes a file, which is a Tv-pair with the filename as its tag and the file’s contents as its value. It produces a List<Tv-pair> where the tags are word tokens and the values are the corresponding counts of the tokens. wc-map produces the count 1 for every word token, because it counts words as independent tokens as it encounters them.
A word count reducer:
fun wc-reduce(
word-counts :: Tv-pair<String, List<Number>>)
-> Tv-pair<String, Number>
It takes a Tv-pair<String, List<Number>> in which the tag represents a word and the value is a list of word counts for that word (in this case, as we know, a list of 1s, though it won’t always be so simple!). It produces a Tv-pair<String, Number> in which the tag is a word, and the value is the number of times that word appeared in all of the input.
It’s the reducer’s job to sum the word token counts into a word type count. wc-map may be run independently on each file, and wc-reduce may be run independently on each word. This is the division of the problem into a series of equivalent tasks which can be run at the same time, and from here we gain the performance advantage over the undivided problem. MapReduce would allow the words from the files to be read into the mapper, and parse that output for use to the reducer.
Take in a list of documents.
Tokenize each word.
Find all instances of each word among all the documents and combine them into one Tv-pair per word.
Pass these Tv-pairs of each word and its list of word counts to the reducer.
Produce a word count for each word among all documents.
Once you and your partner are satisfied with your map-reduce, use it to solve the following problems.
5 Anagrams
Your input consists of a List<Tv-pair<String, String>> in which the tag is the file name, and the value is the string contents of the file (like in the word count example). For example, one item that could be part of an input would be tv("words.txt", "star rats tarts"). Your final output should be a List<Tv-pair<String, List<String>>> in which each tag represents a unique anagram group, and the value is a list of unique words that appear in the input that are all anagrams of each other. We will define anagrams to be a set of words that are made up of the same set of characters (case-sensitive), with the same number of each character. Your functions should be named anagram-map and anagram-reduce.
Note that you are responsible for creating your own tagging convention to uniquely label each anagram group as you see fit. This also means that, when writing public tests (as opposed to ones internal to your implementation), you can’t expect a particular tagging convention!
6 Nile 2.0
If you thought you were done with Nile, then you must be in de-nile!
recommend(title :: String,
book-records :: List<Tv-pair<String, List<String>>)
-> Tv-pair<Number, List<String>>
popular-pairs(book-records :: List<Tv-pair<String, List<String>>>)
-> Tv-pair<Number, List<BookPair>>
7 Clarifications
fun count<A>(target :: A, a :: List<A>, eq-checker :: (A, A -> Boolean))
-> Number:
doc: "counts quantity of target in a"
fun el-checker(el, cnt):
if eq-checker(el, target):
cnt + 1
else:
cnt
end
end
a.foldl(el-checker, 0)
where:
count(3, [list: 1, 2, 3], lam(x, a): x == a end) is 1
count(3, [list: 3, 2, 3], lam(x, a): x == a end) is 2
count(4, [list: 1, 2, 3], lam(x, a): x == a end) is 0
end
fun lst-same-els<A>(
a :: List<A>,
b :: List<A>,
eq-checker :: (A, A -> Boolean))
-> Boolean:
doc: "checks whether two lists have the same elements in the same quantity"
fun same-count(el, acc):
acc and (count(el, a, eq-checker) == count(el, b, eq-checker))
end
(a.length() == b.length()) and a.foldl(same-count, true)
where:
lst-same-els([list: 1, 2, 3], [list: 1, 2, 3], lam(x, a): x == a end)
is true
lst-same-els([list: 1, 2, 3], [list: 1, 2, 3], lam(x, a): x == a end)
is true
lst-same-els([list: 1, 2, 3], [list: 2, 1, 3], lam(x, a): x == a end)
is true
lst-same-els([list: 1, 2, 2], [list: 2, 1], lam(x, a): x == a end)
is false
lst-same-els([list: 1, 2, 2], [list: 2, 1, 1], lam(x, a): x == a end)
is false
end
fun recommend-equiv<A>(
t1 :: Tv-pair<Number, List<A>>,
t2 :: Tv-pair<Number, List<A>>)
-> Boolean:
doc: "checks whether two recommendations are equivalent"
(t1.tag == t2.tag) and lst-same-els(t1.value, t2.value, lam(x, y): x == y end)
where:
recommend-equiv(tv(0, empty), tv(0, empty)) is true
recommend-equiv(tv(1, [list: "a"]), tv(1, [list: "a"])) is true
recommend-equiv(tv(1, [list: "a", "b"]), tv(1, [list: "a", "b"]))
is true
recommend-equiv(tv(1, [list: "b", "a"]), tv(1, [list: "a", "b"]))
is true
recommend-equiv(tv(1, [list: "b", "a"]), tv(2, [list: "a", "b"]))
is false
end
You are welcome to use these functions in your test suites. You should use the lst-same-els function to help you write tests for the Anagrams section of the assignment, and you should use the recommend-equiv function to help you write tests for the Nile section of the assignment.
You should not modify these functions, nor should you use them in your implementations.
8 Analysis
How would JoinLists be useful for MapReduce? (One paragraph or less.)
Give an example of another problem that you could use MapReduce to solve beyond Anagrams and Nile. Provide the contracts for your input data, mapper, and reducer.
9 Files
Feel free to write your tests for map-reduce using
wc-mapper and wc-reducer. Note that you should be
testing functions in an implementation-independent manner—
10 Handing In
Please have only one person from each pair submit the form.