TweeSearch 1
1 Format
You should treat this assignment like an exam. Therefore, you should do this work without consulting course staff except for critical issues (broken links, possible assignment typo, etc.).
2 Theme Song
Rockin’ Robin by Bobby Day
3 Problem
Imagine you have a collection of past tweets, and are given a new tweet. You want to find all the old tweets that are similar to the current one.
How do we test for “similarity”? Easy: we use DocDiff! That is, we will use overlap with some specified threshold (which will be in the range \([0, 1]\)).
data Tweet: tweet(author :: String, content :: String) end
use only the alphanumeric characters (retaining both capital and lower-case letters) and spaces of each Tweet, and
keep the tweet intact, so when returning searched tweets, the entire text is present.
search :: Tweet, List<Tweet>, Number -> List<Tweet>
Your overlap function from DocDiff takes two lists of strings. To create the lists, split the content of a tweet into words. In the tweet, a space separates one word from the next.
The result is all past tweets that have an overlap of at least threshold (inclusive). The result should be sorted from the tweets with highest to those with lowest (up to threshold) overlap.