TweeSearch
1 What is This Assignment’s Purpose?
Large software systems are not written in one fell swoop; rather, they are built in stages. As these stages are released and used by people, their requirements keep changing. Some software methodologies, like agile methods, even expressly instruct developers to follow this strategy, with the hope of eventually delivering a system close to the client’s needs.
We do not have time to simulate the full scope of these activities in the time available in this course. However, in this assignment, we try to give you a taste for it. You will submit work in several stages. At each stage you will be given slightly different and changing requirements, and must turn in a solution that matches that stage. There are three stages in all, ordered in increasing difficulty.
In addition to this, the assignment also gets you to think about modeling real-world data. Each stage represents a model of a Twitter-like system, and subsequent stages incorporate a more sophisticated model of tweets and responses.
2 Theme Song
A Parliament of Owls Determine the Fates of Greater Men No Less Than 5 Stories Above Us in a Dream by Fever the Ghost
3 Format
You should treat this assignment like an exam. Therefore, you should do this work without consulting course staff except for critical issues (broken links, possible assignment typo, etc.).
You should not try to peek ahead to the next stages through any means; doing so would be a violation of the Academic Code.
Unlike in previous assignments, you will be given only one chance to submit at each stage (like a company making a code release of a new version). Note that although Gradescope allows for resubmissions, we are able to check the history of your submissions. If you submit more than once, we will only grade the earliest submission. So please submit with care.
We will be evaluating your code in part based on how effectively it can be reused in more general contexts.
4 Problem 1
Imagine you have a collection of past tweets, and are given a new tweet. You want to find all the old tweets that are similar to the current one.
How do we test for “similarity”? Easy: we use DocDiff! That is, we will use overlap with some specified threshold (which will be in the range \([0, 1]\)).
data Tweet: tweet(author :: String, content :: String) end
consider only the alphanumeric characters (retaining both capital and lower-case letters) and spaces of each Tweet (no punctuation), and
keep the tweet intact, so when returning searched tweets, the entire text is present.
search :: Tweet, List<Tweet>, Number -> List<Tweet>
Your overlap function from DocDiff takes two lists of strings. To create the lists, split the content of a tweet into words. In the tweet, a space separates one word from the next.
The result is all past tweets that have an overlap of at least threshold (inclusive). The result should be sorted from the tweets with highest to those with lowest (up to threshold) overlap.
Here is your Template for Problem 1.
꧁༺ Stop Here! ༻꧂
Submit before continuing to read!
5 Problem 2
data Tweet:
| tweet(
author :: String,
content :: String,
parent :: Option<Tweet>)
end
You will again define search, with the same parameters. However, the way we perform the search is somewhat different, as defined below.
First, we define the notion of a tweet’s relevance. If a tweet has no parent, then its relevance is exactly its overlap. If a tweet does have a parent, then its relevance is \[(0.25 \times r_p) + (0.75 \times o_t)\] where \(r_p\) is the relevance of its parent and \(o_t\) is the overlap of the tweet itself with the new tweet.
The third argument to search is the relevance threshold (inclusive).
All tweets that satisfy the relevance threshold should be included in the result, which is again sorted in order of most to least relevant.
Finally, note that multiple tweets can have the same parent (because two tweets may have responded to the parent tweet). This should remind you of our discussion about DAGs. You need to eliminate duplicates in your response, where duplicates are defined by identical.
꧁༺ Stop Here! ༻꧂
Submit before continuing to read!
Here is your Template for Problem 2.
6 Problem 3
In the previous version, we had a somewhat awkward way of dealing with the true structure of tweets. This is because we defined tweets as ascending chains, which meant we could end up inadvertently traversing duplicates, which we had to then filter. A better structure would be to have a descending tree.
Therefore, we will change the definition of a Tweet, eliminating the parent field and replacing it with the field children, which is of type List<Tweet>.
The given list of old tweets is guaranteed to contain only and all the roots.
Below, we redefine what relevance means.
A tweet’s relevance is now conditionally defined depending on whether or not it has a parent.
If the tweet does not have a parent, its relevance is \[(0.75 \times o_t) + (0.25 \times s)\] where \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is a size ratio (defined below).
If the tweet does have a parent, its relevance is \[(0.20 \times r_p) + (0.60 \times o_t) + (0.20 \times s)\] where \(r_p\) is the relevance of its parent, \(o_t\) is the overlap of the tweet itself with the new tweet, and \(s\) is its size ratio.
What is the size ratio? It’s the ratio of the size of the current tweet’s subtree to the total size of all the trees provided. The size of a tweet’s tree is defined as 1 (for the tweet itself) plus the sum of the sizes of each of the child tweets. (The parent and other tweets do not count.)
Here is your Template for Problem 3.
7 Starter
8 Socially Responsible Computing
Read/View
Read (or watch) this.danah boyd is a Brown CS alum.
Write
Come up with a few pairs of queries that search for the same content on Youtube but, with subtle changes in wording, come up with results quite different in character. Since we can’t be sure to get the same results you do, tell us what the queries produce and why you think those outcomes are significantly different.