Shell
In this assignment you will do some basic text analysis on real data.
1 The Context
We often want to use computers to answer questions about real-world data sources, many of which are collections of text. In this assignment we will do a simple version of this, both learning new tools and exposing ourselves to some of the intricacies of real-world data analysis.
Specifically, given a well-known book, we will ask which person is named most frequently in it.
2 Language
In the assignment, we also want you to learn a new tool: the Unix
shell. The shell is extremely powerful, and being able to do basic
tasks quickly in the shell is a kind of “superpower” that lets you
automate many taks on your computer—
There are many Unix shells. We will specifically use the one called bash (which stands for the Bourne-again shell, a joke based on an earlier shell, sh, designed by Stephen Bourne). If you are a Unix geek with a strong preference for some other shell, you’re welcome to that in your spare time. Here, you must make sure your work runs in bash.
3 Warning
While we are fans of the power of shells, they are also extremely brittle and dangerous. It is not too hard to, say, accidentally delete all the files on your computer and not be able to retrieve them. None of the commands we give below will do this, but if you Google for answers and copy shell commands from the Web, this is a real possibility. Don’t copy arbitrary commands off the Web!
4 How to Run a Shell
On Linux or OS X, use a Terminal application to get a shell. On Windows, depending on the version, it would be a good idea to install Cygwin, which provides bash for Windows. (There are other solutions, such as bash on Ubuntu on Windows, and there are other ways of using Cygwin, such as through PuTTY, and also Cygwin through PuTTY, that you should consider if you plan to continue using these tools. But you should not need any of these for this assignment: just Cygwin will suffice.)
Note: You can also use a Brown account, especially a computer science account, to access a Unix shell. If you’re a Windows user, you may find this easier than dealing with Cygwin.
Once you’re at the Terminal, you can run commands at it just like in Pyret’s interaction panel. For instance:
shows a terminal interaction. (Note: stelvio is the name of the computer, ~ is the current directory, and > is the prompt. This part will look different for you.)
If your commands go into an infinite loop, you can stop evaluation using control-C.
5 Model
The execution model of the shell is essentially (not quite, but close enough for the purposes of this exercise) a form of lazy evaluation. You can think of each shell command as a lazy stream transformer: it consumes a (possibly finite) stream, based on the values in this stream generates new values, and produces these in a stream. The initial command works with an empty stream, while the output of the last command is printed to the console.
You can learn more about the shell from chapter 7.2 of PLAI first edition (note: not second edition). We also found slides 1–41 of this presentation particularly clear and useful. Beware that there are numerous tutorials on the Web, many of which are grossly incomplete or buggy.
6 Useful Commands
As you read the above documents, pay particularly close attention to the commands cat, grep, sed, sort, tr, uniq, and wc. These will suffice to solve this problem and help check your solution. Other Unix commands may also be helpful and may even lead to better solutions, but you are not permitted to use them without permission (which needs to be accompanied by a particularly good reason). In particular, you are not allowed to switch to some other language (such as Awk) to solve the problem. Our goal is to get you acquainted with the Unix shell, to aid your development as effective computer users.
You may also find it helpful to use “redirections”: appending > followed by a filename writes the output of that command to that file, while < reads input from it. This lets you save the intermediate results rather than have to re-run the whole computation, though when you turn in your submission be sure to remove these dependencies, since we can’t see those files.
Finally, you can get help with all Unix commands using the man command (short for “manual”). Just type
man wc
for instance to get more information about how to use wc, including its command-line options (e.g., look up what wc -l is supposed to do, and try it out).
7 Task
As stated earlier, we want you to find the most frequently named person in a book. This includes the book’s title. We define a name to be a word that begins with a capital letter in the English alphabet This is an extremely Anglo-centric definition and one we use only for the purposes of making this assignment. If you want to get pedantic, do understand that just about everything you think you know about names may be wrong. and names a person in that book.
Frankenstein; Or, The Modern Prometheus by Mary Wollstonecraft Shelley
Heart of Darkness by Joseph Conrad
Jane Eyre by Charlotte Brontë
The Legend of Sleepy Hollow by Washington Irving
Moby Dick; Or, The Whale by Herman Melville
Narrative of the Life of Frederick Douglass, an American Slave by Frederick Douglass
Pride and Prejudice by Jane Austin
A Tale of Two Cities by Charles Dickens
Ulysees by James Joyce
Essentially, you will want to convert the document into one where each word is on its own line. Once you’ve done that, you can reduce this to a problem that is mostly solved in the above reference materials.
Names might appear with additional letters at the end (how?). You have to handle these correctly when counting: i.e., make sure to count them.
The most frequent capitalized words may not be names. You could try to eliminate those automatically; for instance, you might remove words that appear in an English dictionary. If you did that, however, you would almost certainly produce the wrong answer for a book about Judge Learned Hand. Instead, therefore, to provide an answer to us in the form, you have to just read the output and use your judgment, perhaps referring back to the original document to make sure about what is and isn’t a name. You can’t solve this problem entirely algorithmically.
The document may contain metadata that disrupts your analysis. You need to either change the document’s format before starting—
perhaps even manually, using a text editor, so long as you aren’t doing anything repetitious— or eliminate this in the output.
In short: when doing a real-world data analysis, of which you will do a great deal in the future in computer science, always confirm your findings in as many ways as possible. All the skills you’ve developed writing oracles and catching chaffs will become especially important in this setting.
8 Useful Shell Programming Hints
8.1 Scripts
You can run your entire program on the Unix command line, just like you could write an entire program at the Pyret interaction prompt. However, it’s useful to save your commands to a file, just like putting them in the Pyret definition panel. You do these by creating Unix scripts (i.e., programs).
To do this, create a new file, say mostname.sh (the .sh is a convention indicating a shell script, just like .arr is a convention indicating Pyret code). In that file, put the line
#!/bin/bash
at the top. On the next line, write the commands of your script.
At this point you have a script that in principle can be run just like any of the other Unix comamnds. However, for security reasons, Unix will not run it until you say to permit it to be run. Thus, at the Unix shell, make the script “executable” (i.e., say it can be run), which from then on makes it runnable even if you make changes to it:
> chmod +x mostname.sh
(but don’t type >: that stands for your Unix shell prompt). Now, you can run your script as if it were a built-in Unix command:
> ./mostname.sh
The ./ means “run the program of this name in the current directory”.
At this point, you may very well have a particular filename embedded in your script: perhaps it looks like
cat frankenstein.txt | ...
You can replace that filename with "$1", which means “the first name specified on the command line” (in other words, the first parameter; Unix shell scripts access parameters by position rather than by giving them names). Then you can run the script with the name of the file you want to process:
> ./mostname.sh frankenstein.txt
but also
> ./mostname ulysees.txt
and so on.
Note that on some operating systems, when you try to run the script, you might get an error like this: /bin/bash: bad interpreter: No such file or directory. This means bash is elsewhere in your system. In that case, try
#!/usr/local/bin/bash
If neither one tries, ask us, telling us what operating system you’re using (to which we might suggest you try the Brown Unix systems...).
8.2 Searching
It can be tricky to figure out the syntax to provide grep. If you want to search for capital letters, you’re looking for things in the range A to Z, which you would write as ’[A-Z]’.Note: This actually searches for any uses of capital letters anywhere in a line. If you need help with other patterns, ask us.
8.3 Ends of Lines
When you try to verify your answer, you may find that names at the end of a line are mysteriously being dropped.
(If you don’t find this, either this isn’t a problem on your operating
system, or you aren’t cross-checking your answers well enough—
If you do have this problem, you will want to add this command
sed -e’s/^M//’
which means “replace all ^Ms with nothing” (i.e., get rid of all of them) to your pipline, but where ^M should be replaced by the “control-M” character.^M is the “carriage return”. A typewriter lurks in the heart of your supercomputer. You can get that character by typing the sequence control-V control-M (control-V means “the next thing I type, I want it verbatim”).
9 Handing In
The form asks you to provide
- for each book:
the most common name
its count
the script you wrote