Data Scripting

    1 What is This Assignment’s Purpose?

    2 Theme Song

    3 Programming with Tables

    4 Program Plans

    5 Testing Plan

    6 Starter

    7 Virtual Art Store

    8 Titanic

    9 Socially Responsible Computing

1 What is This Assignment’s Purpose?

We live in a world awash with data. It’s important that we develop facility with being able to write small programs that can find useful information for us. This assignment gives you practice with that.

In addition, a lot of the world’s data are structured as tables. This assignment therefore also gets you familiar with tables and writing programs over them.

2 Theme Song

Collection 1 by Piano Studio Ghibli

3 Programming with Tables

Tables are built into Pyret, so we can program with them directly. The Pyret documentation lists Pyret’s features for working with tables. You will find it helpful to read through this to familiarize yourself with the available support. You can also read chapters 7, 8, and 9 of DCIC for a more pedagogic introduction.

4 Program Plans

We again want you to construct program plans for your programs before you implement them. We provide a block-based interface to table operations that you can use for this purpose. Please see our instructions on this.

We want plans for the functions get-art-in-1, get-art-in-2, and get-art-in-3 in Virtual Art Store, as well as for the code you will write in Titanic. Please first construct a plan and save it, and then begin implementation. You will be asked to upload the plan separately.

5 Testing Plan

We want you to again create a Testing Plan for one of the problems in Virtual Art Store. This one, however, is more subtle.

As you write the sub-problems that you want to test—the standard testing plan—you will find yourself making assumptions about the structure of the tables. Different table structures will lead to different properties. We are intentionally leaving this open for you to think about.

Therefore, you should also write down your assumptions about table structure. These are often called well-formedness conditions. Essentially, what you are saying is, the testing plan only applies to programs that are run over tables that have this structure. If the given tables don’t, then all bets are off. Put differently, these are just contracts over the input table’s shape.

6 Starter

Template

7 Virtual Art Store

An online store sells rights to digital content created by artists. Artists come from all over the world, as do clients. This requires currency conversions between the two.

The store has two tables. One is for artwork:

table: id :: Number, cost :: Number, currency :: String

This tracks, for each artwork (which has a unique id), its cost and the currency in which that cost is quoted. The other is for currency conversion:

table: from-c :: String, to-c :: String, conv-rate :: Number

This holds the multiplicative factor (conversion rate) to convert from the first currency to the second.

  1. Write the function

    get-art-in-1 :: ArtworkTable, ConversionTable, Number, String -> Number

    that takes an artwork’s id and a desired currency. It gets the price in the listed currency and, if that is not the desired one, uses the conversion table to find the conversion rate and translates it. You can assume that every artwork queried is listed exactly once in the artwork table, and that every pair of currencies you need is listed exactly once in the currency conversion table.

  2. Unfortunately, in practice, errors creep in. Either table may be faulty: entries may be deleted by accident, or there may be duplicate entries. If there are missing or duplicate entries for either the input artwork id or the necessary conversion factor, you should raise an exception. You should only compute and return a numeric answer if none of these happens. Call this function get-art-in-2 (with the same signature). Are there any abstractions you can write that can help you clean up your code?

  3. Sometimes, the currency conversion table may not list the conversion from A to B, but it may list that from B to A. If that happens, use the inverse of the conversion table’s ratio. Call this function get-art-in-3 (with the same signature).

  4. Of course, sometimes you can get a conversion ratio by composing several: e.g., the table may have neither A-to-C nor C-to-A, but it may have a chain of conversions (e.g., A to B to D to C, potentially including inverses). Don’t write a program plan or implementation for this; instead, write only a Testing Plan.

  5. In this problem, we’ve represented currency as strings. Another option is to represent it as a datatype: e.g.,

    data Currency: USD | EUR | CHF | INR | … end

    What are the strengths and weaknesses of each representation? In a separate document, explain the trade-offs between these two representations. (There is no program plan or code needed for this entry.)

8 Titanic

There are a few different versions of databases of passengers who sailed on the Titanic. (There is some ambiguity in this term, because the Titanic made a few stops even on its maiden voyage.) Here is one such database.

The following code enables you to load this into Pyret:

titanic-raw-loader =

  GS.load-spreadsheet("1ZqZWMY_p8rvv44_z7MaKJxLUI82oaOSkClwW057lr3Q")

 

titanic-raw = load-table:

  survived :: Number,

  pclass :: Number,

  raw-name :: String,

  sex :: String,

  age :: Number,

  sib-sp :: Number,

  par-chil :: Number,

  fare :: Number

  source: titanic-raw-loader.sheet-by-name("titanic", true)

end

Relative to the dataset, we define the following concepts:
  • A male has sex field "male", female has it as "female". (We are reflecting the standards of that time, not claiming any form of normativity.)

  • A title is the part of the raw-name field up to but not including the first period.

  • A first name is the part of raw-name between the first and second spaces, skipping over any leading parenthesis.

Your task is to determine the following:
  • The six most popular male first names.

  • The six most popular female first names.

  • The frequencies of the titles.

What to submit:
  1. Write a program plan (using blocks) for how you would approach these tasks.

  2. Write a program that computes these values. Use sensible variable names and/or comments to make clear which parts of your program compute what.

  3. Answer the following questions on Gradescope:

    1. Write in descending order the 6 most common male first names.

    2. Do the same for the 6 most common female first names.

    3. Describe any observations you have about the above two answers.

    4. Describe any observations you have about the titles.

9 Socially Responsible Computing

Read/View

Read this article.

Write

Why do you think we assigned this reading? Can you think of any falsehoods that the author missed?

In some sense, all user profiles are reductionist in that they condense a human being into an object with several predefined traits. What human characteristics (beyond names) do you think software developers tend to reduce in ways that are harmful to people who carry specific traits?

Optional Readings

This article contains illustrative examples for the previous one, in case you need them. It’s always good to have examples!

This thread introduces you to naming conventions for Vietnamese and other Southeast Asian names.