Data Processing Hunt

In this lab, you will use higher-order functions to process data sets and answer questions about them. However, what data sets you process and what questions you want to answer are up to you!

Choosing Your Datasets

In this homework, you will process a number of data sets as outlined in the requirements below. You are free to choose whatever data sets you would like to analyze. It may be statistics on football games, Netflix movie ratings, baby names, Wikipedia articles, books of classic authors, or others.

Generally, you can Google around searching for data sets. Typically throwing “data set” into your query will refine your search properly. Here are some data set repositories to give you some ideas:

If you don’t have a strong preference on the subject you want to analyze, the R data sets and the U.S. Government data sets are good places to start.

The requirements of the data sets you choose are:

Choosing Your Questions

For each data set you choose, you must answer a number of questions about them according to the requirements. These questions can be anything you want subject to the complexity requirements listed below. The only major restriction is that the answers to these questions must be discoverable by using mapping, filtering, and reducing operations over lists.

The program you write for this homework will load in your data sets and programmatically answer the questions you pose, printing these questions and answers to the console in the process.

The DataHunt Program

Write your program in a file called datahunt.py. or each question that you ask of your data sets, your program will print a block of text containing the question and answer. You should format each block as follows, for example:

What is the most popular female baby name in 1906?
> Mary
(Source: SSA's data set on baby names)

On the first line, you should print the question. On the second line, you should print the answer—distinguish it from the question by putting a caret in front of the line. The answer should not be inlined into your code; instead, you should generate the answer programmatically by analyzing the data set. The final line should cite the source of the answer, i.e., the name of your data set, in parentheses as shown above. (This question and answer are taken from the Social Security Administration’s data set on baby names).

Requirements

  1. You must pose 12 questions and then print the answers using your program.
  2. You must use at least 3 data sets to answer these 12 questions, 4 questions per data set.
  3. Each question must be complex enough that you use at least 2 chained list operations to answer that question. For example, you might transform a list, filter it, and then take its length.
  4. For each data set, you must use each of a list transform, filter, and reduce operation, at least once in your answers. Specializations of these functions (e.g., sum for reduce) do not count towards this requirement.
  5. In addition, for each data set, you must not ask questions that have “similar answers”. For example, if you ask what the minimum age is for a data set, you cannot also ask about the maximum age.
  6. At least one data set must contain multiple components per entry in the data set. You are not required to use all of these components in the answers to the questions you pose for that data set.
  7. For at least one data set, you should use the DrawingPanel class to visualize the data in some way. That is, you should create a drawing based on the data present in the data set.