The purpose of the Cheat-Checker assignments (hw04 and hw05) are to gain practice with functions, arrays, and strings, and learn about applications of string matching. The specific goals of this assignment are:
You are invited to complete the mid-semester feedback form. Professor Eaton and the TAs would love to hear how we can improve your CIS 110 experience for the second half of the semester.
In Cheat-Checker, Part I, you wrote a program, CompareStrings, that compares two strings for similarity by aligning them and finding the lowest cost. For extra credit, you may also have chosen to write a program, CompareDocuments, that compares two documents in a similar way.
Part II builds upon CompareDocuments to visualise the range of similarity in a given pool of documents to highlight the pairs that are most similar. This is an example of another common task when considering protein matching and cheat-checking.
We provide you with a reference implementation of CompareDocuments, so you do not need to rely on the code you wrote for your last homework.
Protein matching
Comparing a protein from two different species can give us some information about how closely related they are, but it is even more interesting to compare the protein among numerous species. That helps us answer questions like, "Which two species are most closely related?" or "Which two species are most distant?" We can also build a phylogenetic tree that hierarchically groups species based on how similar the protein is, and thereby gain insight into the entire evolutionary tree. Computational biologists build similar trees of different genes to help learn about their functions—very similar genes often control similar functions. (This type of tree is called a minimum spanning tree, and is at the heart of many compression formats including JPEG, as well as being an essential part of route planning and many other algorithms. You will learn all about minimum spanning trees in CIS 121.) Two examples of phylogenetic trees based on the cytochrome c protein are shown below.
Cheat-checking
A typical cheat-checker compares all pairs of code submissions using techniques similar to those you implemented in CompareDocuments of the previous assignment, returning a sorted list of the top matches. But we're also interested in the distribution of top matches, i.e. the histogram of matching costs.
Any pair of independently written programs should have fairly similar matching costs (actually, you would expect the distribution of costs to form a bell curve); thus, assignment pairs with abnormally low matching costs relative to the cost distribution are likely to be copied.
The images below show the kind of histogram your program will generate, as well as a sample report generated by a cheat-checker.
Your program in hw05 will have five main parts:
A histogram is a representation of the distribution of numerical data. A histogram divides the range of data into bins and displays the frequencies of each bin in a tabular or graphical form.
Let's say we have the following data on the heights, in centimeters, of some of the CIS 110 TAs:
This list is not useful because it does not help us visualize the distribution or find any anomalies: we cannot see at a glance whether there are any TAs that are abnormally tall or short. However, we could group these data to see which heights are the most common. This is a tabular histogram:
Height range / cm | Frequency |
145–150 | 1 |
150–155 | 0 |
155–160 | 1 |
160–165 | 3 |
165–170 | 3 |
170–175 | 8 |
175–180 | 2 |
180–185 | 2 |
We now can easily see that most TAs have heights between 170 and 175 cm, and that there is a clear outlier who has a height between 145 and 150 cm.
We could further improve the presentation of these data by drawing a visual histogram:
We could also have used this histogram, with four bins instead of eight, if we had needed less detail:
Height range / cm | Frequency |
145–155 | 1 |
155–165 | 4 |
165–175 | 11 |
175–184 | 4 |
We provide you with a compiled reference implementation of the program that you will write. You can run this reference implementation to explore how your program is supposed to work. You can also use the functions it provides as drop-in replacements for many of the parts that you will write so that you end each of the subsequent assignment parts with a complete, working program. This makes your program much easier to test.
Download the RankSimilarityReference.class reference implementation. In addition, download the data files cytochrome_c.txt (a data set of cytochrome c proteins for many different species) and costs-cheatchecker.txt (an anonymized data set containing cheat-checking data).
Run java RankSimilarityReference costs-cheatchecker.txt and java RankSimilarityReference cytochrome_c.txt. A histogram should appear visualising the cost data. Press 'p' to print the matches up to your mouse position, 'i' to invert the bar heights to highlight the shortest bars, and 'r' to recompute the histogram with a number of bins determined by your mouse position. Play around with the program to understand how it works.
In this assignment, we will be using 2D arrays.
We strongly recommend that you read Section 1.4 of the textbook on this topic. A summary of 2D arrays is also provided on the booksite: http://introcs.cs.princeton.edu/java/14array/.
A 2D array is an array of arrays. You can visualize a 2D array by thinking of a table (i.e. a grid).
The type of a 2D array of ints is int[][]. Thus, this declares a 2D array of ints:
int[][] a;
To instantiate our int[][], we need to specify the number of rows in the 2D array. Let's say a has 10 rows:
a = new int[10][];
If we also know that each of the row in a has the same number of columns (say, 3), then we have a rectangular array, and we can write:
a = new int[10][3];
Thus, we can refer to the item in row 1 and column 2 as a[1][2]. For instance, we can write the following:
a[1][2] = 78; System.out.println(a[1][2]); // prints 78
Conventionally, the first index refers to the number of rows and the second index to the number of columns.
Because a 2D array is an array of 1D arrays, each row in the 2D array is a 1D array. Thus, we can get the second row of a as a 1D array:
int[] x = a[1]; // x is the 1D array {0, 0, 78} System.out.println(x.length); // prints 3
As with 1D arrays, Java initializes all entries in 2D arrays of ints to 0, doubles to 0.0, and booleans to false.
To access each of the elements in a 2D array, we need nested loops:
double[][] b = new double[M][N]; for (int i = 0; i < M; i++) for (int j = 0; j < N; j++) b[i][j] = 0;
Setting values at compile-time You can set the values in a 2D array at compile-time. This only works when declaring and instantiating arrays at the same time.
int[][] a = { { 99, 85, 98 }, { 98, 57, 78 }, { 92, 77, 76 }, { 94, 32, 11 }, { 99, 34, 22 }, { 90, 46, 54 }, { 76, 59, 88 }, { 92, 66, 89 }, { 97, 71, 24 }, { 89, 29, 38 } };
Ragged arrays Not all of the rows in a 2D array need to have the same length—an array with rows of non-uniform length is known as a ragged array. We can instantiate a ragged array without specifying the number of columns:
int[][] c = new int[3][];
which then requires us to instantiate each row individually as an int[]:
c[0] = new int[5]; c[1] = new int[3]; c[2] = new int[7];
before we are able to assign values to the entries of the 2D array:
c[1][2] = 4;
This code prints the contents of c:
for (int i = 0; i < c.length; i++) { for (int j = 0; j < c[i].length; j++) { // notice the use of c[i].length to get the length of the ith array System.out.println(c[i][j] + " "); } }
You will write a program RankSimilarity.java with functions to match all pairs of documents, sort the matches by cost, compute a histogram of the sorted matches, and plot the match costs in an interactive histogram. It must conform to the following API:
public class RankSimilarity -------------------------------------------------------------------------------------------------------- /* Computes the cost values for all pairs of documents * Returns a 2D array with one row for each document pair * docs is an array of documents; each document is a String[] in docs */ public static int[][] computeCosts(String[][] docs) /* Sorts costs in increasing order of cost value using insertion sort */ public static void insertionSortCosts(int[][] costs) /* Computes the histogram array of the cost values in costs * Returns an array giving the number of matches in each histogram bin * numBins is the number of bins for the histogram (will be at least 1) * Computes modified inverse of bin values if inverse is true */ static int[] buildHistogram(int[][] costs, int numBins, boolean inverse) /* Display a static visualization of histogram */ public void drawHistogram(double[] histogram) /* Computes the histogram from costs and displays an interactive visualization */ public void interactHistogram(int[][] costs) /* Read in the protein file, cost matrix, or list of documents, * compute all match costs, sort them, and create a histogram */ public static void main(String[] args)
Write and test computeCosts() first, before starting to work on the other functions. computeCosts() takes a set of documents as a 2D array. It returns a list of the costs associated with each possible pair of documents as a 2D array. More details are explained below.
We have given you a main() method in the skeleton RankSimilarity.java that you will use for testing. Please read all of main(). Make certain that you understand every step of what happens in main(). If you need help understanding main(), feel free to ask.
Make sure that you make a genuine attempt to understand main(). If you have questions in office hours that suggest you haven't made an effort to understand main() on your own, CIS110 Staff will likely ask you to re-read main().
Below is the API for the RankHelpers class that is called by the provided main().
public class RankHelpers -------------------------------------------------------------------------------------------------------- /* Read in an array of precomputed cost values from the file at filename * Assumes that the file at filename is in the same format as costs-cheatchecker.txt * Returns the 2D costs array * (This skips the cost value computation, which is useful when there is a large amount of data to process, * or when the data are confidential) */ public static int[][] readCosts(String filename) /* Read in a list of proteins from the file at filename * Assumes that the file at filename is in the format of cytochrome_c.txt * Returns an array of documents, comprising the lines of filename * (This treats each line in filename as an individual single-line document) */ public static String[][] readProteinData(String filename) /* Read in the documents at the locations in the list filenames * Returns an array of these documents * (This requires a large number of documents, at least 30, to give meaningful results) */ public static String[][] readDocumentArray(String[] filenames)
computeCosts() takes a parameter, docs, a 2D array of Strings. docs contains the contents of the set of documents for which the costs need to be computed. The number of documents in the set is n.
As in hw04, each document is represented as a 1D array of Strings, where each String is a line in the document. Naturally, an array of multiple documents would be a 2D array of Strings. Thus, docs[i] is a 1D array of Strings representing document i, and docs[i][j] is a String representing the jth line of document i.
This is a possible visual representation of docs. You can view the input, docs, as a table, where the lines of each document are represented in the columns, and the documents in rows.
(The computer does not store a 2D array in this tabular format, so this is not what docs looks like in the computer's memory. Below, we arbitrarily chose to display documents as rows and lines as columns, since it is conventional to display the first index in rows and the second as columns. We could have chosen to transpose rows and columns in the below table.)
docs[0]: String[]: document 0
|
|||||
docs[1]: String[]: document 1
|
|||||
docs[2]: String[]: document 2
|
|||||
... | |||||
docs[docs.length - 1]: String[]: last document
|
(This section may seem like explicit instructions, but there are more clarifications about the instructions in the below sections as well.)
computeCosts() returns a 2D array of ints that contains n(n-1)/2 rows, each with 3 columns. Below, we will refer to this output array as costs. Each row represents the result of matching a pair of documents in docs. For a given row in costs, the first column is the index in docs of the first document in the pair; the second column is the index in docs of the second document in the pair. The third column is the cost value from comparing these two documents.
Be sure not to have duplicate entries: if the first document in the pair is at index f in docs, and the second document is at index s, your resulting costs array should include only the pair where f < s, excluding the pair where s < f.
Also, be sure not to compare a document against itself.
For now, the order of the rows in your output can be arbitrary.
This is a possible visual representation of costs. You can view your output, costs as a table, where each row represents a match between each pair of documents. Because the document pairs do not need to be in any particular order in the costs array, we represent the first index of the 2D array as the indices a, b, c, d, e, and z. The following is simply a representation to give you an idea of what costs would look like.
... | |||
costs[a]: int[]: docs[0] and docs[1]
|
|||
... | |||
costs[b]: int[]: docs[0] and docs[2]
|
|||
... | |||
costs[c]: int[]: docs[0] and docs[3]
|
|||
... | |||
costs[d]: int[]: docs[1] and docs[2]
|
|||
... | |||
costs[e]: int[]: docs[1] and docs[3]
|
|||
... | |||
costs[z]: int[]: docs[n-1] and docs[n]
|
|||
... |
You will need to use CompareDocumentsReference.compareDocuments() to compute the cost of comparing two documents. The API for CompareDocumentsReference.compareDocuments() is as follows:
public class CompareDocumentsReference -------------------------------------------------------------------------------------------------------- /* Return the cost of the best alignment of two documents by aligning all pairs * of lines. Lines in one document that match a "gap" at the * beginning or end of the other document, should be treated as * matching an emtpy line (""). */ public static int compareDocuments(String[] firstDocument, String[] secondDocument, boolean printBestAlignment)
You should call CompareDocumentsReference.compareDocuments() with printBestAlignment set to false.
Write a loop to iterate through each pair of documents (docs[f], docs[s]) in docs.
Instantiate and populate the costs array.
To test your computeCosts() function, run your file with cytochrome_c.txt as the only command-line argument. As you can see in your computeCosts() function, the skeleton file returned the reference's costs.
Make sure that the graph generated by your costs array and the reference's costs array are the same graph.
Allow a little bit of time for the run command to execute. It shouldn't take longer than a minute.
insertionSortCosts() takes a 2D int array, costs, that has the same format as the output of computeCosts(), described in Section 4C, and uses insertion sort to sort the rows of the 2D array in increasing order of cost value (i.e. sort by the third column). More details are explained below.
Review the lecture slides and Chapter 4.2 of the textbook, Sedgewick and Wayne. Carefully read the insertion sort algorithm.
Implement the insertion sort algorithm to sort costs by the cost value column.
You will need to swap entire rows of costs. Recall that each row is of type int[] and the third column refers to costs. You should not write a new function to swap rows.
If you run your code with cytochrome_c.txt, you should get a histogram that looks very similar to the following. Of course, you can use any colors that you want.
buildHistogram() takes a sorted array of costs, an integer numBins, and a boolean inverse as its parameters, and returns an array of doubles, which we will refer to as histogram, where this histogram array has length numBins. Each value in histogram represents a count of the number of rows in costs falling into each bin where a bin is represented by each index of histogram. If the parameter inverse is true, buildHistogram() takes the reciprocal of each nonzero value in histogram. More details are explained below.
Declare the array that you will return, histogram, and write a loop that iterates through each row in costs and partitions the cost values into the bins in histogram.
To highlight pairs of documents with abnormally low costs, if the boolean parameter inverse is true, we take a modified inverse of each value in histogram after computing the histogram values:
The effect of inverting the histogram is that very tall bars in the histogram disappear (1 / very large number is close to 0), while very short bars suddenly get tall (the lowest non-zero count a bin could have is 1, and 1 / 1 = 1). For cheat-checking, we aren't interested in the masses of assignment pairs that are equally similar—the ones that correspond to tall histogram bars. Instead, we want to see the very few assignment pairs that are much more similar than all other pairs. In terms of the histogram, we expect most bins corresponding to low cost/high similarity to have a zero count, but a few may have a nonzero (but small) count corresponding to plagiarized assignments. On the standard histogram, the counts in those bins are still so low that we can't see the bars, but inverting them makes the bars large. So the plagiarized assignments tend to pop out, and the non-plagiarized ones tend to disappear.
If you run your code with cytochrome_c.txt, you should get a histogram that looks very similar to the following.
Now that you have a histogram of costs, you can write a function to visualize it: drawHistogram(). More details are explained below.
If you run your code with cytochrome_c.txt, you should get a histogram that looks very similar to the following. Of course, you can use any colors that you want.
You will now add a few interactive features to the histogram you drew in the previous section. interactHistogram() takes an array of costs as an argument, and calls buildHistogram() and drawHistogram() in response to interaction events (key presses and mouse moves). More details are explained below.
Draw a vertical line at the mouse's x-coordinate from the top to the bottom of the plot (within the bounds of the histogram y-axis).
Certain behaviors will be triggered in your program when the user presses certain keys. StdDraw has a facility to detect key presses.
Your program will start by drawing a histogram with 20 bins and with inverse set to false. The user can change this by pressing the following keys.
# of bins = 100 * mouse's x location / 5
Extra credit on this assignment is open-ended. We provide many ideas here, although you will be able to earn at most 3 extra credit points (and remember that the impact of extra credit on your grade is limited). Some of these ideas are relatively easy, and others are harder. Finish the assignment first, submit it, then work on these ideas as they interest you.
You should implement your extra credit in a program RankSimilarityEC.java that you submit as a separate file from your main submission. This simplifies grading and gives you more latitude to change any part of the program you want. You can also submit a zip file, extra.zip, containing any additional java files, data, images, etc. that are needed to run you extra credit.
Submit RankSimilarity.java and your completed readme_rank.txt.
Make sure that in the file you submit, each function runs the code that you wrote and does not run the reference implementation (which is the default implementation of the skeleton file).
You may also submit RankSimilarityEC.java for extra credit, along with any additional material needed for your extra credit implementation in extra.zip, but these are not required.
If you have not already done so, please fill out the anonymous Mid-Semester Feedback Form to help us improve the second half of the course.