We have put together several stub files –– click here to download them.
8 files to submit: HashMap.java
, HashMapTest.java
, Trie.java
, TrieTest.java
, ITerm.java
, Autocomplete.java
, AutocompleteTest.java
, QuantitativeAnalysis.pdf
Please do not submit any additional files.
Be sure to follow these guidelines:
package src;
).[compilation]
, and
two free submissions on the autograder marked [full]
.java.util.HashMap
: A Rite of PassageWe have used hash-based data structures throughout this class on various programming assignments, and we have proven certain properties about them in lecture and recitation. The time has now come to embark on a rite of passage that every budding computer scientist must take. It is time to implement a hash table.
But seriously, too many people go around blissfully using this magical
all-operations-expected-$O(1)$ data structure unaware of how it works, how to
get the most out of it, and when something else might be better. In this
homework, to ensure you aren’t that person, you will implement a
production-grade hash-map which will conform to the Java 7 Map
interface (see a section later below on the Java 8 changes).
We provide a IHashMap
which your implementation should implement
from, which reduces the work for a few complex Map
methods. Before you
get started, take a look at the java.util.Map
interface
to familiarize yourself with the API.
A prefix trie1 is an ordered tree data structure, which stores string keys by storing characters in nodes. The “prefix” part of the name comes from the fact that common prefixes between keys share the same path from the root in the trie. For a naive trie, there is a fair amount of overhead, since every character lives in its own node. One example of an optimization that addresses this object overhead issue is a PATRICIA trie, in which each nodes are compressed into its parent when appropriate in order to optimize space and memory usage. Unfortunately, you will not be implementing PATRICIA tries for this assignment.
1: The term “trie” comes from retrieval, and is most commonly pronounced “try”.
We’ve provided a test harness in TestHarness.java
that records execution
duration and memory usage of both map implementations. The harness tests two
datasets, dictionary.txt
(350,000 words) and phonenumbers.txt
(8,000 phone
numbers, all of the form (215)-898-xxxx, encoded as letters and thus starting
with “cbfiji”). You might need to move these files to the root of your project
(or wherever your working directory is, normally not in the src
folder)
to avoid FileNotFoundExceptions
. Further setup instructions can be found below. Before you do run the harness,
you should take a look at each dataset and form a hypothesis on the
CPU times and memory usages of each implementation on each dataset; the results
might surprise you!
IMPORTANT: Historically, many students have experienced issues setting up
jamm.jar
and TestHarness.java
. Please ensure that you can get this up and
running as soon as possible, since it’s required for the quantitative analysis
questions below. You can, however, do the programming part of this homework
without this step.
You MUST set up your project with Java 8
, otherwise there is no guarantee that TestHarness.java
will run properly.
When you first import the project into your workspace, you may see some error
messages associated with the TestHarness.java
file. Here are setup instructions you
need to follow:
jamm.jar
is added as a library to your project (File → Project Structure → Libraries → Click on the plus sign in the right pane → Select the jar file).-javaagent:jamm.jar
to the VM options field.
For your reference, the window should look something like this:
HashMap.java
, run TestHarness by right-clicking on the file and
choosing “Run TestHarness.main()”.
It will report the CPU time and memory usage of your implementation! If you
receive an error, restart IntelliJ and try again (you will use this part again
in Part 3).Acquire jamm.jar
from the assignment’s stub files.
Make sure the jamm.jar
file is in the root (i.e., the folder you will
click delete on if you want this whole project gone) of your project (not the
src
folder). Also, make sure the dictionary.txt
and phonenumbers.txt
files are at the root.
Right click on your project and go to Build Path $\rightarrow$ Configure
Build Path $\rightarrow$ Libraries $\rightarrow$ Add JARs. Select jamm.jar
from the file selector drop down and hit “Okay”.
If everything up to this point was done correctly, your directory should look like this:
Right-click on TestHarness.java
and go to Run As $\rightarrow$ Run
Configurations. In Java Application, under the “arguments” tab, add the
string -javaagent:jamm.jar
to the VM arguments field, then Apply and Run.
Once you’ve implemented the methods in HashMap.java
, run TestHarness.java
.
It will report the CPU time and memory usage of your implementation! If you
receive an error, restart Eclipse and try again (you will use this part again
in Part 3).
Note: If you are using OS X, you may receive the following error in the console:
Class JavaLaunchHelper is implemented in both /Library/Java/JavaVirtualMachines
/jdk1.8.0\_102.jdk/Contents/Home/bin/java and /Library/Java/JavaVirtualMachines/
jdk1.8.0\_102.jdk/Contents/Home/jre/lib/libinstrument.dylib. One of the two will
be used. Which one is undefined.
This is fine, as the test harness will still run as long as everything else is set up properly.
HashMap
(40 points)Files to submit: HashMap.java
, HashMapTest.java
Recall the method of chaining discussed in lecture and in recitation.
A hash table first hashes an Object
’s hash code into a bucket index appropriate
for the length of the backing array. Each bucket is a linked list of map entries that
can be added to, modified in place, or removed from. Note that when you add an element to a bucket, you
should add the element to the front of the bucket (i.e., the head of the
linked list).
We have provided a fair bit of skeleton code for you, namely the constructors, the table buckets, and the hashing methods. You need to worry about the following eight method stubs:
get(K key)
containsKey(K key)
put(K key, V value)
resize(int newCapacity)
remove(K key)
containsValue(V value)
clear()
entryIterator()
Each method stub contains further instructions. It is critical that you read both the Javadoc specification (best done by using IntelliJ to read the spec) and the implementation comments for each method as this will answer any potential questions you have about implementation details. They contain necessary information on both the external and internal behavior of these methods, and hints on how to go about implementing them.
It is also critical that you understand the provided code and methods. You will
want to pay special attention to the threshold
and loadFactor
variables,
the hash(int h, int length)
method, and the Entry
inner class. Your solution
will explicitly invoke these entities.
For threshold
/resizing, you should resize when you realize that adding the next element would cause you to reach the threshold. For example if threshold
is 3, you should resize when you’re about to add the third element.
You should make sure you are correctly handling null keys and null values (for example, null keys should be hashed to index zero). Note that non-null keys can also be hashed to index zero. Be careful to explicitly handle and test those, and don’t be afraid to repeat some code if you want to explicitly isolate the null cases.
Finally, entryIterator()
can return the elements in any order, but must be lazy (you cannot just put everything
into a list and return it).
IMPORTANT: Do not throw any ClassCastException
s. Whenever the HashMap
Javadocs say to throw a ClassCastException
, you can safely ignore this.
In Java 8, HashMaps actually are no longer implemented using simple linked lists for chaining. Instead, Java now uses balanced binary search trees as the buckets. This decision comes with certain trade-offs. Namely, it reduces the worst case access time for any element from $O(n)$ to $O(\log n)$ because insert, search, and delete in balanced binary search trees take $O(\log n)$ time. However, it also adds increased overhead in terms of space and does not change the expected $O(1)$ access time.
If you are interested in the specifics of the Java 8 implementation, it uses Red-Black trees (which are balanced binary search trees, meaning the tree is of height $O(\log n))$. These trees are only used when the HashMap passes certain thresholds: when there are more than 8 elements in one bucket and the number of slots in the HashTable is at least 64. For more information about these specifics, check out the discussion here.
hashCode()
When writing unit tests for your HashMap
, you may find it useful to force
collisions between Object
s that you are inserting in the map. Since your
implementation will require the use of Java’s hashCode()
method, you will have
to override this method when testing your chaining.
Suppose you want to force a collision between two objects in your unit test. You can do the following:
@Test
public void testCollision() {
Object obj1 = new Object() {
@Override
public int hashCode() {
return 5;
}
};
Object obj2 = new Object() {
@Override
public int hashCode() {
return 5;
}
};
map.put(obj1, "foo");
map.put(obj2, "bar");
//...
}
An alternative, more clean approach would be to create a class where you can set
the hashCode
at construction:
static class MockHashObject {
private final int hashCode;
public MockHashObject(int hashCode) {
this.hashCode = hashCode;
}
@Override
public int hashCode() {
return hashCode;
}
}
Either of these approaches works for testing your chaining.
We are providing you a visualizer to help you get a feel of what the underlying table behind your hashmap implementation looks like.
You should be able to run the visualizer by right-clicking on HashMapVisualizer.java
and running the main() method. If not, make sure graphJar.jar
is added as a library to your project (File → Project Structure → Libraries → Click on the plus sign in the right pane → Select the jar file).
This is the same setup required to run the Huffman visualizer and the Graph Algorithm visualizer.
For the visualizer we need you to add the graphJar.jar
file to your build path. To do so, right click on your project in Eclipse → Build Path → Configure Build Path → Click on Classpath → Click on Add External JARs… on the right pane → select GraphJar.jar → Open → Apply and Close.
This is the same setup required to run the Huffman visualizer and the Graph Algorithm visualizer.
The visualizer supports all the various functionalities that HashMap supports. Note that setting either field (key/value) as null
allows it to be interpreted as null
by the visualizer.
Try adding the entries whose keys are FB
and Ea
. Since both return the same hash, they should show to collide in your visualizer!
If you run into any difficulties using the visualizer, please post on Piazza.
TrieMap
(45 points)Files to submit: Trie.java
, TrieTest.java
This should be a straightforward “standard” trie implementation, as described
in the lecture notes.
Your trie should support inserting strings which are prefixes of each other (ex: pen and penguin).
We have provided a skeleton for you in the form of a nested class Node
and a
few helper methods, and we’ve left you to worry about the following method
stubs:
put(CharSequence key, V value)
get(CharSequence key)
containsKey(CharSequence key)
containsValue(Object value)
remove(CharSequence key)
clear()
countPrefixes(CharSequence prefix)
allValuesWithPrefix(CharSequence prefix)
Each method stub contains further instructions. It is critical that you
read both the Javadoc specification (best done by using IntelliJ to read
the spec) and the implementation comments for each method as this will
answer any potential questions you have about implementation details. They
contain necessary information on behavior of these methods, hints on how to
go about implementing them, and important explanations of differences from
the HashMap
implementation.
The trickiest part of this implementation will be correct removal of keys. You might find it helpful to work out examples of each method on paper before going forward with your implementation.
A few things to note:
null
keys or values. The empty string is still valid as a key,
however.Node
class given to you (and you should!) to meet the runtime requirements of the methods. (Hint: consider adding more instance variables).allValuesWithPrefix
and countPrefixes
will return duplicate values. For instance if you have the key value pairs ("apple", 5)
and ("app", 5)
in your trie then allValuesWithPrefix("ap") = {5, 5}
and countPrefixes("ap") = 2
Similarly to the HashMap, we are also providing a visualizer for you to help you see the structure of your TrieMap.
You should be able to run the visualizer by right-clicking on TrieMapVisualizer.java
and running the main() method. If not, make sure graphJar.jar
is added as a library to your project (File → Project Structure → Libraries → Click on the plus sign in the right pane → Select the jar file).
This is the same setup required to run the Huffman visualizer and the Graph Algorithm visualizer.
For the visualizer we need you to add the graphJar.jar
file to your build path. To do so, right click on your project in Eclipse → Build Path → Configure Build Path → Click on Classpath → Click on Add External JARs… on the right pane → select GraphJar.jar → Open → Apply and Close.
This is the same setup required to run the Huffman visualizer and the Graph Algorithm visualizer.
Due to the limited amount of screen space, we will not be displaying every possible node. Please read the following very carefully to understand how the visualizer interprets your underlying datastructure.
Adding the entries {“pen”, 1} {“penguin”, 2} {“pencil”, 3} will hence produce the trie below!
Files to submit: TrieMap.java
.
Implement the entryIterator()
method to return the entries in lexicographic
order with respect to the keys. You must write this as a true lazy iterator—an
implementation that simply dumps all the elements into a collection and
retrieves an iterator from the collection will be awarded no credit. The iterator
only stores enough state to do its job. Constraints:
If this last constraint about space usage is too difficult, you can ignore that constraint for half credit. For more information, refer to the JavaDoc of the iterator.
This is for Kudos points only, and does not count for your grade. Therefore it is completely optional!
Autocomplete
(10 Points)Files to submit: ITerm.java
, Autocomplete.java
, AutocompleteTest.java
Now that you have an implementation of a Trie, it’s time to see it in action. You will use your Trie implementation to create an autocomplete tool. Autocomplete is used every day to save users time when searching for a single word among a large dictionary of possible words. It takes what the user has already typed as the prefix to generate word suggestions. To handle these prefix-based searches effectively, Tries are often used!
First, a reminder on comparators. A comparator is a special type of Java object that allows us to organize the objects in an ordering that we get to define. We have provided you with an object called Term
that we will use in our autocomplete implementation. Please implement the byReverseWeightOrder()
method in ITerm.java
.
Lastly, use your Trie implementation to complete the required methods in Autocomplete.java
. Note there shouldn’t be much code to actually write here. The Autocomplete
class will act as a wrapper to your Trie implementation.
To run the autocomplete gui you need to right-click on AutocompleteGUI.java
→ More Run/Debug → Modify Run Configuration… Then under the feild that says “program arguments” you need to write the filename in quotes followed by the number of suggestions. It should look something like this:
Click “apply” and close out of the window and you can now run the GUI by right clicking on AutocompleteGUI.java
→ Run ‘AutocompleteGUI.main()’. It should look something like this:
Please note that we provided two files to work with the AutocompleteGUI which are pokemon.txt
and wikitionary.txt
. If you with to use other files please copy the format that those two files follow in order for the GUI to work correctly.
Please answer the following questions in a LaTeX typeset file called
QuantitativeAnalysis.pdf
:
Note: You must have followed the setup instructions for TestHarness.java
and jamm.jar
in Part 0 to be able to answer these
questions.
Important: You must provide the actual program output for question 1, and actual numbers for question 5. If you do not, you will receive no credit for this section.
After finish your implementation of the HashMap and TrieMap, click run on
TestHarness.java
. Note: this will take a few minutes to run on your
computer.
Please copy and paste the output of TestHarness.java
as an answer
to this question, and place it inside of a verbatim
environment in your LaTeX file.
For dictionary.txt
, which implementation (between your implementations of TrieMap
and HashMap
) had a better running
time? Which implementation had better space usage? What about for
phonenumbers.txt
? Was this what you were expecting? Why or why not?
It was mentioned in lecture that tries are more space-efficient than hash
tables due to the fact that they compress common prefixes. Did your
implementation reflect this for dictionary.txt
? What about for
phonenumbers.txt
? If so, why? If not, what could you potentially do
to improve the memory consumption of your TrieMap
?
Based on your answer to 3, does Big-Oh notation tell us anything about the actual running time or space usage of an algorithm on a data set? Why or why not? What might an implication of this be for software development?
Add a call to initChildren()
to the end of the Node
constructor in
TrieMap.java
, then re-run the test harness, and consider the results for
TrieMap
and dictionary.txt
. The difference here is that now we initialize
the children array as soon as the node is constructed, instead of waiting
until we actually add a child (the “lazy” way). How much memory, both
absolutely and relatively, does the lazy initialization save us?
(Give actual numbers.) Would you say the lazy initialization is a
worthwhile optimization? Why or why not?
Please remove this change once you’re done answering Q5.
As mentioned above in the writeup, Java 8’s implementation of HashMap
s differs to
what you implemented in this assignment. How do your results compare with Java’s
implementation of a HashMap
in terms of both
space and time? What do you attribute the differences to?
How do your results compare with Java’s implementation of a TreeMap
in terms of both
space and time? What about the results between TreeMap
and java.util.HashMap
? What do you
attribute the large disparity in runtime to? Hopefully now you understand why HashMap
s are so
popular and much more widely in use :)
NOTE: A TreeMap
is not a TrieMap
. These are very different. A TreeMap
is implemented
as a red-black balanced binary search tree.
Again, as a reminder: you must provide the actual program output for question 1, and actual numbers for question 5. If you do not, you will receive no credit for this section.
The above parts together are worth a total of 90 points. Style and code coverage are graded on a subtractive basis with deductions as specified in the Setup and Logistics section of the writeup, and you will be graded according to the CIS 1210 style guide. Gradescope will give you your grade on this section immediately upon submission to either autograder.
IMPORTANT: Please DO NOT use any external files (.txt, etc.) to write your JUnit tests for this assignment. Our code coverage tool will fail, and you will not be able to submit your code due to failing test cases.
You will need to write comprehensive unit test cases for each of the classes, inner classes, and methods you implement (including helper methods!) in order to ensure correctness. Make sure you consider edge cases and exceptional cases in addition to typical use cases. Use multiple methods instead of cramming a bunch of asserts into a single test method. Your test cases will be auto-graded for code coverage. Be sure to read the testing guide for some pointers on what level of testing we expect from you. Finally, due to the Code Coverage tools we use, please be sure to use JUnit 4 as it is the only version supported.
Note: You will not be able to write JUnit test cases for any private
methods. Instead, make them package-private by leaving off any privacy modifier
(i.e., do not write public
or private
).