Date: 2000 Apr 19
Due Thursday, 2000 Apr 27, at the beginning of class.
2000Apr25: In CS1321-2 today, it appeared the hash function scheme presented in lecture was broken. It is not. Read this exposition for an explanation why.
Read chapter 12.
You and possibly a CS1321 classmate are to implement an open-chained hash table and incorporate it into a program to find duplicate WWW pages. You may solve the problem in any way except that you may not use any code available in written or electronic form.
Default function arguments can specify values for function arguments that usually have the same value. For example, the function int foo(int x, int y = 3) can be called using either foo(4,5) or foo(4). In the former case, the parameter values are x = 4 and y = 5. In the latter case, y's parameter value is 3. See also the textbook pp. 60-62.
A hash table is a data structure supporting insertion, removal, and querying of elements in expected constant time by using a hash function. A hash function converts an element into a number specifying where the element should be stored. You are to implement an open-chained hash table storing URL objects, which conceptually are names of WWW pages. The algorithms and hash function to use were presented during lecture.
(The hash table described below has been designed to support finding URLs with the same contents. Other hash tables, e.g., the STL hash table, are similarly designed but details may vary. For example, most hash tables store key-value pairs.)
The hashTable class should support the operations listed in Table 1. These operations are similar to but not identical to these provided by the STL hash tables.
All hashTable operations use == or != to compare keys.
The insert function ensures that the given key is in the hash table. If a matching key, i.e., one equivalent using ==, is already in the hash table, the given key is not inserted. Thus, all keys in the hash table are always unique. insert returns a pair in which
As discussed in lecture, our hash function converts a string, e.g., a URL's contents, to a position in the hash table's main vector. There exist many different hash functions, but we describe one that has proven to work well in practice.
Conceptually, our hash function
In practice, converting the string into a base-256 number yields a very large number that can overflow the largest integer that most computers can store. Instead, we combine the first two steps. After converting each string's character into its ASCII code, multiply by the golden ratio. After each multiplication or addition drop the integral portion of the result. This will ensure that the numbers will always have reasonable size.
A good hash function spreads out a hash table's contents, but, if any significant number of table positions have too many entries, using the hash table will take too much time. Thus, if the hash table becomes too full, we should enlarge the table. If twice the number of items in the hash table exceeds the size of the hash table's main vector, double the vector's size. If the number of items is less than one-eighth of the main vector's size, halve its size.
When resizing the hash table, every URL in the hash table must be rehashed since the hash function's values depend on the table's size. One implementation strategy s to insert the URLs into a newly created vector. Another strategy is to always have two vectors in the hash table object and a variable indicating which vector contains the URLs. To resize, the other vector is cleared, resized to the new size, the URLs are copied into it, and marked as the vector containing the URLs.
Hash tables are commonly used as dictionaries, i.e., data structures supporting insertion, querying, and removal. They are also useful when determining duplication. For example, one problem faced by WWW search engines is identification of mirrored WWW pages, i.e., duplicates of other WWW pages, because it is not desirable to present the user with a long list of duplicate WWW pages. Directly comparing all pairs of WWW pages takes too long. Instead, WWW pages can be hashed and pages compared only if they hash to the same location. This reduces the running time from O(n2) to O(n), where nis the number of WWW pages.
findDuplicates.cc uses a hashTable containing URLs to identify duplicates. Given the name of a file containing a list of URLs as its one command-line argument, the program identifies all duplicate WWW pages. For example, if three URLS ``foo,'' ``bar,'' and ``baz'' have the same WWW contents, two matches will be printed.
A URL (uniform resource locator) object stores a name. Conceptually, this is the name of a WWW page, ftp file, etc. To avoid the slowness of Trinity's Internet problems, we just use filenames, not WWW addresses.
Two URLs are == if and only if their files have exactly the same contents. Broken links, i.e., URLs without a file in the local directory, are not equal to any other URL. The hash table should hash the URL's contents, not its name.
As far as Jeffrey D. Oldham can tell, WWW search engines use technology similar to this homework to eliminate duplicates. The scheme, however, is fooled by ``almost mirrors'' which differ in small ways, e.g., hypertext links or use counters. Recent research shows that careful choice of hash functions can help automatically group related WWW pages. Other uses remain undiscovered.
Using STL containers and functions may ease implementation. We suggest using a vector of vectors of strings. Let the main vector be the vector of vectors.
Useful vector functions include:
The number of distinct characters is UCHAR_MAX, which is defined in limits.h.
A possible order for implementation is
The cmp UNIX command compares two specified filenames, printing whether the files differ or not. For more information, see the ``diff'' info pages accessible via ``Control-h i'' inside Emacs.
If you find using URLs confusing, try implementing hashTable storing another type, e.g., key_type, testing it. Then revise to use URLs.
We have provided code to find duplicate WWW pages and a URL class. Add code to the hash table to implement the hash table storing URLs. The Makefile may be useful.
As for previous homeworks, working with other people during your planning phase is encouraged. For this homework, you are permitted to write the code, i.e., program, with one other person in CS1321. To learn the material, both of you should be actively involved in the programming. Failure to do so will almost certainly hurt your comprehension of the material in the rest of the course.
Each one- or two-person team of programmers should submit only its completed implementation, consisting of the file hashTable.h. You do not need to send any other files. Please send only text documents, do not send Microsoft Word documents, PDF documents, HTML documents, etc. Please include both your names and email addresses at the top of your program.
We will test your program using our own input. Please be sure your code compiles without warning when using g++ -Wall -pedantic.
See the submission details for information how to email the programs. Note that the CS computer configuration was changed over spring break. If you want to receive an automated reply acknowledging your submission, please read the submission WWW page. If a team of two are in different sections, submit exactly once to one of the two permissible email addresses.