CS 1321 (Principles of Algorithm Design II):
Homework #3

Assigned:
February 1, 2000
Due:
February 11, 2000, at the beginning of class
Credit:
20 points.

Contents

Hints and tips

Homework solution tip

Solving this homework will probably require as much preparation time as time spent programming. (Our solution required writing approximately two pages of code.)

Here is one way to approach a homework that has more pages of explanation than lines of code.

  1. In a first pass, skim through the homework, trying to understand how the search engine works and what you are required to do. Skim through the provided code, particularly the comments.
  2. In a second pass, read carefully, drawing pictures of the search engine and how information moves through it. Try to understand the math as much as possible. In your drawing, try to determine what code you need to write.
  3. Now that you have identified what the task is, carefully read through that code, trying to understand what the variable types are, what code is provided, and what code needs to be written.
  4. Plan what functions you will write. Using English, describe what each function should do.
  5. Determine how you will test your code.
  6. Write throw-away programs using the STL containers described below.
  7. Write the necessary code.

Remember that discussions with classmates are allowed, up to the point at which you begin writing code.

Programming tip

When trying to use code you have never previously used, try testing it out in a small ``throw-away'' program. Doing so helps ensure that any difficulties come from the new code, not from other parts of a larger program. Writing small throw-away programs may seem like a waste of time, but, for me, they actually save time. Here is a code template for a throw-away program.

Some useful STL containers

We provide a short explanation for four STL containers that we use for the search engine:
pair:
a group of two values.
vector:
an array that can grow and shrink.
string:
string of characters that can grow and shrink.
hash table:
table permitting quick lookup.

Pair

The C++ Standard Template Library uses pairs to group together two values. For example, one can group together a string and an integer using pair<string,int>("hello",3). Here is some example code.

Constructors:
There are at least three different ways to create a pair. To create an empty pair, use

	pair<double,int>()
	
To specify the pair's contents upon creation, use
	pair<double,int>(3.4,12)
	
If you do not want to explicitly write the pair's types, you can use
	make_pair(3.4,12)
	
and the compiler will take its best guess about the pair's types.

Predicates:
none

Selectors:
To access a pair p's first item, use
	p.first
	
To access a pair p's second item, use
	p.second
	
Be sure to #include <utility> near the top of your file.

Vector

A vector is an array that can change its size. Anything one can do to an array, one can also do to a vector. Here is some example code.

Constructors:
To create a vector of size zero that can hold elements with type T, use

	vector<T> v
	
To create a vector of size n, use
	vector<T> v(n)
	
To change a vector so it can hold n elements, use
	v.resize(n);
	
After doing this, accessing positions 0, 1, ..., n - 1 is legal.

Predicates:
If a vector v is empty,
	v.empty()
	
yields true. If
	v.size()
	
yields s, then positions 0, 1, ..., s - 1 may be accessed.

Selectors:
To access or change the element at position i, assuming the vector has size at least i + 1, use
	v[i]
	
Alternative equivalent syntax for accessing, but not changing, the element is v.at(i). This has the nice feature that, if the vector does not have a position i, it kills the program rather than just returning a garbage value.
	v.push_back(item)
	
enlarges the vector by one position, inserting item into the last position.
	v.pop_back(item)
	
shrinks the vector by one position, eliminating item from the last position.

v.begin() and v.end() yield iterators for the vector's beginning and one past the end.

Be sure to #include <vector> near the top of your file.

String

A string is like a C-style string, but with more operators and no maximum length. Anything one can do to C-style strings, one can do to strings, and more. Be sure to #include <string>.

code meaning
string s; creates a string with no characters
string t("hello"); creates a string containing ``hello''
string t = "hello"; another way of doing the same thing
s = t; makes s equal ``hello''
cout << s[1]; prints the letter `e'
s = "good"; changes s's contents
s = s + "bye"; changes s to ``goodbye''
cout << s + t; prints ``goodbyehello''
s.empty(); yields false because s has characters
t.size(); yields 5 because it has five characters
t.push_back('s'); appends s to ``hello''
t.clear(); shrinks t to the empty string
t.c_str(); converts t to a C-string.

When using the .c_str() function to convert from a string to a C-style string, be sure to use the result immediately. It may ``magically'' disappear by the time the next statement is executed. This function is seldom needed, but it is useful when using the .open(filename) function for istreams and ostreams, which takes only const char [].

Hash table

Hash tables (textbook, ch. 12) permit quickly finding a pair by specifying the pair's first component. In our search engine, we use a hash table to map from a word to its position in a vector. For example, suppose our hash table is called ht and vector position 12 corresponds to "bonjour", i.e., the pair ("bonjour",12). Here is C++ code to determine "bonjour"'s vector position.

The .size() member function yields the number of elements in the hash table. Be sure to #include <hash_map> near the top of your file.

Problem statement

The problem is to finish writing a simple Web search engine. Before describing what code needs to be written, we present the model and the algorithmic ideas.

The mathematical model

We call two Web documents similar if they contain many of the same words used with similar frequency. Each Web document is modeled using a very long vector, with each vector component representing the occurrence frequency of a particular word in the document. For example, if the word ``molasses'' occurs twice as frequently as the word ``jam,'' the molasses component will be twice as large as the jam component.

Technically, two Web documents are similar if the angle between them is small. To understand what this means, first consider the dot product of two vectors. The dot product of two vectors is the sum of the pairwise multiplication of vector components. For example, (3, 4, 5) . (6, 7, 8) = 3*6 + 4*7 + 5*8. The dot product is large if the two documents have many of the same words. For the computation, we actually use the relative frequency of words within a document, e.g., ``molasses'' forms 20% of the document's words while ``jam'' forms 10%. Using the formula for dot product A . B = | A|| B| cos$ \theta$, we see that the angle $ \theta$ is small if (A/| A|) . (B/| B|) is large.

Search engine algorithms

The two parts of a search engine are:

  1. Preprocessing the documents to produce a vector for each Web document.
  2. Given a list of search words, finding the closest Web documents.

Preprocessing the documents requires collecting the documents, extracting the documents' words for use as vector components, and then computing each document's vector. For this homework, we just used the wget command to snarf a collection of documents. We then collected all of the documents' words into a hash table and finally converted each document into a vector.

To convert a document into a vector, for every word we read from a document, we increment the word's component in the vector. To determine the component number, we ask the hash table for the word's component. For example, if the hash table is called ht, we can determine hello's component (an integer) using ht.find("hello"). We then normalize the vector A by scaling by the reciprocal of its length  | A| = $ \sqrt{A \cdot A}$. That is, we multiply every component of A by 1/| A|.

For example, if a document contains only the words ``bonjour'' (three times) and ``hello'' (four times) and the components of ``bonjour'' and ``hello'' are 12 and 20, respectively, then the unnormalized document vector will have a 3 in component 12, a 4 in component 20, and zeroes everywhere else. (The number of vector components is determined by the number of words in the hash table.) The normalized vector then has 0.6 in component 12, 0.8 in component 20, and zeroes everywhere else.

Given a list L of search words, we wish to determine the closest Web documents. To do so, we first construct a search vector from L. For each word w in L, we use the hash table to increase w's component by one. Then, we normalize by scaling by the reciprocal of its length. A document is similar if its dot product with the search vector is large. Search words not in the hash table can be ignored.

(Aside: Although I had not previously considered this, I suppose a user could type the same search word repeatedly. For example, searching for ``hello hello hello hello hello goodbye goodbye'' specifies that ``hello'' is 2.5 times more important than ``goodbye''. Is adding the importance of each word to a search query a useful feature? Do any current search engines provide this feature?)

Details

Coding the search engine

We will provide software for preprocessing sets of documents, the results of the preprocessing, and a few sets of example documents. Your job is to finish writing the code that queries the user for search words, determines which documents are most similar, and prints the results.

For each set of documents we provide, we will provide a file containing the set's words and, for each document, the vector components. We will also provide code to read in the file, storing its contents in a hash table and in a vector of document-vector pairs with one component per document.

Although you probably do not need to know how to preprocess the documents to complete the assignment, we describe it here for completeness and so you can process your own set of documents if you desire. The prepareDatabase program takes one or two command-line arguments:

  1. the name of a file listing all the documents to be included in the database. Documents are specified by filename.
  2. optionally, a prefix to affix to the beginning of each document filename. For example, if the document's filename is ``zmrcs10.txt'' and the prefix is ``ftp://metalab.unc.edu/pub/docs/books/gutenberg/etext99/'' then the combined name is ``ftp://metalab.unc.edu/pub/docs/books/gutenberg/etext99/zmrcs10.txt,'' which looks a lot like a Web address.
This program preprocesses the documents, sending the keywords it selects and each document's vector to the standard output. Save this output in a file to give to your search engine.

Your job is to finish the search engine code that queries the user for search words, computes how close each document is to the search vector, and prints the closest documents. More specifically, this code is supposed to:

  1. Read the document database information into the KeywordMapping hash table and the vector<DocVec> collection of documents and their vectors.
  2. Prompt the user for any positive number of search words (allowing repeats) as well as the desired number n of similar documents.
  3. Find the n closest documents, printing them in order from most similar to least similar. It would also be nice to print their scores.
Code is provided for some portions of these tasks.

Types

Using STL containers can easily lead to very long names for types. For example, hash_map <const string, vector<double>::size_type> is the type of the hash table translating words to vector components. Instead of typing this forty-eight character type name, we say

    typedef hash_map<const string, vector<double>::size_type> KeywordMapping;
creating a new equivalent type named KeywordMapping. Thus, declaring variables with a type of KeywordMapping is the same as using the forty-eight character type.

(The syntax for the type definition statement typedef is

    typedef type new-synonym;
)

types.h contains several type definitions:

DocVec:
is a pair of a document's name, which is a string, and its vector. A document vector contains doubles. We guarantee that every component has a value.
DocScore:
is a pair of a document's name, which is a string, and its dot product with the search vector, a double.
KeywordMapping:
is a hash table translating a word, i.e., a string, to its component in a vector if the word is in the hash table.

What files do I need?

Your job is to finish writing the search engine code in search-engine.cc. You will also need the type declaration file types.h. Be sure it is called ``types.h'' and is in the current directory. To compile, use a command similar to g++ -Wall -pedantic search-engine.cc -o search-engine.

If you want to process your own set of documents, download prepareDatabase.cc and types.h. Compile using a command similar to g++ -Wall -pedantic prepareDatabase.cc -o prepareDatabase.

To ease compilation, use the Makefile. To create an executable called search-engine, use

    make search-engine

Sample document sets

You will need to generate your own data to test your code, but we have provided three sets of documents and one program to generate random sets.

Each set's database file for use with search-engine has a ``.db'' suffix. Compressed archive files ending with a ``.tgz'' suffix are provided in case you want to copy a set of documents to your own computer. To extract the files, use the command tar xzvf filename.

Please do not copy the large set of documents to your home directory on the CS department's computers. If any significant fraction of students do so, the department's disk will quickly fill up. Instead copy just the database file to your home directory. If you really do want all the files, please store them in the directory called /tmp so that (1) they will not fill the computer science disk and (2) they will automatically be erased when the machine is rebooted.

How our search engine is simpler than commercial engines

Our search engine uses one approach to solve the most important and most difficult task performed by search engines: determining which Web documents are closely related to each other. Our code, however, uses a very simple ranking scheme similar to what Altavista probably used to use. More complicated ranking schemes can yield more usable results such as those returned by Google. Also, our preprocessor makes a very limited attempt to filter out uninteresting words by omitting all words of three or fewer characters but does not stem suffixes from words such as ``played'' and ``playing'' so that they match. It makes a heuristic attempt to remove punctuation and ignore case. We also do not filter unacceptable Web documents from the document pool.

Commercial search engines must accept more complicated input syntax such as boolean operators (usually), have at least 99.9% uptime, be able to handle large number of simultaneous queries, and deal with network issues.

What to submit and how

Please submit only your completed search-engine.cc. You need not send any other files. Please send only text documents, do not send Microsoft Word documents, PDF documents, HTML documents, etc.

We will test your code on our sample data. Please be sure it compiles without warning when using g++ -Wall -pedantic.

See the guidelines for programming assignments for instructions on how to e-mail the programs. For this assignment use a subject line of ``cs1321 homework 3''.