CSCI 1321 (Principles of Algorithm Design II), Spring 2001:
Homework 31

Assigned:
February 9, 2001.

Due:
February 19, 2001, at 5pm.

Credit:
40 points.


Contents

Hints and tips

How to approach this homework

Solving this homework will probably require as much preparation time as time spent programming. (Our solution required writing approximately two pages of code.)

Here is one way to approach a homework that has more pages of explanation than lines of code.

Remember that discussions with classmates are allowed, up to the point at which you begin writing code.

Programming tip

When trying to use code (e.g., a library function or class) you have never previously used, try testing it out in a small ``throw-away'' program. Doing so helps ensure that, when you use the library function or class in a larger program, any problems are caused by new code in the larger program and not by your imperfect understanding of how to use the library code or class. Writing small throw-away programs may seem like a waste of time, but in fact in the long run it can save time. Examples of such throw-away programs are sample programs vector-use-example.cpp and pair-use-example.cpp.

Some useful STL containers and functions

The C++ Standard Template Library (STL) defines a number of useful container classes. This section describes four of these classes that we use for the search engine, namely:

It also describes the STL sort() function, which you may find useful.

Pairs

The STL provides a class pair to group together two values. For example, one can group together a string and an integer using pair<string, int>("hello", 3). See pair-use-example.cpp for more simple examples of use. Be sure to #include <utility> in your program.

Creating pairs:

There are at least three different ways to create a pair.

Working with pairs:

Vectors

An STL vector is an array that can change size. Anything one can do to an array, one can also do to a vector. See vector-use-example.cpp for more simple examples of use. Be sure to #include <vector> in your program.

Creating vectors:

Working with vectors:

Strings

An STL string is like a C-style string (array of char) but with more operators and no maximum length. Anything one can do to a C-style string, one can do to a string, and more. Below are some examples of use; see string-use-example.cpp for more examples. Be sure to #include <string>.

Code Meaning
string s; Creates a string with no characters.
string t("hello"); Creates a string containing ``hello''.
string t = "hello"; Creates a string containing ``hello''.
s = t; Makes s equal ``hello''.
cout << s[1]; Prints the letter `e'.
s = "good"; Changes s's contents.
s = s + "bye"; Changes s to ``goodbye''.
cout << s + t; Prints ``goodbyehello''.
s.empty(); Yields false because s has characters.
t.size(); Yields 5 because it has five characters.
t.push_back('s'); Appends s to ``hello''.
t.clear(); Shrinks t to the empty string.
t.c_str(); Converts t to a C-style string.

When using the .c_str() function to convert from a string to a C-style string, be sure to use the result immediately. It may ``magically'' disappear by the time the next statement is executed. This function is seldom needed, but it is useful when using the .open(filename) function for an istream or ostream, which requires as input a C-style string (array of char), not a string).

Hash tables

Hash tables (described in chapter 12 of the textbook) permit quickly finding a pair by specifying the pair's first component. In our search engine, we use a hash table to map from a word to its position in a vector. For example, suppose our hash table is called ht and vector position 12 corresponds to "bonjour", i.e., the pair ("bonjour",12). Here is C++ code to determine "bonjour"'s vector position:


    hash_map<string,int> hm;        // create the hash table
    // code for storing values in the hash table omitted 
    hash_map<string,int>::const_iterator pos = hm.find("bonjour");
                                    // try to find "bonjour"
    if (pos == hm.end())
      cout << "bonjour not in the hash table\n";
    else
      cout << "bonjour's int is " << (*pos).second << ".\n";

Be sure to #include <hash_map> near the top of your file.

Sorting

The STL sort() function sorts all or some of the elements of a suitable container class into ascending order. Here is an example of using this function to sort a vector:


    vector<int> v(10);
    // code to put elements into v omitted
    sort(v.begin(), v.end());

The two parameters passed to sort() are STL iterators; the first one points to the first element to be sorted, and the second points just past the last element to be sorted. begin() and end(), as the code suggests, are member functions of the vector class.

The function does require that there be a sensible < operator defined for the elements to be sorted. This will be the case for simple types (e.g., int or double) and some library classes (e.g., string). If there is no appropriate < operator, you can write your own comparison function and pass it to sort() as a third parameter. This function should behave like less-than, taking two parameters (the objects to be compared) and returning a bool whose value is true if the first object is ``smaller'' (should appear first in a sorted list) and false otherwise. See print-sorted.cpp for an example of using this feature to do a case-insensitive sort of strings.

Be sure to #include <algorithm> in your program.

Problem statement

The problem is to finish writing a simple Web search engine. Before describing what code needs to be written, we present the model and the algorithmic ideas.

The mathematical model

We call two Web documents similar if they contain many of the same words used with similar frequency. Each Web document is modeled using a very long vector, with each vector component representing the occurrence frequency of a particular word in the document. For example, if the word ``molasses'' occurs twice as frequently as the word ``jam,'' the molasses component will be twice as large as the jam component.

Technically, two Web documents are similar if the angle between them is small. To understand what this means, first consider the dot product of two vectors. The dot product of two vectors is the sum of the pairwise multiplication of vector components. For example, (3, 4, 5) . (6, 7, 8) = 3*6 + 4*7 + 5*8. The dot product is large if the two documents have many of the same words. For the computation, we actually use the relative frequency of words within a document, e.g., ``molasses'' forms 20% of the document's words while ``jam'' forms 10%. Using the formula for dot product A . B = | A|| B| cos$ \theta$, we see that the angle $ \theta$ is small if (A/| A|) . (B/| B|) is large.

Search engine algorithms

The two parts of a search engine are:

Preprocessing the documents requires collecting the documents, extracting the documents' words for use as vector components, and then computing each document's vector. For this homework, we just used the wget command to snarf a collection of documents. We then collected all of the documents' words into a hash table and finally converted each document into a vector.

To convert a document into a vector, for every word we read from a document, we increment the word's component in the vector. To determine the component number, we ask the hash table for the word's component. For example, if the hash table is called ht, we can determine hello's component (an integer) using ht.find("hello"). We then normalize the vector A by scaling by the reciprocal of its length  | A| = $ \sqrt{A \cdot A}$. That is, we multiply every component of  A by 1/| A|.

(Aside: Observe that in the above discussion ``vector'' means a mathematical vector, not C++ vector. A C++ vector is like an array, and its length is the number of elements. A mathematical vector is a sequence of numbers, and its length is as defined in the preceding paragraph. Its number of elements, in contrast, determines its dimensionality -- vectors with two elements are two-dimensional, vectors with three elements are three-dimensional, etc.)

For example, if a document contains only the words ``bonjour'' (three times) and ``hello'' (four times) and the components of ``bonjour'' and ``hello'' are 12 and 20, respectively, then the unnormalized document vector will have a 3 in component 12, a 4 in component 20, and zeroes everywhere else. (The number of vector components is determined by the number of words in the hash table.) The normalized vector then has 0.6 in component 12, 0.8 in component 20, and zeroes everywhere else.

Given a list L of search words, we wish to determine the closest Web documents. To do so, we first construct a search vector from L. For each word w in L, we use the hash table to increase w's component by one. Then, we normalize by scaling by the reciprocal of its length. A document is similar if its dot product with the search vector is large. Search words not in the hash table can be ignored.

(Aside: Observe that a user could type the same search word repeatedly. The effect will be to make the repeated words more significant in finding matches -- for example, searching for ``hello hello hello hello hello goodbye goodbye'' specifies that ``hello'' is 2.5 times more important than ``goodbye''. It is not clear that this is a useful feature, and we do not know whether current search engines provide such a feature. We will agree to simply allow the user to repeat words and let the chips fall where they may.)

Details

Coding the search engine

We provide software for preprocessing sets of documents, the results of the preprocessing, and a few sets of example documents. We also provide a partially-written program, including the code required to read in the database file and store its contents in a hash table and in a vector of document-vector pairs with one component per document. Your job is to finish writing the code that queries the user for search words, determines which documents are most similar, and prints the results.

The incomplete program is search-engine.cpp. This program takes one command-line argument (the name of the file containing the database information); it prompts the user to enter search words, computes how close each document in the database is to the search vector, and prints those that are closest. More specifically, this code is supposed to do the following:

  1. Read the document database information into the KeywordMapping hash table and the vector<DocVec> collection of documents and their vectors.

  2. Prompt the user for any positive number of search words (allowing repeats) as well as the desired number n of similar documents.

  3. Find the n closest documents and print them in order from most similar to least similar. It would also be nice to print their scores.

Code is provided for some portions of these tasks, in the form of several functions with extensive comments describing their pre- and postconditions. Review these comments before starting to write your own code; our sample solution makes use of all of the provided functions. To see where you must, or may, add code, look for comments containing the word ``ADD'', for example

// ====> ADD code here <====

Types

Using STL containers can easily lead to very long names for types. For example,

hash_map <const string, vector<double>::size_type>

is the type of the hash table translating words to vector components. Instead of typing this 48-character name, we can say

typedef hash_map<const string, vector<double>::size_type> KeywordMapping;

and create a new equivalent type named KeywordMapping. Thus, declaring variables with a type of KeywordMapping is the same as using the 48-character type.

The syntax for the type definition statement typedef is

typedef type new-synonym;

types.h contains several type definitions, including the following:

What files do I need?

First, you will need the incomplete search-engine program search-engine.cpp. You will also need the type declaration file types.h. Be sure it is called types.h and is in the same directory as search-engine.cpp. To compile, use a command similar to

g++ -Wall -pedantic search-engine.cc -o search-engine

If you want to process your own set of documents, download prepareDatabase.cpp. and types.h. Compile using a command similar to

g++ -Wall -pedantic prepareDatabase.cc -o prepareDatabase

and refer to the last section of this document for instructions on how to use the program.

To ease compilation, you can use a makefile, as described in the preceding homework.

Sample document sets

To help you test your code, we provide three sets of documents and one program to generate random sets. For each set of documents we provide, we provide a file containing the set's words and, for each document, the vector components. I.e., each set contains a database file (suffix .db) for use with search-engine, plus the original text files used to create it, and a compressed archive file (suffix .tgz) in case you want to copy a set of documents to your own computer (to extract the files, use the command tar xzvf filename).

For each set, the only file you need to copy into your own directory is the one ending .db. Please do not copy the entire set(s) of documents into your home directory on the department's computers; this wastes space on the file server's disk and on the daily backup tapes. If you really do want all the files, please store them in the directory called /tmp so that (1) they do not take up space on the file server's disk and (2) they will automatically be erased when the machine is rebooted.

Here are the document sets and program:

What to submit and how

Submit your source code as described in the Guidelines for Programming Assignments. For this assignment, use a subject header of ``cs1321 hw3'', and submit a single file containing your revised version of search-engine.cpp. You do not need to send types.h or any of the database files; I will provide these files when I test your code.

Miscellaneous tips and remarks

How our search engine is simpler than commercial engines

Our search engine uses one approach to solve the most important and most difficult task performed by search engines: determining which Web documents are closely related to each other. Our code, however, uses a very simple ranking scheme similar to what Altavista probably used at one time. More complicated ranking schemes can yield more usable results such as those returned by Google. Also, our preprocessor makes a very limited attempt to filter out uninteresting words by omitting all words of three or fewer characters but does not stem suffixes from words such as ``played'' and ``playing'' so that they match. It makes a heuristic attempt to remove punctuation and ignore case. We also do not filter unacceptable Web documents from the document pool.

Commercial search engines must accept more complicated input syntax such as boolean operators (usually), have at least 99.9% uptime, be able to handle large number of simultaneous queries, and deal with network issues.

Using the prepareDatabase program

You probably do not need to know how to preprocess the documents to complete the assignment, but we describe it here for completeness and so you can process your own set of documents if you desire. The prepareDatabase program takes one or two command-line arguments:

This program preprocesses the documents, sending the keywords it selects and each document's vector to the standard output. Save this output in a file to give to your search engine. (The .db files we provide were generated in this way.)

Where to find out about STL classes and functions

Appendix H of our textbook briefly documents many of the STL classes and functions we will use in this course. For more details, see SGI's online Standard Template Library Programmer's Guide.



Footnotes

... 31
© 2001 Jeffrey D. Oldham and Berna L. Massingill. All rights reserved. This document may not be redistributed in any form without the express permission of at least one of the authors.


Berna Massingill
2001-03-01