Date: 2000 Feb 11
An undocumented feature of the database preparation program is that words with three or fewer characters are ignored. Also, the first occurrence of a punctuation mark marks the end of a word. For example, the apostrophe in ``don't'' causes the database preparation program to consider only ``don,'' which is too short to be included in the database.
To summarize, search only for words with four or more characters.
Each line in the database file ends with a space. Without this space, the distributed code is not guaranteed to work correctly.
The second form for sort requires a third parameter to compare elements in the container that are being sorted. The parameter can be a function that acts like less-than <. That is, the function should take two arguments and return a bool indicating whether the first argument is strictly less than the second argument. Using such a function with sort will cause the container to be sorted from smallest to largest. For an example, see the program sorting in a case-insensitive way.
Reported by Daniel White that some vector components of the distributed database files had zero for all documents.
Problem Solution (JDO): Copy newly created databases: hello-goodbye.db, etext90.db, etext90.db and, if desired, prepareDatabase.cc.
Problem Description (JDO): During the two phases of database preparation, words were canonicalized during the ``word collection phase'' but not during the phase when documents' vectors were created.
C++ vectors were named after mathematical vectors, but it becomes confusing.
C++ vector | mathematical vector |
like an array | a sequence of numbers |
length means the number of components | length is Euclidean length |
The Euclidean length of a vector (3, 4, 0, 5, 6, 0) is .
In this program, we use C++ vectors to store mathematical vectors.
The short answer is
Jeffrey may have misled CS1321-1 during Tuesday's class. Using for_each or transform to multiply each vector component by a particular number will not work using what we currently know. The same is true for performing the dot product between the search vector and each of the document vectors. Just write a loop to do the work.
A reminder on how to declare iterators: