Assignment #5

Assignment #5 Description

I'm hoping that this is fun for you. Basically you are going to write a simple search engine. The style you are going to use is actually something that struck me while I was taking a course on natural language processing in graduate school. Part of that class included discussion of search engines and how to pull information out of the web. My thought was that a KD-tree would be an efficient mechanism for doing "clustering" this is where you group together pages that deal with similar topics. To do this, you will treat the word count vector as a location in a very high dimensional space. Then you will build a KD-tree based on that. When someone enters a search string, you treat that as a vector and then search in the tree for the sites that are near the search string. You can sort them by "distance" to get relevance.

C++: On the C++ side you get to write the KD-tree implementation. You have lots of flexibility in the details of how this works. One problem is that you probably don't want to use just the word counts as straight values. You almost certainly want to use normalized values and some people find it helps if you take the log of the word count values first. In addition, you might find that you get better results if you throw out commonly occurring short words. You can decide how to do that. One way is to just make a list. Words like "the" and "a" don't really imact the meaning of a page, but they can impact the location of a normalized vector in a high dimensional space. Another approach is to actually count up words in the site and just ignore any short words that occur too often as they probably don't add much meaning.

Once you have the tree built you will need to be able to search it. A KD-tree can be searched for a single element in roughly the same way you search a normal binary tree. However, in this case we don't want to find just the single node that contains the "result", we'd like to find all the nodes that intersect with a sphere of a certain radius around the vector we are actually searching for. This way we will pick up several leaf nodes and the natural clustering of the KD-tree might pull in other interesting hits. Of course, they might also be complete garbage but the frequency of those can be reduced by reducing the size of the sphere. Also remember to present the final results in sorted order of distance from the search vector.

Extra credit: Put the KD-tree on disk and be able to search it efficiently without loading it into memory. This would be required for a large search engine.

Java: All you have to do here is make your GUI so that it will work with the search engine. You have to give the user configuration options both for building the tree and for searching it. Have the search build a "sub-site" that you can display in your graph. Note that this can very easily be a non-connected graph so you will have to deal with that possibility.

Extra credit: Add some other interesting feature to your GUI that hadn't been there before. Make sure that I can't miss it. You might even consider doing a JOptionPane.showMessageDialog at the beginning to tell me what feature I should try.