Assignment #2

Assignment #2 Description

Now that you know a bit more about C++ and have a foundation in the Java part it is time for you to write some code that will glue the two together as well as some more significant data structures in C++.

C++: On the C++ side you are going to be doing a fair number of things. The simple tasks are to write implementations for both a sorted list and an unsorted list using both the STL vector and your own doubly linked lists. Over the course of the semester you will be writing a number of containers that are both sorted and unsorted. For that reason, you will probably want to think a fair bit about the interface on them. What types of operations make sense for sorted and unsorted containers? You don't want too many methods because that reduces flexibility, but too few reduces your power. The CLR text might be helpful here.

In addition to this, you will need to write C++ code that does direct access files to store off the information from the web pages that your spider finds. I want you to store these as indexed direct access files. This means that you will have one file that stores the data for each of the web pages as variable length records and another file that stores where the beginning of each record is. Writing this code shouldn't be all that challenging and I would recommend that you write it in a well encapsulated class so that it will be easy for you to port to other purposes later in the semester. Templating it might be wise.

To do the word counts you will also have to have a mapping from words to integer indexes. This mapping will also need to be stored off in a file. This could be done in Java, but I'm going to recommend that you do it in C++ because that will fit in better with later assignments. If you want, this file can be in text so that both languages can easily load and use it.

Extra credit: One of the advantages of having separate files for the index and the data is that you can have more than one index file into a particular data file. In fact, unless your program is trying to access records by number (which this one will), the original index file isn't really required because you could just read through the entire file to get to any record. Each record does tell you how long it is. For extra credit you can create a second index file that might be a bit more useful. Make your second index file so that the elements are sorted alphabetically by URL. This will provide you with a fast way to quickly look up something by URL instead of by the order in which you happened to spider it.

Java: Perhaps the greatest challenge in this assignment comes from the fact that the data you want to write out to these binary files is coming from the web spider that you wrote in Java for the first assignment. For this assignment you will be adding a little more logic to those so that you have something interesting to save off. For each web page, I want you to keep track of its URL, all the pages it links to (this can be an integer and should not be a full URL), and a sparse vector telling how many times the words on that page occur. This last element should count words that do not occur inside tags in the HTML. So from the time you see a '<' to the time you see a '>' you won't be counting anything. Outside the tags you should keep track of how many times different words occur. To do this you will keep a list of words so that each word you have found has an integer index into the list. When you find a new word you will add it to the end of the list and it will get the next number in order. (Note that you could probably optimize this significantly by having a way to look up words in an efficient manner while they are sorted.)

So you will start by adding two things to your basic spider from last time. As it parses a page, any time it finds a link it should note that link in such a way that it can associate it with an integer index either then or later. In addition, it should keep a list of the counts of all the words it has come across on that page. This should be a sparse list. That is a list that only stores the non-zero values. You will probably want to keep a separate data structure that keeps track of all the words found so far that has two lookup functions. Give a word it should be able to return and index number or given a number it should be able to tell you the word. The page then has a list of pairs where each pair has an index and a count.

After the spider has found all the pages, all the links, and done all the counts, it should communicate with your C++ code to write it all out to a binary file.

Once you have that done I want you to write a separate Java program that has a little GUI that shows a JList with all the URLs that were spidered. When the user clicks on one it should show the user the list of URLs that page linked to and the word counts on that page. Here the word counts should be string, integer pairs. This program should not be taking data directly from the spider. Instead it should be using the C++ code to read in the data that was stored from the spider.

Extra credit: If you did the extra credit for C++ then it should be fairly easy for you to write an efficient addition to the GUI so that the user can type in a URL and it will use the sorted index to find that URL and get back the links and counts. The data display can use the same part of the GUI as the normal list selection, it is just a different way for the user to request that data.