Assignment #3 Description


Last assignment you integrated your C++ and Java codes just enough so that you could save off the results of your spidering into a binary file that you could later use to search through for the things that you found. For this assignment you will strengthen that link between the two parts of your code and add some functionality on both side. If you did the extra credit for the last assignment you created a reasonably fast way for people to search for a web page by the URL. This time you will write even faster code for that. You will also write some Java code that uses this as well as some code to make a nice graphical presentation of the relationship between pages that you have spidered.

C++: On the C++ side you need to write a hash that can be used to look up web pages by full URLs in roughly linear time. It should also be able usable with any other data structure given certain requirements (in other words it needs to be templated). Depending on the way that you have designed your C++ implementation and the interface to Java, this hash will either be stored in memory on the C++ side, or it will be written to disk as a binary file that can be quickly accessed. The main function that you have to write outside of the hash itself is something that when given a URL will return back to you (the Java code in particular) the data that you have stored for that page.

Extra credit: As we have discussed, there are many ways to do hashing, especially when it comes to deciding what to do when two items collide. Implement a second approach and see how it compares to the first one in number of items it checks on average before finding the right one.

 

Java: I want you to put a pretty face on your web spider data and this is where you will really start to do that. You are going to write code that draws the graph of the website (at least part of it). Technically we haven't talked about graphs yet, but the basic idea is that you will draw a box for each page that was spidered and draw lines (or arrows if you are ambitious) from a page to all the other pages it linked too. As you might guess, the idea of drawing the whole graph is nice, but it isn't exactly ideal for any serious site. So I want you to only draw sites that are within two "clicks" of the "main" site. By this I mean that you only include the main site, those it links to, and those they link to. You should not include a given page more than once. The boxes should have labels in them that give either the full URL or just the ending page name, or even just the number of that page in your spidering.

The image is still likely to be larger than any screen you have available. For that reason, the JPanel subtype that draws it should return the image size for the method getPreferredSize() and you should embed the panel in a JScrollPane so that the user can scroll around to see the whole thing.

To find the site to use as the "main" site the user will type in a URL. What they type in should be sent to the C++ hash code to find that one site quickly. From there you can decide how you will communicate that site and those those within two clicks of it to the Java program so that it can be drawn. If they type in a different URL you should find it and draw the new image for it.

Extra credit: For extra credit here you will make your graph interactive. This involves two parts. First, if they click on a box you should display the full information for that box including the URL and word counts. Also have it so that in addition to the user being able to type in a URL, if they double click on the box for a site other than the current "main" then that becomes the new "main" and the graph is updated accordingly.

You will be using this graph or derivatives of it a fair bit in the next 5 assignments so it is probably worth putting some effort into. If done well it can really make your application stand out. In fact, you should look ahead now to see what extensions will be expected and try to design your code so that those will be fairly easy to do.