Assignment #4 Description


For this assignment you are going to further expand the capabilities of your program so that your program will take data more directly from the web for processing.

Objective 1 - Modify you HTMLPage class so that it can accept data taken directly off the web by the WebURL class that I'm giving you. This is as opposed to having it read in the data from a file like in the last assignment. Then make it so that the HTMLPage class keeps track of the links off of that page. Keep in mind that some links can be locally referenced so you will have to do a little work to get the full link. The WebURL class gives you some help in doing this. To simplify things you should only look at the links that are of the form <a href="url"> and don't go to pages with .gif, .jpg, or .png extensions. Think about where this will go inside the code that you had written for assignment #3. You need to add functions to your HTMLPage class that allow you to find out how many links there were on that page and to get the URL for each one so that it can be used by WebURL to open it.

Objective 2 - Now you have the tools to build a little web spider (class WebSpider), that is a piece of code that starts at a given page and goes out and pulls down the other pages that one links to. It then repeats this. The term comes from the fact that it "crawls" around the web between pages. This type of behavior can cause some serious problems for one thing the number of pages can grow very quickly. Also there can be cycles where you can follow a path and get back to the web page you started at. To prevent the number of pages from getting to large I want you to have your spider stop at only 3 levels deep or 3 steps away from the page you start at. You should also make it so that your spider keeps track of where it has been so it doesn't go through cycles. You need to have it so that your spider can do a count for "an entire site" and return to you what substrings appeared at that site and how many times each.


Submission executable - The executable that you create for this assignment will be very similar to the one for the last assignment. It should take a command line argument, but this time that will be the URL of a web page that you want it to process. It should print out all the URLs that it visits and a total count for them all as displayed in the last assignment.

Extension - This one requires not just a little code, but also some writing. You now have the ability to do some experimenting with the web and try to explore a bit about its structure from different starting points. For this extension you don't have to worry about substring counting, but you do want to parse web pages and find the links in them. Let your spider go to different "depths" and see how much it increases the number of new pages you find. Also, keep track of how many cycles do get of different lengths. That is, how many different ways you get where you start at one page and end up back at it after traversing n links. Don't be to rigorous about that second part because it can get very complex very quickly. Turn in a little writeup of your oberservations with plots of those things (probably shouldn't exceed 2 pages or so).


Looking ahead - After this assignment you have the basic functionality to handle both the web pages and the counting of substrings. For the last two assignments we will extend on this so that it does comparisons between different web pages and does optimization work to determine what substrings different companies should be charged for. We will also make some comparisons that will intentionally test the flexability of your code.


Once again I would like the written design handed in to me. The code need only be mailed on the date it is due, but this time I would also like you to return both the original design and the revised design. It would be nice if you can make in the revised design where you made alterations. This can be done by putting in any character string that stands out. Making sure it isn't common also allows you to do a search for it before the next assignment and remove it easily. I would recommend something like "!ALTERED!" so that it is also clear to me what it means. Of course, this has to be in a comment if it is parts of your header file.