Assignment #1 Description


This assignment is intended to help give you something to do to get up to speed in C++ while you also lay the foundation for what you will be doing in Java.

C++: For the C++ section I want you to do the following things. The final code that you submit doesn't have to show your iterations of building things up, just final implementations. There are three requirements on these classes and you will have a total of at least four class. (1) Write an array based stack and an array based queue. (2) Write a list based stack and a list based queue. (3) Template them so that they can used with any type. In addition to writing the classes that implement these things you will also need to write a function that you can pass a stack to and do something with it as well as one that you can pass a queue to and do something with it. Those functions can be part of your test suite that you turn in before the implementations.

For extra credit: make pure virtual base classes Stack and Queue that the implementations are derived from. It is also worth extra credit if your array based versions grow dynamically using new and delete for memory allocation. Make sure to note that I should look for the extra credit in the comments in your code.

 

Java: The the Java part you are doing something that is a fair bit more detailed, I want you to write a simple web spider. The idea is that the user should be able to give the program a URL on the command line. The program should then go find all the sites that page links to and all the sites those sites link to, etc. Because that can basically go on forever, I don't want you to visit any sites that aren't "below" the original site. To make sure it is working, just have your program print out a URL every time you find one, whether you will visit it or not. Make sure you don't revisit any URLs because this can lead to infinite loops.

So if I were to enter in "java pda.group1.assn1 http://www.trinity.edu/", your program would print out every web page that page links to. It would also go to each of those pages that are in subdirectories of www.trinity.edu and print out all of their links. I should warn you that this takes a while because Dr. Jensen in Business Administration has a LOT of web pages out there.

In order to do this you will make extensive use of the URL class in java.net. It has nice constructors that help you deal with creating relative links. This is nice because that can be a real pain. You only need to deal with links from tags of the form <a href="link">.

Only worry about things that the Java URL class deals well with. If it throws an IOException when you create it or when you try to read it then just ignore that link. Also, don't keep the anchors in a link. An anchor is the part after a "#" in a URL. They just point to different parts of the same page so there is no point in reading those in multiple times for this program. To keep things short, you should also exclude links that end with".jpg",".gif", ".png", ".txt", or ".pdf" either lowercase or uppercase. We will probably extend this last bit of functionality later on.

For testing purposes you could try the course homepage "http://www.cs.trinity.edu/~mlewis/CSCI2320-F03/" which is pretty small. To go up a bit you can drop off the course name as the end and do my entire site. A slightly bigger site is Dr. Howland's where mlewis is replaced with jhowland. If you want a real challenge try the Trinity CS or the full Trinity site. (Note that the Trinity CS site is not a subset of the Trinity site because the machine names are different.) You can also try some companies if you want. Let me know if you find some places that are interesting.

For extra credit on the Java part, you can write your code so that it will load and parse websites in parallel. You can make the number of threads fixed or have it be a user option, but make sure I know how to set it. I should note that doing this will require that you do some synchronization because the different threads will almost certainly be working on the same datasets. I suggest this because having threads here can make your program go much faster if it is run on a site with any significant lag time (anything off campus for example). The reason is that when one thread is paused waiting for data from the site, another one can be parsing the data it has already or requesting a different page.