Project Description

Web Site Analyzer


As was the case with PAD1 and PAD2, a significant portion of the work that you do for this course will involve you writing programs. Here again the nature of what you write is going to change. It will retain some of the features of what you did in PAD2 in that you will be working on a single large project for the entire semester. This time that project might not be quite so cohesive because the demands on the material that we have to cover are a bit more specific. Also, this time you will be doing your work in small groups. You will be working with the same group over the entire semester. This is because the scope of what you will be doing is a bit larger than what can be expected from an individual. Also, it will help to teach you to deal with cooperation in your coding which is essential in professional programming.

In order to see the real advantages of different data structures on a modern computer, you have to have a large data set. These machines don't care if you do something that is O(n^2) or even O(n^3) when n is less than 1000 in size. However, if n becomes 100,000+ then the importance of good data structures grows significantly. The problem with this is that somehow you have to get a large data set to play with. I didn't feel like typing one up, nor did I think it would be a good use of your time to do that. I also don't see much point in having you deal with randomly generated data because it has no inherent meaning. Fortunately, there is a huge repository of data that exists for us known as the world-wide-web. It already exists, it has meaning, and it is certainly big enough to not only stress, but completely kill a single computer if you don't use very intelligent algorithms and data structures. The only thing you have to do is write code that can help "harvest" this data for you. Once you have harvested it, you will be able to play lots of games with the processing of it and see what types of data structures are most ideal for the process. Inevitably you will also pick up a deeper understanding of the Java and C++ programming languages as you go.

Below you will find links to the descriptions of each of the assignments. Each assignment will include two parts which I refer to as the application part and the data structure part. The two will work together to give you a full application. You will have a large amount of control over the design of both of these parts though some pieces of the data structure part will be more uniformly defined to enable me to systematically test them. I expect that each part will also be written in a different language. Part of your task with this project is getting the two pieces to work together. The application can be written in C++ or Java. However, I believe that given the requirements you will find it much easier to do in Java. The Data structure part MUST be written in C++. You can write a version of some of the data structures in Java if that helps your design, but you must get them working in C++ for credit.

I strongly recommend that you read through as much description of all the assignments as is posted at the very beginning. This is because you should tailor your design so that it allows you to make the extensions that will come up in later assignments. Some designs will make the end of the semester more difficult than others.

Data Structures: As the title implies, one of the major objectives of this course is to teach you about various data structures. This includes having you implement and use a number of them. These data structures will be used to help you search through and perform analysis on the data that you gather from the web. Some of these analyses will be quite involved and might require a significant amount of time. This is part of the reason that you are writing this code in C++ instead of Java. Not only does it force you to learn the language, C++ is typically faster for large scale data processing.

Application: This part of the project will include a GUI to interact with the user and the program as a whole and a web spider that will go out and collect that data that you will be analyzing. The spider will need to be able to visit multiple web pages and pull information off of them. It will then save this information in some format. You can choose whether this will be done in a Java binary format (which makes it much harder to read in C++), in a C++ binary format (which makes it harder for Java), or as a text file (which can be hard for Java and slow for both). You can put things in binary files that are only accessed by one language. At least one assignment will require you to do this in C++ with direct access files. It will be very difficult to read those files from Java, but that won't matter because they will only be used by C++.

Interfacing the two: One of the big challenges of this project is that you have to get the data structures code and the application code to communicate. If you write your application in C++ this isn't a big problem, but writing a web spider and a GUI in portable C++ code is a fairly formidable task. The two main methods that you could use to do this are to have separate programs that communicate with one another via shared format data files, or having the routines in one language directly call those in the other. The former can most easily be implemented using text files and could be nicely implemented using XML. The second method would require the use of Java Native Interface (JNI). Both of these are advanced topics that you will be expected to research a bit. How you do this is also a significant part of the decisions you have to make as part of your design.

Extra Credit: For every assignment there will be the possibility of getting extra credit. Some of it is particular to the description of the assignment, but in addition to that I will give extra credit to the group that produces the fastest data structures for completing a given set of tasks. I will also give extra credit points to the group that I feel has the strongest design or the easiest to use application. Using portable technologies like XML will help your group in the latter area.

Assignment Descriptions: Below are brief descriptions of each of the assignments you will do as part of this project. Each includes a link to a longer description giving you the details of what you have to complete for the assignment.

Assignment 1: Stack, queue and beginning spider.

Assignment 2: Sorted and unsorted lists with linked list and vector. Do word counts and save off data from spider in direct access file with index..

Assignment 3: Hash to quickly search for web site by full URL. Website graph drawer.

Assignment 4: AVL-tree so that you can quickly find all contents of a directory. Store the tree as a direct access file. Allow the sub-site to be drawn in the graph drawer.

Assignment 5: KD-tree for doing searches on the site by word content. They should be able to expand how large they want the hit volume to be. This produces a sub-site that can be drawn.

Assignment 6: Do word pairs and a sparse representation of that.

Assignment 7: Graph algorithm for shortest path between pages

Assignment 8: Graph algorithm or DP (Canceled this semester)

Submission Guidelines: This section describes what you will need to submit for each of the assignments and how I expect you to submit it. I do not care what type of computer, OS, or IDE you use for doing your program. However, your code will have to compile and run under Linux with g++ and J2SDK 1.4 when I do the grading. Whether you do all of your development there, or just check compatability at the end is up to you. If you use Eclipse with Cygwin you should be able to work under Windows and have little to no compatability problems.

Last semester you had to submit design documents generated in Together before each assignment. This semester that requirement is going to change a bit. Now you need to submit both design documents and test code the class day before each assignment is due. Having the test code and the code stubs to create documentation will give you enough so that it complies. It will not run at that point though. If you are using Together for editing you can certainly use their documentation generating tools. Eclipse also has a feature to invoke javadoc for Java code. If you aren't using those you can always go to the command line and use javadoc directly for the java code and something like doxygen for the C++ files. Both are installed on the lab machines. (Note that doxygen can be used for Java files as well and if you set the JAVADOC_AUTOBRIEF flag to true the comments for Doxygen can look just like what you are used to for Java.) I'm not going to be as strict about what I require in your documentation comments this semester, but the requirement of writing compiling test code will counterbalance that.

The test code should be submitted like normal code. It will also be included in the final submission. Note that this means the final submission should always be at least your second submission for a given assignment. The purpose of the test code is exactly what the name implies. It should test as much of the functionality of the final code as possible. If the code works properly it should do nothing. If it fails, then the test code should notify you of the failure. You won't be able to write simple tests for your GUI code, but you should be able to do the tests on everything in the C++ code as well as much of what you write in the Java code. You might also need to create some small datafiles that will be processed as part of the tests. Those should be submitted with the code.

The HTML documentation should go in your Local/HTML-Documents directory under Sol. Only one group member needs to post it and you can just let me know who will be doing that. Under that directory you should create a PDAProject directory (make sure it has global read and execute rights: chmod a+rx PDAProject). Then you will create a directory under that for each of the assignments named Assn1, Assn2, ... Put the documentation in there and make sure that I can get both the Java and C++ versions without too much difficulty.

For the code submission you will be using the same program that you used last semester (if you had me as your professor). Only one member of your group needs to actually submit code. How you deal with working on code in pairs is up to you, however, I would suggest that you use something like CVS. In fact, Eclipse will interface with CVS so that you can create a CVSrepository under one of your Sol user accounts and then connect to it to check in and out source code to work on. CVS is a standard tool used in industry whenever more than one person is working on a particular piece of code so taking time to become familiar with it could be very helpful to you in the future to learn it. Even if you have this though I would strongly suggest that you and your partner do most of your coding at times when you can be together, whether it be in a dorm room or in a lab. When you do this there are two models that you can follow. You can pair program where one of you is the "driver" at the computer typing away while the other is the navigator watching of the driver's shoulder and thinking about where things are going and how they will fit together. This style is championed by the "Extreme Programming" approach (which also uses the tests-first mentality). It can take a bit longer to write the code because only one person is writing, but often produces better code that has fewer bugs which makes up for that. You shouldswitch who is the "driver" regularly if you take this approach. Alternately you can work in parallel on two computers sitting next to one another. You'll want to be close by so that you can constantly communicate about what you are doing and make sure that one person isn't changing things that will cause problems for the other person. The project was designed so that one person can work on C++ aspects while the other works on Java aspects. If you break it up that way you need to switch who is working in which language at least for each assignment if not more often because having one C++ expert and one Java expert doesn't really satisfy the learning objectives for either of you.

I'm going to be fairly strict about some details of how your code is submitted. In Java you should be using packages to keep the different parts of your code organized. Packages are placed in different directories so the code itself will be in some subdirectory of what you submit. Your C++ code can be placed in the root directory though it might be wiser to create a subdirectory for the C++ code as well to help organize things (it will also be easier that way if you are using Eclipse). In the root directory that you submit you will have to have a makefile and that makefile has to have at least four things in it. A 'make all' should compile all the code and put it in places relative to the main directory. A 'make test' should run all the test that you have written for your code, that includes every assignment. Sometimes you will change parts of your code that make an old test invalid. You should update that test so that it is valid instead of throwing it away. The test should not involve GUI components and should only have text output if they fail. A 'make run' should run the primary application. Lastly I'd like a 'make clean' option that will remove all the files created by compiling so that the project can be compiled from scratch.

When you submit your assignments I want at least three things to be included. One is the directory structure for your Java packages. In addition to that you should have the directory for your C++ code. Lastly you should have a makefile in the root directory that has the options described above. Below are links with more details later on what you can assume about the state of the machine that you compile into for things like the Xerces XML parser or the LD_LIBRARY_PATH variable and the settings you can put in your makefile to help things flow more smoothly. Your makefile will not need to set those things as long as you put your compiled code in the right spots.

The class period after an assignment is due I expect you to turn in a short "group evaluation" to me. These serve several purposes. First, they are supposed to help me keep track of who is doing what in your pair and the methods you are using in your collaboration. Second, they are supposed to protect you so that when I assign grades, I can adjust things based on your personal contribution. Third, they allow me to deal with situations where the pair dynamic is simply not working. This project is bigger than one student can really tackle during the course of the semester. You will need to work together well, and efficiently to succeed.

XML help

JNI help

Group Evaluation Questions