Assignment #3 Description


Now that you have some basic functionality in your code base we will turn back to the original objective of the project. Remember the reason you have a file with over 400,000 four character substrings is that you want to find these substrings in pieces of text published by different companies to charge them for it. In this assignment you are going to take your first steps in the direction of getting this to work.

Write an HTMLPage class. This class represents a single HTML page and the text that is in it. You want to add functionality to this class so that you can search for the presence of substrings of different lengths inside of it. The catch is that you want to ignore letter case, white spaces, and all characters that aren't valid characters in your substring. I also want you to try to ignore HTML pages. You aren't going to be all that rigorous about that aspect. You will just ignore all text that comes after a '<' character and before the next '>'. This might produce incorrect results on some pages, but for now it will be sufficient. I will leave it up to you to decide what the interface for this class should be like and how you want it to interact with an SSContainer that is full of SubStrs. The nature of this interaction will become more clear below.

The HTMLPage class will wind up having a few constructors that take different arguments depending on what you want to use it for. Right now you will be reading in your data from a file so your constructor will take either a string for the file name or a FILE* if the file has already been opened. For testing purposes you can get hold of different web pages using the wget command (or any other method you prefer). For information on wget do a man on it. Note that you can also do tests on files that you didn't get from the web. In fact, I would strongly recommend this, especially early on when you just want to debug your code. You can write small files with vi to test. If you have them include a little bit of HTML it will help you determine if it really works without using a file that won't take forever to process.

You will then write code so that the number of occurrences of each substring in the page can be counted up. This will require making some additions to the SubStr and SubStr4 classes. Mainly adding a counter and methods to access it. This is a general use counter that should not be saved off to file. You will add other counters later that do get saved off to file. I recommend that the methods for processing the file and those dealing with the counters on many SubStr objects be put into SubStrHandler. That is, after all, the major function of that class to handle things dealing with many substrings.


Submission executable - To help with grading of this assignment because you are building up more code I want you to create an executable that does something very specific. It should take one command line argument that is the name of a file to be read in. To have your program take command line arguments you have main take two parameters. The first one in an int and the second one is a char**. The second one is an array of null terminated C-style strings. The first one is how many of those strings are in the array. A common signature of main is something like int main(int argn, char **argv). Note that argn is always at least 1 because the first element in argv is the name of the executable. Basically you will want to open a file with a line like FILE *fin=fopen(argv[1],"rt");. Then you will proceed to read in the contents of that file and use them for an HTMLPage. You will also read in your binary file of substrings to an appropriate SSContainer (you all will have SSLinkList, if you have SSArray that is an option as well).

What you need to do is process the entire specified file and print out how many times each substring occurred in that file from most common to least common. Remember to ignore the proper English words. You don't have to output any of the counts to your binary file yet, just have it so that after the whole input file has been processed you output a series of lines that look something like this:

abeh 3
gika 3
ilhe 2
weem 2
...

Here again testing with your own small input before using wget might be helpful so that you can have a file where you can do the counts yourself first and make sure the printed answers are correct.

Extension - Try some different ways of implementing this, either with different implementations of the methods in SubStrHandler, in the way you run through the HTMLPage, or with different SSContainers if you have written more than one. Have your program print out a line describing what it is doing, then print out some timing information on how long it took to use that method at the end of the list of substring occurrences To get timing information use the clock() command and do "man clock" and "man 2 times" for usage information.


Looking ahead - If you find you have some extra time you might consider putting in some stuff looking forward to the future. In an upcoming assignment you will have to enhance your HTMLPage class and the loading of it so that you will have a minimal functionality web spider. I'm going to give you library code so that you can directly pull down web pages in your program instead of saving them and reading from the file. With this ability you will be able to have give your program a page and have it search down a few levels to the other pages that can be reached from that one. You will do this by looking for the anchor tags in the HTML files that have href=??? in them. Consider adding into your HTMLPage class now this ability in the parsing of the page as well as methods to get at the URLs of the pages it links to. You will have to take into account that many links are "local" so they assume the first part of the address. You won't have to follow links that using something like JavaScript to generate the call for the page.


Once again I would like the written design handed in to me. The code need only be mailed on the date it is due, but this time I would also like you to return both the original design and the revised design. It would be nice if you can make in the revised design where you made alterations. This can be done by putting in any character string that stands out. Making sure it isn't common also allows you to do a search for it before the next assignment and remove it easily. I would recommend something like "!ALTERED!" so that it is also clear to me what it means. Of course, this has to be in a comment if it is parts of your header file.