Assignment #7


For this assignment I want you to write a bit of code in ML that could actually be useful at some point. It will also be educational in forcing you to learn about something that has become very significant in computer science in the last decade: XML. XML stands for eXtensible Markup Language and all it is is a standard text format for encoding information in tree-like structures. Though this description seems very simple, many technologies have built up around XML and it drives much of the internals of many web sites. For a longer description of what XML is click here. The site www.xml.com can also be a general resource for seeing all the things that are being done with XML, but don't spend too much time getting into details because this assignment will stick with the very simple basics. For this assignment you will write an XML parser and serializer in ML along with some basic functions to access and alter the information in your "DOM tree".

If you want the definitive source on what an XML document really is (and this might be helpful) you should go to the W3C recommendation for it. The link in the next paragraph is more approachable, but not as complete or definitive. Read it first and then come back to this one for clarifications.

So, what goes into an XML document? Here is a description of what an XML document looks like. The basic idea is that you have some header information then a tree of elements with text potentially inside of elements. You also can have comments in the document. All elements in XML have to be closed. So if you have an element <myElem> then later on you have to have an element </myElem>. If nothing would go between the two, you might see the alternate syntax <myElem/>.

The starting tag for an element can also contain attributes. These are name value pairs separated by an equal. In XML the value must always be inside of double quotes. For example <myElem myValue="a value">...</myElem>.

Inside elements you can also have text mixed in with other elements. You need to keep tract of that.

Comments can go in XML documents and can be found outside of the root element. They have begin with "<!--" and end with "-->". Similar to comments are processing instructions. These begin with "<?" and end with "?>". You should keep track of both of these so that you can output back XML files that are as similar as possible to the original.

The last type of entry in an XML file is CDATA. This is basically a way for you to put large chunks of text in the document that aren't processed at XML. A simple example would be a function in a normal programming language where symbols like "<" and ">" might appear. To feel the assignment reasonable, I'm going to say you don't have to handle CDATA in your parse. If you get an XML file with CDATA in it, delete it by hand.

You will not have to make your parser be a validating parser, so you don't have to worry about DTDs or XSchemas. So you only have a few header elements followed by a tree structure of elements and text. Basically, you are going to produce a DOM parser and your own simple DOM (Document Object Model). This needs to be able to read in an XML file and give you back a tree with the parsed information. You will provide functions for going through that tree and making alterations as well. Lastly, you will provide a function to write out an XML file given a document tree.

readXML fileName
This takes the string fileName and returns the datatype that you are using for you XML document.

writeXML fileName doc
This takes a string and an object of your XML document type and writes that document out to a file.

alterDoc doc alterFunc
This takes a doc and a function that operates on elements/comments/whatever. It applies that function to all the parts of the document and builds a new document from the results. This function does not have to be able to remove nodes from the document tree though you can make it do so if you wish.

pruneDoc doc predicateFunc
This is the function that can remove nodes from the document tree, but it doesn't alter any nodes. If an element is removed, all things inside it go away. The predicate function that is passed in takes an element/comment/whatever and returns true if it should be kept in the document.

Note that these functions use the curried form in ML where you don't have a comma separated list in parentheses. To check if your code works, I want you to save and use one of Yahoo's RSS feed files and write the following two functions to process it.

notText node
This function returns true for everything except text. Basically, it takes all text between the elements/comments/ PIs out and leaves only what is left.

matchElement elemNameList node
This function takes a list of names and a node. It returns true for everything that isn't an element. For an element it returns true if the element type name is in the list.

setPermaLinkTrue node
This function takes a node and returns the same node if it isn't an element. If it is an element, it checks if that element has an attribute named isPermaLink (this is used a lot in the Yahoo RSS). If it is, it returns the same element but with the value of isPermaLink ste to "true".

An example of the use of XML is RSS feeds. If you want some extra data files to play with here. Also, as a hint for how to do this assignment, keep in mind that tree structures inherently work well with recursion. So you could probably have a recursive function called "readElement" that returns your data type for an element and recursively calls itself when it gets to elements inside of that element.

Hints - As with all significant programs, there are lots of ways to do this. Because this is fairly complex I'm going to give you a suggestion on one possible approach. You don't have to follow it. I found that having one function that reads a text inside of < and > as well as one function that reads text not inside of < and > was helpful. They each return a domNode as described below. You will have to watch out for having them inside of quotes, but since you don't have CDATA, that isn't a big problem. I would make a single datatype that can be any of the things that go in the XML file and possibly more. That includes elements, comments, text, and PIs.

As a simple hint because it gives you an odd error about a non-constructor in a pattern, if you have a pattern which is the cons of chars (and you probably should), put spaces between the cons operator and the char literals. For example (#"<" :: #"!" :: #"-" :: #"-" :: t). If you leave out the spaces ML has issues with the # being next to the ::.

Here's the BIG hint. You don't have to use this, but here are the type definitions that I'm using for my solution to this problem

type attribute = string * string;

datatype domNode = RemovedNode
| ElementNode of string * attribute list * domNode list
| ElementStart of string * attribute list
| ElementEnd of string
| CommentNode of string
| ProcessingInstruction of string
| TextNode of string;

type domDoc = domNode list;

So the readXML function returns a domDoc. The other functions are passed domDocs along with other information and do the right thing with them. For writeXML that just means writing it to a file. For alterDoc and pruneDoc it means that they return a new domDoc with the proper alterations. Feel free to use records instead of tuples if you want. The first option, RemovedNode, is only used later for the prune and later functions.

To give you some idea of the length of the code, my parsing code was 130 lines. That is by far the longest part of this program. The other three functions can be done in less space than that after readXML is done. Also, my code is longer than it has to be because I used patterns to match #" ", #"\t", and #"\n" in many places. I could have made it shorter with an if, but I decided to use the ML style with patterns instead. If I cut out comments and used an if I feel certain I could get my code below 100 well formatted lines of code. So basically this isn't a huge assignment, but it will require you to think a bit.

Also, I work with things as list of characters until the very last operations when I store them in the domNodes and at that point I convert to strings. This is basically because ML doesn't really offer ways to go through strings.

Extra Credit - Make your parser handle CDATA properly. If you do this you need a big comment at the top telling me to test it out.

Notes on future assignments - My plan is that the last four assignments will be linked to form a larger project. Most all of these build off of this assignment in that I want you to use XML as the basic format for loading and stroring information from those assignments. I haven't worked out all of the details, but there are at least five broad directions that I'm considering. It might interest you to know that when possible I'm trying to have these projects overlap. In the descriptions I'll tell you which aspects are needed for going toward which final ends. If you think of something, tell me and I can add it to my list (that's worth extra credit of course).

Graphics - The final goal for this is a ray tracing application that is optimized to work efficiently with large data sets. As you might guess, the data for the geometry will be loaded in from an XML file. The output will be a text file that I'll provide you with tools to render to a PNG file.

Physics - This project involves the creation of an optimized N-body gravitational simulation code. In theory, this could be used to do small simulations of solar systems, globular clusters, or galaxies.

Text Game - You might call this a SUD (Single User Dungeon). It would be multi-user, but networking is not a standard part of ML as far as I know and even if it were, we won't have time to cover it. If you have ever played a MUD though, this will look something like that where the user is in a text interface and walks around through a graph of text described rooms.

Advanced XML - If you do a little looking, you will quickly find that there is a whole heck of a lot more to XML than the basic parser we did. This path will have you build up some more complex tools though I'm not certain what yet. I was thinking along the lines of a validating parser or and XSLT processor as possibilities.

Data Structures/Algorithms - This is the one that isn't tied as strongly to XML, but I'll probably find ways to throw it in. The idea is that each assignment will have you write code to support a more advanced data structure or to perform some type of advanced algorithm. These could include graph algorithms and balanced trees.