This document is under review and not quite finished

Efficient XML Processing

Abstract

This document describes the application of the visitor pattern to XML processing. It uses a code generator that translates a DTD into an XML parser that buils a tree in memory which can be traversed with a visitor implementation. Main advantages of this approach are speed of processing, ease of writing filters for XML processing and maintenance. The technique has proven its value in industry for a few years.

Problem statement

With the increasing popularity of XML the need for processing XML documents becomes more and more important: data extraction (list of figure, index items), data manipulation (count number of items in an order, calculate total cost of a shopping basket), XML conversion (generate table of contents, generate list item and header numbers) are the main applications. The main methods for processing XML, SAX based and DOM based, each have their own pro's and con's: SAX procescing is fast but does not allow tree traversal while DOM processing tends to be slow and memory hungry. Here, a method is presented that combines the advantages of both methods to some extent while minimising the disadvantages.

Application of the visitor pattern to XML processing

The visitor design pattern is one of the basic patterns described in the GoF Design book[1]. It is a pattern that is very well suited to the situation where there is a tree structure with many different types of nodes. So, it seems to be natural to apply the pattern to XML processing: in XML processing we have a document that can be interpreted as a tree with a lot of different types of nodes. Every element translates into a different type.

TODO: (how the visitor pattern works)

Class diagram for XML processor

Example

Suppose, we have the following simple DTD containing four elements: doc, title, section and p. Note that the attributes of doc have various types: ID, CDATA, NUM and a choise


<!ATTRIBUTE doc
    id      ID                  #IMLIED
    author  CDATA               #IMPLIED
    version NUM                 #IMPLIED
    status  (proof|publishable) proof
>

<!ELEMENT title (#PCDATA)>
<!ATTRIBUTE section
    id      ID                  #IMLIED
>

<!ELEMENT section (title|p|#PCDATA)>
<!ATTRIBUTE section
    id      ID                  #IMLIED
>

<!ELEMENT p (#PCDATA)>
<!ATTRIBUTE p
    id      ID                  #IMLIED
>

The class diagram that comes with this DTD is the following:

Class diagram for processing instances of the example DTD

Suppose, we have the following instance, which conforms to the above DTD.

<?xml version="1.0">
<doc author='rrb' version='2' status='publishable'>
	<title>The visitor pattern</title>
	<section id='s1'>
		<title>Origin</title>
		<p>Bla, bla, bla.</p>
		<p>More bla, bla, bla.</p>
		<p>Bla, bla, etc.</p>
	</section>
	<section id='s2'>
		<title>Motivation</title>
		<p>Bla, bla, bla.</p>
		<p>Bla, bla, etc.</p>
	</section>
</doc>

When this file is fed to the XMLParser, it builds a tree in memory of where objects are created from the classes that come with each of the elements. The following diagram shows the objects created and their relations for the example file.

Tree built by the XMLParser on the example input

Code generator

There remains only one pragmatic problem to be solved to apply the visitor pattern: how to generate the code to parse an XML file and build the tree in memory.

The code generator takes as input a DTD and produces a Java/C++ file with the XMLBase derived classes, a BaseVisitor class and a XMLParser class. This requires a program that can read a DTD and resolve all issues with fragmented DTDs (i.e., the DTD parser needs to know how to find the location of the fragments referred to in the input DTD). James Clarks SP parser is such a parser. However, the source is hard to follow due to sparse commentary and documentation, though eventually it has been proven able to build a code generator based on SP.

A more transparent code generator can be based on the DOM?

Extentions

In practice, there are a lot of filters that require retrieval of an attribute that appears in various elements. For example, a filter for determining all IDs in a document. In the approach as described so far, it means that a method has to be written for every element that may have an ID attribute.

However, since we are generating code anyway, it is easy to add a method 'GetAttribute' that takes as argument the name of an attribute and returns the value of the attribute asked for or throws an exception when it does not exist. Likewise, it has been proven useful in practice to be able to get the name of the element in the form of a string or an enumerated type.

Code

TODO (have to solve copyright issues)

Pros and cons

Main benefits of this approach are

Main disadvantages of this approach are

Conclusion

A new method for XML processing is presented that has the benefits of speed of processing, ease of writing filters for XML processing and maintenance. It lends itself easily to extend with DOM like interfaces since a code generator takes care of the hard part: translating a DTD into an XMLParser en treebuilder.


1: Design Patterns: Elements of Reusable Object-Oriented Software by Erich Gamma, Richard Helm,Ralph Johnson, and John Vlissides. Addison Wesley. October 1994.

Comments please to Remco Bouckaert.


Last updated: 5 Dec 2000 webmaster