Discovering the joy of SAX in VB6
Published: 02 Jan 2003 12:58 GMT

Microsoft's XML Core Services, affectionately known as MSXML2, provides a useful XML toolkit that VB and COM developers can use in their applications. I'm now going to look at the SAX side of the XML parser coin.
What is SAX?
I don't have space here for more than a cursory discussion of how SAX works, but if you're interested, I'd encourage you to check out "Remedial XML: Learning to play SAX". Briefly, SAX, or Simple API for XML, is a serial push parser, in that a SAX parser pushes elements from an XML document into its host application in the order in which it encounters them in the document. SAX originally was created as a parser for Java, but has since been ported to a variety of other languages, including Microsoft's COM implementation. As a parser, SAX has advantages over DOM when you find yourself dealing with a large document, or when you are looking for a particular piece of information within a document. Of course, SAX is more complex than DOM, requiring you to keep track of context information to know where you are in a document.
Microsoft's SAX implementation
There are, in fact, two SAX implementations in MSXML2, one meant for VB programmers and the other for C++ developers. From a VB perspective, you'll need to master a handful of classes to get up and running with SAX:
SAXXMLReader: The parser itself
MSXML's VB-specific SAX parser is defined by the IVBSAXXMLReader interface. The SAXXMLReader class is a version-independent implementation of this interface and is the reader you should use in your applications to guarantee future compatibility with new versions of MSXML. You set the parser to work on a document by calling either the parse or parseURL methods. By itself, SAXXMLReader parses only documents; it doesn't inform you of their content. You'll need to implement a utility interface to actually make use of the parser.
The content handler
The IVBSAXContentHandler interface contains a set of methods called by the SAX parser to inform your application about the content in a document. I've listed a selection of the important methods in Table A.Table A
Important content handler methodsstartDocument Invoked when the parser begins parsing a document. startElement Invoked for each element the parser encounters, when the parser reads the element's start tag. Input parameters indicate the local and fully-qualified name of the element. Note that SAX uses a depth-first traversal -- child elements are parsed before sibling elements. characters Invoked after startElement for data elements. The data is passed to the method as an input parameter. Because the VB implementation of the SAX parser is non-validating, this method receives white space as well. endElement Invoked after startElement and characters when the parser reads the closing tag for an element. processingInstruction Invoked when the parser encounters a processing instruction element. The content of the instruction is passed to this method via an input parameter. endDocument Invoked when the parser finishes parsing a document. At this point, the parser can be reused to parse a different document.
You'll want to implement at least the startElement and characters methods on this interface, and pass an instance of the implementing class to SAXXMLReader via its contentHandler property.
The trick with implementing a content handler is that SAX is stateless, meaning that your implementing class will have to keep track of the element that's currently being parsed (save the name you get from startElement) so you know what to do with element content received through the characters method. Also, the current VB implementation of SAX is non-validating, which causes an interesting side effect: White space in a document is actually handed off to characters instead of being passed to ignorableWhitespace, as you might expect.










