Parsing XML: Sax or DOM

I tried both SAX and DOM, for parsing an XML file, difference in time performance is huge, from 4 hours (DOM) to 9 seconds (SAX).

Different APIs

Different interface bring to different complexity of code, but all depends from the point of view

DOM

You do not know what is Document Object Model if you do not know what is an XPath: XPath is a path that match one or more XML element in a XML document.

Example: //product[@title="casa"]

Does match all element with attribute titel=”casa”

Every programming language, and every library give different way to use xpath, however
it consist on make a request for a list of nodes that match a given XPath.
Typically this list of node are used for querying other specific information for every node contained.

For example, if <product> contains child elements or other attributes, one could request a specific
child, maybe throw another XPath.

SAX

(Simple API for XML). The Library has a parser top-down recursive, parser accent as parameter the document and 3 or more callback functions, which parser call when it happen some event with parameter given by the context and event type itself.

The 3 base callback functions are called for events: element start, element end, character data.

(there will be callback for CDATA start, CDATA end, and anything else if library is complete)

The element start callback is called wiht element name the list of attribute.
The element end callback is called with element name. The character callback is called with character list.
(it is needed to refer to the specific documentation)

Differents prospectives

From programming point of view there are 2 differents prospectives. With DOM the document is an object and request are done by methods, such as XPath. With SAX the document is read and parsed for record or use info contained.

Simply, if what is central are data contained in document (all), then I tend to use SAX, else if what is central is to extract some data from a document I tend to use DOM.

DOM Tricks

If I am using DOM, it is better to load document in ram (if not already there). This because reading from file is surely slower.

Other thing, but simple, do not call methods when not needed, reduce number of request to minimum.

Performance

Even loading document on ram performance could be very bad, let consider 2 MBytes XML document. With 200 XPath request for the root, parser have to parse (read and process) 400 MBytes of character, which is a great amount of operations

With SAX document is read once, it does not care to load it in ram before (given the fact parser is descending recursive). All info have to be recorded in some data structure locale to the code. Once parsing is ended, those data will be used, or those was already used during parsing.

Conclusion

Even if DOM seems more appealing becouse of object nature, really, even if more complex (local structures for data are needed), SAX is surely better for extracting all data from an XML document.


Posted

in

,

by

Tags: