First impressions of StAX
I have recently been working on this issue, converting some of Pulses‘ XML processing from using DOM to using StAX. Whilst the DOM API is a simpler to work with, it is not so memory friendly and was becoming a problem for some of our customers.
I chose to try StAX rather than SAX because it was developed to be a middle ground between DOM and SAX, having the memory efficiency of a streaming parser like SAX whilst retaining a simpler API like DOM.
A quick note: This is not an exhaustive analysis of the pros and cons of the various XML APIs as this has been done before. Rather, this is a comparison of some of the things I made a mental note of whilst doing the conversion.
DOM
- Works with an in memory tree representation of the XML document, and therefore has high memory requirements for large documents
- Easy to use API that allows you to navigate the XML document in whatever way is most appropriate to. You can process a element as often as you like, go forwards and backwards as well as search through the document.
This is what our code looked like before the conversion.
Note that using dom is simply manipulating an in memory tree.
SAX
- Implementations manage the state in the form of instance variables. This works well for simple documents, but becomes more difficult to manage as the document gets more complex.
- Processing of an element typically occurs when you encounter the elements end tag, as only then is all of the elements content available. Until you reach an end tag, you have no real idea of how far through an element you are.
- You only need to respond to elements that are of interest. Ie, when you receive a callback for one of these elements, just do nothing and return.
StAX
- Implementations typically manage the state on the execution stack, with a new method call for each element that is encountered. This makes the code pretty easy to read as it is self documenting.
- You need to process each and every (unfiltered) tag and event in the xml document. This is rather low level, and without care can lead to confusion and complications.
And this is what the code looked like after the conversion:
{
expectStartElement(ELEMENT_SUITES, reader);
reader.nextTag();
while (reader.isStartElement())
{
if (isElement(getConfig().getSuiteElement(), reader))
{
processSuite(reader);
}
else
{
// ignore this element.
nextElement(reader);
}
}
expectEndElement(ELEMENT_SUITES, reader);
}
My use of StAX is a little more regimented. I begin and end each method with an assertion that I am at the element I expect to be (this has the advantage of documenting the implementation). The rest of the implementation is similar to its DOM counterpart, expect that rather than simply asking for the elements I want to process, I need to loop over all the elements, skipping over those that are not of interest.
Summary
Overall, I am happy with the way the conversion has turned out. A couple of things were unexpected. The StAX API did not include any higher level utility functions that allowed you to move around at the element level, only between the end tag of one element and the start tag of the next. The other was that it required a fair amount of effort to write the code such that it was resilient to unexpected data in the reports. Every tag has to be processed after all.
This entry was posted on Thursday, October 29th, 2009 at 4:54 pm and is filed under Technology. You can follow any responses to this entry through the RSS 2.0 feed. You can leave a response, or trackback from your own site.

November 11th, 2009 at 7:30 pm
my name is Jason and I’m a dirty old man :p