Showdown – Java HTML Parsing Comparison

02/02/2008

I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like any markup I’d create. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.

Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:

http://www.theqarena.com

http://cleveland.indians.mlb.com

http://www.clevelandbrowns.com

http://www.cbgarden.org

http://www.clemetzoo.com

http://www.cmnh.org

http://www.clevelandart.org

http://www.mocacleveland.org

http://www.glsc.org

http://www.rockhall.com

I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:

NekoHTML:

final DOMParser parser = new DOMParser();
try {
	parser.parse(new InputSource(urlIS));
	document = parser.getDocument();
} catch (SAXException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
}

TagSoup:

final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
	sax2dom = new SAX2DOM();
	parser.setContentHandler(sax2dom);
	parser.setFeature(Parser.namespacesFeature, false);
	parser.parse(new InputSource(urlIS));
} catch (Exception e) {
	e.printStackTrace();
}
document = sax2dom.getDOM();

jTidy:

final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);

HtmlCleaner:

final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
	cleaner.clean();
	document = cleaner.createDOM();
} catch (Exception e) {
	e.printStackTrace();
}

Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:

Found 87 links at http://www.theqarena.com/
Found 156 links at http://cleveland.indians.mlb.com/
Found 96 links at http://www.clevelandbrowns.com/
Found 106 links at http://www.cbgarden.org/
Found 70 links at http://www.clemetzoo.com/
Found 23 links at http://www.cmnh.org/site/
Found 27 links at http://www.clevelandart.org/
Found 51 links at http://www.mocacleveland.org/
Found 27 links at http://www.glsc.org/
Found 90 links at http://www.rockhall.com/

One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.  One drawback to HtmlCleaner is that it’s not available in a Maven repository.  Sometimes NekoHTML may be easier to use for this reason.  Note also that at the time of writing the last released version of jTidy was from 2000.  A newer .jar has since been made available that will likely perform better.