HTML Parsing using the Firefox DLLs

03/21/2008

One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.

However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.

final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";

try {
	File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
	String parseLibrary = parserLibraryFile.getAbsolutePath();
	MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
	MozillaParser parser = new MozillaParser();
	document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
	e.printStackTrace();
}

The most unfortunate thing about this approach is that it is not pure Java, which can be a deal breaker in many situations. Also it’s not well maintained with responsive developers.