HTML Parsing Showdown – New Contender Takes Title
One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.
However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.
final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";
try {
File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
String parseLibrary = parserLibraryFile.getAbsolutePath();
MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
MozillaParser parser = new MozillaParser();
document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
e.printStackTrace();
}
Buy Me a Beer










John Towell said,
September 13, 2008 at 12:49 am
I can’t get paste the following exception. I have checked and double checked my path many times. Any ideas?
com.dappit.Dapper.parser.ParserInitializationException
at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
at com.fantasytruth.accuracy.ParserTest.testParser(ParserTest.java:21)
at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
at java.lang.reflect.Method.invoke(Unknown Source)
at junit.framework.TestCase.runTest(TestCase.java:154)
at junit.framework.TestCase.runBare(TestCase.java:127)
at junit.framework.TestResult$1.protect(TestResult.java:106)
at junit.framework.TestResult.runProtected(TestResult.java:124)
at junit.framework.TestResult.run(TestResult.java:109)
at junit.framework.TestCase.run(TestCase.java:118)
at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
Caused by: java.lang.UnsatisfiedLinkError: C:\dev-tools\MozillaHtmlParser\native\bin\MozillaParser.dll: Can’t find dependent libraries
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(Unknown Source)
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.load0(Unknown Source)
at java.lang.System.load(Unknown Source)
… 18 more
Audrey said,
February 12, 2009 at 8:23 am
Thanks for the comparison report. Looks like I am working on something similar a year later. I am thinking of trying Cobra HTML Parser http://lobobrowser.org/cobra.jsp because it is pure java and is CSS and javascript aware and looks like it is more actively maintained than HTMLParser.
A. Shiraz said,
April 19, 2009 at 6:34 pm
Trying this out and get the same error as the others (dependencies). I tried putting the following directories in path :
C:\Documents and Settings\Shiraz>path
PATH=C:\Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows;C:\
Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows\components
I then tried the following
File parserLibraryFile = new File(”C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/windows/MozillaParser”
+ EnviromentController.getSharedLibraryExtension());
String parserLibrary = parserLibraryFile.getAbsolutePath();
System.out.println(”Loading Parser Library ” + parserLibrary);
// mozilla.dist.bin directory
final File mozillaDistBinDirectory = new File(
“C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/”
+ “windows”);
String absPath =mozillaDistBinDirectory.getAbsolutePath();
MozillaParser.init(parserLibrary, absPath);
I still get the following error :
Operating system : Windows XP
Loading Parser Library C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll
com.dappit.Dapper.parser.ParserInitializationException
at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
at first.ParserExample.main(ParserExample.java:30)
Caused by: java.lang.UnsatisfiedLinkError: C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll: Can’t find dependent libraries
at java.lang.ClassLoader$NativeLibrary.load(Native Method)
at java.lang.ClassLoader.loadLibrary0(Unknown Source)
at java.lang.ClassLoader.loadLibrary(Unknown Source)
at java.lang.Runtime.load0(Unknown Source)
at java.lang.System.load(Unknown Source)
… 2 more