RSS Entries RSS
RSS Subscribe by Email

HTML Parsing using the Firefox DLLs

One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.

However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.

final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";

try {
	File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
	String parseLibrary = parserLibraryFile.getAbsolutePath();
	MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
	MozillaParser parser = new MozillaParser();
	document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
	e.printStackTrace();
}

The most unfortunate thing about this approach is that it is not pure Java, which can be a deal breaker in many situations. Also it’s not well maintained with responsive developers.

Share and Enjoy:
  • Add to favorites
  • HackerNews
  • DZone
  • Reddit
  • del.icio.us
  • StumbleUpon
  • Slashdot
  • Digg
  • Google Bookmarks
  • Facebook

Tags: , , ,

6 Comments »

  1. John Towell said,

    September 13, 2008 at 12:49 am

    I can’t get paste the following exception. I have checked and double checked my path many times. Any ideas?

    com.dappit.Dapper.parser.ParserInitializationException
    at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
    at com.fantasytruth.accuracy.ParserTest.testParser(ParserTest.java:21)
    at sun.reflect.NativeMethodAccessorImpl.invoke0(Native Method)
    at sun.reflect.NativeMethodAccessorImpl.invoke(Unknown Source)
    at sun.reflect.DelegatingMethodAccessorImpl.invoke(Unknown Source)
    at java.lang.reflect.Method.invoke(Unknown Source)
    at junit.framework.TestCase.runTest(TestCase.java:154)
    at junit.framework.TestCase.runBare(TestCase.java:127)
    at junit.framework.TestResult$1.protect(TestResult.java:106)
    at junit.framework.TestResult.runProtected(TestResult.java:124)
    at junit.framework.TestResult.run(TestResult.java:109)
    at junit.framework.TestCase.run(TestCase.java:118)
    at org.eclipse.jdt.internal.junit.runner.junit3.JUnit3TestReference.run(JUnit3TestReference.java:130)
    at org.eclipse.jdt.internal.junit.runner.TestExecution.run(TestExecution.java:38)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:460)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.runTests(RemoteTestRunner.java:673)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.run(RemoteTestRunner.java:386)
    at org.eclipse.jdt.internal.junit.runner.RemoteTestRunner.main(RemoteTestRunner.java:196)
    Caused by: java.lang.UnsatisfiedLinkError: C:\dev-tools\MozillaHtmlParser\native\bin\MozillaParser.dll: Can’t find dependent libraries
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(Unknown Source)
    at java.lang.ClassLoader.loadLibrary(Unknown Source)
    at java.lang.Runtime.load0(Unknown Source)
    at java.lang.System.load(Unknown Source)
    … 18 more

  2. Audrey said,

    February 12, 2009 at 8:23 am

    Thanks for the comparison report. Looks like I am working on something similar a year later. I am thinking of trying Cobra HTML Parser http://lobobrowser.org/cobra.jsp because it is pure java and is CSS and javascript aware and looks like it is more actively maintained than HTMLParser.

  3. A. Shiraz said,

    April 19, 2009 at 6:34 pm

    Trying this out and get the same error as the others (dependencies). I tried putting the following directories in path :

    C:\Documents and Settings\Shiraz>path
    PATH=C:\Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows;C:\
    Temp\set\MozillaParser-v-0-3-0\MozillaParser-v-0-3-0\dist\windows\components

    I then tried the following
    File parserLibraryFile = new File(“C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/windows/MozillaParser”
    + EnviromentController.getSharedLibraryExtension());
    String parserLibrary = parserLibraryFile.getAbsolutePath();
    System.out.println(“Loading Parser Library ” + parserLibrary);
    // mozilla.dist.bin directory
    final File mozillaDistBinDirectory = new File(
    “C:/SET/lib/mparser/MozillaParser-v-0-3-0/dist/”
    + “windows”);
    String absPath =mozillaDistBinDirectory.getAbsolutePath();
    MozillaParser.init(parserLibrary, absPath);
    I still get the following error :

    Operating system : Windows XP
    Loading Parser Library C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll
    com.dappit.Dapper.parser.ParserInitializationException
    at com.dappit.Dapper.parser.MozillaParser.init(Unknown Source)
    at first.ParserExample.main(ParserExample.java:30)
    Caused by: java.lang.UnsatisfiedLinkError: C:\SET\lib\mparser\MozillaParser-v-0-3-0\dist\windows\MozillaParser.dll: Can’t find dependent libraries
    at java.lang.ClassLoader$NativeLibrary.load(Native Method)
    at java.lang.ClassLoader.loadLibrary0(Unknown Source)
    at java.lang.ClassLoader.loadLibrary(Unknown Source)
    at java.lang.Runtime.load0(Unknown Source)
    at java.lang.System.load(Unknown Source)
    … 2 more

  4. Johan said,

    July 31, 2009 at 12:59 am

    I’ve had it working with the following setup (Windows only)

    Append the following two directories to the PATH variable (properly prefixed, e.g., C:/)
    MozillaParser-v-0-3-0\dist\windows\mozilla\components
    MozillaParser-v-0-3-0\dist\windows\mozilla

    Set the following two variables:
    // From archive: http://sourceforge.net/projects/mozillaparser/files/mozillaparser/MozillaParser-v-0-3-0/MozillaParser-v-0-3-0.zip/download
    String parserLibrary = “C:\\MozillaParser-v-0-3-0\\dist\\windows\\MozillaParser.dll”;

    // From archive: http://sourceforge.net/projects/mozillaparser/files/mozillaparser/Mozilla%20Components%20base%20v.0.1/mozilla-dist-bin-windows.zip/download
    String mozillaBin = “C:\\bin”

    Finally:
    MozillaParser.init(parseLib, mozillaBin);

    As a side node, the author enters the above string objects into a file object, and then returns the path from the file object. This is probably a better approach, but not necessary to get it all working.

  5. Johan said,

    July 31, 2009 at 1:56 am

    For the record, cobra (http://lobobrowser.org/cobra/java-html-parser.jsp) seems very promising. It offers a very helpful feature to extract all links from a page. Hence, given a html page, cobra downloads includes, stylesheets and external javascript automagically. After the parsing is done, a simple routine returns all links that were found. Unfortunately it had a serious flaw, it could not parse http://www.google.com. Somehow, when parsing javascript it fell into an eternal loop. This simple fact severely reduced the attractiveness of the parser.

  6. Anton said,

    January 27, 2010 at 1:46 pm

    Also, i tried parsing your test document with org.htmlparser and it seems to have parsed it okay, even with the weird tags.

RSS feed for comments on this post · TrackBack URL

Leave a Comment