Quantcast
RSS Entries RSS
RSS Subscribe by Email

Archive for March, 2008

HTML Parsing using the Firefox DLLs

One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.

However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.

final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";

try {
	File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
	String parseLibrary = parserLibraryFile.getAbsolutePath();
	MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
	MozillaParser parser = new MozillaParser();
	document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
	e.printStackTrace();
}

The most unfortunate thing about this approach is that it is not pure Java, which can be a deal breaker in many situations. Also it’s not well maintained with responsive developers.

Comments (7)

Wuala Invites Available

Wuala is a p2p backup and remote storage service.  It gives you 1GB of backup space for free and more when you share disk space with others.  If you have a large amount of data or multimedia to backup then this is a much cheaper alternative to services such as Mozy.  All your data is encrypted and replicated multiple times.  There is a very interesting Google tech talk on the subject which shares some insights into the workings of Wuala.  I found the portion dealing with erasure codes to be particularly interesting.  Contact me if you’d like an invite to Wuala.

Also, you may have noticed the pace of blogging my has slowed.  I’ve been extremely busy recently and expect this to continue for the next month or two.  Nonetheless, I am going to make time to post a series of Struts 2 tutorials based off a presentation I gave a few months ago.

Comments (2)

Change the NetBeans Default JDK

A client sent me some code today to update. He was using the NetBeans, so I downloaded the IDE and fired it up to open the project he’d sent me. Unfortunately, the project wouldn’t compile because he’d written the code in Java 6 while NetBeans was using Java 5. I couldn’t find a NetBeans menu to update the setting, but rather found that the fix is to add the following in NetBean’s etc/netbeans.conf file:

# Default location of JDK, can be overridden by using –jdkhome <dir>:
netbeans_jdkhome=”C:\Program Files\Java\jdk1.6.0_05″

Comments (27)