RSS Entries RSS
RSS Subscribe by Email

Easy Java Bean toString() using BeanUtils

I often want to have a String description of my beans for debugging or logging purposes, but hate having to manually concatenate the fields in my class to create a toString() method.  This code snippet using Apache Commons (a.k.a. Jakarta Commons) is very helpful for just such occasions:

	public String toString() {
		try {
			return BeanUtils.describe(this).toString();
		} catch (Exception e) {
			Logger.getLogger(this.getClass()).error("Error converting object to String", e);
		}
		return super.toString();
	}

Comments (3)

HTML Parsing using the Firefox DLLs

One of my first posts was a comparison of HTML parsers. Today I found a particularly challenging document to parse. None of the parsers I had compared earlier were able to handle the malformed HTML in this table where the td elements were prematurely ended. The behavior of Neko and HtmlCleaner made the most sense (while still failing to clean the document) while the output from TagSoup and jTidy was a bit more strange.

However, I noticed that FireBug parsed the document correctly. So I did a bit of research into how I’d be able to use Firefox’s HTML parsing and found a project called Mozilla Parser that had been put together to do just that. Its setup is not quite as nice as the others, but is well documented. Follow the quick start to begin with. Then when you get to the portion where you write actual Java code you may want to follow the example below as it appears the API has been updated since the documentation was posted.

final String BASE_PATH = "C:\\Documents and Settings\\bjm733\\My Documents\\workspace\\MozillaHtmlParser\\";

try {
	File parserLibraryFile = new File(BASE_PATH + "native" + File.separator + "bin" + File.separator + "MozillaParser" + EnviromentController.getSharedLibraryExtension());
	String parseLibrary = parserLibraryFile.getAbsolutePath();
	MozillaParser.init(parseLibrary, BASE_PATH + "mozilla.dist.bin."+EnviromentController.getOperatingSystemName());
	MozillaParser parser = new MozillaParser();
	document = parser.parse("<html><body>hello world</body></html>");
} catch(Exception e) {
	e.printStackTrace();
}

The most unfortunate thing about this approach is that it is not pure Java, which can be a deal breaker in many situations. Also it’s not well maintained with responsive developers.

Comments (6)

Intro to URL Rewriting with Apache’s .htaccess

I have created an .htaccess file to do URL rewriting for every site I’ve ever created. If you’re not familiar with URL rewriting, it is used to modify a URL or redirect the user before the requested resource is fetched. One of its major uses is to make URLs human readable. That means your users can visit a pretty URL like http://www.tabworldonline.com/guitar/A/ and have it interpreted by the server as http://www.tabworldonline.com/artists.php?instrument=guitar&letter=A.

Most of the time, this file can be relatively simple. I would always recommend using one for URL canonicalization, which is a fancy term for making sure you have one unique URL for each page. For example, lumidant.com redirects to www.lumidant.com. This is beneficial for SEO because you want to ensure that search engines don’t split your ranking points between pages that are actually one and the same.

The code below is the .htaccess file from this site. The declarations in the file are regular expressions, which you might need to get a quick refresher on if you’re not familiar with. A few other things to be aware of include the fact that [NC] stands for no case and means that the text is not case-sensitive, [R=301] tells the server to do a 301 redirect, and [L] tells the server it can quit there and and not bother processing the rest of the file.

<IfModule mod_rewrite.c>

  RewriteEngine on

  # rewrite all lumidant.com requests to the lumidant subdirectory
  RewriteCond %{HTTP_HOST} ^(www\.)?lumidant\.com$
  # this is needed to stop infinite looping
  RewriteCond %{REQUEST_URI} !^/lumidant/.*$
  # don't redirect these directories to the lumidant subdirectory
  RewriteCond %{REQUEST_URI} !^/pinknews/.*$
  RewriteRule ^(.*)$ /lumidant/$1

  # if you're asking for a directory and there is no trailing slash then add one
  RewriteCond %{REQUEST_FILENAME} -d
  RewriteCond %{REQUEST_URI} !^.*/$
  RewriteRule ^/lumidant/(.*)$ http://www\.lumidant\.com%{REQUEST_URI}/ [R=301,L]

  # add a www if there's not one
  RewriteCond %{HTTP_HOST} ^lumidant\.com$ [NC]
  RewriteCond %{REQUEST_URI} !^/blog.*$
  RewriteRule ^lumidant/(.*)$ http://www\.lumidant\.com/$1 [R=301,L]

</IfModule>

This blog is currently hosted with BlueHost. For accounts with multiple domains, BlueHost places the add-on domains in subdirectories of the main domain. This can be confusing to maintain, so I moved all of the lumidant code to a subdirectory as well and then updated the .htaccess file to make this organization transparent to the end user.

The last few lines add a www to all non-www pages. While I could have placed this at the beginning of the file, the file would be executed again after the redirect causing possibly another redirect to be executed if a trailing slash needed to be added. Keep in mind while organizing the file that you’d like to minimize the number of redirects for many reasons including response times, reducing server load, and optimizing for search engines.

URL rewriting can be tricky at first, especially if you’re not familiar with regular expressions. If you’re working with redirections, then it may help to check the HTTP headers of your request to see what intermediate redirects are occurring.

Finally, if you’re not using Apache there are other alternatives to .htaccess. For example, I have used the UrlRewriteFilter in the past for Java web apps.

Comments

Suppressing Compile Warnings with Java Annotations

If you’ve used Java 1.5 Generics much then you’re probably familiar with the following compile warning: “Type safety: The expression of type List needs unchecked conversion to conform to List<String>” or similar. It turns out there’s a rather simple solution with annotations to ignore this problem:

@SuppressWarnings(“unchecked”)

A couple other possible uses of the annotation that might be of interest are:

@SuppressWarnings(“deprecation”)
@SuppressWarnings(“serial”)

These are compiler specific, so you may want to check out the full Eclipse list, which is a bit lengthier than Sun’s 7 options (all, deprecation, unchecked, fallthrough, path, serial, and finally).

Also, multiple statements can be combined into one as follows:

@SuppressWarnings({“unchecked”, “deprecation”})

Comments (1)

Showdown – Java HTML Parsing Comparison

I had to do some HTML parsing today, but unfortunately most HTML on the web is not well-formed like any markup I’d create. Missing end tags and other broken syntax throws a wrench into the situation. Luckily, others have already addressed this issue. Many times over in fact, leaving many to wonder which solution to implement.

Once you parse HTML, you can do some cool stuff with it like transform it or extract some information. For that reason it is sometimes used for screen scraping. So, to test the parsing libraries, I decided to do exactly that and see if I could parse the HTML well enough to extract links from it using an XQuery. The contenders were NekoHTML, HtmlCleaner, TagSoup, and jTidy. I know that there are many others I could have chosen from as well, but this seemed to be a good sampling and there’s only so much time in the day. I also chose 10 URLs to parse. Being a true Clevelander I picked the sites of a number of local attractions. I’m right near all of the stadiums, so the Quicken Loans Arena website was my first target. I sometimes jokingly refer to my city as the “Mistake on the Lake” and the pure awfulness of the HTML from my city did not fail me. The ten URLs I chose are:

http://www.theqarena.com

http://cleveland.indians.mlb.com

http://www.clevelandbrowns.com

http://www.cbgarden.org

http://www.clemetzoo.com

http://www.cmnh.org

http://www.clevelandart.org

http://www.mocacleveland.org

http://www.glsc.org

http://www.rockhall.com

I gave each library an InputStream created from a URL (referred to as urlIS in the code samples below) and expected an org.w3c.dom.Node in return once the parse operation was completed. I implemented each library in its own class extending from an AbstractScraper implementing a Scraper interface I created. This was a design tip fresh in my mind from reading my all-time favorite technical book: Effective Java by Josh Bloch. The implementation specific code for each library is below:

NekoHTML:

final DOMParser parser = new DOMParser();
try {
	parser.parse(new InputSource(urlIS));
	document = parser.getDocument();
} catch (SAXException e) {
	e.printStackTrace();
} catch (IOException e) {
	e.printStackTrace();
}

TagSoup:

final Parser parser = new Parser();
SAX2DOM sax2dom = null;
try {
	sax2dom = new SAX2DOM();
	parser.setContentHandler(sax2dom);
	parser.setFeature(Parser.namespacesFeature, false);
	parser.parse(new InputSource(urlIS));
} catch (Exception e) {
	e.printStackTrace();
}
document = sax2dom.getDOM();

jTidy:

final Tidy tidy = new Tidy();
tidy.setQuiet(true);
tidy.setShowWarnings(false);
tidy.setForceOutput(true);
document = tidy.parseDOM(urlIS, null);

HtmlCleaner:

final HtmlCleaner cleaner = new HtmlCleaner(urlIS);
try {
	cleaner.clean();
	document = cleaner.createDOM();
} catch (Exception e) {
	e.printStackTrace();
}

Finally, to judge the ability to parse the HTML, I ran the XQuery “//a” to grab all the <a> tags from the document. The only one of these parsing libraries I had used before was jTidy. It was able to extract the links from 5 of the 10 documents. However, the clear winner was HtmlCleaner. It was the only library to successfully clean 10/10 documents. Most of the others were not able to make it past even the very first link I provided, which was to Quicken Loans Arena site. HtmlCleaner’s full results:

Found 87 links at http://www.theqarena.com/
Found 156 links at http://cleveland.indians.mlb.com/
Found 96 links at http://www.clevelandbrowns.com/
Found 106 links at http://www.cbgarden.org/
Found 70 links at http://www.clemetzoo.com/
Found 23 links at http://www.cmnh.org/site/
Found 27 links at http://www.clevelandart.org/
Found 51 links at http://www.mocacleveland.org/
Found 27 links at http://www.glsc.org/
Found 90 links at http://www.rockhall.com/

One disclaimer that I will make is that I did not go out of my way to improve the performance of any of these libraries. Some of them had additional options that could be set to possibly improve performance. I did not delve into wading through the documentation to figure out what these options were and simply used the plain vanilla incantations. HtmlCleaner seems to offer me everything I need and was quick and easy to implement.  One drawback to HtmlCleaner is that it’s not available in a Maven repository.  Sometimes NekoHTML may be easier to use for this reason.  Note also that at the time of writing the last released version of jTidy was from 2000.  A newer .jar has since been made available that will likely perform better.

Comments (26)

Running Ant within Eclipse

I tried to run an Ant target within Eclipse today and got a fun error:

BUILD FAILED

java.lang.NoClassDefFoundError: com/sun/javadoc/Type

This happened because Ant could not find tools.jar, which contains classes to used to run javac and javadoc. The solution:

  • Open up Window->Preferences->Java->Installed JREs
  • Set your default JRE to a JDK/SDK
  • Open up the one you just set as default, click “Add External JARs”, and then add tools.jar located in the JDK lib directory.

Eclipse’s Edit JRE Screen

Comments (1)

Hibernate ASM Incompatibilities

A few times now I’ve tried to use ASM 2.2.3 in an application that was utilizing Hibernate. It doesn’t work the way you’d hope. The easiest solution I’ve found is to remove the CGLib and ASM (1.5.3), which come with Hibernate and instead replace them with a copy of cglib-nodep. Using the no dependencies version of CGLib you can safely use the newer ASM without conflict.

I’ve heard this problem was occurring for Spring users before version 2.5. The two places I’ve encountered it were alongside Groovy and today with CXF.  This solution worked in both instances and I am now using both Groovy and CXF alongside Hibernate without any Java exceptions being thrown.

Comments (2)