<?xml version="1.0" encoding="UTF-8"?><rss version="2.0"
	xmlns:content="http://purl.org/rss/1.0/modules/content/"
	xmlns:dc="http://purl.org/dc/elements/1.1/"
	xmlns:atom="http://www.w3.org/2005/Atom"
	xmlns:sy="http://purl.org/rss/1.0/modules/syndication/"
		>
<channel>
	<title>Comments on: Showdown &#8211; Java HTML Parsing Comparison</title>
	<atom:link href="http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/feed/" rel="self" type="application/rss+xml" />
	<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/</link>
	<description>The software development weblog of Benjamin McCann.</description>
	<lastBuildDate>Sat, 20 Mar 2010 14:42:36 +0000</lastBuildDate>
	<generator>http://wordpress.org/?v=2.9.2</generator>
	<sy:updatePeriod>hourly</sy:updatePeriod>
	<sy:updateFrequency>1</sy:updateFrequency>
		<item>
		<title>By: deepak</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-21444</link>
		<dc:creator>deepak</dc:creator>
		<pubDate>Thu, 04 Mar 2010 07:31:47 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-21444</guid>
		<description>Nice article and good covering. I have found the LOBO to be somewhat useful.

Sorry, I am taking this off topic.

Jonathon, Your implementation looks interesting and useful for our startup that is scrapping stock prices from all over internet. However we are planning to use third party XQuery and I need org.w3c.dom.Document. Your implementation has a custom Document. How to convert your document model to org.w3c.Document tree? Any pointers much appreciated.</description>
		<content:encoded><![CDATA[<p>Nice article and good covering. I have found the LOBO to be somewhat useful.</p>
<p>Sorry, I am taking this off topic.</p>
<p>Jonathon, Your implementation looks interesting and useful for our startup that is scrapping stock prices from all over internet. However we are planning to use third party XQuery and I need org.w3c.dom.Document. Your implementation has a custom Document. How to convert your document model to org.w3c.Document tree? Any pointers much appreciated.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Jonathan Hedley</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-20172</link>
		<dc:creator>Jonathan Hedley</dc:creator>
		<pubDate>Tue, 02 Feb 2010 20:21:27 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-20172</guid>
		<description>Hi Ben and readers,

Self plug: I&#039;ve just released a new open source HTML parser called jsoup @ http://jsoup.org/ . Its goal is to deal with all real-world HTML. The interface is designed a bit differently than the above parsers; it combines the best of DOM, CSS, and jquery-like accessor methods.

I ran the URLs from the showdown through it:

List urls = Arrays.asList(&quot;http://www.theqarena.com&quot;,   &quot;http://cleveland.indians.mlb.com&quot;,
  &quot;http://www.clevelandbrowns.com&quot;, &quot;http://www.cbgarden.org&quot;, &quot;http://www.clemetzoo.com&quot;,
  &quot;http://www.cmnh.org/site/&quot;, &quot;http://www.clevelandart.org&quot;, &quot;http://www.mocacleveland.org&quot;,
  &quot;http://www.glsc.org&quot;, &quot;http://www.rockhall.com&quot;);

for (String urlString : urls) {
  URL url = new URL(urlString);
  Document doc = Jsoup.parse(url, 3*1000);
  Elements links = doc.select(&quot;a&quot;);

  System.out.println(String.format(&quot;Found %d links at %s&quot;, links.size(), urlString));
}

The results: (all parse successfully)
Found 71 links at http://www.theqarena.com
Found 330 links at http://cleveland.indians.mlb.com
Found 105 links at http://www.clevelandbrowns.com
Found 135 links at http://www.cbgarden.org
Found 86 links at http://www.clemetzoo.com
Found 320 links at http://www.cmnh.org/site/
Found 103 links at http://www.clevelandart.org
Found 63 links at http://www.mocacleveland.org
Found 34 links at http://www.glsc.org
Found 90 links at http://www.rockhall.com

Probably all of these sites have updated since the original post so it&#039;s not totally comparable, but I think it&#039;s a pretty good showing, and the interface is (imho) a lot easier to use.</description>
		<content:encoded><![CDATA[<p>Hi Ben and readers,</p>
<p>Self plug: I&#8217;ve just released a new open source HTML parser called jsoup @ <a href="http://jsoup.org/" rel="nofollow">http://jsoup.org/</a> . Its goal is to deal with all real-world HTML. The interface is designed a bit differently than the above parsers; it combines the best of DOM, CSS, and jquery-like accessor methods.</p>
<p>I ran the URLs from the showdown through it:</p>
<p>List urls = Arrays.asList(&#8220;http://www.theqarena.com&#8221;,   &#8220;http://cleveland.indians.mlb.com&#8221;,<br />
  &#8220;http://www.clevelandbrowns.com&#8221;, &#8220;http://www.cbgarden.org&#8221;, &#8220;http://www.clemetzoo.com&#8221;,<br />
  &#8220;http://www.cmnh.org/site/&#8221;, &#8220;http://www.clevelandart.org&#8221;, &#8220;http://www.mocacleveland.org&#8221;,<br />
  &#8220;http://www.glsc.org&#8221;, &#8220;http://www.rockhall.com&#8221;);</p>
<p>for (String urlString : urls) {<br />
  URL url = new URL(urlString);<br />
  Document doc = Jsoup.parse(url, 3*1000);<br />
  Elements links = doc.select(&#8220;a&#8221;);</p>
<p>  System.out.println(String.format(&#8220;Found %d links at %s&#8221;, links.size(), urlString));<br />
}</p>
<p>The results: (all parse successfully)<br />
Found 71 links at <a href="http://www.theqarena.com" rel="nofollow">http://www.theqarena.com</a><br />
Found 330 links at <a href="http://cleveland.indians.mlb.com" rel="nofollow">http://cleveland.indians.mlb.com</a><br />
Found 105 links at <a href="http://www.clevelandbrowns.com" rel="nofollow">http://www.clevelandbrowns.com</a><br />
Found 135 links at <a href="http://www.cbgarden.org" rel="nofollow">http://www.cbgarden.org</a><br />
Found 86 links at <a href="http://www.clemetzoo.com" rel="nofollow">http://www.clemetzoo.com</a><br />
Found 320 links at <a href="http://www.cmnh.org/site/" rel="nofollow">http://www.cmnh.org/site/</a><br />
Found 103 links at <a href="http://www.clevelandart.org" rel="nofollow">http://www.clevelandart.org</a><br />
Found 63 links at <a href="http://www.mocacleveland.org" rel="nofollow">http://www.mocacleveland.org</a><br />
Found 34 links at <a href="http://www.glsc.org" rel="nofollow">http://www.glsc.org</a><br />
Found 90 links at <a href="http://www.rockhall.com" rel="nofollow">http://www.rockhall.com</a></p>
<p>Probably all of these sites have updated since the original post so it&#8217;s not totally comparable, but I think it&#8217;s a pretty good showing, and the interface is (imho) a lot easier to use.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Borgy Manotoy</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-17322</link>
		<dc:creator>Borgy Manotoy</dc:creator>
		<pubDate>Mon, 23 Nov 2009 09:19:42 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-17322</guid>
		<description>I have tried adding this code... hoping that it will convert  into  but I have no luck..

	CleanerTransformations transformations = new CleanerTransformations();				
	TagTransformation tagTransformation = new TagTransformation(&quot;table&quot;, &quot;TABLE&quot;, true);
	transformations.addTransformation(tagTransformation);				
	cleaner.setTransformations(transformations);

this code does not transform  &quot;table&quot; into &quot;TABLE&quot; but.... if I try to transform it from &quot;table&quot; into &quot;TABLET&quot; or any value other than &quot;table&quot; it will transform...

is there a way where in I can transform all tags into Uppercase retaining all the tag attributes... in an instant? ^_^</description>
		<content:encoded><![CDATA[<p>I have tried adding this code&#8230; hoping that it will convert  into  but I have no luck..</p>
<p>	CleanerTransformations transformations = new CleanerTransformations();<br />
	TagTransformation tagTransformation = new TagTransformation(&#8220;table&#8221;, &#8220;TABLE&#8221;, true);<br />
	transformations.addTransformation(tagTransformation);<br />
	cleaner.setTransformations(transformations);</p>
<p>this code does not transform  &#8220;table&#8221; into &#8220;TABLE&#8221; but&#8230;. if I try to transform it from &#8220;table&#8221; into &#8220;TABLET&#8221; or any value other than &#8220;table&#8221; it will transform&#8230;</p>
<p>is there a way where in I can transform all tags into Uppercase retaining all the tag attributes&#8230; in an instant? ^_^</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Borgy Manotoy</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-17320</link>
		<dc:creator>Borgy Manotoy</dc:creator>
		<pubDate>Mon, 23 Nov 2009 08:42:55 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-17320</guid>
		<description>tnx ben ^_^

I have a noob question regarding HTMLCleaner this time.

How can I transform from lower-case into  upper-case? I want all the tags to be in Upper-case since the
xsl file that will use it is expecting all tags in upper-case form.... and I dont have idea about XSLT ^_^

Thank you and Godbless ^_^</description>
		<content:encoded><![CDATA[<p>tnx ben ^_^</p>
<p>I have a noob question regarding HTMLCleaner this time.</p>
<p>How can I transform from lower-case into  upper-case? I want all the tags to be in Upper-case since the<br />
xsl file that will use it is expecting all tags in upper-case form&#8230;. and I dont have idea about XSLT ^_^</p>
<p>Thank you and Godbless ^_^</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ben</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-17174</link>
		<dc:creator>Ben</dc:creator>
		<pubDate>Thu, 19 Nov 2009 05:12:19 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-17174</guid>
		<description>Borgy,
If you need to turn a String into an InputStream you can do the following:
new ByteArrayInputStream(string.getBytes())</description>
		<content:encoded><![CDATA[<p>Borgy,<br />
If you need to turn a String into an InputStream you can do the following:<br />
new ByteArrayInputStream(string.getBytes())</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Borgy Manotoy</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-17170</link>
		<dc:creator>Borgy Manotoy</dc:creator>
		<pubDate>Thu, 19 Nov 2009 01:07:12 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-17170</guid>
		<description>Nice one, I&#039;ve been looking for this in days already ^_^ 

Just a question, how about if I have the content that I want to parse? which among them are able to parse a string content directly not from a URL? 

I am using this code..

DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();
doc = docBuilder.parse( new java.io.StringBufferInputStream(content) );	

but I have no luck whenever the content is a malformed one. I need to convert this content into a Document or something in a form of nodes.

QUESTION:
------------
Which among the tools u mentioned is capable of parsing the content directly and do some corrections if needed?
It would really be a great help.

Thanks and Godbless ^_^
------------</description>
		<content:encoded><![CDATA[<p>Nice one, I&#8217;ve been looking for this in days already ^_^ </p>
<p>Just a question, how about if I have the content that I want to parse? which among them are able to parse a string content directly not from a URL? </p>
<p>I am using this code..</p>
<p>DocumentBuilderFactory dbfac = DocumentBuilderFactory.newInstance();<br />
DocumentBuilder docBuilder = dbfac.newDocumentBuilder();<br />
doc = docBuilder.parse( new java.io.StringBufferInputStream(content) );	</p>
<p>but I have no luck whenever the content is a malformed one. I need to convert this content into a Document or something in a form of nodes.</p>
<p>QUESTION:<br />
&#8212;&#8212;&#8212;&#8212;<br />
Which among the tools u mentioned is capable of parsing the content directly and do some corrections if needed?<br />
It would really be a great help.</p>
<p>Thanks and Godbless ^_^<br />
&#8212;&#8212;&#8212;&#8212;</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Ipodfilm</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-15870</link>
		<dc:creator>Ipodfilm</dc:creator>
		<pubDate>Tue, 20 Oct 2009 11:05:38 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-15870</guid>
		<description>So nice site. I will visit it more often and read comments. Thx u a lot</description>
		<content:encoded><![CDATA[<p>So nice site. I will visit it more often and read comments. Thx u a lot</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Xediks</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-15506</link>
		<dc:creator>Xediks</dc:creator>
		<pubDate>Sat, 10 Oct 2009 22:01:30 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-15506</guid>
		<description>Great blog, found here all that was looking for.</description>
		<content:encoded><![CDATA[<p>Great blog, found here all that was looking for.</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: Samarjit Samanta</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-11633</link>
		<dc:creator>Samarjit Samanta</dc:creator>
		<pubDate>Mon, 15 Jun 2009 15:30:53 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-11633</guid>
		<description>Hi I Support HtmlCleaner the syntax has changed though as of now. But anyways HTML Cleaner did it for me hte following learnings happened for me while i was creating HTML scrapping program given an xpath I was to extract the innerHTML. I got it working with Javascript in 2 days.But, the rest follows...
HTML Parser i just couldnt get the job done as the APIs are so different and some useful things are missing. Not even remotely similar to using DOM parser.It does have Filters but too difficult to manage.I needed to test others so quickly 
TagSoup it did most of the job well but it required cleaned HTML.
  something  Some other thingis needed to be cleaned up

So had to put in jTidy to clean up. But it  required to save the intermediate file and then parse again from it. Alas my earlier xpath is no longer valid as now the DOM structure of temp html is different. Doesnt suit me.
Finally Tried Mozilla parser well in my office i dont hv rights to path variable so gave up. As other options to start was too time consuming to discover.
HTML Cleaner cleaner did the job for me no intermediate file saving. No creating xpath from intermediate files. It did parse malformed HTML as its parsed by IE or Mozilla(I know this because my JavaScript works for Mozilla as well as IE without any changes) 
</description>
		<content:encoded><![CDATA[<p>Hi I Support HtmlCleaner the syntax has changed though as of now. But anyways HTML Cleaner did it for me hte following learnings happened for me while i was creating HTML scrapping program given an xpath I was to extract the innerHTML. I got it working with Javascript in 2 days.But, the rest follows&#8230;<br />
HTML Parser i just couldnt get the job done as the APIs are so different and some useful things are missing. Not even remotely similar to using DOM parser.It does have Filters but too difficult to manage.I needed to test others so quickly<br />
TagSoup it did most of the job well but it required cleaned HTML.<br />
  something  Some other thingis needed to be cleaned up</p>
<p>So had to put in jTidy to clean up. But it  required to save the intermediate file and then parse again from it. Alas my earlier xpath is no longer valid as now the DOM structure of temp html is different. Doesnt suit me.<br />
Finally Tried Mozilla parser well in my office i dont hv rights to path variable so gave up. As other options to start was too time consuming to discover.<br />
HTML Cleaner cleaner did the job for me no intermediate file saving. No creating xpath from intermediate files. It did parse malformed HTML as its parsed by IE or Mozilla(I know this because my JavaScript works for Mozilla as well as IE without any changes)</p>
]]></content:encoded>
	</item>
	<item>
		<title>By: A. Shiraz</title>
		<link>http://www.benmccann.com/dev-blog/java-html-parsing-library-comparison/comment-page-1/#comment-8697</link>
		<dc:creator>A. Shiraz</dc:creator>
		<pubDate>Sun, 19 Apr 2009 02:33:14 +0000</pubDate>
		<guid isPermaLink="false">http://www.lumidant.com/blog/java-html-parsing-library-comparison/#comment-8697</guid>
		<description>I tried using mozilla parser but it seems too complicated to use. I just want the page text content and the images. I tried htmlparser but that does seem to be working properly : http://htmlparser.sourceforge.net/javadoc/org/htmlparser/parserapplications/SiteCapturer.html It does not load images or stores them locally. I would just like to get reproduce the html and get a handle on the images please.</description>
		<content:encoded><![CDATA[<p>I tried using mozilla parser but it seems too complicated to use. I just want the page text content and the images. I tried htmlparser but that does seem to be working properly : <a href="http://htmlparser.sourceforge.net/javadoc/org/htmlparser/parserapplications/SiteCapturer.html" rel="nofollow">http://htmlparser.sourceforge.net/javadoc/org/htmlparser/parserapplications/SiteCapturer.html</a> It does not load images or stores them locally. I would just like to get reproduce the html and get a handle on the images please.</p>
]]></content:encoded>
	</item>
</channel>
</rss>
