Comments on: Showdown – Java HTML Parsing Comparison

By: Peter

Peter — Thu, 08 Sep 2016 08:40:02 +0000

This is a very useful post! Thank you very much.
I like HtmlCleaner. It saves me a lot of time while parsing html files.

By: Lizabeth Gronowski

Lizabeth Gronowski — Sat, 14 Apr 2012 16:57:45 +0000

I’m particularly pleased to find this site. I want to give you thanks for ones time put in writing this superb post. I most certainly appreciated each part of it and i also have you book-marked to check out new articles or blog posts on your site.

By: Victor

Victor — Sat, 25 Jun 2011 08:13:31 +0000

Thanks for this excellent article. After 3 years of the initial comparison, the results are still useful to choose the appropriate html parser for our needs. I will use HTMLCleaner by the way…

By: marcelo camanho

marcelo camanho — Sat, 23 Apr 2011 00:34:05 +0000

utkarsh, you can use the following:

CleanerProperties props = new CleanerProperties();
//not sure if you will need it, but i needed it..
props.setNamespacesAware(false);

DomSerializer dom = new DomSerializer(props);
Document doc = dom.createDOM(clean);

By: utkarsh

utkarsh — Thu, 07 Apr 2011 05:05:41 +0000

Hi Ben in your code for Html Cleaner you are using
document=cleaner.createDOM();
However this function is not present in HTMLCleaner class
So, please help me.
Thanks

By: utkarsh

utkarsh — Tue, 22 Mar 2011 06:59:59 +0000

thanks Ben

By: Ben

Ben — Mon, 21 Mar 2011 17:52:10 +0000

Hi Utkarsh,
urlIs is a java.io.InputStream. One way of getting an InputStream is to call URL.openStream(). If you’re reading from files on disk you’ll probably want to use a java.io.FileInputStream.

By: utkarsh

utkarsh — Mon, 21 Mar 2011 11:59:52 +0000

What is urlIS here ? which class’s object is it.
actually i am trying to parse an html page stored on disk .
how to supply the FileReader object to HTMLCleaner Constructor

By: Mario Gaitán

Mario Gaitán — Wed, 02 Mar 2011 04:39:50 +0000

Thanks for the post, great comparison!!!

By: David

David — Mon, 17 Jan 2011 06:52:46 +0000

Thanks to all for your posts and your time !!!
I really appreciate it !