JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS and JQuery-like methods.

I had been using the HTMLParser library for so many years even though the library had been stopped way way before because I found it met my needs. But when it came time to parsing a particular site that I wanted to check, I found it hard to parse the HTML code because the web developers did a stupid job of displaying data.

That was when I decided to try JSoup. With very good results!

JSoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do. Other features include:

  • scrape and parse HTML from a URL, file, or string
  • find and extract data, using DOM traversal or CSS selectors
  • manipulate the HTML elements, attributes, and text
  • clean user-submitted content against a safe white-list, to prevent XSS attacks
  • output tidy HTML

JSoup is very easy to use too. Its JQuery like style for searching for HTML portions of the code makes it very convenient and easy to use.

Goodbye old library. JSoup is what I will be using from now on!

There are many tools and/or libraries to use if you want to parse html pages. In java, one of the popular ones is called HTML Parser, which is what i use. It is not an application but a java library that you can plug into your classpath when compiling and executing your application using it. Go over to their site http://htmlparser.sourceforge.net/ and download it. When you extract the archive file, it contains the JAR file library , samples and documentation.

I mainly use HTML Parser for extraction purposes. However, you can also use it for transformation. Some cool features include having filters which help immensely in getting the html tags that you only need.

Here is a sample code that uses the HasAttributeFilter class to filter out only tags that contain this attribute. I use the FilterBean class in this example to access the site page’s content. You can also use the Parser class to do the same thing. Using either is up to your preference.

try {
  NodeFilter[] nff = {new HasAttributeFilter("id", "spoof")};
  FilterBean fb = new FilterBean();
  fb.setFilters (nff);
  fb.setURL(link);
  NodeList pageNodeList = fb.getNodes();
  System.out.println(pageNodeList.toHtml());
} catch (Exception e) { }

Suppose our link page contains the following html contents:

<body>
<p id="spoof">This is a sample paragraph</p>
<p id="officeid">Office id is 000123</p>
</body>

Once you execute that code, the output for System.out would be:

<p id="spoof">This is a sample paragraph</p>

the NodeList class is patterned after the Vector class and can be broken into separate tags. You just need to loop them. The documentation API contains all the classes of HTML Parser that you can use in your parsing needs. Take another filter as example, the TagNameFilter. if you replace HasAttributeFilter in the code with this

new TagNameFilter("p")

System.out will output as one string:
<p id="spoof">This is a sample paragraph</p>
<p id="officeid">Office id is 000123</p>

if you need to acecss each <p> tag separately you need to loop the pageNodeList object like this:

for (int i=0; i<pageNodeList.size(); i++) {
  System.out.println(((Node) pageNodeList.elementAt(i)).toHtml());
}

There you have it. HTML Parsing is so easy when using this helper library. It saves you the time and trouble of creating your own parser. Feel free to comment out if you have questions and/or problems.

Related Posts Plugin for WordPress, Blogger...