There are many tools and/or libraries to use if you want to parse html pages. In java, one of the popular ones is called HTML Parser, which is what i use. It is not an application but a java library that you can plug into your classpath when compiling and executing your application using it. Go over to their site http://htmlparser.sourceforge.net/ and download it. When you extract the archive file, it contains the JAR file library , samples and documentation.

I mainly use HTML Parser for extraction purposes. However, you can also use it for transformation. Some cool features include having filters which help immensely in getting the html tags that you only need.

Here is a sample code that uses the HasAttributeFilter class to filter out only tags that contain this attribute. I use the FilterBean class in this example to access the site page’s content. You can also use the Parser class to do the same thing. Using either is up to your preference.

try {
  NodeFilter[] nff = {new HasAttributeFilter("id", "spoof")};
  FilterBean fb = new FilterBean();
  fb.setFilters (nff);
  fb.setURL(link);
  NodeList pageNodeList = fb.getNodes();
  System.out.println(pageNodeList.toHtml());
} catch (Exception e) { }

Suppose our link page contains the following html contents:

<body>
<p id="spoof">This is a sample paragraph</p>
<p id="officeid">Office id is 000123</p>
</body>

Once you execute that code, the output for System.out would be:

<p id="spoof">This is a sample paragraph</p>

the NodeList class is patterned after the Vector class and can be broken into separate tags. You just need to loop them. The documentation API contains all the classes of HTML Parser that you can use in your parsing needs. Take another filter as example, the TagNameFilter. if you replace HasAttributeFilter in the code with this

new TagNameFilter("p")

System.out will output as one string:
<p id="spoof">This is a sample paragraph</p>
<p id="officeid">Office id is 000123</p>

if you need to acecss each <p> tag separately you need to loop the pageNodeList object like this:

for (int i=0; i<pageNodeList.size(); i++) {
  System.out.println(((Node) pageNodeList.elementAt(i)).toHtml());
}

There you have it. HTML Parsing is so easy when using this helper library. It saves you the time and trouble of creating your own parser. Feel free to comment out if you have questions and/or problems.

Related Posts Plugin for WordPress, Blogger...