JSoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS and JQuery-like methods.
I had been using the HTMLParser library for so many years even though the library had been stopped way way before because I found it met my needs. But when it came time to parsing a particular site that I wanted to check, I found it hard to parse the HTML code because the web developers did a stupid job of displaying data.
That was when I decided to try JSoup. With very good results!
JSoup implements the WHATWG HTML5 specification and parses HTML to the same DOM as modern browsers do. Other features include:
- scrape and parse HTML from a URL, file, or string
- find and extract data, using DOM traversal or CSS selectors
- manipulate the HTML elements, attributes, and text
- clean user-submitted content against a safe white-list, to prevent XSS attacks
- output tidy HTML
JSoup is very easy to use too. Its JQuery like style for searching for HTML portions of the code makes it very convenient and easy to use.
Goodbye old library. JSoup is what I will be using from now on!