Jsoup

What is JSOUP?

Java HTML parser that makes sense of real-world HTML soup.

jsoup is a Java library for working with real-world HTML. It provides a very convenient API for extracting and manipulating data, using the best of DOM, CSS, and jquery-like methods.

jsoup implements the WHATWG HTML specification, and parses HTML to the same DOM as modern browsers do.

parse HTML from a URL, file, or string find and extract data, using DOM traversal or CSS selectors manipulate the HTML elements, attributes, and text clean user-submitted content against a safe white-list, to prevent XSS output tidy HTML

jsoup is designed to deal with all varieties of HTML found in the wild; from pristine and validating, to invalid tag-soup; jsoup will create a sensible parse tree.

Jsoup is designed to deal with different kinds of HTML found in the real world, which includes proper validated HTML to incomplete non-validate tag collection. One of the core strength of Jsoup is that it’s very robust.

Package Description
org.jsoup Contains the main Jsoup class, which provides convenient static access to the jsoup functionality.
org.jsoup.examples Contains example programs and use of jsoup.
org.jsoup.helper
org.jsoup.nodes HTML document structure nodes.
org.jsoup.parser Contains the HTML parser, tag specifications, and HTML tokeniser.
org.jsoup.safety Contains the jsoup HTML cleaner, and whitelist definitions.
org.jsoup.select Packages to support the CSS-style element selector.

What all you can achieve with Jsoup?

jsoup implements the WHATWG HTML5 specification, and parses HTML to the same DOM as modern browsers do.

Jsoup API Jsoup includes many classes, however, its three most important classes are:

org.jsoup.Jsoup

org.jsoup.nodes.Document

org.jsoup.nodes.Element

Jsoup.java

Method Description
static Connection connect(String url) create and returns connection of URL.
static Document parse(File in, String charsetName) parses the specified charset file into document.
static Document parse(File in, String charsetName, String baseUri) parses the specified charset and baseUri file into document.
static Document parse(String html) parses the given html code into document.
static Document parse(String html, String baseUri) parses the given html code with baseUri into document.
static Document parse(URL url, int timeoutMillis) parses the given URL into document.
static String clean(String bodyHtml, Whitelist whitelist) returns safe HTML from input HTML, by parsing input HTML and filtering it through a white-list of permitted tags and attributes.

Document.java

Methods Description
Element body() Accessor to the document’s body element.
Charset charset() Returns the charset used in this document.
void charset(Charset charset) Sets the charset used in this document.
Document clone() Create a stand-alone, deep copy of this node, and all of its children.
Element createElement(String tagName) Create a new Element, with this document’s base uri.
static Document createShell(String baseUri) Create a valid, empty shell of a document, suitable for adding more elements to.
Element head() Accessor to the document’s head element.
String location() Get the URL this Document was parsed from.
String nodeName() Get the node name of this node.
Document normalise() Normalise the document.
String outerHtml() Get the outer HTML of this node.
Document.OutputSettings outputSettings() Get the document’s current output settings.
Document outputSettings(Document.OutputSettings outputSettings) Set the document’s output settings.
Document.QuirksMode quirksMode()
Document quirksMode(Document.QuirksMode quirksMode)
Element text(String text) Set the text of the body of this document.
String title() Get the string contents of the document’s title element.
void title(String title) Set the document’s title element.
boolean updateMetaCharsetElement() Returns whether the element with charset information in this document is updated on changes through Document.charset(Charset) or not.
void updateMetaCharsetElement(boolean update) Sets whether the element with charset information in this document is updated on changes through Document.charset(Charset) or not.

Create Documet from URL

Create Document from File

Create Document from String

Parsing HTML Fragment

A full HTML document includes Header and Body, sometimes you also need to parse an HTML fragment. And you can get a full HTML document includes headers and body. See for example:

DOM Methods

Jsoup has some methods similar to the method in the DOM model ( Parsing XML document)

Methods Description
Element getElementById(String id) Find an element by ID, including or under this element.
Elements getElementsByTag(String tag) Finds elements, including and recursively under this element, with the specified tag name.
Elements getElementsByClass(String className) Find elements that have this class, including or under this element.
Elements getElementsByAttribute(String key) Find elements that have a named attribute set. Case insensitive.
Elements siblingElements() Get sibling elements.
Element firstElementSibling() Gets the first element sibling of this element.
Element lastElementSibling() Gets the last element sibling of this element.
 

The method of retrieving data of Element.

Method Description
String attr(String key) Get an attribute’s value by its key.
void attr(String key, String value) Set an attribute. If the attribute already exists, it is replaced.
String id() Return The id attribute, if present, or an empty string if not.
String className() Gets the literal value of this element’s “class” attribute, which may include multiple class names, space separated. (E.g. on <div class=”header gray”> returns, ” header gray”)
Set<String> classNames() Get all of the element’s class names. E.g. on element <div class=”header gray”>, returns a set of two elements “header”, “gray”. Note that modifications to this set are not pushed to the backing class attribute; use the classNames(java.util.Set) method to persist them.
String text() Gets the combined text of this element and all its children.
void text(String value) Set the text of this element.
String html() Retrieves the element’s inner HTML. E.g. on a <div><p>a</p></div>, would return <p>a</p>. (Whereas Node.outerHtml()would return <div><p>a</p></div>.)
void html(String value) Set this element’s inner HTML. Clears the existing HTML first.
Tag tag() Get the Tag for this element
String tagName() Get the name of the tag for this element. E.g. div
  ……

For example, using the DOM methods, parsing an HTML document and retrieve information form.

The methods similar to jQuery You want to find or manipulate elements using a CSS or jquery-like selector syntax?

JSoup give you a few methods to do this:

Element.select(String selector)

Elements.select(String selector)

Connection conn = Jsoup.connect(“http://o7planning.org“);

Document doc = conn.get();

// a with href

Elements links = doc.select(“a[href]”);

// img with src ending .png

Elements pngs = doc.select(“img[src$=.png]”);

// div with class=masthead

Element masthead = doc.select(“div.masthead”).first();

// direct a after h3

Elements resultLinks = doc.select(“h3.r > a”);

Jsoup elements support a CSS (or jquery) like selector syntax to find matching elements, that allows very powerful and robust queries.

The select method is available in a Document, Element, or in Elements. It is contextual, so you can filter by selecting from a specific element, or by chaining select calls.

Select returns a list of Elements (as Elements), which provides a range of methods to extract and manipulate the results.

Selector overview

Selector Description
tagname find elements by tag, e.g. a
ns|tag find elements by tag in a namespace, e.g. fb|name finds <fb:name> elements
#id find elements by ID, e.g. #logo
.class: find elements by class name, e.g. .masthead
[attribute] elements with attribute, e.g. [href]
[^attr] elements with an attribute name prefix, e.g. [^data-] finds elements with HTML5 dataset attributes
[attr=value] elements with attribute value, e.g. [width=500] (also quotable, like sequence”)
[attr^=value], [attr$=value], [attr*=value] elements with attributes that start with, end with, or contain the value, e.g. [href*=/path/]
[attr~=regex] elements with attribute values that match the regular expression; e.g. img[src~=(?i)\.(png|jpe?g)]
* all elements, e.g. *

 

Selector combinations

 Selector Description
el#id elements with ID, e.g. div#logo
el.class elements with class, e.g. div.masthead
el[attr] elements with attribute, e.g. a[href]
  Any combination, e.g. a[href].highlight
ancestor child child elements that descend from ancestor, e.g. .body p finds p elements anywhere under a block with class “body”
parent > child child elements that descend directly from parent, e.g. div.content > p finds p elements; and body > * finds the direct children of the body tag
siblingA + siblingB finds sibling B element immediately preceded by sibling A, e.g. div.head + div
siblingA ~ siblingX finds sibling X element preceded by sibling A, e.g. h1 ~ p
el, el, el group multiple selectors, find elements that match any of the selectors; e.g. div.masthead, div.logo

 

Pseudo selectors

Selector Description
:lt(n) find elements whose sibling index (i.e. its position in the DOM tree relative to its parent) is less than n; e.g. td:lt(3)
:gt(n) find elements whose sibling index is greater than n; e.g. div p:gt(2)
:eq(n) find elements whose sibling index is equal to n; e.g. form input:eq(1)
:has(seletor) find elements that contain elements matching the selector; e.g. div:has(p)
:not(selector) find elements that do not match the selector; e.g. div:not(.logo)
:contains(text) find elements that contain the given text. The search is case-insensitive; e.g. p:contains(jsoup)
:containsOwn(text) find elements that directly contain the given text
:matches(regex) find elements whose text matches the specified regular expression; e.g. div:matches((?i)login)
:matchesOwn(regex) find elements whose own text matches the specified regular expression
Note that the above indexed pseudo-selectors are 0-based, that is, the first element is at index 0, the second at 1, et