Java XML

 

In Java JDK, two built-in XML parsers are available – DOM and SAX, both have their pros and cons. Here’s few examples to show how to create, modify and read a XML file with Java DOMSAXJDOM.

In addition, updated JAXB example to show you how to convert object to / from XML.

XML is the abbreviation for Extensible Markup Language and is an established data exchange format. XML was defined 1998 by the World Wide Web Consortium (W3C).

An XML document consists of elements, each element has a start tag, content and an end tag. An XML document must have exactly one root element (i.e., one tag which encloses the remaining tags). XML differentiates between capital and non-capital letters.

An XML file must be “well-formed”.

well-formed XML file must apply to the following conditions:

  • An XML document always starts with a prolog (see below for an explanation of what a prolog is)
  • Every opening tag has a closing tag.
  • All tags are completely nested.

An XML file is called valid if it is well-formed and if it is contains a link to an XML schema and is valid according to the schema.


Comparison of XML to other formats

XML has the following characteristics which makes processing it via computer programs relatively easy compared to a binary or unstructured format:

  • XML is plain text
  • XML represents data without defining how the data should be displayed
  • XML can be transformed into other formats via XSL
  • XML can be easily processed via standard parsers
  • XML files are hierarchical

On the other side is XML format is relatively verbose, e.g., if data is represented as XML the size of this data is relatively large compared to other formats. In the Internet JSON or binary formats are frequently used to replace XML if the data throughput is important.

 


Java XML overview

The Java programming language contains several methods for processing and writing XML.

Older Java versions supported only the DOM API (Document Object Model) and the SAX (Simple API for XML) API.

In DOM you access the XML document over an object tree. DOM can be used to read and write XML files.

SAX (Simple API for XML) is a Java API for sequential reading of XML files. SAX can only read XML documents. SAX provides an event driven XML Processing following the Push-Parsing Model, e.g., you register listeners in the form of Handlers to the Parser and these are notified through call-back methods.

Both DOM and Sax are older APIs and I recommend not using them anymore.

Stax (Streaming API for XML) is an API for reading and writing XML Documents. It was introduced in Java 6.0 and is considered better than SAX and DOM.

Java Architecture for XML Binding (JAXB) is a Java standard that defines how Java objects are converted to XML and vice versa (specified using a standard set of mappings). JAXB defines a programmer API for reading and writing Java objects to / from XML documents and a service provider which allows the selection of the JAXB implementation. JAXB applies a lot of defaults thus making reading and writing of XML via Java very easy.


Why XML?

The HTML’s original objective of letting the document author to focus on the contents of the document and leave the actual appearance of the document to the browser, has gone out of control. Many HTML documents have more markup tags than the contents. Worse still, many of the markup tags are dealing with the appearance of the document (e.g., <font>) rather than the contents (e.g., <h1>).

HTML has grown into a huge and complex language, with more than hundred markup tags in its latest version. On one hand, despite these many tags, specific applications (such as e-commerce and Mathematical formula) are asking for more tags, On the other hand, many tags are not used frequently by many applications and can be removed. Furthermore, many of the HTML tags (e.g., <font>, <span>, <div>) are meant for presentation rather than the contents.


Objectives of XML

XML aims to:

  • Focus on the content rather than the appearance of the documents.
  • Resolve the conflicting demands on tags: on one hand, specialized applications need more tags; on the other hand, many tags are not frequently used and can be removed.

XML adapts the following principles to meet the above objectives:

  • XML has no pre-defined tags: The authors of the documents creates their own tags to suit their applications. Hence, XML is flexible and extensible.
  • XML has strict syntax: HTML is sloppy and loose in syntax. HTML browsers need to correct sloppy HTML scripts, resulting in complex and heavy browser. By tightening the syntax, XML browser is smaller, lighter and faster.

Applications for XML

XML is useful for these applications:

  • Data exchange between computer systems: XML is platform- and computer-language-neutral and text-based, which greatly facilitates exchanging of data between two computer systems. For example, two e-commerce partners can use an agree-upon XML format to exchange purchase orders and invoices electronically, and directly fed into their computer systems.
  • Data storage: Unlike databases which is platform- and language-dependent, XML provide a platform-neutral mean for data storage.
  • Specialized publishing: XML can be used for marking up documents for specialized applications, such as e-commerce, scientific documents, Mathematical formula, e-books, among others.

XML Syntax

There are currently two versions of XML specifications: XML 1.0 and XML 1.1, maintained by W3C.

  • Elementis the basic unit of an XML document. Each XML element must have a start-tag and end-tag. The tags are enclosed in angle brackets, e.g., <title>…</title>. Unlike HTML, closing tag is mandatory.Empty-element tag (or standalone tag) must be properly closed, e.g., <out_of_print />.
  • An XML element includes its start-tag, enclosing character data and/or child elements, and the end-tag.
  • The element’s name can contain letters, numbers, and other Unicode characters, but NOT white spaces. The name must start with a letter, underscore “_”, or colon “:”, but cannot start with certain reserved words such asxml.
  • Each XML document must have one (and only one)root
  • XML elements must beproperly nested. For example, <book><title>…</book></title> is incorrectly nested.
  • XML iscase sensitive. For example, <book> and <Book> are considered two different tags.
  • The start-tag may containattributes in the form of attribute_name=”attribute_value”  Attributes are used to provide extra information about the element. Unlike HTML, the attribute_value of an XML attribute must be properly quoted (either in double quotes or single quotes).
  • Certain characters, such as<, >, which are used in XML syntax, must be replaced with so-called entity references in the form of &name;. XML has five pre-defined entity references: &lt; (<), &gt; (>), &amp; (&),&quot; (“), and &apos; (‘).
  • XML comment takes the form of<!– comment texts –>, which is the same as HTML.
  • Unlike HTML, white spaces in the text are preserved. New-line is represented by a Line Feed (LF) character (0AH).

Well-Form XML Documents

An XML document is well-formed, if its structure meets the XML specification, i.e., it is syntactically correct. A well-formed XML document exhibits a tree-like structure, and can be processed by an XML processor. For example, the tree structure of the “bookstore.xml” is as follows:

xml

Structure of XML Documents

An XML document comprises of the following basic units:

  • Element: includes the start-tag, the enclosing character data and/or nested elements, and the end-tag.
  • Attribute: defined in the start-tag to provide extra information about the element, in the form ofattribute_name=”attribute_value”.
  • Entities References: in the form of&name;, e.g., &lt; (<), &gt; (>), &amp; (&), &quot; (“), and &apos; (‘).
  • Character References: in the form of&#decimal-number; or &#xhex-code; for replacing any Unicode character, e.g., both &#169; and &#xA9; can be used for copyright symbol ©.
  • PCDATA (Parsed Character Data): Text between start-tag and end-tag that will be examined by the parser for entity references and nested elements.
  • CDATA (Character Data): Text between start-tag and end-tag that will NOT be examined by the parser for entity references and nested tags.

XML is Extensible

XML is classified as an extensible language as it does not have a pre-defined set of tags. You can create and extend your own tags to suit your application.

You can also expand on an existing set of tags without breaking your existing applications. For example, in the bookstore example, we can define more tags such as <number_of_pages>, <weight>, <dimension> for shipping purpose, without breaking the existing applications.

Best Practices

Naming Convention:

  • Names should be self-described.
  • Names shall be nouns, and may consist of a few words. Use underscore “_” to join the words, e.g.,<first_name>, <last_name>.
  • Avoid colon “:” character, which is reserved for namespace. Avoid dot “.”, which could be confused with object property. Avoid dash “-“, which could be confused with subtract operation.

You can use either element or attribute to carry information. In the above example, title could be an element, or an attribute inside the book element like ISBN. Generally, try to avoid attributes, as attributes are harder to read, cannot carry multiple values, and not easily expandable.

HTML vs. XML

  • XML defines data, HTML defines both data and presentation.
  • XML is case sensitive, HTML is not.
  • XML has strict syntax, HTML’s syntax is loose and sloppy.
    • An XML element must begin with a start-tag and end with a end-tag. Empty element’s tag must be closed with a forward slash “/”. HTML’s end-tag may be omitted.
    • XML elements must be properly nested within the root element.
    • XML document must have one (and only one) root element
    • XML elements must be properly nested.
    • XML attribute values must be properly quoted.
    • The same attribute can not appear more than once in the same element.

DOM XML Parser

The DOM is the easiest to use Java XML Parser. It parses an entire XML document and load it into memory, modeling it with Object for easy nodel traversal. DOM Parser is slow and consume a lot memory if it load a XML document which contains a lot of data.


When to use?

You should use a DOM parser when:

  • You need to know a lot about the structure of a document
  • You need to move parts of the document around (you might want to sort certain elements, for example)
  • You need to use the information in the document more than once

What you get?

When you parse an XML document with a DOM parser, you get back a tree structure that contains all of the elements of your document. The DOM provides a variety of functions you can use to examine the contents and structure of the document.


Advantages

The DOM is a common interface for manipulating document structures. One of its design goals is that Java code written for one DOM-compliant parser should run on any other DOM-compliant parser without changes.


DOM interfaces

The DOM defines several Java interfaces. Here are the most common interfaces:

  • Node– The base datatype of the DOM.
  • Element– The vast majority of the objects you’ll deal with are Elements.
  • AttrRepresents an attribute of an element.
  • TextThe actual content of an Element or Attr.
  • DocumentRepresents the entire XML document. A Document object is often referred to as a DOM tree.

Common DOM methods

When you are working with the DOM, there are several methods you’ll use often:

  • getDocumentElement()– Returns the root element of the document.
  • getFirstChild()– Returns the first child of a given Node.
  • getLastChild()– Returns the last child of a given Node.
  • getNextSibling()– These methods return the next sibling of a given Node.
  • getPreviousSibling()– These methods return the previous sibling of a given Node.
  • getAttribute(attrName)– For a given Node, returns the attribute with the requested name.

SAX XML Parser

SAX parser is work differently with DOM parser, it does not load any XML document into memory and create some object representation of the XML document. Instead, the SAX parser use callback function (org.xml.sax.helpers.DefaultHandler) to informs clients of the XML document structure.


When to use?

You should use a SAX parser when:

  • You can process the XML document in a linear fashion from the top down
  • The document is not deeply nested
  • You are processing a very large XML document whose DOM tree would consume too much memory.Typical DOM implementations use ten bytes of memory to represent one byte of XML
  • The problem to be solved involves only part of the XML document
  • Data is available as soon as it is seen by the parser, so SAX works well for an XML document that arrives over a stream

Disadvantages of SAX

  • We have no random access to an XML document since it is processed in a forward-only manner
  • If you need to keep track of data the parser has seen or change the order of items, you must write the code and store the data on your own

ContentHandler Interface

This interface specifies the callback methods that the SAX parser uses to notify an application program of the components of the XML document that it has seen.

  • void startDocument()– Called at the beginning of a document.
  • void endDocument()– Called at the end of a document.
  • void startElement(String uri, String localName, String qName, Attributes atts)– Called at the beginning of an element.
  • void endElement(String uri, String localName,String qName)– Called at the end of an element.
  • void characters(char[] ch, int start, int length)– Called when character data is encountered.
  • void ignorableWhitespace( char[] ch, int start, int length)– Called when a DTD is present and ignorable whitespace is encountered.
  • void processingInstruction(String target, String data)– Called when a processing instruction is recognized.
  • void setDocumentLocator(Locator locator))– Provides a Locator that can be used to identify positions in the document.
  • void skippedEntity(String name)– Called when an unresolved entity is encountered.
  • void startPrefixMapping(String prefix, String uri)– Called when a new namespace mapping is defined.
  • void endPrefixMapping(String prefix)– Called when a namespace definition ends its scope.

Attributes Interface

This interface specifies methods for processing the attributes connected to an element.

  • int getLength()– Returns number of attributes.
  • String getQName(int index)
  • String getValue(int index)
  • String getValue(String qname)

 

 


JDOM XML Parser

JDOM provides a way to represent that document for easy and efficient reading, manipulation, and writing. It’s an alternative to DOM and SAX.


When to use?

You should use a JDOM parser when:

  • You need to know a lot about the structure of a document
  • You need to move parts of the document around (you might want to sort certain elements, for example)
  • You need to use the information in the document more than once
  • You are a java developer and want to leverage java optimized parsing of XML.

What you get?

When you parse an XML document with a JDOM parser, you get the flexibility to get back a tree structure that contains all of the elements of your document without impacting the memory footprint of the application. The JDOM provides a variety of utility functions you can use to examine the contents and structure of the document in case document is well structured and its structure is known.


Advantages

JDOM gives java developers flexibility and easy maintainablity of xml parsing code. It is light weight and quick API.


JDOM classes

The JDOM defines several Java classes. Here are the most common classes:

  • Document– Represents the entire XML document. A Document object is often referred to as a DOM tree.
  • Element– Represents an XML element. Element object has methods to manipulate its child elements,its text, attributes and namespaces.
  • AttributeRepresents an attribute of an element. Attribute has method to get and set the value of attribute. It has parent and attribute type.
  • TextRepresents the text of XML tag.
  • CommentRepresents the comments in a XML document.

Common JDOM methods

When you are working with the JDOM, there are several methods you’ll use often:

  • build(xmlSource)()– Build the JDOM document from the xml source.
  • getRootElement()– Get the root element of the XML.
  • getName()– Get the name of the XML node.
  • getChildren()– Get all the direct child nodes of an element.
  • getChildren(Name)– Get all the direct child nodes with a given name.
  • getChild(Name)– Get first child node with given name.