Sigurdhsson

III: HTML and XML

On the serializations of HTML5

A recent question on the Q&A-site Stack Overflow brought into light the difference between the two serializations of HTML5 and the difference in behaviour when using them, specifically the corner cases that occur when one tries to embed foreign content of the XML variety (in this case SVG). While the conclusion in the discussion that occured was that the HTML5 specification is inconsistent in this corner case1, it did provoke me to think about the use cases of the two different serializations of HTML5. It turns out that it’s pretty much the same thing as the HTML4.01/XHTML1.1 duality, but more formal.

Anyway, the purpose of this article is to explore the twot serializations, identify their strengths and weaknesses and finally condense these into some kind of usage recommendations.

The serialization concept

As you may know, the serialization concept simply means that any given HTML5 document can be expressed as either HTML (a markup language heavily based on, but not a subset of SGML) or as XML (a proper SGML subset). While there is no semantic or contentual difference between two serializations of the same document, different parsing rules apply to them. For example, the XML serialization is parsed by an XML parser (possibly with less draconian error-handling) and must not implement the document.write() DOM method, while the HTML serialization is parsed in quite a different manner.

Since there is no difference in content or semantics between two different serializations of the same document, we are free to embed things like SVG and MathML even in the HTML serialization of the document; something that wasn’t possible in the HTML4.01/XHTML1.1 duality. However, we will see later that this isn’t always a good idea.

It is worth noting that no matter what your document looks like, it is the mime-type that defines what it is. There is no such thing as sending an XML serialization as text/html; what you’re doing is in fact sending a (possibly invalid) HTML serialization. It is possible to write so-called polyglot documents, i.e. documents that are both valid XML serializations and valid HTML serializations of the same content, but this is difficult in most circumstances.

HTML5 as HTML

The HTML serialization of HTML5 can be traced back to HTML4.01, and features among other things relaxed (but well-defined) error handling, “void” elements (elements which must not have an end tag) and elements with optional end tags. Concepts such as “self-closing” elements do not exist except in foreign content (i.e. SVG and MathML), but the syntax that describes self-closing elements in XML is allowed on void elements, for compatibility reasons. The HTML serialization is required to have a doctype and must have a HTML mime-type (text/html or the rather interesting text/html-sandboxed). A sample HTML serialization could look like this:

<!DOCTYPE html>
<html>
	<head>
		<meta charset="utf-8">
		<title>HTML5 document</title>
	</head>
	<body>
		<!-- Here be dragons! -->
	</body>
</html>

There are somewhat complicated rules regarding “foreign content” and the implied namespaces, syntax and so on of such elements.

Advantages

The HTML serialization has three major advantages:

One can also note that with HTML5 it is possible to embed both MathML and SVG directly in the content, which was not possible in HTML4.01. This could be considered an “advantage”, but only over earlier HTML serialization, not over the XML serialization in which the same thing is possible.

Disadvantages

Of course, the HTML serialization also has disadvantages:

In previous HTML versions, the relaxed error handling was also a disadvantage since different browsers handled invalid markup in different ways. In HTML5, it is clearly defined how invalid markup should be handled and as such all HTML will be parsed into the exact same DOM tree in all conforming browsers.

HTML5 as XML

The XML serialization of HTML5 can be traced back to XHTML1.1, and features most things regular XML does; (somewhat) draconian error handling, strict syntax rules, self-closing elements and so on. Concepts like “void” elements and doctypes do not exist at all in the XML serialization, and foreign content is required to specify its own namespace. The XML serialization may omit the doctype (since it has no effect) but must be served with an XML doctype (application/xml or application/xhtml+xml) and must specify the correct namespace:

<?xml version="1.0" charset="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
	<head>
		<title>HTML5 document</title>
	</head>
	<body>
		<!-- Here be dragons! -->
	</body>
</html>

Advantages

The advantages of the XML serialization mainly deal with the machine-assisted generation and interpretation of the markup:

One should note that generating well-formed XML-serialized HTML5 is hard unless you’re using XML tools. Since most web software currently uses simple string concatenation (or similar methods), not XML tools, making them generate XML-serialized HTML5 is a difficult task that likely requires complete reqwrites.

Disadvantages

The strict, well-defined XML serialization is not without its flaws, however:

Conclusion

The conclusion must be that the advantages of the XML serialization rarely make up for the difficulty in generating error-free documents dynamically and the draconian unfriendly error-handling. Since it is harder to write (or generate) well-formed XML than it is to write (or generate) valid HTML, one should avoid doing so unless there are compelling reasons. As such, I would recommend that you default on the HTML serialization of HTML5, and use the XML serialization if and only if one of these statements are true:

  1. It is imperative that your markup can be read by XML tools.
  2. You’re generating your data using an XML serializer.
  3. You’re embedding SVG data in your HTML5 serialization.

The campaign for promoting XHTML had much to do with the validity of documents (which isn’t ensured by XML nor made impossible by HTML), the uniformity of rendering (which is enforced by both serializations of HTML5) and the use of correct markup in the sense that semantics play a larger role than presentational markup (which is a spirit that is present in all aspects of the HTML5 standard). These three points are no longer relevant; there is little reason to use the XML serialization of HTML5 unless you wish to avoid some of the (very few) inconsistencies regarding foreign content, or if you for some reason need to use XML, want to use it or already generate XML content.

  1. The problem arose from the fact that scripts in the HTML serialization are required to run when the closing tag is parsed, not when the element is closed, which means that the self-closing tags of the foreign SVG content cause problems.

Comments

powered by Disqus