III: HTML and XML
On the serializations of HTML5
A recent question on the Q&A-site Stack Overflow brought into light the difference between the two serializations of HTML5 and the difference in behaviour when using them, specifically the corner cases that occur when one tries to embed foreign content of the XML variety (in this case SVG). While the conclusion in the discussion that occured was that the HTML5 specification is inconsistent in this corner case1, it did provoke me to think about the use cases of the two different serializations of HTML5. It turns out that it’s pretty much the same thing as the HTML4.01/XHTML1.1 duality, but more formal.
Anyway, the purpose of this article is to explore the twot serializations, identify their strengths and weaknesses and finally condense these into some kind of usage recommendations.
The serialization concept
As you may know, the serialization concept simply means that any given HTML5 document can be expressed as either
HTML (a markup language heavily based on, but not a subset of SGML) or as XML (a proper SGML subset). While there is
no semantic or contentual difference between two serializations of the same document, different parsing rules apply
to them. For example, the XML serialization is parsed by an XML parser (possibly with less draconian error-handling)
and must not implement the document.write() DOM method, while the HTML serialization is parsed in quite a
different manner.
Since there is no difference in content or semantics between two different serializations of the same document, we are free to embed things like SVG and MathML even in the HTML serialization of the document; something that wasn’t possible in the HTML4.01/XHTML1.1 duality. However, we will see later that this isn’t always a good idea.
It is worth noting that no matter what your document looks like, it is the mime-type that defines what it is.
There is no such thing as sending an XML serialization as text/html; what you’re doing is in fact sending a
(possibly invalid) HTML serialization. It is possible to write so-called polyglot documents, i.e.
documents that are both valid XML serializations and valid HTML serializations of the same content, but this is
difficult in most circumstances.
HTML5 as HTML
The HTML serialization of HTML5 can be traced back to HTML4.01, and features among other things relaxed (but
well-defined) error handling, “void” elements (elements which must not have an end tag) and elements with optional
end tags. Concepts such as “self-closing” elements do not exist except in foreign content (i.e. SVG and MathML),
but the syntax that describes self-closing elements in XML is allowed on void elements, for compatibility reasons.
The HTML serialization is required to have a doctype and must have a HTML mime-type (text/html or the rather
interesting text/html-sandboxed). A sample HTML serialization could look like this:
<!DOCTYPE html>
<html>
<head>
<meta charset="utf-8">
<title>HTML5 document</title>
</head>
<body>
<!-- Here be dragons! -->
</body>
</html>
There are somewhat complicated rules regarding “foreign content” and the implied namespaces, syntax and so on of such elements.
Advantages
The HTML serialization has three major advantages:
- Simplicity. The HTML serialization has easy-to-remember syntax rules and many useful time-saving constructs such as void elements, optional end tags and such.
- User friendliness. Error handling in the HTML serialzation is much more use-friendly than in the XML serialization. End users won’t see horribly failing pages, but graceful (and well-defined) degradation.
-
Backwards compatibility. With its
text/htmlmime-type and through design choices, the HTML serialization is backwards compatible both in the sense that many older HTML4.01 documents are valid HTML serializations, and in the sense that older tools will be able to parse many new HTML5 documents.
One can also note that with HTML5 it is possible to embed both MathML and SVG directly in the content, which was not possible in HTML4.01. This could be considered an “advantage”, but only over earlier HTML serialization, not over the XML serialization in which the same thing is possible.
Disadvantages
Of course, the HTML serialization also has disadvantages:
-
Inconsistency with foreign content. Foreign XML content (SVG, MathML) is not parsed using an XML parser, but
normal HTML parsing rules apply. This means that self-closing elements aren’t allowed, and in particular that
scriptelements (in SVG) are treated in an unintuitive manner. This mostly affects SVG, as many MathML generators avoid self-closing elements (or close them explicitly). - Complicated parsing rules. The XML serialization has strict syntax rules that can easily be used to create a conforming parser. The HTML parsing algorithm given by the HTML5 specification is much more complicated, with lots of corner cases and sometimes unintuitive rules chosen so that they won’t “break” the web.
In previous HTML versions, the relaxed error handling was also a disadvantage since different browsers handled invalid markup in different ways. In HTML5, it is clearly defined how invalid markup should be handled and as such all HTML will be parsed into the exact same DOM tree in all conforming browsers.
HTML5 as XML
The XML serialization of HTML5 can be traced back to XHTML1.1, and features most things regular XML does; (somewhat)
draconian error handling, strict syntax rules, self-closing elements and so on. Concepts like “void” elements and
doctypes do not exist at all in the XML serialization, and foreign content is required to specify its own
namespace. The XML serialization may omit the doctype (since it has no effect) but must be served with an XML
doctype (application/xml or application/xhtml+xml) and must specify the correct namespace:
<?xml version="1.0" charset="utf-8"?>
<html xmlns="http://www.w3.org/1999/xhtml">
<head>
<title>HTML5 document</title>
</head>
<body>
<!-- Here be dragons! -->
</body>
</html>
Advantages
The advantages of the XML serialization mainly deal with the machine-assisted generation and interpretation of the markup:
- Machine-readability. XML is designed to be read by machines. The XML serialization is well-defined and simple to parse using existing XML tools.
- Machine-generability. XML is also easily generated by machines. Many tool exist that generate well-formed XML, and to generate web pages using these should, in theory, be easy.
-
Consistency with foreign content. Since foreign XML content is embedded in an XML context (and using XML
methods, with proper namespaces and such) if you’re using an XML serialization, the correct parsing rules apply
to the foreign content and the result is what you’d expect. This means no tricky
scriptproblems, and self-closing elements working properly.
One should note that generating well-formed XML-serialized HTML5 is hard unless you’re using XML tools. Since most web software currently uses simple string concatenation (or similar methods), not XML tools, making them generate XML-serialized HTML5 is a difficult task that likely requires complete reqwrites.
Disadvantages
The strict, well-defined XML serialization is not without its flaws, however:
- Draconian error-handling. The XML serialization has extremely unfriendly error handling. Unless you as a developer are able to absolutely guarantee the well-formedness of your pages (which is next to impossible if they are being dynamically generated without using XML tools), there’s always the risk that your visitors will be greeted with a horrible, nonsensical error about XML well-formedness. This is bad for business.
- Lack of legacy support. While older XHTML1.1 documents are still (sort of) valid XML serializations of HTML5, there is little to no support in older software for XML serializations of HTML. While browsers that lack support are uncommon these days, it is still something that has to be considered. This may be seen as a minor disadvantage.
Conclusion
The conclusion must be that the advantages of the XML serialization rarely make up for the difficulty in generating error-free documents dynamically and the draconian unfriendly error-handling. Since it is harder to write (or generate) well-formed XML than it is to write (or generate) valid HTML, one should avoid doing so unless there are compelling reasons. As such, I would recommend that you default on the HTML serialization of HTML5, and use the XML serialization if and only if one of these statements are true:
- It is imperative that your markup can be read by XML tools.
- You’re generating your data using an XML serializer.
- You’re embedding SVG data in your HTML5 serialization.
The campaign for promoting XHTML had much to do with the validity of documents (which isn’t ensured by XML nor made impossible by HTML), the uniformity of rendering (which is enforced by both serializations of HTML5) and the use of correct markup in the sense that semantics play a larger role than presentational markup (which is a spirit that is present in all aspects of the HTML5 standard). These three points are no longer relevant; there is little reason to use the XML serialization of HTML5 unless you wish to avoid some of the (very few) inconsistencies regarding foreign content, or if you for some reason need to use XML, want to use it or already generate XML content.
-
The problem arose from the fact that scripts in the HTML serialization are required to run when the closing tag is parsed, not when the element is closed, which means that the self-closing tags of the foreign SVG content cause problems.↩
Comments
powered by Disqus