XML vs HTML – Underwater Book Club

One of my assignments this semester was to examine HTML and XML, two markup languages written by humans and processed by computers to codify information. The assignment was to explain how XML is better than HTML. That’s a loaded question, as we’ll see.

HTML, hypertext markup language, is a markup language specifically designed for the interconnected documents of the web. Web pages as documents can have headings, paragraphs, strong and emphasized text, tables and images. Importantly, they can have anchors, allowing for easy reference of other parts of the document, or even specific locations within other documents. You probably got to this page by following a link, a reference in whatever page on whatever website you were on that referenced this page on this website. Anchors allow for tables of contents, footnotes, and much more. HTML also includes support for metadata, or data about the document that is marked up. That can include authorship, rights information, publication date, language, and so on. Many metadata standards can be encoded into HTML. At time of writing, UnderwaterBook.Club uses ld+json and Dublin Core metadata on every page. Tags in HTML are defined in a public open standard, and it’s difficult to find a computing device today that can’t render it in a useful way. HTML supports encoding information for both computers and humans about general-purpose interlinked documents, backed by an open standard with a fixed vocabulary, and it has seen amazing success in the Internet era.

XML, or extensible markup language, has a lot in common with HTML. It has tags that are wrapped in angle brackets, it supports nesting of tags to reflect hierarchical information, it can be written by hand by a human and understood to some extent, while still being understandable to a computer. XML is also an open standard, managed by the same non-profit organization. But there are some significant differences between HTML and XML, and the biggest is this: where HTML has a fixed vocabulary useful for describing general-purpose interlinked documents, XML has no fixed vocabulary. Users of XML can define any tag that suits their purpose. For my class assignment, we were asked to devise an XML representation of a poem, so instead of using a <body> tag like I would for HTML, I chose to use a <poem> tag. Instead of <p> tags for paragraphs, I used <stanza> tags, and each one was filled with <line> tags. Marking up the poem this way makes it crystal clear what different parts of the document are and how they relate to each other, in a way that humans can easily write and read, and so can computers. The ability to use domain-specific vocabulary helps to clarify intent in a way that generic document elements can’t match. And computers can now understand that this isn’t just a document but a poem, with specific poem-related structures, which can help with tasks like metadata generation (count the number of stanzas, for example). On its surface, it seems like XML is the clear winner here in terms of information expression.

But as a human, I was able to recognize the content as a poem composed of lines arranged in stanzas without needing markup, by virtue of how I saw it on the page. My task was to create the markup to describe the poem, and I was able to do so without any markup already existing. Conversely, there is no computer program in existence that can parse and understand the marked up version that I produced, because I invented all of the tags. XML doesn’t define what a poem or stanza or line means. A line in a poem is a very different thing than a line in a play, or a line in a song, or a line in a computer program; software that is reading my document would have to be taught what a stanza is, because that information isn’t conveyed through the tag definitions in a specification. Two metadata librarians could take the same poem and create two different tag vocabularies to describe them, perhaps choosing to use verse instead of stanza for example, and software written to understand one could not understand the other. The flexibility of XML allows users to create whatever tags they wish, but without a standard that defines a fixed vocabulary of tags and their meanings, the interoperability between authors and computer systems doesn’t materialize.

This is obviously a problem, and technology offers a solution: DTDs, or Document Type Definitions. This is how one can specify which tags are valid within a specific type of XML document, eg a Poem document. It describes how the tags can be nested, what additional attributes they can contain, and what their meaning is. When an XML document is combined with a DTD, software engineers can develop programs that can produce Poem documents that can be consumed by any other program that implements the same DTD. Authors can be confident that the documents they write by hand can be consumed by software if they follow the DTD, and if they’re presented with a document and its DTD, they can understand the meanings of the tags used. DTD bridges the gap and gives us the fixed vocabulary combined with meaning that XML in general does not.

But at the end of the day, when you have XML+DTD to describe a specific type of document, you end up with something very similar to HTML: a fixed vocabulary designed to represent a document of a specific type through the use of meaning-assigned tags. Without a DTD, XML is worthless for conveyance of meaning and interoperability. With a DTD, it has most of the same drawbacks that HTML does, plus it requires writing custom software to utilize it while HTML is ubiquitously supported. So is XML really better than HTML after all?