MU compared with ...

I have been outlining a hypothetical alternative to XML that I am calling MU. In this note I compare MU to some other mark-up notations.

MU compared with HTML

MU is intended to generalize what is a reasonable HTML-style document. The intention is that the addition of a standard MUD file and style sheets, a generic MU processor would be able to grok a large subset of extant HTML documents. Obviously there will still be documents that are not well-formed MU any more than they are well-formed HTML, and processors may still need to do some second-guessing.

There is no !DOCTYPE declaration, and no SGML-style DTD. The equivalent is the mud attribute of the media type. As a special case, we might define a special text/html+mu media type defined so that

text/html+mu; charset=UTF-16

is the same as

text/mu; charset=UTF-16; mud=""

where the specified MUD file contains character and no-range tags sections that define the HTML syntax, view sections naming appropriate CSS style sheets, and a features list including all the HTML modules.

Unlike HTML, there is no attempt to define the default charset as anything other than US-ASCII, thus avoiding conflict with the MIME RFCs. (On the other hand, if the document has no reported charset then processors' behaviour is undefined, and may include attempts to glean the charset from analysis of the byte patterns.)

MU documents can contain namespace-prefixed tags (declared in the MUD), but the prefix is only required when there is a name collision; tags without a prefix are taken as being in whichever namespace allows them to be recognized by one of the activated features. This allows for extensions (like Apple's canvas tag, for example) without having every other tag adorned with prefixes.

MU would prefer that document metadata -- such as title, links, and suchlike -- live externally to the document data. In an HTTP or MIME context, the natural place for title and suchlike is the headers; on disc there would be a META resource containing the same information. In particular, a document should not be an authority for its own encoding; that leads to confusion. For HTML compatibility, there would have to be features that can be mentioned in the MUD file to activate embedded metadata for compatibility.

Many HTML processors use 'DOCTYPE switching' to select different quirks to match the bugs in other programs. With MU, you are supposed to be able to use the features section of the MUD file to assert the need for different processing models or whatever.

MU compared with XML

MU does not have a !DOCTYPE declaration, PUBLIC identifier for its DTD, or a DTD. The functions of the XML DTD are

  • defining default attributes (MU does not have default attribute values);
  • some checks for document validity (MU would use external validators, in the style of Relax NG and XSD);
  • defining entities (MU restricts entities to chacter entities and can define them in the MUD file).

In addition, SGML DTDs can alter the syntax by defining some element-types as empty; MU does that through the MUD file.

MU has namespaces, but they are defined in the MUD file rather than in-line (saving on typing, if nothing else). It also treats unqualified tag-names differently, allowing them to be in whichever vocabulary makes most sense. Some of the functions XML uses namespaces for are achieved through feature declarations in the MUD.

MU does not have XML's encoding pseudo-attribute; instead MU requires that external metadata be used to establish the conversion from bytes to characters. My META proposal is intended to shore up those operating systems that have problems in this regard.

MU does not have NOTATION definitions.

XML models the document as an ordered tree of nodes, the leaf nodes being text nodes or empty elements. MU considers a document to be a character sequence, with some ranges (possibly overlapping) tagged with optional attributes. Where XML is a semi-plausible format for exchanging non-document datasets (such as serialized database tables, RSS, and XML-RPC), in the world of MU, one might prefer YAML for those tasks.

Compared with LMNL

LMNL is another mark-up language firmly rooted in the concept of marking up text. Like MU, tags mark ranges that may overlap. LMNL has a properly worked out processing model that MU would steal wholesale. LMNL uses a different microsyntax, with start tags written [par} and end tags {par]. This addresses neatly one of the annoyances of SGML-derived languages, the one-character difference between <foo> and </foo>. It also gives them an unambguous empty-range notation: [img].

LMNL has a much more powerful attribute syntax -- each annotation's value is potentially a LMNL text layer, whereas MU, like XML and HTML, limits attribute values to unstructured character data. I believe that all MU documents can be trivially converted to LMNL, and LMNL documents with simple enough attributes can be transformed to MU.


2004-08-06: Added sections on XML and LMNL