This is a continuation of my pointless musing about a hypothetical alternative to XML called MU.
The importance of metadata
A MU instance cannot stand entirely alone -- no text file can, because you need to know the encoding used to convert the character data to a the sequence of bytes actually stored on disc or transmitted over the network. (As discussed in the previous entry, the MU data does not describe its own encoding.)
I will describe two ways to store metadata for MU documents, one using
MIME media-type parameters and the other as a separate resource in a
format called MUD. A MUD-savvy web servers will use the metadata in
the MUD to construct the Content-type
headers.
MIME-type parameters
In the fantasy world in which MU exists, the Internet media-type for
MU is text/mu
, with variations like text/html+mu
permitted in a manner similar to the +xml
convention.
There is some discussion as to where XML data should be
application/xml
or text/xml
. I am pretty sure that the reasons
for preferring the former is that (a) not all XML data is supposed to
be a document for reading by human beings, and (b) media-types
starting with text/
are subject to transcoding, and that would break
XML documents that mention their own encoding. MU is not intended as
a general-purpose serialization format, and can be transcoded (because
the program doing the transcoding will emend the metadata), so I think
that makes text/mu
OK. I might be wrong, though.
The charset
parameter specifies how to convert the bytes in to
characters. Thus one can have things like
text/mu; charset=ISO8859-1
text/mu; charset=UTF-8
text/mu; charset=UTF-16LE
And for compatibility with other RFCs I am too lazy to look up,
- an omitted
charset
means the MU document is in strict US-ASCII; - if the charset is UTF-16, then the data may be (a) a byte-order mark 0xFF 0xFE followed by UTF-16LE data, (b) a BOM 0xFE 0xFF followed by UTF-16BE data, or (c) UTF-16BE data with no BOM.
The first point diverges from HTML, which vaguely assumes ISO 8859-1
in the absence of a charset but then requires increasingly
hair-raising guesses depending on the byte values the parser
encounters. But MU documents must always be accompanied by a valid
media-type, and that MUST include a charset
if the encoding is not
US-ASCII.
Another parameter is mud
. This gives one or more URIs for MUD
resources to be used when interpreting this MU resource. This is a
list of one or more URIs separated by whitespace. For example,
text/mu; charset=UTF-8; mud="http://example.org/muds/formz.mud"
Note that MUDs for popular formats will of course be cached, so the standard URI should be used wherever possible.
MUD units
In this document I am not describing MUD features in complete detail as would be required by a real spec. The underlying format of the MUD unit is YAML. A formal specification of MUD, if such a thing extsited, might define a syntax that is a subset or profile of the full YAML syntax. For example:
media type: text/html+mu charset: UTF-8 features: - tag:alleged.org,2004:mu:link/1.0 - tag:alleged.org,2004:mu:object/1.0 namespaces: html: http://www.w3.org/1999/xhtml dc: http://purl.org/dc/1.0
The media type
property is MIME type and subtype, sans parameters.
The key charset
gives the character encoding of the MU. A MUD-savvy web
server would use these to generate a complete MIME media-type for the
MU resource, with a mud
parameter pointing at the MUD unit. Note
that the media type from the web server takes precedence over any
media type
and charset
parameters in the MUD (to allow for
transcoding).
The connection of MU to MUD for files on a computer's file system will
depend partly on the underlying operating system. On a Macintosh with
HFS, the MUD unit might conceivably be a MUD!
resource of the MU
document. Ditto for other operating systems which allow multipart
files natively -- although if I were to point out that this includes
NTFS it would probably suprise a lot of Windows NT developers!
Otherwise processors could automatically expect that a document
foo/bar.mu
will have a MUD resource available as foo/bar.mud
, or,
failing that, foo/DEFAULT.mud
.
MUD imports
The key imports
has as its value a list if URIs, each on a separate
line, introduced with a hyphen-space (YAML list syntax). For example:
imports: - http://www.example.org/2004/html.mud - http://www.example.org/2004/formz.mud - http://www.example.org/2004/mathml.mud
Imports means reading in the extra MUD files and merging their properties.
The key base
can be used to specify the base for partial URIs (same
idea as xml:base
).
The use of imports allows for a MUD embedded in a file (as a Macintosh
resource, say) to be very short, limited to a charset
and an
imports
parameter).
Namespaces
The purpose of namespaces is to allow sets of tags defined by
different organizations to be mixed without prior arrangement.
Namespaces are not intended to be used to indicate the
interpretation of the tags; for that, see the features
property
described below.
The namespaces
parameter is a mapping from prefixes to URIs. In the
MUD this is represented as namespace-colon-URI pairs, one per line,
indented (the YAML syntax for maps). For example:
namespaces: html: http://www.w3.org/1999/xhtml form: tag:example.org,2004:formz
Each of these maps a namespace prefix to the subject indicator for its namespace. No implication is made that there is a downloadable resource at this URI; they are used merely to supply a globally unique identifier.
There is no default namespace. Tags with no prefix are promiscuous so far as special meanings (specified by features, below) are concerned.
MU namespaces are simpler than XML namespaces in a couple of ways: they are specified outside the MU document (not on individual elements), and so apply uniformly throughout the document. A MU editor that is assembling a document from arbitrary fragments can collect together their namespaces and if necessary munge the prefixes to make them unique.
Mixed-namespace documents should not be needed as much with MU as with
XML, because MU does not need to use namespaces to trigger special
features. For example, MU can use href
for links (using the
features
property described below), rather than having to have
xlink:href
or xml:href
as separately namespaced attributes.
Also, inclusion of data in different formats is expected to be done through links rather than in-line.
Features
The key features
introduces a list of URIs that are subject
indicators for features required in the MU processor to correctly
render this document.
The term subject indicator
means a URI that does not refer to actual
downloadable resources, but is used as a token for a software feature
the processor must support to make sense of this document. The URIs
may well be compiled in to the plug-in implementing the feature.
For example, we might define the URI
tag:alleged.org.uk,2004:mu:link/1.0
to mean that href
and src
attributes have the same meanings as xml:href
and xml:src
in
Skunklink.
Features will generally be associated with a namespace, but that does not mean that the tags that are recognized by a feature must always have a namespace prefix: tags are offered to all feature implementations and features should match tags (a) with their namespace prefix, and (b) with no prefix.
For example, we might define a switch
tag in a namespace
tag:alleged.org,2004:mu:switch
and include the URI
tag:alleged.org,2004:mu:switch/1.0
amongst the features.
This allows us to have a MU fragment like the following:
<p> This is HTML with an image: <switch> <img feature="tag:alleged.org,2004:mu:svg/1.0" src="foo.svg"/> <img feature="tag:alleged.org,2004:mu:png/1.0" implementation="msie/5.5" src="foo-noalpha.png"/> <img feature="tag:alleged.org,2004:mu:png/1.0" src="foo.png"/> <span>FOO</span> </switch> And on with the text. </p>
The p
, span
and img
tags are recognized by the HTML feature,
say, so implicitly in the http://www.w3.org/1999/xhtml
namespace,
whereas the switch
tag, and feature
and implementation
attributes are recognized by the switch feature and are implicitly in
its namespace.
The idea here is to allow modularization of HTML but to only require namespace prefixes when there is ambiguity. This reduces the load on the MU author, which is just as well because we have seen that most HTML developers are terrified of XML namespaces!
If a tag is not recognized by any feature, then the default behaviour is to strip the tags and display its content. This includes the case where a feature is not supported by the application.
Finally, one of the feature URIs would be (say) tag:alleged.org,2004:css/3
,
and would mean that CSS 3 is used to describe how tags are displayed
(in the absence of other applicable features). I imagine that much of
the HTML compatibility could be expressed in CSS 3.
Title
Does this belong in the MUD or the MU? I think the MUD but this implies that every document needs a separate MUD resource.
Links
Links to other resources. Each link is itself a mapping:
links: - rel: stylesheet type: text/css href: foo.css - rel: alternate type: application/atom+xml href: bar.atom
I think I have gone on about my hypothetical MUD for long enough now.