CAP is the OASIS Common Alerting Protocol, which is a specification of an XNL format for disseminating warnings of hurricanes, earthquakes, and suchlike. The CAP v1.1 format is mandated by the European R&D project I am working on. This is an inconvenience, because CAP is badly flawed XML standard. I am going to discuss here some of the problems I have had with message identity as defined by CAP.
Message Identity
We want to be able to analyse the content of a thread of messages, meaning an alert message indicating an incident has occurred, and update messages for that incident. For that we need to be able to stored and retrieve messages, which may arrive from a variety of sources, and for that they need identifiers that are globally unique.
Sadly for us, there is no guaranteed universal, unique identifier for CAP messages. This makes aggregation difficult.
Failed to use URIs
There is an element called identifier
which the standard says is
A number or string uniquely identifying this message, assigned by the sender
Since there is no central repository of identifiers, so there is nothing that guarantees
that different senders will not coincidentally choose the identifier msg0045
. So the
intention must be that identifier
is unique in the scope of a namespace maintained by the
sender.
This is OK so long as the sender is uniquely identified: then the combination of sender name and message identifier would be unique. But the sender
element is described as
Identifies the originator of this alert. Guaranteed by the assigner to be unique globally; e.g., may be based on an Internet domain name
How are assigners (presumably the same as senders) to guarantee their sender ids are unique? There is no registration process described. Saying it may be ‘based on an internet domain name’ is not good enough, although in practice the examples they give are
KARO@CLETS.DOJ.CA.GOV
trinet@caltech,edu
KSTO@NWS.NOAA.GOV
hsas@dhs.gov
These all match the pattern of RFC 2822 addresses and message-ids, which would not have been an entirely unreasonable approach to making unique identifiers, but only if they had made this format mandatory. Doing so would have represented good practice if this standard had been drafted in 1985.
On the other hand, it is no longer 1985. In 2005 we have the World-Wide Web, founded on universal resource locators (URLs). There are plenty of precedents of using the URL syntax for universal resource identifiers (URIs), which make a better alternative for universally unique identifiers than any application-specific identifier scheme: URIs satisfy uniqueness and universality automatically, so anyone owning a domain name can mint their own URIs that are guaranteed not to clash with anyone else's. In the case of CAP, a better approach would have been to require senders and messages to be identified by URI, or at least to have senders identified by URI and messages assigned ids of a form suitable for combining to the sender URI to make a message URI.
References messages using datetime
To make matters worse, when they do want to identify messages -- such as in the
references
element -- they use a comma-separated triple of the form
sender
,
identifier,
sent
where sender and identifier are as previously discussed and sent is
(1) The date and time is represented in [dateTime] format ... (2) Alphabetic timezone designators such as "Z" MUST NOT be used.
by which they mean the format specified in XML Schema except that the canonical format (UTC, indicated with a Z
suffix) is forbidden.
The first thing to say here is to repeat the point that inventing your own system of resource identifiers is always less desirable than simply specifying that URIs should be used.
The second thing is that earlier we were told that the message identifier was unique in the scope of the sender; now we are told that identifiers must be further qualified with a timestamp. I guess we must infer that the same identifier might be reused with different timestamps on modified versions of the same message. In other words, this implies that messages are mutable, which complicates matters when aggregating messages. It also raises philosophical questions, such as whether two updates to the original alert are modifications of the same message (so should share an identifier) or separate messages in their own right. This issue is simply not addressed in the spec.
The next question is, are these the same message reference?
KARA@CLETS.DOJ.CA.GOV,KAR0-0306113339-SW,2003-06-11T22:39:00-07:00
KARA@CLETS.DOJ.CA.GOV,KAR0-0306113339-SW,2003-06-12T05:39:00+00:00
I would say yes, since they refer to the same moment in time. But this means that we cannot
rely on string comparisons to compare message references, because we have to parse the
timestamp portion. The standard could have mandated that the timestamp always be passed on
verbatim, or always be expressed in some canonical form (e.g., UTC), but is silent on the
subject. Will timestamps be be altered by intermediaries? It seems likely, especially as
most of my colleagues want to use XML schema to decode messages in to data structures
before dealing with them -- so the sent
element will be parsed in to a DateTime
object,
and re-formatted when the message is passed on. The DateTime
classes in Mcirosoft .NET
converts timestamps to what it considers to be the local time zone of the computer it
happens to be on. This is bad if the computer in question is a web server with global
audience. Similar things go wrong when you save a timestamp in a database and retrieve it,
or format a timestamp using JavaScript, and so on. Also, timestamps got from DateTime.Now
are recorded in decimicroseconds, which means they can be different from the
timestamp retrieved from the database even if they look identical when printed out.
The Path Not Taken
The specification could start by discussing the relationship between the resources it discusses (the information about some event) and the representations of that resource (the XML messages sent to recipients). If the resource is mutable, so there can be updated representations in play, then the representations should have the same URI but different last-modified timestamps; there will be no need to list the old versions within the message, as the recipient can simply store the most-recent representation and discard the others.
For a given incident there may be several messages that cover different aspects of the event. For example, a hypothetical sequence
-
Storm predicted at future date xxx in area yyy
-
Storm now predicted at future date xxx in areas yyy and zzz
-
Storm imminent; take shelter as described here ...
-
Storm has passed, extra resources needed
-
Summary of storm recovery
These are still what people would call ‘updates’, but because at another level they are all separate resources, hence separate URIs. A recipient that wants a full picture will need to remember the latest versions of each message in the set. Someone who only wants to know the current status will discard versions of all messages all except the last.
Conclusion
Identity of messages in CAP is fuzzy and uses date-time values to distinguish between messages, which makes processing them needlessly complicated. Modern web standards should instead define when two representations (in this case, an alert message) can be different views of the same resource, or are different enough to represent distinct resources (with unique URIs).