Damian Cugley’s Weblog
Feel Mark
Pilgrim’s distress at the excision of
cite
from XHTML
2.0’s Text module. The irony is that
cite
is one of the ‘semantic’ tags
(‘logical’ tags, as they used to be called) that is
actually used and supported by web browsers. Meanwhile fossils
like dfn
, kbd
and samp
are retained.
In proper English-language typography, italics are used for various purposes:
- Citing things like movies and books (The Fellowship the Rings, 2001: A Space Odyssey), but not, for example, short stories (‘The Sentinel’);
- Names of ships (HMS Beagle, HMS Endeavour), but not, for example, pubs (The Beagle and Hounds);
- Foreign tags (ad nauseam, bête noire), but not when they have become English words in their own right (café, ångström);
- Words and letters mentioned rather than used (‘the word complex is often confused with complicated’, ‘mind your ps and qs’);
- Terms being introduced for the first time* (‘we use equivalence relation to mean a relation that is symmetric, transitive, and associative...’);
- Words and letters used as identifiers in mathematical work† (x, y, α, β), with special exceptions for some standard functions like sin and cos;
- Ditto for writings about computing by authors who think of computing as being related to maths (gcd(a, b), shortest_path, CLUNK); and
- To indicate emphasis.
The implication of the XHTML 2 draft is that all of the
above actual, real-people uses of mark-up only deserve a single
tag, em
. If we want to have a single element whose
semantic, logical, Platonic ur-meaning is ‘text that is
printed in italics’, why not just use i
and
save us some typing?
Meanwhile there are separate XHTML tags for several esoteric
usages that exist only in computer literature, and in
fact only in computer manuals: code
,
samp
, kbd
, and (in some
interpretations) var
. Now, even computer-literate
types are inconsistent in their use of a special typeface to
distinguish ‘computer text’ from other text. For
identifiers in programs one might argue that italics works
nicely and is easier to read (Bjarne
Stroustrup uses italics rather than typewriter in the third
edition of The C++
Programming Language). Few, if any, find the time to
distinguish between typing foo on the keyboard, the
character sequence foo it generates, the program
fragment foo
and the variable foo it is
parsed as. And if you are in that position, you probably ought
to be using DocBook instead...
I’d wager that even back in the dawn of the WWW, non-computer-related text dominated the Web, starting with those particle-physics databases and the IMDB. The HTML features designed to support computer manuals are a fossil, left over from when the HTML vocabulary was lifted from GNU Texinfo (or something closely related thereto).
Idle speculation
There are precedents for formatting that which is normally
italicized differently. The Texinfo conventions for
cite
, em
, and var
are
_cite_
, *em*
, and VAR
.
Donald E. Knuth in the TeXBook distinguishes
citation from emphasis, using oblique type for the former and
true italic for the latter (the mad fool).
For what it is worth, if I were king of XHTML for a day
I would retain cite
. Would it also be
appropriate to extend it to other names of things like ship
names? Many of the above uses of italics are really a form of
quotation; they could use q
, which after all has
never been very successful at supplying quotation marks:
Strunk abhored the phrase <q>student body</q> and suggested <q>studentry</q> instead.
producing something like
Strunk abhored the phrase student body and suggested studentry instead.
In print this would be set with italics, but you can see how it
could just as easily have used quotation marks. (In principle
the use of typewriter text for computery stuff is also a form of
quotation and could arguably be a variation on q
!)
Actual reported speech and would use actual marks of quotation,
which in British tradition are ‘
and
’
(‘…’), and in
American “
and ”
(“…”).
This leaves dfn
for definitions, var
for variables names and similar (metasyntactic variables, formal
parameter names, and mathematical symbols). Oh, and
em
to indicate emphasis only. Hmm. This
almost makes sense.
Conclusion
I think I am sticking with XHTML 1.0 for now. To
be honest I am still chary of this new-fangled
application/xhtml+xml
media-type (I still
haven’t found out why they want
application
rather than text
).
I think that even if XHTML 2 is not intended to be
backward-compatible with XHTML 1, it nevertheless should be
rich enough that documents may be converted between formats
without loss of information. Folding cite
in to
em
on the face of it violates that principle.
Footnotes
* The element dfn
was lifted from Texinfo to cover
this case, but was not supported by browsers (it was not shown
italicized), so no-one uses it.
† The element var
was originally introduced
to cover the mathematical and metasyntactic use (being lifted
straight out of the Texinfo conventions), but Microsoft Internet
Explorer’s designers got it wrong and used the monospace
font for var
, in effect changing its meaning. The
XHTML 2 description tends towards the latter
interpretation.
Update (1 November 2006). Linked to from
Cafe con Leche XML News and Resources.
Corrected the spelling of bête.
Since I wrote this, the Firefox programmers have elaborated their implementation of the
q
element to generate language-sensitive quotation marks.
Joe Clark has more on
we computer scientists’ promotion of computer-specific HTML
elements like kbd
over additions that might be useful in
non-technical publishing in an article
‘How not to fix HTML’.