Gaps in the System

		2. Gaps in the System
		Home > Greek > Unicode
Language: ENG ELL EPO JBO TLH LAT		Home > Greek > Unicode

There is a lot of grief out there about Unicode, mainly because people don't understand (or don't like) the way Unicode has approached character encoding. This is counterpolemic, in a sense: I describe the approach Unicode takes the its allocated codepoints, and why the standard could not always afford to live up to that ideal—and why the Consortium usually looks askance at any requests to keep not living up to the ideal.

2.1. Unicode Characters as Platonic Ideals

A guiding principle of Unicode is that the purpose of encoding is not to make the text look nice, but to represent its alphabetical (or other systemic) content accurately. Unicode characters are meant to be such that a search engine can retrieve text efficiently; having it look nice on the screen is a secondary criterion, and one that should be left to mechanisms other than the allocation of novel codepoints. Let me give some examples.

2.1.1. The fi ligature

It's pretty regular in quality typesetting to have a ligature for f + i, which happens to be encoded in Unicode as U+FB01 LATIN SMALL LIGATURE FI, ﬁ. So you dutifully replace all instances of f i in your text with ﬁ, your text looks snazzy and frilly, and you get a warm fuzzy feeling inside.

If I am searching for the word fire, I am going to type into my search engine the characters f i r e, and I expect to get all matching words back out. So my search engine comes across your text, chock full of ﬁ. And ﬁ is not the same thing as f i: U+FB01 is distinct from U+0066 U+0069. So my search engine fails to recognise ﬁre as an instance of f i r e. This is obviously not a good thing.

There is a conflict between the content of the text, which involves two characters, and the presentation, which involves a single glyph. We could place the onus on the search engine, which means that any piece of software searching a text has to already know that U+FB01 needs to be mapped onto U+0066 U+0069. And that, of course, is not the only exception there is. Not by any means.

The alternative is to make content rather than presentation decisive. If you mean an f followed by an i, then your plain text (defined to mean: devoid of formatting and frummery) should contain just an f followed by an i. A search engine shouldn't have to deal with your tricks of fancy printing; and a search engine should always be able to retrieve the original plain text back out from your fancy printing. If you want to see the ligature ﬁ on the printed page, don't do so at the expense of corrupting the original text: have the program generating the printout, or the screen display, ligate the two characters underlying into the one glyph, but leave the underlying characters themselves alone and accessible.

The ideal of Unicode is the second scenario: characters are to be treated as Platonic ideals, as distingushable emic units, which a text processor can deal with and traverse without needing to have a Master's degree in forensic graphology. As we'll see, Unicode has not been able to live up to that ideal when it comes to backward compatibility. It is in fact saddled with a huge list of equivalences (decomposition mappings), as I described the nightmare scenario to be.

U+FB01 = U+0066 U+0069 is a compatibility decomposition. What this means is, Unicode knows that ﬁ corresponds to f i, but warns you that if you break ﬁ down to f i, you will end up losing some information—in this instance, the fact that the characters are ligated. If you're writing a search engine, of course, you couldn't care less about the ligation. Other decompositions are canonical; this means that as far as Unicode is concerned, they are fully equivalent. This equivalence obtains, for instance, for U+00C2 Latin Small Letter A With Circumflex, â, which canonically decomposes to U+0061 Latin Small Letter A + U+0302 Combining Circumflex Accent, â. Can you see the difference? Well neither can Unicode...

The Unicode Consortium is determined not to add any further to that list; the list of equivalences costs operating systems programmers a lot of money, and they are understandably less than enthusiastic about doubling that list to cope with the niceties of Ugaritic or Mongolian—or Ancient Greek, for that matter. And as I describe below, this policy affects the shopping list of many a classicist.

2.1.2. Cyrillic т.

The second example involves Cyrillic. Now, if I'm representing the T of English and French on computer, it makes no sense to encode them as different codepoints, right? They're both t's, they both belong to the same script and alphabet, they look identical, there is no reason to separate them. And imagine the horrors if you did. You'd need to launch a distinct search for Thomas in French texts and Thomas in English texts. If you were writing about Thomas Jefferson in French, would you use the French t or the English? What would you do with Thomas Aquinas? Clearly the prospect is absurd: French and English share a script (even if they don't share every single letter in the script), so they should also share the codepoints for that script.

Likewise, it would be silly to have distinct codepoints depending on the typographical style or font you use. Thomas written in plain, italics, boldface, small caps, Helvetica, Georgia, Comic Sans or Zapfino should be the same text, and retrieved by the same search string.

The case of Russian (and other Cyrillic) versus Serbian and Macedonian [Slavonic] т is celebrated, and an excellent illustration of how Unicode codepoints are Platonic ideals, independent of how they are realised in display. In plain style, Russian and Serbian т look identical. That is, of course, because they are identical: they are both U+0442 Cyrillic Small Letter Te. There's a catch when you go to italics, however. Serbian uses as its italic form a glyph that looks like a w with a bar over it; Russian uses a glyph that looks like an m. (The story is that Serbian is using older cursive characters, which were formerly also current in Russia, but have been abandoned there.) Chances are that your italic Cyrillic unicode font (if you have one) follows the Russian model: т. But if you're Serbian, the Russian italic form is just wrong, and you are perfectly entitled to want your Serbian text to look like Serbian.

So you could do one of two things.

You could make Serbian т a distinct codepoint from Russian т. The comparison with English and French should hopefully have made the point that this would be madness. There are five Cyrillic characters whose italics differ between the two language; in only one does the plain form also differ. The other twenty-eight characters the two languages share are identical. Should the entire Cyrillic alphabet be duplicated for Serbian? Obviously not—and of course Serbian would not be the only language this kind of thing would happen in. Should just the five letters be duplicated? But then how would you know which of the two is the lower case equivalent of U+0422 Cyrillic Capital Letter Te, Т? Why should I need to do two searches to get Russian and Serbian references to Тито, but only one for Вук?
Clearly, a te is a te, whether Russian or Serbian, italic or plain. And the real solution is to display Serbian text in a Serbian font, with Serbian italic forms, and Russian text in a Russian font, with Russian italics. (Or put your Serbian into Latin script, which does happen; but this shouldn't happen just because you can't get proper Cyrillic italics...)

This of course is cold comfort to a Serb right now, since the technologies to tag HTML with language codes (so that the browser knows how to render the given character) are thin on the ground, and the fonts smart enough to adjust their italics according to language are even thinner. But the alternative is intolerable, and unfortunately, the Serbs will need to be patient for a couple of years, until the proper technologies are in place.

People hate it when they hear that. The most concerted attacks on Unicode have been from East Asia, since Unicode has undertaken a similiar conflation of Japanese, Chinese and Korean characters, which may appear different in the different languages' writing traditions, but are considered to be the same characters. (This is distinct from the traditional vs. simplified characters of Chinese, which Unicode does separate out.)

Whether to unify or disunify characters is a matter of degree and convention, and decisions can be unmade. For example, Coptic and Greek have long been considered the same script, since they share a substantial number of characters. However, it has recently been proposed by the International Association for Coptic Studies that the Consortium disunify the two languages' scripts, with a Coptic alpha, vita, ghamma &c distinct from the Greek alpha, beta, gamma; and it seems likely that the disunification will take place. One of the major arguments is that Coptic does not appear in any style other than uncial, and Coptic in a normal Greek font would be unrecognisable; the difference is in fact exploited in texts with both Coptic and Greek, which differentiate the two by their typographic style. (The need for Coptic characters to fit into the style of existing Greek characters has meant that font designers have already been making the Coptic-specific characters less uncial than they should be.) In this instance, the tradition of display wins out over the logical parsimony of emic distinctions that Unicode holds to.

This is however not the normal way things happen; and the reason it is not is the fallout from the compatibility codepoints Unicode is already stuck with.

2.2. Precomposed and Composed Characters

2.2.1. Composing

If Unicode were designed from scratch, and with no constraints imposed from existing standards and implementations, it would do more with less. It would not, for example, be stuck with 11,000 Korean codepoints when the job could in theory have been done with a few dozen. (After all, Hangul is supposed to be a wondrously logical script, and there's nothing logical about the Hangul repertoire becoming comparable in size to that of Chinese.) And it would not have a separate codepoint for every combination of a letter and a diacritic. There is no need for separate, precomposed á, é, í, ó, ú, ý codepoints; we should be able to make do with just a, e, i, o, u, y, and a single acute diacritic common to all of them, making á a character composed of two codepoints.

Similarly, there should be no need for dozens of Greek characters, covering every possible combination of base letter and diacritic (and inevitably missing those combinations not appearing in standard forms of the language.) Greek (usually) has 24 letters, capital and lower case, and seven diacritics; having the base letters combine with codepoints for each diacritic, rather than having single codepoints for letter-diacritics combinations, would shrink the Greek repertoire from around 400 characters to around 50.

And this would be a boon to people processing texts. Say I'm doing a diacritics-insensitive search for αυτ, which I expect to match all of: αὔτανδρος αὐτάρκης, ἀϋτέω, αὖτε, ταῦτα, ταύτῃ. It should be self-evident which of the two possible encodings you'd prefer to program a text editor to recognise:

	Composing Diacritics	Precomposed Diacritics
αυτ	`03B1 03C5 03C4`	`03B1 03C5 03C4`
αὔτανδρος	`03B1 03C5 0313 0301 03C4 03B1 03BD 03B4 03C1 03BF 03C2`	`03B1 1F54 03C4 03B1 03BD 03B4 03C1 03BF 03C2`
αὐτάρκης	`03B1 03C5 0313 03C4 03B1 0301 03C1 03BA 03B7 03C2`	`03B1 1F50 03C4 1F71 03C1 03BA 03B7 03C2`
ἀϋτέω	`03B1 0313 03C5 0308 03C4 03B5 0301 03C9`	`1F00 03CB 03C4` `1F73 03C9`
αὖτε	`03B1 03C5 0313 0342 03C4 03B5`	`03B1 1F56 03C4 03B5`
ταῦτα	`03C4 03B1 03C5 0342 03C4 03B1`	`03C4 03B1 1FE6 03C4 03B1`
ταύτῃ	`03C4 03B1 03C5 0301 03C4 03B7 0345`	`03C4 03B1 1F7B 03C4 1FC3`

With the encoding done as on the left hand side, you can search for U+03B1 U+03C5 U+03C4 with the reasonable expectation that those codepoints will actually turn up in the text; and you need only tell the search engine to ignore any intervening characters of the diacritic type (information Unicode contains in its definition). With the right hand side, your search engine—and ultimately your operating system—are burdened with a laundry list of equivalences: it needs to be told that, wherever it sees U+1F50, it should treat it as equivalent to U+03C5 U+0313; wherever it sees U+1F54, it should treat it as equivalent to U+03C5 U+0313 U+0301; and so forth.

The problem with Unicode's puristic approach is that it was ahead of the technology. The technology to do things like spontaneously ligate f i on display is quite new; if you wanted to see a ligature even five years ago, you had no choice but to use a separate codepoint. And even now that operating systems have started getting smarter about combining codepoints together, the results look less than satisfactory. Some existing 8-bit fonts for Greek (such as SPIonic) involve backspacing diacritics rather than precomposed glyphs; because characters vary in width, it was absurd to expect the same acute to backspace over an iota and an omega.

And even if this is resolved in Unicode implementations (which with newer systems is nowadays the case), there are things about Greek the operating system is blissfully unaware of when it comes to diacritics. In particular, the default behaviour for multiple diacritics in general in Unicode is to stack one on top of the other, as occurs with breathings plus circumflex, and as occurs in Vietnamese. Greek requires that breathings appear to the left of acute and grave; Unicode fonts simply do not make this exception for U+0313 U+0301 combinations, unless the implementer specifically tells it to (and the font technology is a little more advanced than that commonly in use as of this writing).

Moreover, Unicode has always had to deal with legacy encodings. There is a huge amount of data in the world in which á has been encoded as a single codepoint; Unicode would not have gotten very far if it was not backwards-compatible with such data. This applies to any script using diacritics which has any tradition of computer encoding, which certainly includes polytonic Greek.

2.2.2. Decomposing

As a result, Unicode has made an "historic compromise". Unicode has admitted as codepoints the precomposed combinations that were already in use on computers—however obscure the encoding might have been. This is the reason behind some of the more puzzling combinations and digraphs in the Latin Extended B block. For example, U+01C6 Latin Small Letter DZ With Caron, ǆ, and U+01C8 Latin Capital Letter L With Small Letter J, ǈ, appear because Serbian programmers in the '70s, implementing automatic transliteration, were too lazy to write anything but a 1-to-1 mapping between Cyrillic and Latin letters. (The corresponding Cyrillic Serbian letters are U+045F Cyrillic Small Letter Dzhe, џ, and U+0409 Cyrillic Capital Letter Lje, Љ. See the "Latin digraph characters" thread on the Unicode mailing list for more information.)

This means that the common letter-diacritic combinations for any script that had a widespread existing implementation all have their own codepoints in Unicode. In fact, because such combinations will be optimal for display (so there's no need to teach the rendering engine about breathings next to acutes), the W3 consortium recommends such precomposed characters be used preferentially on the web (Normalisation Form C).

This would make life a living hell for anyone who ever does any programming with text. But Unicode has not let programmers down: as far as Unicode is concerned, U+1F54 really is to be treated as equivalent to U+03C5 U+0313 U+0301. All the precomposed characters it has admitted have been assigned canonical decompositions: a copy of that huge laundry list of equivalences sits in every copy of MacOSX, WinXP, Linux, and so forth.

And the operating system acts on those equivalences. Recall that Unicode conflates polytonic acute and monotonic tonos, and that this leads to confusion for older fonts which differentiate between U+03AC Greek Small Letter Alpha With Tonos and U+1F71 Greek Small Letter Alpha With Oxia. The reason is that the operating system is told that both composed characters have the same canonical decomposition: U+03B1 U+0301, Greek Small Letter Alpha + Combining Acute Accent. So internally, Unicode will treat the two identically: a search for U+03AC will return instances of U+1F71.

2.2.3. Uncomp(r)o(mi)sing

With canonical decomposition, it would seem that everyone wins: users get their pretty combinations of glyphs, programmers get their platonic forms of characters to work with. The people who don't win are the Consortium itself, which has to keep managing a very messy laundry list, and the companies putting together operating systems: every copy of MacOSX and WinXP is burdened with thousands of mappings between Korean, Latin, Greek, Cyrillic, and whichever else precomposed characters and their decompositions. Those mappings cost execution time, memory, and of course money. And noone wants a repeat of the Korean disaster.

So if a classicist comes to the Unicode Consortium, and asks them what possessed them not to include the vitally important Greek Capital Letter Upsilon With Psili and Varia, the answer is not going to be, "But of course, we shall be more than happy to burden our operating systems with even more mappings for the laundry list, just so five out of our fifty million users can read Sappho more pleasantly." The answer is going to be more like, "We only put those precomposed characters into Unicode in the first place because we were forced to. If you want Greek Capital Letter Upsilon With Psili and Varia, you know where to find U+03A5 Greek Capital Letter Upsilon, U+0313 Combining Comma Above, and U+0300 Combining Grave Accent."

(And the answer will be brusque, at that: the precomposed characters only entered Unicode at ELOT insistence in 1991, and have caused no end of headaches. Although officially composed and decomposed are equally valid representations of characters, the Consortium would be happier if everyone stuck with the decomposed characters, instead of lawyering with the laundry list.)

And when you complain about the diacritics being stacked on top of the capital upsilon, rather than appearing to the left like they're supposed to, you won't get much sympathy; you'll be pretty much directed to the same queue as the Serbs. Because mapping the component codepoints to glyphs (including glyphs without corresponding extant precomposed codepoints) is not an issue Unicode is concerned with. Unicode's business is the platonic forms; this is an issue font designers should be concerned with instead. Turning multiple codepoints into a single glyph—which font designers generically refer to as ligatures—is something more complex scripts, like Arabic, Devanagari and Mongolian, have to deal with all the time. Although this is taking a while for implementors to realise, the 'simple' scripts like Latin and Greek are not an exception: they too can have tricky mappings between codepoints and glyphs, the platonic underlying forms and what you actually see on the screen; and you should not assume a one-to-one mapping between the two.

Thus, Ken Whistler in a post on the Prosgegrammeni:

Many of us have felt all along that polytonic Greek should always be represented decomposed, and that the ELOT polytonic "character" encoding was a dangerous conflation of glyph design and character encoding concerns.

Implementations that use full decomposition for polytonic Greek and fonts that correctly map the accentual and diacritic combinations are the best bet for consistency *and* good presentation in the long run.

As of right now, I don't know of an operating system-font combination clever enough to do niftiness with U+03A5 U+0313 U+0300, as Ken foreshadows—though several fonts do have the needed glyphs tucked away in their inventories. In the long term, though, that is the only thing that can happen: classicists working on Aeolic will need to bring such fonts into being. The Unicode Consortium—which has enough unencoded scripts left to go, including a pretty much inexhaustible list in CJK Unified Ideographs Extension—is not going to come up with new encodings for characters you can already represent. Cyrillic does not need two te's, and Greek does not need four more capital upsilons (let alone eight, if we count the puzzling U+03D2 Upsilon With Hook Symbol).

The issue is, what do you do until the promised fonts and operating systems materialise. The newer font technologies have completely distinct glyph and codepoint repertoires; they allow for one-to-many mappings of codepoint or codepoint combination to glyph. But if a font can't have a glyph without a codepoint, then it needs to insert its glyph into a real codepoint. That is what current fonts with these glyphs do: they have a single codepoint assigned to the non-Unicode precomposed character. That codepoint can be in one of two places.

The Private Use Area (PUA) is intended for ad-hoc codepoints which people may come up with for their encoding needs (especially when they need to exchange data with another party), but which do not have a place in the standard. This may be either because they have not yet officially been accepted in the standard (as is the case for several minority scripts under proposal), or because there is no chance they will be accepted. The ConScript registry, which is intended for "made-up" scripts, exemplifies both. It was a temporary station for the scripts Shavian and Deseret, which were eventually included in the standard (even if as trial balloons). It is also as much of a home as the Klingon script pIqaD and the script of the Phaistos Disc will ever find, after being rejected from the standard.

Shavian is a phonetic script intended as an orthographic reform of English, and funded by G.B. Shaw's estate; Deseret was a script for English briefly used by Mormons. The Phaistos Disc script was rejected because it has only ever involved one document, with virtually nothing known about it, and could not plausibly be used productively; see e.g. Michael Everson's posting on the Unicode Mailing List. I discuss pIqaD separately on this site.

The alternative is to slot these glyphs into the slots in the Greek and Greek Extended areas where one would expect them (see below on the 'gaps').

To use a PUA codepoint is risky; you are assuming your party has the same font as you, and assuming same fonts is the very problem Unicode is supposed to be getting us away from. To use a codepoint in an undefined area of a defined block is negligent: Unicode is under no obligation to keep that area undefined in perpetuity, and certainly not to fill the area with new precomposed glyphs that it objects to on principle. Fonts that fill such areas are in the wrong, and users should not encourage their bad habits.

The fonts to look askance at on this issue are: Aisa Unicode for Greek Extended; Vusillus (U+03F4 for capital lunate sigma, U+03F5 for capital Yot) and New Athena Unicode (U+03FF for capital Yot) for Greek. Vusillus in particular has been upstaged by subsequent developments, as U+03F4 and U+03F5 have since been assigned to other characters, and the capital lunate has ended up in U+03F9.

2.3. Particular 'Gaps'

2.3.1. Smooth breathing on capital upsilon

So what are these combinations of diacritics that didn't make it into Unicode—and why?

The codepoints in Greek Extended are set out on a grid, which has space for all possible character + diacritics combinations. Not all those precomposed combinations were filled by ELOT, so the non-canonical slots were left blank; the result is "gaps in the system". As noted, there is an overwhelming temptation to fill those slots with the pertinent glyphs; this is nonetheless bad form, as Unicode has no obligation to leave those slots unfilled, and avoid filling them with other codepoints of Greek as might prove necessary. In fact, this is fairly likely, since the Consortium is committed not to increase the count of precomposed codepoints.

The reason why there are no precomposed smooth breathings on capital upsilon (U+1F58, U+1F5A, U+1F5C, U+1F5E) is that in Attic Greek, the standard Greek dialect for surviving Classical literature, word-initial upsilon was always preceded by /h/ (with the exception of the odd interjection, and the original word for upsilon itself, ὖ). When the rough breathing ceased being pronounced, the convention of writing a rough breathing on initial upsilon persisted as traditional. Lowercase upsilon, on the other hand, has smooth breathing (U+1F50, U+1F52, U+1F54, U+1F56) because this can occur as the second letter of a word-initial diphthong, which need not have taken a rough breathing; e.g. οὐκέτι, οὖς.

The only time you will see capital upsilon with smooth breathing, then, is in Ancient dialects other than Attic, where initial upsilon need not take rough breathing, and in "spelling mistakes" (or at least non-traditional spellings) of subsequent periods of Greek. Given that the character is limited to titlecase contexts, which are not plentiful to begin with, such characters turn up very infrequently. In the entire TLG corpus as of June 2003 (a corpus of 83 million Greek words), I have identified the following counts:

Upsilon with smooth breathing	Υ̓	13 instances
Upsilon with smooth breathing and grave	Υ̓̀	0 instances
Upsilon with smooth breathing and acute	Υ̓́	23 instances
Upsilon with smooth breathing and circumflex	Υ̓͂	6 instances

Of the instances, 4 are the second letters of diphthongs in accented all-caps words. The TLG is not an inscriptional database, and Ancient dialect has a much stronger presence in the inscriptions; nonetheless these combinations are quite marginal.

2.3.2. Circumflex on epsilon and omicron

Epsilon and omicron in the standard Greek alphabet represent short vowels; and at no stage of Greek did short vowels have a circumflex pitch accent. There are thus gaps in the table where epsilon and omicron with circumflex would go: U+1F16, U+1F17, U+1F1E, U+1F1F, 1+1F46, U+1F47, U+1F4E, U+1F4F, and two unpredictable gaps for lowercase epsilon and omicron with circumflex alone. (Haralambous has proposed (§3.3) U+1FB5 and U+1FC5 for the latter two.) These gaps could never be filled in any representation of Greek where epsilon and omicron were short values.

But in many variants of the Greek alphabet—including that of Athens before 404 BC—epsilon and omicron were both short and long vowels, as alpha remained throughout Greek; epsilon thus also stood for eta and epsilon iota, and omicron for omega and omicron upsilon. (Once the diphthongs monophthongised, the graphemic diphthongs were also used conventionally to represent new long vowels: see discussion of Corinthian EI.) If an epigraphical transcription keeps the epsilon and omicron as they were in the original, but adds conventional accentuation as extrapolated from our knowledge of Greek, you will end up with circumflexes on omicron and epsilon (usually on top of macrons). The same holds in diplomatic editions of subsequent documents—those preserving the original spelling of a document, complete with spelling mistakes.

This is usually done for non-literary texts, such as legal deeds. Hence the name: diploma, as in legal document.

In the TLG corpus, there are:

17 instances of epsilon with circumflex
1 of epsilon with smooth breathing and circumflex
13 of omicron with circumflex
1 of omicron with smooth breathing and circumflex
1 of omicron with rough breathing and circumflex
2 of capital omicron with rough breathing and circumflex

The combination is not that infrequent in diplomatic transcriptions, however, particularly for omicron.

2.3.3. Circumflex on capitals

As I discuss elsewhere, there is an odd set of capital vowels with grave and acute accents in Greek Extended (e.g. U+1FBA Greek Capital Letter Alpha With Varia, U+1FBB Greek Capital Letter Alpha With Oxia). If this set of characters made any sense (which they might, but not as titlecase characters), then it would be worth pointing out that the alpha, eta, iota, upsilon and omega are missing a capital with circumflex version. Since however these characters don't really make sense, there's no real point asking for a circumflexed version of a character that doesn't make sense either.

2.3.4. Non-canonical subscripts

The set of characters that take subscripts is normally only alpha, eta, and omega. There are two other potentially long vowels that could combine with iota to form a long diphthong, which would end up notated with subscripts. The first is iota; however, iota-iota groups in Greek have never formed a diphthong, but have always been regarded as either two syllables (e.g. διά + ἰσχυρίζομαι = διϊσχυρίζομαι 'rely on') or a single long vowel (e.g. Χίος + -ιος = Χίιος = Χῖος 'inhabitant of Chios'). This makes sense, of course: two short iotas make one long iota.

The other possibility is upsilon iota. The diphthong was quite infrequent, and was already breaking down to /yː/ in Attic; there is no mainstream tradition of distinguishing /yːi/ from /yi/ in Greek. Exceptionally, however, Eustathius of Thessalonica (12th century AD) used it once in grammatical explication:

Δῆλον δὲ ὅτι παρὰ τὰ γυῖα, ὃ τὰς χεῖρας ἰδίᾳ δηλοῖ πολλαχοῦ, γίνεται καὶ ἐγγύη, ἡ ὡσανεὶ ἐν χερσὶ τιθεῖσα τὸ κατεγγυηθέν, καὶ τὸ ἐγγυαλίζω, ἤγουν ἐν χερσὶ τίθημι, καὶ τὸ ὑπόγυͅον, ὃ καὶ ἐξ ὑπογύιου λέγεται, τὸ ἐγγύς, φασί, προσδόκιμον ἢ παραυτίκα γεγονὸς καί, ὡς εἰπεῖν, πρόχειρον ἢ μᾶλλον ὑποχείριον, ὅπερ ἐν ᾿Οδυσσείᾳ κεῖται.

Clearly from guîa "limbs", which often means "hands" in particular, is derived engúeː "surety", placing the guarantee as it were in the hand, and engualízoː "put in the palm", namely "place in the hand", and hupógu[i]on "under one's hand; sudden", or alternatively ex hupogúiou "on the spur of the moment", an expected or immediate event which is nearby and, so to spoke, close at hand or rather "in hand"—as used in the Odyssey. (Eustathius of Thessalonica, Commentary on the Iliad, III 439)

Eustathius' is the only instance of upsilon with iota subscript in the TLG corpus. This makes sense as a pedantic emendation of /yː/ for earlier /yːi/, where, just as with ᾳ ῃ ῳ, the ι has become 'silent': the forms hupóguios and hupóguos both occur in Attic, attesting such variability. But this never seems to have caught on as a spelling convention, and for the single instance where it does occur in Greek, the combining subscript diacritic (U+0345 Combining Greek Ypogegrammeni) is more than adequate—especially if you can see hupógu_ion correctly in the excerpt above.

2.3.5. Smooth Breathing on capital rho

Just as with upsilon, there is no capital rho with smooth breathing to correspond to U+1FEC Greek Capital Rho With Dasia—even though there is a beckoning gap at U+1FDC. The reason is again to do with Attic phonology: as already discussed, /r/ was devoiced at the beginning of a word, whence it was written with a rough breathing there (ῥ = /rh/ = [r̥]), which means it can appear with a rough breathing in titlecase position, as the first letter of a word. But rho with a smooth breathing only appeared before a rho with a rough breathing in some typographical traditions, indicating that the devoicing did not encompass both rhos; since no Greek word begins with double rho, there is no context in which rho with smooth breathing can appear in titlecase context, so there is no reason for the gap to be filled (despite Microsoft's inadvertent attempt with their glyph for U+1FEC in Palatino Linotype.)

Though the rho ceased being devoiced, this was not represented in conventional orthography; Modern Greek ended up dropping breathings on rho altogether, rather than switching them to smooth. The only context in which rho with smooth breathing appears, once again, is in Ancient representations of psilotic dialects. The TLG corpus manages 57 instances, all of them Aeolic, with the exception of Rarus, father of Triptolemus, the first human to have been taught agriculture by Demeter; his name was apparently back-formed from the (non-Attic) Rarian field in Eleusis associated with the cult of Demeter. In the surviving corpus, Rarus and his field are predictably mentioned more by the grammarians, who loved a good exception, than by the classics themselves:

Τὸ ρ ἀρχόμενον λέξεως δασύνεσθαι θέλει, ῥά, ῥανίς, ῥάξ χωρὶς τοῦ Ρ̓ᾶρος (ἔστι δὲ ὄνομα κύριον) καὶ χωρὶς τῶν ἐξ αὐτοῦ οἷον «Ρ̓αρίδος Δηοῦς» καὶ Ρ̓αρία γῆ καὶ Ρ̓αριάς, σημαίνει δὲ τὴν Δήμητρα, καὶ Ρ̓άριον πεδίον ἐν Ἐλευσῖνι, ἐπὶ τούτων γάρ φασι τὸ μὲν πρῶτον ρ ψιλοῦσθαι, τὸ δὲ δεύτερον δασύνεσθαι. ἀλλὰ τὰ δύο ρρ ψιλωτέον ἐπὶ τούτων καὶ κατὰ τὴν πρώτην συλλαβὴν καὶ κατὰ τὴν δευτέραν. (Herodian, De prosodia catholica 3.1.547)

At the beginning of a word, r takes a rough breathing, as in rhá "then", rhanís "drop", rháx "grape", except for Rarus (a proper name) and the words derived from it, such as "Rarian Demeter" and "Rarian earth" and Rarias, which is a name for Demeter, and the Rarian field in Eleusis. It is said that for these the first r has a smooth breathing and the second a rough; but the two r's in these words should bear a smooth breathing both in the first and in the second syllable.

Not all authorities put up with the exception; the Suda preferred to call Demeter Rharias.

Nick Nicholas, opoudjis [AT] optusnet . com . au
Created: 2003-06-03; Last revision: 2003-09-16
URL: http://www.opoudjis.net/unicode/unicode_gaps.html

2. Gaps in the System