Ambrose Li

The Unihan code charts aren’t user-friendly

(updated )

In the past few days I have been working on a personal project that’s long overdue: the Meta-index to etymologically correct written forms of Cantonese words. It’s still a work-in-progress, but I already got to a point where I had to use data and charts published by the Unicode Consortium.

Let me first explain what I mean by “etymologically correct”. You see, we’re not supposed to write in Cantonese. (It’s like you spoke English but had to write in Latin.) Cantonese is a marginalized language.

Although we’re not supposed to write Cantonese down, sometimes we have to,‍[Note 1] or just want to. But we don’t know how to write many of the words that we use, so we made up new ones.

In the past two decades or so, people have been rediscovering that many of our words are actually in the dictionary; we just didn’t know. So people had an idea: instead of the words we made up, we should use these “correct words” (正字, ˉdziŋ ˍdzi).


One problem with these etymologically correct words is that many of them are so archaic that even when they are listed in Unicode there’s often no way to type them, and sometimes even when you find a code to type there’s no font that will show them.

So I found myself looking for a few of these pesky words in the Unicode Radical–Stroke Index.‍[Note 2] After a few successful tries I hit a brick wall and I ended up looking through the actual standard documents. One of these was Unicode Unified Ideographs Extension B,‍[Note 3] a gigantic PDF at 39 megabytes.‍[Note 4] It’s 741 pages. And it’s virtually impossible to look up anything.

Screen shot of the second of the file
The first page of the Unicode Unified Ideographs Extension B that is not the copyright page. There is no table of contents. No index. No section heads. The header gives the start and end code points on the page, and each character gives its radical and stroke count, but there are otherwise no useful pointers on the page.

First, you have no idea where to even start looking. Let’s say you want to find U+265BF.‍[Note 5] Which page is that? Page 500? Page 600? 700? No one knows.

Second, within the general traditional, radical-based groupings, it’s hard to figure out where you are. Let’s say you want to look for something that might belong to the “water” radical. Which page is that? Say you flipped to a random page and it’s a different radical. Do you flip back a few pages or forward? The radicals form a 214-letter “alphabet” that few can remember.‍[Note 6]

Or, say you flipped to a random page and it happened to be the “water” radical. Are you at the right place? Yes, stroke count is in the charts but it’s so small. Even if you don’t find this an issue how do you skip back or forward by say two stroke counts, when the entire grouping could be several dozen pages?

In short, it has all the hallmarks of an improperly indexed work. Location information is not perceivable, and when you do land at the wrong place, it’s impossible to recover.


The Unicode Consortium does not seem to understand or appreciate the traditional radical–stroke method of lookup. When I was tackling the first problem I encountered building my meta-index that CJK characters were not being sorted correctly I rediscovered Unihan Dictionary-like Data,‍[Note 7] a data set published by the Unicode Consortium. This data file mentions a technical report #38 for “documentation”, but the report includes this very judgmental comment:

To find a character using the radical-stroke system, one determines its radical and the number of residual strokes, then looks through the list of characters with those characteristics. This is a clumsy system compared to alphabetical lookup, but is one of the most widespread systems throughout East Asia. Unfortunately, it is also ambiguous.[Note 8] [Emphasis mine]

It seems since this traditional system is so “clumsy” and “ambiguous”, they’ve concluded it must be quite useless and therefore, in the actual code charts, there’s no need to including people who rely on it.

I cannot explain this cognitive dissonance. Perhaps they don’t have anyone who speaks a Chinese language and therefore don’t understand the usability issue at hand: Precisely because CJK ideographs are not “alphabetical”, these “clumsy” and “ambiguous” methods are essential.


To be fair, they do include us in a way, because radical–stroke count is what the Unicode Radical–Stroke Index does. But the reason I had to turn to the PDF files in the first place was that radical–stroke is rarely the only way to organize a large dictionary.

Looking up by radical and stroke count is indeed the primary way we look up a real dictionary, but let’s face it we do admit the system isn’t perfect, and that’s why most real dictionaries provide a second, fallback method to look up words: a table called the “index to difficult-to-look-up words” (難檢字表, ˏnan ˊgim ˍdzi ˊbiu) that allow lookups by stroke count only.

I have to point out that the Unicode Consortium has not given any thought to this fallback method at all. None of their interactive lookup tools supports looking up words by stroke count. If they truly find radical–stroke lookup so “clumsy”, why don’t they at least provide this standard fallback method? kTotalStrokes is in their data files; exposing this in their interactive tools would have made their tools a lot more usable.

Notes

  1. For example, a script for an ad or a play
  2. “Unihan Radical–Stroke Index,” http://​www​.unicode​.org/​charts/​unihanrsindex​.html.
  3. “Unicode Unified Ideographs Extension B” (Unicode Consortium, 2020), https://​www​.unicode​.org/​charts/​PDF/​U20000​.pdf.
  4. I’m using the original IEC definition, so a megabyte is 2²⁰ bytes, not 10⁶ bytes.
  5. To give you an idea of how big a problem this is, assuming the whole 20000–2A6DD range is filled, the chart contains 42,718 characters.
  6. My home form teacher in Grade 5 actually tried to teach us to memorize the radicals. I can still recite the first few dozen (out of the 214); they begin: ˈjɐt, ˊgwɐn, ˊdzy, ˉpit, ˍjyt, ˉkyt; ˍji, ˌtɐu, ˌjɐn, ˌjɐn, ˍjɐp, ˉbat, ˊgwiŋ;.... You can see the problem right there: some radicals have the same pronunciation. Also, if you can remember how to pronounce a radical it doesn’t mean you remember how to write it; I don’t remember ˉkyt or ˊgwiŋ, or one of the ˌjɐn’s.
  7. “Unihan Dictionary-like Data”, Version 9.0.0 (Unicode Consortium, 2016), ftp://​ftp​.unicode​.org/​Public/​9​.0​.0/​ucd/​Unihan​_​Dictionary​Like​Data​.txt. In Version 14.0.0 of the standard the stroke count data has been moved to a different file.
  8. Unicode Han Database (Unihan) (Unicode Standard Annex #38), Revision 29, ed. John H. Jenkins, Richard Cook, and Ken Lunde (Unicode Consortium, 2020), http://​www​.unicode​.org/​reports/​tr38/​, 3.6.

Tags

  • #Chinese languages
  • #dictionary lookup
  • #exclusion
  • #headings
  • #indexing
  • #perceptible information (universal design)
  • #radical-based lookup
  • #stroke count
  • #tolerance for error (universal design)
  • #Unicode