Ambrose Li

Google’s Noto has failed its mandate

(updated )

I chose Lato for this blog because I needed certain letters for Cantonese, but Cantonese is not normally written in Latin letters. To display the CJK characters that are normally used to write Cantonese, I chose Noto, because besides the fact that it’s the only open-licensed CJK font available on a public font server as its official download page says,

When text is rendered by a computer, sometimes characters are displayed as “tofu”. They are little boxes to indicate your device doesn’t have a font to display the text.

Google has been developing a font family called Noto, which aims to support all languages with a harmonious look and feel. Noto is Google’s answer to tofu. The name noto is to convey the idea that Google’s goal is to see “no more tofu”.[Note 1] [My emphasis]

Soon after starting to work on the Meta-index to etymologically correct written forms of Cantonese words, I realized Noto has already failed its mandate. This is how one screen of the meta-index currently looks like:

A screen shot showing a number of CJK characters documented 14 years ago are still appearing as “tofu”

The screen cap shows 𣿂 (ˌsœ), 𣱐 (ˉfɐn) and 𨄇 (ˌtɐn) appearing as “tofu”.

If you keep your text “etymologically correct”, the verb 𨄇 (ˌtɐn) really does not have a viable substitute that’s not “tofu”. The same goes for the adjective 𡍲 (ˉdat) (not shown in the screen cap).

These words came from a book published 14 years ago, in 2007, and includes fairly frequently used words like ˉdat and ˌtɐn, yet even a comprehensive face like Noto fails to display these words correctly. If this has nothing to do with Cantonese being a marginalized language I’d like to know your alternative explanation.


It is true what we call Hong Kong Chinese is a written form of Mandarin, because standard written Chinese is a written form of Mandarin. And I can tell you that we have to go to school learn how to write in a foreign language that we don’t speak (viz. Mandarin). But there has been more and more talk of using Cantonese in writing using Cantonese words and Cantonese grammar eventually even in formal writing. Yet when it comes to writing Cantonese, if we follow the principle of being “etymologically correct”, Google’s Noto has failed its mandate of “no more tofu” despite the fact that Noto Sans TC has “graduated” from the Google Fonts Early Access programme.‍[Note 2]

It is also true that many people equate “Chinese” with “Mandarin” one reason I believe we should stop using the word “Chinese”. In fact, if you picked up a copy of the Chicago Manual of Style and looked up Chinese language in the index, you’ll only find “romanization systems (Pinyin, Wade-Giles)”‍[Note 3] and none for other Chinese languages. If you flipped to section 11.82, the section that talks about romanization systems, you’ll find

The Hanyu Pinyin romanization system, introduced in the 1950s, has largely supplanted both the Wade-Giles system and the place-name spellings of the Postal Atlas of China (last updated in the 1930s), making Pinyin the standard system for romanizing Chinese.‍[Note 4]

That statement seems to imply Chinese is a unified single language that can be handled by a single romanization system.

Is there such a thing as a Chinese language that is not Mandarin? Are there places in “China” (whatever that means) that aren’t using Pinyin? To Chicago’s editors, the answer to both questions seems to be no.


But let’s go on a detour.

Before Unicode there was Big5.‍[Note 5] It was an ad-hoc character set originally cobbled up in Taiwan as an interim solution; it was so poorly designed it was literally impossible to write normal Taiwanese Chinese in it.

One very common word that used to appear frequently in Taiwanese Chinese was somehow excluded: The word was .‍[Note 6]

Many today do not even realize as recent as twenty years ago, some Taiwanese people still distinguished between and ; many today incorrectly believe that is only used in so-called “simplified Chinese”.

Why? Because people still associate “traditional Chinese” with Big5, and they know in Big5 it’s impossible to type .

But actually has disappeared from Taiwanese Chinese. How it happened is a valid question, and I’d say being excluded from Big5 killed it. No one would have been able to type it, and if people somehow managed to type it, the resulting files would become gibberish when sent to someone else.‍[Note 7] The word became useless. Therefore it died out.


I hope that was not a useless detour, because that was about how poor character set design and incompatible fonts killed a perfectly fine Taiwanese word. What’s happening right now could eventually kill many perfectly fine Cantonese words.

I appreciate Google’s efforts in creating Noto. I know creating a CJK font is a herculean effort. But I hope Google will rectify the deficiency before it’s too late.

If Noto’s mandate really is “no more tofu”.

Because Chinese is not a single language, and “traditional Chinese” is not only used by Mandarin.

Notes

  1. “Google Noto Fonts,” https://​www​.google​.com/​get/​noto/​.
  2. “Early Access,” accessed January 27, 2021, https://​fonts​.google​.com/​earlyaccess. The typeface Noto Sans TC is listed under “The following fonts have graduated from this page to be included in the Google Fonts Catalog”.
  3. Chicago Manual of Style, 17th ed. (University of Chicago Press, 2017), 1035.
  4. Chicago, 651.
  5. Because the character set was invented in Taiwan, it had an official name in Chinese; that name was 五大碼. Wikipedia now files that character set under 大五碼, an ungrammatical misspelling that I believe was first used in NJStar, a piece of software invented by a PRC Mainlander based in Australia in other words, an outsider who should never have touched Big5.

    (NJ, short for Nánjí, refers to the far south. Normally this means the South Pole, but in the context it is clearly a reference to Australia.)

  6. And there was another: 裏. But in this case, and (which was included in Big5) are at least true equivalents. and , however, are not.
  7. Foundries often included these accidentally excluded words in their typefaces, but because these words were not in Big5, they were assigned random code points specific to the foundry, much like how the private-use area is used today in Unicode.

Tags

  • #Big5
  • #Chinese languages
  • #exclusion
  • #typography
  • #Unicode