Research paper showing inaccuracy of Google Translate shows instead neither designers nor translators are valued

May 17, 2022 (updated May 20, 2022)

Originally posted as a comment on a translators’ Facebook group, for a post that links to the Verge article.

I recently read a one-year-old Verge article‍^[Note 1] that talks about how Google Translate is still not suitable for medical purposes. The gem isn’t the Verge article though, but the actual research paper that it cites‍^[Note 2] — the paper inadvertently shows why it and many other research papers contain invalid results because researchers don’t bring in subject-matter experts who can see flaws in their experiment design right away.

And by subject-matter experts I don’t mean experts in the researchers’ own disciplines, but outsiders — like artists, graphic designers, native speakers of languages the researchers don’t speak, translators.

While I welcome the paper’s conclusions, a few things in the paper are invalid enough that the whole thing should have been rejected by the peer reviewers unless revisions were made; instead, it was allowed to be published in its current form. I can already guess the makeup of the review panel — no graphic designer with Arabic/Farsi/Hebrew experience, no non-PRC Chinese speaker.

So, my comments on the paper:

The paper mentions the experiment was designed for “written translations”‍^[Note 3] but what languages were chosen was based on “spoken languages” including “Chinese (including Cantonese and Mandarin)”.‍^[Note 4]
This perpetuates the common myth that Cantonese vs Mandarin is relevant for written translations. Unless the researchers were ultra-progressive and used written Cantonese,‍^[Note 5] Cantonese vs Mandarin is completely irrelevant; the researchers should have split Chinese into PRC Mainland, Taiwanese and Hong Kong variants instead (but with the last two merged, since GT does not distinguish between them). Neither “Cantonese” nor “Mandarin” corresponds to any specific written form.
The paper claims that “directionality of the written language was not accounted for by the software, i.e. that Farsi [...] was transposed to left to right by GT and was illegible.”‍^[Note 6]
This is an outright factual error that suggests the researchers copied GT output to InDesign without checking their work. The thing is GT does not transpose Farsi — or any other right-to-left language — whereas one piece of software that’s known to do so‍^[Note 7] is InDesign,‍^[Note 8] which isn’t mentioned in the paper. (The researchers couldn’t have used Word, PowerPoint or LibreOffice, since these do handle RTL text correctly.) So the researchers hid a detail that they mistook as irrelevant, and the peer reviewers did not catch this because none of them was a designer with RTL experience.
The paper also mentions the use of a 5-point Likert scale,‍^[Note 9] without explaining how or if they addressed cultural differences in interpreting such scales.
(I suppose at least it’s a 5-point scale, not 4- or 6-point, so at least their volunteers could express neutrality, but what does neutrality mean for translation accuracy?)
While the use of a 5-point Likert scale doesn’t make the results especially invalid, it is problematic because some cultures (including my own) do not use the word “strongly” the same way as anglophone North Americans. This is so well-known outside academia that the first person who told me this was a French teacher; she was white and she came from France.

(Disclosure: I have experience in typesetting RTL with InDesign (Hebrew) and live in a neighbourhood where both CJK and ME are used (Korean and Farsi).)

Notes

Nicole Wetsman, “Google Translate still isn’t good enough for medical instructions,” March 9, 2021, https://www.theverge.com/2021/3/9/22319225/google-translate-medical-instructions-unreliable.
Breena R. Taira, Vanessa Kreger, Aristides Orue, and Lisa C. Diamond, “A Pragmatic Assessment of Google Translate for Emergency Department Instructions,” Journal of General Internal Medicine 36 (2021), doi:10.1007/s11606-021-06666-z.
Taira, Kreger, Orue, and Diamond, “Pragmatic,” 3361.
Taira, Kreger, Orue, and Diamond, “Pragmatic,” 3362.
This cannot be verified since the surveys used have not been published, but is unlikely because GT does not support Cantonese.
Taira, Kreger, Orue, and Diamond, “Pragmatic,” 3363.
The problem should have been much worse than mere transposition, since InDesign should not have been able to shape the letters correctly (i.e., transform standalone forms to initial, medial and terminal based on context).
Adobe splits InDesign into separate (and mostly incompatible) East Asian (CJK), Middle Eastern (ME), and other versions. Only ME versions of InDesign can correctly typeset RTL.
Taira, Kreger, Orue, and Diamond, “Pragmatic,” 3362.

Research paper showing inaccuracy of Google Translate shows instead neither designers nor translators are valued

Notes

Tags