OCR beyond Pleco

Working with OCR for digitizing texts on Thursday, I experienced the applications of OCR for the first time outside of the typical Chinese learner’s experience with Pleco. Compared to recognition for dictionary purposes, the challenges in document digitization are similar, but more pronounced due to the extent of the project. For example, the lack of spaces between characters–although more than one character may form a word–is oft cited as a language-specific challenge for OCR in Chinese. However, in Pleco, these issues are rarely prohibitive of successful use because the user can manually edit the combination of characters until they make sense. Conversely, for document digitization, the OCR guesses the combination for you, prepares the information as a document, and then you must correct any errors post-facto. Additionally, the ease of OCR to identify Chinese characters in Pleco contrasts sharply with the OCR recognition of Arabic in Abbyy FineReader. This distinction testifies to the utility and progress of machine learning. However, with the struggles in recognizing handwriting, one must wonder if the ancient, more stylized Chinese texts have similar problems to Arabic. As OCR can read not even English handwriting yet, it will be interesting to see which language acquires this capability first. Indeed, if/once accomplished, the ability for OCR to recognize handwriting could facilitate building a corpus of author’s notes, about their literature or their personal lives, and presenting this added contextual information alongside the novels already available.

Leave a Reply

Your email address will not be published. Required fields are marked *