NLP systems are limited by the availability of text data, and because machine-readable text exists only in a few hundred languages, most of the world’s languages are under-represented in modern language technologies.
Text data exists in many more languages! However, it is locked away in printed books and handwritten documents, and training a high-performance optical character recognition (OCR) system to extract the text is challenging for most under-resourced languages.
In this talk, I will describe two methods for improving text recognition in low-resource settings using automatic OCR post-correction. The first is a multi-source encoder-decoder model with structural biases to efficiently learn from limited data. The second is a semi-supervised learning technique that uses raw unlabeled images to improve performance without additional manual annotation. The method combines self-training with automatically derived lexica through the use of weighted finite-state automata (WFSA) to improve post-correction. I will present empirical evaluation on multiple under-resourced languages to illustrate the effectiveness of the proposed approaches as well as future applications in using the extracted texts to expand multilingual NLP models to many more languages.