Translation and Speech
African Language Technology
People who do not speak a major language such as English are at several disadvantages. They cannot access information or resources, cannot do business easily outside their local area, and cannot make their voices heard in many surveys or online forums.
Data collection is an important part of developing language technology, and the Sunbird AI Language Translation (SALT) dataset is a multi-way parallel corpus of 25,000 sentences translated across six local Ugandan languages, developed in collaboration with the Makerere AI Lab and the Makerere University Institute of Languages.
ABOUT THE PROJECT
Sunbird AI is building Natural Language Processing (NLP) technologies to provide language resources for social good. With our partners, including the Makerere University AI lab, we have built open local language datasets, translation and speech systems.
The Sunbird Translate system can automatically take text from any of the five local languages; Acholi, Ateso, Luganda, Lugbara and Runyankole and translate it to and from English with state of the art accuracy. We train multilingual models to do translation between multiple languages using recent advances in deep neural networks, making it robust and easily extensible to other languages and to improvements with additional data. We are now working on turning these resources into social impact. Visit translate.sunbird.ai and try it out yourself.
As with all Sunbird AI projects, the data, code and models are freely and openly available for others to extend and use. In particular, data collection is an important part of developing language technology, and the Sunbird AI Language Translation (SALT) dataset is a multi-way parallel corpus of 25,000 sentences translated across six languages, developed in collaboration with the Makerere AI Lab and the Makerere University Institute of Languages. Unlike any comparable dataset, we developed this to include a range of locally relevant topics e.g. in healthcare, agriculture and society.
Our technical report has the details of collecting text data and training translation models.
Speech technology, in particular the ability for systems to recognise and produce speech in local languages, is important because local languages in our context are more often spoken rather than written.
Text-to-speech (TTS) models are normally trained by having a voice actor in a studio make recordings of sentences. Sunbird AI has been able to train Luganda TTS in a new way, using crowdsourced data from Mozilla Common Voice. The model was trained using the voices of hundreds of individuals. This is to our knowledge the only existing TTS system for any Ugandan language, and is freely and openly available.