Parallel Text Language Corpus Dataset for Key Ugandan Languages

Parallel Text Language Corpus Dataset for Key Ugandan Languages

One of the ways that artificial intelligence shapes society is through language technology. Neural networks that can process language are the basis for being able to search the web, translate between languages, provide recommendations and carry out large scale analysis of text.

Machine translation has for a long time been the de facto NLP task. Unfortunately many African languages have not benefited from the advances of NLP because of limited language resources. Uganda is home to 43 languages and dialects, with most of them more spoken than written.

To contribute to language resources in Africa we set out to collect parallel text sentences for the 5 top languages in Uganda sufficient to enable a start on the machine translation task for Ugandan languages. Our approach was to build on existing efforts in this regard and make this the principal dataset for Ugandan language resources.

Uganda has 43 local/native languages used by large sections of the population. The map shows the spatial distribution of the languages and dialects spoken in Uganda. Within regions there are relatively similar languages/dialects with people able to understand and speak across these dialects, but these differences become more pronounced between different regions.

AFRICAN LANGUAGE TECHNOLOGY

Africa is a very linguistically diverse continent, with at least 1500 languages (compared to around 200 in Europe), for most of which no AI language technology has ever been developed. Almost all current effort on AI language technology is focused on English and a handful of other languages.

Another issue with the way this technology has evolved is that existing large models are generally trained with text trawled from the Internet, and have shown a tendency to reflect the harmful biases and divisiveness common in online speech.

Starting with our work in 2020 on social media monitoring, we’re interested in showing what’s possible with language AI technology in African languages, and how it can be done responsibly and inclusively.

PROJECT IMPLEMENTATION

This project started in October 2020 and was completed in Jan 2021. It was structured as a collaboration between Sunbird AI and the Makerere AI lab. Makerere AI lab has as its strengths, a solid track record of doing applied AI research and interfacing with the eventual clients of the research particularly government agencies in Uganda and direct beneficiaries like small holder farmers. Makerere AI lab is also located in Makerere University the leading university in Uganda and has access to a pool of good graduate and undergraduate research students.

THE DATASET

We collected a multilingual parallel text language corpus of 60,000 language phrases/sentences comprising of 10,000 English sentences/phrases and their corresponding translations in five under-resourced languages in Uganda: Luganda, Lugbara, Runyakitara, Acholi and Ateso.

The English sentences were obtained from the following data sources to capture the variety and context of use of language. The most likely use of this corpus will be for applications in these same source domains or similar.

  • Social media (Facebook and Twitter)
  • English Transcripts from radio data
  • Online newspapers, articles, blogs and websites, e.g., Uganda Legal Information Institute (ULII).
  • Text contributions from the Makerere University NLP community.
  • Farmer responses from surveys.

To mitigate privacy and copyright concerns, we only used some of the sources as motivation for creation of similar but different phrases taking account of any privacy and bias concerns e.g. for social media, removing identifying tags and removing explicit references to sensitive attributes like religion and politics. The dataset can be downloaded here.

Interview on ‘The Groove with Crystal: Podcast’

Interview on ‘The Groove with Crystal: Podcast’

On 17th May 2021, Ernest Mwebaze, one of our directors and founders, was hosted by one of Uganda’s radio legends, Crystal Newman. In this podcast they discussed how we are using Artificial Intelligence to curb noise pollution in Kampala.

You can listen to it here: The Groove Podcast – Sunbird

Sunbird AI Sets Out to Help Curb Noise Pollution in Kampala

Sunbird AI Sets Out to Help Curb Noise Pollution in Kampala

The prevalent exposure of Ugandans to noise pollution persists and continues to be unabated because of failures in the monitoring and control framework in the country.

Yet, there is evidence on the relationship between noise pollution and its effects on human health, and the general environment. While KCCA and NEMA had made efforts to control noise pollution within the capital city and courts have weighed in on the issue, there is a need to approach the problem with more innovative and effective ways. These should empower citizens to be at the forefront of noise monitoring and control using simple technology tools.

Further, it is Sunbird AI’s belief that agencies mandated to control and monitor noise pollution should be empowered with technological tools that enable response and enforcement in more efficient ways.

Against this background Sunbird AI, an artificial intelligence firm/company has developed tools to support monitoring, and enforcement of noise pollution controls in Uganda using artificial intelligence. Sunbird AI envisages that these tools will empower the public to be vigilant actors in detecting and reporting noise pollution.

Ten noise collection agents have been dispatched to 66 parishes within the 5 divisions of Kampala city. An additional 100 will be dispatched at the end of the month of May 2021. These agents will enable assessments on general levels of exposure and provide the requisite data regulatory agencies need for decision-making on generating best practices.

Ernest Mwebaze, Sunbird AI’s Director, shared that, ‘identifying areas of high noise pressure is a key element for an effective environmental management and for mitigating impacts, identifying noise hotspots and areas of potential conflicts helps gather baseline knowledge on noise-producing human activities and mapping these areas.’

‘Currently, our focus is limited to more urban and industrialized towns because they have more population and so more human activities going on that are highly considered to be responsible for the high noise production. For example, in Kampala and Wakiso, owing to the level of industrialization, the population, and traffic dynamics (road, air, and railway), the noise pollution continues to increase, unabated,’ Ernest added.

Sunbird AI’s Lydia Sanyu training the noise collection agents on how to use the artificial intelligence to capture noise levels

Equipping citizens to detect, report, and control noise pollution would go a long way in empowering Ugandan citizens to participate and be part of decision-making on a critical issue that affects their lives and health. For the general public to be involved in the regulation of noise pollution and requires the necessary technology that will help them sense and measure their personal exposure to noise in their everyday environments. With Sunbird AI artificial intelligence technology, this is now possible.

Although KCCA and NEMA monitor and control noise pollution, there is a paucity of data and trends documented, and it is reported that there are no established systems to manage and track noise pollution data. This poses challenges and risks in noise pollution monitoring and designing mitigation mechanisms. Sunbird AI is keen on supporting KCCA and NEMA with artificial intelligence to manage and track noise pollution data.

Sunbird AI 2020 Annual Report

Sunbird AI 2020 Annual Report

Good news: the 2020 Sunbird AI annual report is out.

Our annual report contains the work we did as an organization last year. Considering that 2020 was dominated by a global pandemic with lockdowns, curfews and stay-at-home orders among other things, we started out as a completely remote team. In the absence of the ability to move about physically to coordinate our initially planned projects, we turned to the work we could do at that moment: aiding in the analysis of COVID-related data, on social media and on radio.

We also began research and implementation of AI language technology for five Ugandan languages, starting with a dataset of translations from English to these languages.

You can read more about these projects and about our organisation by downloading the report here: Sunbird 2020 Annual Report.

Happy reading!

COVID-19 Analysis (March 2021)

COVID-19 Analysis (March 2021)

March 2021 has been a special month: it marks a year since the world began to see the effects of a rapidly spreading global pandemic. From freezing air travel to closing schools to lockdowns to curfews, the pandemic began to change the way we lived our lives. Worse still were the hospitalizations and deaths of so many people, the losses among our families and friends.

One year later, we are still living through these issues in some form. The rollout of vaccinations offers some hope, but the effects of the pandemic are far from over.

For most of the second half of 2020, we worked with the Ministry of Health (Uganda) to do social media analysis on public discussions about COVID-19. At this one year landmark, the Ministry of Health requested a follow-up analysis to find out what the Ugandan public generally thinks of the current state of events including the recently dropping numbers of recorded COVID-19 cases in Uganda, the rollout of vaccinations, the continuing curfew, and any other COVID-related issues.

We did the analysis on social media data (Twitter and Facebook), over a period of about 12 months, starting from March 2020 to the end of February 2021. 

To carry out analysis, we developed a two-part system:

  1. A pipeline to fetch tweets from the Twitter API and posts from Facebook (through the CrowdTangle API) and store them for analysis.
  2. A machine learning model (our BERT classifier named SunBERT classifier) that we trained using these tweets/posts to predict whether a tweet is COVID-related or not.

Using Twitter’s new Academic Research API, we collected over 1.9 million Ugandan tweets in the period between March 2020 and February 2021. Using the SunBERT Classification Model we developed, we found that approximately 50,000 out of the 1.9 million tweets were related to COVID-19, and most of those were in during March and April 2020. Below is the monthly distribution of COVID-related tweets from our analysis:

Let’s look at an analysis of COVID-related tweets in Uganda over this time period:

It is evident that the discussion of COVID-related issues on Twitter was very high when the pandemic had just begun in March 2020 and has been steadily falling as the Ugandan public has become less and less interested in the pandemic discussion.

Comparing this trend alongside the number of new COVID cases in Uganda (according to Worldometers) reveals a surprising lack of correlation between the two. For example, there was a spike of cases in November but people had got tired of discussing COVID by then, as shown in the graph below:

WHAT HAS BEEN DISCUSSED RECENTLY?

Despite the relatively few tweets about COVID-19 in Uganda, there were still quite a number of interesting ones that revealed some underlying sentiments that Ugandans had about the pandemic. Let’s explore this in the following case study:

A CASE STUDY OF FEBRUARY 2021

In February 2021, only around 0.6% of tweets in Uganda were related to Covid-19.

Of these, some messages were expressing the pandemic in Uganda to be over, or not to be of significance, as the examples below show:

“Uganda Covid free 💪🏽”

“Now we register only 12 new cases of Covid 19? Small small 12?”

“What was the fear for. Covid was really hyped”

 

There were also some other themes of discussion that came up repeatedly. Below are a few examples:

QUESTIONING THE NEW FOR CONTINUED RESTRICTIONS

“Why is there still a curfew in Uganda?”

“When will curfew be lifted? Asking on behalf of everyone.”

“Sincerely tweeting, why do we still have curfew in Uganda????”

 

VACCINE HESITANCY

“Are we sure vaccines are safe anyway?”

“Do we really need #COVID19 vaccine as UGANDA?”

 

TESTING – THE ASSOCIATED EXPENSE AND NEED TO TEST

“Testing for covid19 in uganda… it’s like a privilege!”

“I’ve tested for covid 8 times since covid came. never tested positive. tonight i sit here to think about my ka money 😫”

 

PRESENCE OF COVID – IS IT THERE OR NOT? / WE SHOULD LIVE WITH IT

“Do our leaders know that covid19 isn’t just  a “period” …. this thing is gonna hit us for a long time. We are gonna have to eventually learn how to live with it ….just like we did with HIV n a bunch of other diseases n conditions”

“Why would we import covid vaccines when covid does not exist in the country 🤔”

 

Let’s also take a look at the popular tweets within this time period. These seemed to mostly discuss the above topics, some by being very grave about it, and others by trying to present them in a comedic way. A look into a few of these tweets shows this:

POPULAR TWEETS FROM LAST 4 MONTHS (FEB 2021)

“One day I will tell you guys how my mom nearly married me off during the lockdown and how I had to sit her down and give her the “I am not like that kind of girl” speech 😹”

“I have seen great businesses and enterprises close during this Covid19. I have seen renown rich people struggle to provide for their families because their income has been frustrated. If you still have a meal everyday and a roof over your head, count yourself blessed.”

“If there is just one thing I pray for every day is that covid ends and SOPs for taxis are removed. The fact that the common man has to pay twice as much as they used to in order to go to work is heart breaking. I really pray that taxi fares go back to normal soon 🥺.”

“Aaaahhh… but African governments bought tents and cars when rich countries were investing in vaccine research.”

 

CONCLUSION

The analysis above has shown us the reducing interest in discussion about COVID-19 in the past months. One can only hope that that does not translate into the dismissal of Standard Operating Procedures (SOPs) and all other safety measures, for the sake of our health in the midst of this pandemic.

Radio Advert Analysis for Covid-19

Radio Advert Analysis for Covid-19

Ever been seated there listening to the radio and then you hear an advert about how COVID-19 spreads and how to stay safe? At Sunbird AI, that’s music to our ears.

Our most recent project has been monitoring and analyzing Ministry of Health adverts on the spread of COVID-19 and related safety measures.

The analysis is to track whether COVID-19 adverts are played on radio stations and how frequently they are played. This is important because the broadcasting of information to the public about COVID-19 is a priority, in order to keep us healthy and safe.

This project was implemented in a number of steps:

GETTING RADIO DATA

First, we have to have the radio data, and that means listening to a whole lot of radio. More than 300 stations, to be exact. And yes, I’m joking, of course, we did not manually listen to all that radio.

We went through a digital data collection process as described below:

  • Compilation of a file with streaming URLs for a number of radio stations whose streaming URLs were easily accessible
  • Writing a Python script to get to each of those streaming URLs and record the data for an hour at a time
  • Writing a cron job to run this script every top of the hour, for most of the hours of the day

STORING THE DATA

By now you could be wondering about the huge amount of storage space that all this data would take up with time. In order to save storage space, a data retention policy is required. A data retention policy within an organization is a set of guidelines that describes which data will be archived, how long it will be kept, what happens to the data at the end of the retention period, and other factors concerning the retention of the data. In our case here, we store the radio recordings on our server only for the current day. At the end of each day, the recordings are backed up to cloud storage and then deleted from the server.

ANNOTATING

Before fingerprinting the recordings, we had to find a way to test the accuracy of that method. If the results of fingerprinting say that there are two instances of the advert in a certain recording, we had to know for sure how true that was.

This meant that we would have to have some form of labeled data, labeled by we the human beings, in order to prove the computer right. We chose out samples of our huge pile of data and annotated them using Audacity, an amazing audio software.

A glimpse into what the data annotation process looks like:

Here is a sample of annotated data for an hour of radio:

As the image shows, there are a lot of different things that go on in just an hour of radio, but what we are looking out for are the Ministry of Health COVID-19 adverts that run for just about a minute. Now that we know that the advert features in this particular hour, we can run the fingerprinting and see if it comes up with the same result.

FINGERPRINTING

First, what does fingerprinting even mean?

Audio fingerprinting is the process of digitally condensing an audio signal, generated by extracting acoustic relevant characteristics of a piece of audio content.

The short version of this is that it finds a way of identifying a piece of audio.

For our project, we ran a fingerprinting script using a Python tool called dejavu, with the aim of identifying the instances of the COVID-19 adverts played within the radio recordings.

CONCLUSION

After this process, what we get is the ability to choose any radio recording and find out the number of times the COVID-19 adverts are played. This can be extended according to what is required at a given time, for example checking how frequently the adverts play on an entire day on a particular radio station, or checking how they are dispersed throughout the day.

In that way, we achieve the goal of tracking the broadcasting of crucial COVID-19 information to the public, to make sure that safety information becomes common knowledge.