Bridging the AI language problem
Source: By Kira Schacht: Deutsche Welle
AI tools like ChatGPT offer amazing opportunities — if you happen to speak a language they support. But according to Mekdes Gebrewold, founder of Ashagari consultancy in Ethiopian capital Addis Ababa, even machine translations are impossible in her language. “Tools like Google Translate are not well-constructed for Amharic,” she told DW. “So we pay professionals instead.”
Billions of people like Mekdes Gebrewold are unable to take advantage of AI-powered tools due to their language. This applies not only to generative AI like ChatGPT or translation services like Google Translate, but also a range of other tools — autocomplete, transcription services, voice assistants and content moderation on social media.
But some are working to change that.
Why do AI tools not work in many languages?
Modern AI tools are, in essence, advanced autocomplete tools that predict the most likely answer based on the input they get. These predictions rely on vast amounts of “training data” – digital collections of content that AI engineers use to build their models.
One important source of training data is the so-called Common Crawl, an openly available dataset consisting of billions of web pages from the internet. About 60% of the examples used to train ChatGPT’s version 3.5 come from this collection.
However, because of the reliance on this training data, AI tools don’t work when data in a particular language is scarce. This is an issue since the internet is dominated by only a few languages — none more so than English, which alone accounts for almost half of all pages in the Common Crawl.
Amharic, along with all other African, American, and Oceanian languages, make up less than 0.1% of the Common Crawl. It’s known as a low-resource language with little digital data available. Around the world, billions of people speak these low-resource languages, a category that includes even major languages like Hindi, Arabic, and Bengali.
There is a clear pattern that shows which languages are left behind. European languages are vastly overrepresented compared to most Asian and all African ones. Dutch, for example, is spoken as a first language by just over 20 million people, similar to Amharic. Yet Dutch appears almost 700 times more in the Common Crawl dataset, and hundreds of times more than even Hindi with its over 300 million native speakers. There are ways around this lack of data, though. Beyond the tech giants of Silicon Valley, machine-learning researchers all over the world are developing AI-powered tools for their own languages.
How to bridge AI’s language gap
Asmelash Teka Hadgu cofounded Lesan, a startup that creates machine translation and speech technology for Ethiopian languages Amharic and Tigrinya. Without vast amounts of online resources, his team works directly with their community and finds creative ways to collect data.
“We work mainly with students who just love their language,” he told DW. “When we tell them we’re building this thing, they are inspired and want to contribute. So we set out tasks to gather content in our language. And we assist, and reward them financially.”
This requires a lot of manual labor. Contributors first identify high-quality datasets, such as trustworthy books or newspapers, then digitize and translate them into the target languages. Finally, they align the original and translated versions sentence by sentence to guide the machine learning process.
With this approach, companies like Lesan cannot hope to rival the billions of pages of English content available, but they might not need to. Lesan, for instance, already outperforms Google Translate in both Amharic and Tigrinya.
“We’ve shown that you can build useful models by using small, carefully curated data sets,” said Asmelash Teka Hadgu. “We understand its limitations and capabilities. Meanwhile, Microsoft or Google usually build a single, gigantic model for all languages, so it’s almost impossible to audit.”
More languages need digital support
Lesan’s approach is not unique. Similar projects are being successfully implemented all over the world, even for languages with smaller digital footprints.
Ethnologue, a global database of languages run by Christian NGO SIL International, lists Amharic among the languages with “Vital” language support. This means that there are at least some machine translation tools, spellcheck, and speech processing available. Thousands of languages worldwide, including many with more than a million users, have even less content and fewer digital tools on offer.
Asmelash Teka Hadgu is part of a network of African AI pioneers. He is a research fellow at the Distributed AI Research Institute (DAIR), a group of researchers from Africa, Europe, and North America. He is also in regular contact with groups like GhanaNLP and the African grassroots collective Masakhane.
“We’re enabling ownership of these technologies by African founders,” he told DW. “This is being built and served by people from these communities. So the financial rewards will also go directly back to them.” Outside of Africa as well, researchers around the world are working on other languages including Jamaican Patois, Catalan, Sudanese, and Māori.
And while tech giants like ChatGPT’s OpenAI keep their models secret and inscrutable, initiatives like the global AI collective Hugging Face have been sharing insights and AI models freely. This makes it easier for any researcher to create solutions for their languages.
“Talent is everywhere, opportunity is not,” said Asmelash Teka Hadgu. “If you want to create the best kind of machine translation technology, say, for a Ghanaian language, there will be a Ghanaian who feels passionately and can do it well. Let’s empower that.”