Mohamed Hassan/PXHere

Machine translation technology has created a destructive cycle affecting small-language Wikipedia editions, where artificial intelligence systems train on poorly translated content and subsequently produce increasingly degraded translations.

The problem emerged prominently when Kenneth Wehr assumed management of Greenlandic Wikipedia four years ago and discovered virtually all content had been created by non-speakers using machine translators, forcing him to delete most articles, reports MIT Technology Review.

The 26-year-old German, who studied Greenlandic after becoming fascinated with the autonomous Danish territory, found pages riddled with elementary errors, including an entry claiming Canada contained only 41 inhabitants. Many articles featured meaningless word strings generated by translation systems unable to process the language properly.

Volunteers managing four African language editions estimate between 40 and 60 per cent of their Wikipedia articles consist of uncorrected machine translations. Analysis of Inuktitut Wikipedia, an Indigenous Canadian language related to Greenlandic, suggests over two-thirds of substantial pages contain machine-generated portions.

The phenomenon creates what researchers term a “linguistic doom loop” where AI systems learn from Wikipedia’s flawed content, subsequently producing worse translations that generate more corrupted pages. Wikipedia often represents the largest online linguistic resource for minority languages, making it a primary training source for translation models.

Kevin Scannell, former Saint Louis University computer science professor who develops software for endangered languages, explained that AI models depend entirely on available text. “These models are built on raw data. They will try and learn everything about a language from scratch. There is no other input. There are no grammar books. There are no dictionaries. There is nothing other than the text that is inputted.”

Research indicates Wikipedia comprised over half the training data for AI translation models covering several African languages in 2020, whilst 2022 German studies found Wikipedia was the sole accessible online source for 27 under-resourced languages.

Abdulkadir Abdulkadir, who manages Fulfulde Wikipedia for pastoralists across the Sahel region, spends three hours daily correcting machine-translated agricultural information that could harm farmers if left uncorrected. Google Translate incorrectly suggests the Fulfulde word for January means June, whilst ChatGPT claims it represents August or September.

“It is going to be terrible, honestly,” Abdulkadir said regarding the language’s future. “Totally, completely no future.”

The crisis threatens languages already facing displacement pressures. Lucy Iwuala, who contributes to Igbo Wikipedia, views her work as cultural preservation. “This is my culture. This is who I am,” she said. “That is the essence of it all: to ensure that you are not erased.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

AI consciousness claims are ‘existentially toxic’ and unprovable

The only scientifically justifiable position on artificial intelligence is “agnosticism”, meaning humans…

Tech-savvy millennials suffer most anxiety over digital privacy risks

Digital concerns regarding privacy, misinformation and work-life boundaries are highest among highly…

Experts warn of emotional risks as one in three teens turn to AI for support

Medical experts warn that a generation is learning to form emotional bonds…

Social media ‘cocktail’ helps surgeons solve cases in three hours

A global social media community is helping neurosurgeons diagnose complex pathologies and…

World’s smallest programmable robots cost one penny and run for months

The world’s smallest fully programmable, autonomous robots have launched, able to sense…

Being organised cuts death risk by 10 per cent, major global study confirms

Your personality type effectively determines your lifespan, with organised individuals showing a…