Mohamed Hassan/PXHere

Machine translation technology has created a destructive cycle affecting small-language Wikipedia editions, where artificial intelligence systems train on poorly translated content and subsequently produce increasingly degraded translations.

The problem emerged prominently when Kenneth Wehr assumed management of Greenlandic Wikipedia four years ago and discovered virtually all content had been created by non-speakers using machine translators, forcing him to delete most articles, reports MIT Technology Review.

The 26-year-old German, who studied Greenlandic after becoming fascinated with the autonomous Danish territory, found pages riddled with elementary errors, including an entry claiming Canada contained only 41 inhabitants. Many articles featured meaningless word strings generated by translation systems unable to process the language properly.

Volunteers managing four African language editions estimate between 40 and 60 per cent of their Wikipedia articles consist of uncorrected machine translations. Analysis of Inuktitut Wikipedia, an Indigenous Canadian language related to Greenlandic, suggests over two-thirds of substantial pages contain machine-generated portions.

The phenomenon creates what researchers term a “linguistic doom loop” where AI systems learn from Wikipedia’s flawed content, subsequently producing worse translations that generate more corrupted pages. Wikipedia often represents the largest online linguistic resource for minority languages, making it a primary training source for translation models.

Kevin Scannell, former Saint Louis University computer science professor who develops software for endangered languages, explained that AI models depend entirely on available text. “These models are built on raw data. They will try and learn everything about a language from scratch. There is no other input. There are no grammar books. There are no dictionaries. There is nothing other than the text that is inputted.”

Research indicates Wikipedia comprised over half the training data for AI translation models covering several African languages in 2020, whilst 2022 German studies found Wikipedia was the sole accessible online source for 27 under-resourced languages.

Abdulkadir Abdulkadir, who manages Fulfulde Wikipedia for pastoralists across the Sahel region, spends three hours daily correcting machine-translated agricultural information that could harm farmers if left uncorrected. Google Translate incorrectly suggests the Fulfulde word for January means June, whilst ChatGPT claims it represents August or September.

“It is going to be terrible, honestly,” Abdulkadir said regarding the language’s future. “Totally, completely no future.”

The crisis threatens languages already facing displacement pressures. Lucy Iwuala, who contributes to Igbo Wikipedia, views her work as cultural preservation. “This is my culture. This is who I am,” she said. “That is the essence of it all: to ensure that you are not erased.”

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Bedtime doomscrolling costing millions of Americans a good night’s sleep

Millions of Americans are actively sacrificing a good night’s rest for one…

Humans beat AI at spotting deepfake videos but fail entirely with photos

As artificial intelligence gets better at generating fake imagery, a new study…

Grocery stores are new immigration ‘hot spots’ but communities fight back

As immigration enforcement reaches deep into everyday American life, once-safe business spaces…

40 million lost days: The real ‘human cost’ of the race for digital capacity

As data centres scale to power the AI era, it’s not just…