Machine translation technology has created a destructive cycle affecting small-language Wikipedia editions, where artificial intelligence systems train on poorly translated content and subsequently produce increasingly degraded translations.
The problem emerged prominently when Kenneth Wehr assumed management of Greenlandic Wikipedia four years ago and discovered virtually all content had been created by non-speakers using machine translators, forcing him to delete most articles, reports MIT Technology Review.
The 26-year-old German, who studied Greenlandic after becoming fascinated with the autonomous Danish territory, found pages riddled with elementary errors, including an entry claiming Canada contained only 41 inhabitants. Many articles featured meaningless word strings generated by translation systems unable to process the language properly.
Volunteers managing four African language editions estimate between 40 and 60 per cent of their Wikipedia articles consist of uncorrected machine translations. Analysis of Inuktitut Wikipedia, an Indigenous Canadian language related to Greenlandic, suggests over two-thirds of substantial pages contain machine-generated portions.
The phenomenon creates what researchers term a “linguistic doom loop” where AI systems learn from Wikipedia’s flawed content, subsequently producing worse translations that generate more corrupted pages. Wikipedia often represents the largest online linguistic resource for minority languages, making it a primary training source for translation models.
Kevin Scannell, former Saint Louis University computer science professor who develops software for endangered languages, explained that AI models depend entirely on available text. “These models are built on raw data. They will try and learn everything about a language from scratch. There is no other input. There are no grammar books. There are no dictionaries. There is nothing other than the text that is inputted.”
Research indicates Wikipedia comprised over half the training data for AI translation models covering several African languages in 2020, whilst 2022 German studies found Wikipedia was the sole accessible online source for 27 under-resourced languages.
Abdulkadir Abdulkadir, who manages Fulfulde Wikipedia for pastoralists across the Sahel region, spends three hours daily correcting machine-translated agricultural information that could harm farmers if left uncorrected. Google Translate incorrectly suggests the Fulfulde word for January means June, whilst ChatGPT claims it represents August or September.
“It is going to be terrible, honestly,” Abdulkadir said regarding the language’s future. “Totally, completely no future.”
The crisis threatens languages already facing displacement pressures. Lucy Iwuala, who contributes to Igbo Wikipedia, views her work as cultural preservation. “This is my culture. This is who I am,” she said. “That is the essence of it all: to ensure that you are not erased.”