Wikipedia

Wikimedia Deutschland has launched a free vector database enabling developers to build generative AI applications using Wikidata’s 119 million open knowledge entries, marking the first time this data can be used directly for AI development.

The Embedding Project went live today at https://wd-vectordb.toolforge.org and translates Wikidata’s structured data into vectors that large language models can process through retrieval augmented generation. The technology supports searches in English, French and Arabic, with Spanish and Mandarin to follow by year end.

The database employs a hybrid search approach that combines vector search, keyword search, and descriptive queries, with built-in reranking to surface the most relevant results. Around 24,000 volunteers worldwide maintain and expand Wikidata monthly.

“We want to create an infrastructure that enables everyone to develop generative AI applications based on verifiable, free and open data,” says Lydia Pintscher, Portfolio Lead at Wikimedia Deutschland. “This is an important step toward a digital world in which technologies for the benefit of society are not a footnote but the norm.”

The project aims to reduce AI hallucinations by providing verified data sources, increasing transparency through traceable sourcing, and offering more current information than statically trained models. The codebase is available under an open licence.

Wikimedia Deutschland has developed the project since September 2024 in collaboration with DataStax, an IBM company that provides AI and data solutions, and Berlin-based Jina AI, which supplies the embedding system that transforms Wikidata into vectors. DataStax’s Astra DB vector database stores the data.

A free webinar on 9 October will demonstrate practical applications and usage tips for developers interested in the technology.

Leave a Reply

Your email address will not be published. Required fields are marked *

You May Also Like

Humans beat AI at spotting deepfake videos but fail entirely with photos

As artificial intelligence gets better at generating fake imagery, a new study…

40 million lost days: The real ‘human cost’ of the race for digital capacity

As data centres scale to power the AI era, it’s not just…

Grocery stores are new immigration ‘hot spots’ but communities fight back

As immigration enforcement reaches deep into everyday American life, once-safe business spaces…

The invisible data exchange fueling the artificial intelligence boom

Data’s actual market value remains completely hidden from the public. If regulators…