Wikimedia Deutschland has launched a free vector database enabling developers to build generative AI applications using Wikidata’s 119 million open knowledge entries, marking the first time this data can be used directly for AI development.
The Embedding Project went live today at https://wd-vectordb.toolforge.org and translates Wikidata’s structured data into vectors that large language models can process through retrieval augmented generation. The technology supports searches in English, French and Arabic, with Spanish and Mandarin to follow by year end.
The database employs a hybrid search approach that combines vector search, keyword search, and descriptive queries, with built-in reranking to surface the most relevant results. Around 24,000 volunteers worldwide maintain and expand Wikidata monthly.
“We want to create an infrastructure that enables everyone to develop generative AI applications based on verifiable, free and open data,” says Lydia Pintscher, Portfolio Lead at Wikimedia Deutschland. “This is an important step toward a digital world in which technologies for the benefit of society are not a footnote but the norm.”
The project aims to reduce AI hallucinations by providing verified data sources, increasing transparency through traceable sourcing, and offering more current information than statically trained models. The codebase is available under an open licence.
Wikimedia Deutschland has developed the project since September 2024 in collaboration with DataStax, an IBM company that provides AI and data solutions, and Berlin-based Jina AI, which supplies the embedding system that transforms Wikidata into vectors. DataStax’s Astra DB vector database stores the data.
A free webinar on 9 October will demonstrate practical applications and usage tips for developers interested in the technology.