AI unlocks billions of plant records as researchers automate museum archives

Billions of plant specimens currently gathering dust in museum cabinets could soon be accessible to scientists worldwide after researchers successfully used artificial intelligence to automate the digitisation of natural history collections.

A new study from the University of North Carolina at Chapel Hill demonstrates that large language models (LLMs) can determine the original collection locations of plant specimens with near-human accuracy, solving a manual bottleneck that has kept vast amounts of ecological data offline.

The research team found that AI tools could complete this “georeferencing” process with an error margin of less than 10 kilometres whilst operating significantly faster and more cost-effectively than traditional methods.

Natural history collections are vital for tracking biodiversity loss, understanding species movement under climate change and analysing ecosystem shifts. However, of the estimated two-to-three billion herbarium specimens worldwide, only a small fraction have been digitised.

Without digital records and precise spatial data, these physical archives remain essentially useless for modern large-scale ecological research. Traditional georeferencing relies on manual interpretation, specialised software or multiple rounds of expert review — a process that has proven too slow and expensive to handle the global backlog.

Biggest bottlenecks

“Our study explores how large language models can take on one of the biggest bottlenecks in digitising plant collections,” said Yuyang Xie, first author and postdoctoral researcher in the Department of Biology at UNC. “We are pioneering the use of these tools for georeferencing, a breakthrough that will accelerate the digitisation of plant specimens and unlock new possibilities for ecological research.”

The study set out to answer whether AI could automate one of the most time-consuming steps in digitisation. The results confirmed that LLMs could outperform existing methods in terms of accuracy, efficiency, and scalability.

By accurately interpreting location descriptions from specimen labels, the technology allows researchers to rapidly process millions of records that would otherwise take decades to digitise manually.

“Recent advances in LLMs can potentially transform the georeferencing process, making it faster and more accurate,” said Xiao Feng, corresponding author and assistant professor in the Department of Biology at UNC. “This gives researchers unprecedented opportunities to advance our understanding of global biodiversity distributions.”

“This technology allows us to unlock millions of records that are currently sitting in cabinets,” said Xie. “With the power of LLMs, we can rapidly digitise plant specimen data that will be critical for addressing global environmental challenges.”