AI-ready datasets
Artificial intelligence depends on access to high-quality, reliable and structured data. Training models responsibly requires datasets that are governed, interoperable, traceable, and enriched with authoritative metadata, qualities that are often difficult for research and industry to obtain. Within the European ecosystem, this need has been recognised through the European AI policy agenda, which emphasises strengthening data ecosystems and improving access to trustworthy, reusable datasets for AI development.
As the official provider of publishing and reference data services for the European Union, the Publications Office plays a unique enabling role. Through the combination of Cellar, Europe’s common repository for documents and metadata, and EU Vocabularies, Europe’s reference and semantic asset hub, we are able to generate curated, AI-ready corpora. These corpora leverage human-annotated resources, authoritative taxonomies, persistent identifiers, multilingual assets and FAIR-aligned publication practices, making them highly suitable for training, validating and benchmarking AI systems.
This page introduces the first set of corpora produced under this initiative:
- A curated evaluation corpus designed to benchmark auto-tagging solutions using real, institutional documents enriched with Digital Europe Thesaurus (DET), corporate body and country tags. Built through a multi-phase methodological process, it offers a balanced sample of 4,341 documents selected from Cellar to ensure coverage, annotation richness, diversity and analytical depth.
- A training corpus derived from Cellar documents tagged with EuroVoc concepts, supporting the development of machine learning models for document classification, semantic tagging and entity recognition in institutional environments.
Together, these corpora demonstrate how Cellar and EU Vocabularies act as an AI-ready foundation: they transform public sector knowledge into reusable training assets for AI applications, support evaluation of machine learning solutions, and contribute to data spaces, AI factories and semantic services across Europe.
This page will continue to evolve with updates on datasets, usage guidance, governance principles and links for access, contributing to trustworthy, European AI development grounded in authoritative public data.