Corpus for evaluating auto-tagging solutions
Introduction
The corpus for evaluating auto-tagging solutions is a curated set of documents extracted from the CELLAR repository, each associated with DET tags and, when present, resource type labels. It was developed to provide a stable, high-quality benchmark for comparing automatic tagging solutions using realistic institutional content. By focusing on representative materials from the existing ecosystem, the corpus should enable robust measurement of how well tagging systems perform on documents.
The main objective of the corpus was to support evaluation of autotagging tools for a specific use case along three dimensions: tagging accuracy, consistency across heterogeneous documents, and scalability to large collections.
Methodology
The creation of the corpus was a multi-phase process designed to systematically refine a vast and noisy initial dataset into a high-quality, balanced corpus.
Phase 1: General Data Extraction from CELLAR
The foundational step involved a bulk extraction of document links, their corresponding metadata, and especially EuroVoc tags, from the CELLAR repository. This initial trawl was designed to be broad, capturing a wide array of documents to form the raw material for the corpus. We restricted this extraction to Englishlanguage documents in PDF/A1b or PDF/A2a formats.
Phase 2: Tag Taxonomy Mapping
In the scope of the original use case, a critical decision was made to map the existing, broader tags from the EuroVoc thesaurus to the more specific and structured taxonomy of DET (Domain, Entity, Topic) as well as the authority tables’ Country and Corporate Body tags. The resulting set of tags is of 1904 items, of which around 1406 appear at least once in the extracted documents.
The initial extraction coupled with the refinement of the tags set provided a large, heterogeneous pool of candidate documents of 13,522 items and a median of three tags per item, which served as the basis for designing a balanced corpus.
Phase 3: Exploratory Data Analysis and Frequency Distribution
To further refine this pool, a detailed analysis of the 1,406 active tags was performed. As shown in Graph 1 below, the tag "EU Member State" is an important outlier with over 2,500 occurrences.
This tag is so general that its inclusion in a filtering model is not necessary and drowns out more specific and informative tags. The gradual decrease in frequency for the remaining tags, with no clear cutoff, made it difficult to separate "high-frequency" from "low-frequency" tags, necessitating a more arbitrary selection strategy. Out of the 1,406 active tags, we decided it was necessary to select the ones that we wanted to use to refine our extraction.
Phase 4: Scenario Modelling and Selection
Several extraction scenarios were evaluated, varying both the number of tags considered (for example, the 50 vs 150 most frequent DET tags) and the minimum number of such tags that a document must display. Our goal was to find a trade-off between tag diversity and the volume of unique documents to be selected. The chosen scenario requires each document in the golden sample to contain at least three DET tags drawn from the 150 most frequent tags in the general extraction, excluding the “EU Member State” outlier. This resulted in a compromise between reasonably scaling down the original extraction to 4,341 documents and offering a good coverage of the tag space.
The graph below shows the distribution of DET tags in the selected documents.
This selection strategy is designed to avoid overrepresenting a small set of highly frequent tags while still maintaining sufficient volume for robust benchmarking. It ensures that a broad portion of the active DET vocabulary is represented without diluting the corpus with extremely rare tags that would provide limited evaluative value.
Phase 5: Resource type annotations
Within this set, 835 documents additionally have a resource type annotation, enabling more fine‑grained analysis of tagging performance by document genre. The remaining documents lack a resource type label and thus contribute to the diversity of structural and editorial profiles within the corpus.
Among the 835 resourcetyped documents, “annual activity report” is the most common type with 383 instances, followed by “project report” (114), “case study” (87) and “proceedings” (86), with smaller counts for types such as “database”, “evaluation study”, “educational resource”, “speech”, and several others. This mix reflects typical EU institutional outputs and provides an opportunity to assess how tagging solutions behave on both narrative documents (e.g. reports, speeches) and more structured or technical content (e.g. databases, management plans, elearning resources).
Recap statistics and link to the corpus
Size: 4,341 documents
Format: PDF/A-1b or PDF/A-2a
Language: English
Tags: 3 out of the 150 most frequent DET tags
Resource Types (835 annotated): Annual activity report (383), project report (114), case study (87), proceedings (86); 3,506 unannotated for diversity testing
Links to the resources :