Publications Office of the EU
Corpus for training - EU Vocabularies
DisplayCustomHeader
Corpus for training auto-tagging solutions

Corpus for training auto-tagging solutions  

Similarly to the corpus for evaluating auto-tagging solutions, the corpus for training auto-tagging solutions is a curated set of documents extracted from the CELLAR repository, associated with EuroVoc tags. However, in the scope of training, this corpus was developed to provide a large and robust dataset with a broad coverage of domains and tags to capture variance.

Methodology 

The methodology is not very different from the one used for building the corpus for evaluating auto-tagging solutions, however here the focus was on building a large corpus based on EuroVoc tags only. Indeed, for the evaluation corpus we focused on domain representativity and on mapping EuroVoc tags to DET tags. Here, we are not making use of DET tags.

Phase 1: General Data Extraction from CELLAR 

The foundational step involved a bulk extraction of document links, their corresponding metadata, and especially EuroVoc tags, from the CELLAR repository. This initial trawl was designed to be broad, capturing a wide array of documents to form the raw material for the corpus. We restricted this extraction to Englishlanguage documents in PDF/A1b or PDF/A2a formats only and collected 13,984 document links.

Phase 2: Exploratory Data Analysis and Frequency Distribution  

To further refine this pool, a detailed analysis of the 3,965 EuroVoc tags was performed. As shown in Graph 3 below, and in the same way as with the corpus for evaluation, the tag "EU Member State" is an important outlier with over 2,500 occurrences. 

The findings are like those of the DET tags for the evaluation corpus, so we decided to take “EU Member State” out of the scope when refining the EuroVoc tags selected.

Phase 3: Scenario Modelling and Selection 

Several extraction scenarios were evaluated, varying both the number of tags considered and the minimum number of such tags that a document must display. Our goal was to find a trade-off between tag diversity and the volume of unique documents to be selected. Three scenarios were suggested, each excluding the “EU Member State” outlier.  

Corpus of 10,272 document links tagged with at least 1 out of the 50 most frequent EuroVoc tags. 

Corpus of 8,678 document links tagged with at least 2 out of the 100 most frequent EuroVoc tags 

Corpus of 10,071 document links tagged with at least 2 out of the 150 most frequent EuroVoc tags 

Scenario 1 maximizes volume but risks bias towards very frequent tags. Scenario 2 improves document richness with moderate diversity. Scenario 3 offers the best balance between corpus size and tag coverage but demands more care in monitoring distribution and representativeness. 

All 3 scenarios offer a large collection of documents that can be use according to defined specific use cases. 

Scenario 1 is well suited for developing broad, high-level auto-tagging models focused on common topics or systems designed to operate efficiently on vast corpora with limited tag granularity. 

Scenario 2 is appropriate for building models aiming to capture multidimensional content tagging and intermediate diversity of topics. This corpus suits applications requiring moderately granular tagging such as domain-specific content categorization, improved metadata quality, and enhanced document discoverability through more nuanced tag combinations. 

Scenario 3 is particularly useful for systems targeting detailed semantic distinctions, cross-domain adaptability, and robust performance with medium-frequency and niche tags.

Recap statistics and link to the corpus 

Size: Between 8,678 and 10,272 documents depending on the scenario 

Format: PDF/A-1b or PDF/A-2a 

Language: English 

Tags:

  • 1 out of the 50 most frequent EuroVoc tags for scenario 1. 
  • 2 out of the 100 most frequent EuroVoc tags for scenario 2. 
  • 2 out of the 150 most frequent EuroVoc tags for scenario 3.

Links to the resources :