ICPR 2020 Competition on Text Block Segmentation on a NewsEye Dataset

We propose a standard data science competition on Text Block Segmentation to analyze the structure of historical newspaper pages. In contrast to many existing segmentation methods, instead of working on pixels we want to cluster baselines / textlines into text blocks / paragraphs. Therefore, we introduce a new measure based on a Baseline Detection evaluation scheme. But also common pixel-based approaches can participate without restrictions. Working on baseline level addresses directly the application scenario where for a given image the contained text should be extracted in blocks for further investigations.

Keywords: Document Image Analysis, Historical Documents, Layout Analysis, Text Block Segmentation, Baseline Detection

Competition type: Standard data science competition

Background and impact

This competition should be carried out as part of the European Union’s Horizon 2020 research and innovation programme NewsEye - A Digital Investigator for Historical Newspapers.

The purpose of the NewsEye project is to enable historians and humanities scholars to investigate a great amount of historical newspaper collections provided by libraries. To ensure an efficient work, the data processing steps should be as automatic as possible.

To this end an automatic digitization of scanned newspaper pages has to be done. This includes especially the detection of the baselines present in the image, the recognition of the corresponding text and finally, the goal of this competition, merging the single lines into text blocks. Afterwards, these blocks can be analyzed by humanists or can be used for other Document Analysis tasks like Named Entity Recognition or Topic Modeling.

In the past a similar competition was already organized, the ICDAR Competition on Recognition of Documents with Complex Layouts. The main difference is that we work on baseline level instead on pixel level. This implies also the need for a new evaluation measure.