Competition details

Data

The data is taken from the NewsEye project and consists of historical newspaper pages (partially binarized) ranging from the 19th to 20th century provided by the Austrian National Library, i.e., especially newspapers in German language. The newspapers made available for this competition comprises the titles "Arbeiter Zeitung", "Illustrierte Kronen Zeitung", "Innsbrucker Nachrichten" and "Neue Freie Presse".

The training data contains a set of scanned pages. Furthermore, for every image we provide the coordinates of the baselines, the corresponding text of the lines and the text regions marking the text blocks in the well-established PAGE XML format. Additionally, baselines lying within the same block have a unique ID in the so-called "custom tag" (see Figures 1.a and 2.a for marked baselines and Figures 1.b and 2.b for marked text blocks).

(1.a) Upper part of a simple newspaper page with marked baselines (baselines belonging to the same block have the same color)
(1.b) Upper part of a simple newspaper page with marked text blocks
(2.a) Upper part of a complex newspaper page with marked baselines (baselines belonging to the same block have the same color)
(2.b) Upper part of a complex newspaper page with marked text blocks (red blocks indicate images, blue blocks indicate advertisements, violet blocks indicate tables)

Please note that a text block caputers a whole paragraph and the block outlines enclose the text very closely. Headlines are separately marked and blocks are not across columns. Furthermore, images can be ignored since they (usually) do not contain baselines and occurring tables and framed advertisements are handled as single text blocks (see Figure 2.b).

The following represents a snippet of a PAGE XML file where the baseline with ID "tl_223" forms a block together with all other lines with the block ID "a7"

    <TextLine id="tl_223" primaryLanguage="German" custom="readingOrder {index:5;} structure {id:a7; type:article;}">.

The type description "article" in the custom tag is a result of the NewsEye project. In connection with this competition an article means simply a text block. For the corresponding image and Page XML files for Figure 1.a/1.b respectively 2.a/2.b see the zip files below.

For each sample in the test data there is an image of the scanned newspaper page with its corresponding PAGE XML file containing the baselines (without any block ID's), the text and only a single text region surrounding the whole page. The single region should be ignored but is necessary because the PAGE XML format requires that every line is assigned to a region. The Ground Truth data, in which the baselines have again the block ID's, is, of course, confidential. Ground Truth means, in our context, the ideal of a system's output generated by humans.

Tasks

The task for the participants on the test data will be to assign baselines belonging to the same detected text block with the same block ID in the custom tag. All given lines must be assigned and it is possible that a single line forms a block, i.e., it has its own ID. However, the reading order of the lines within a block and the reading order of the blocks among themselves is irrelevant.

When doing page or image segmentation in general this is usually done on pixel level, but in the context of text extraction we are rather interested in text lines. This motivates the introduction of a new measure that evaluates the quality of a Text Block Segmentation system at baseline level. Hence, for the evaluation measure only the block ID's of the lines are crucial.

Please note that pixel-based approaches can also participate without restrictions, since the training data contains Ground Truth regions for corresponding methods. Afterwards, lines lying within the same detected region can simply be assigned with the same block ID in the custom tag. For entering such tags in PAGE XML files we refer to tools available in our GitHub repository (the organizers can of course be contacted for further help or support) and also to a small usage example.   

The competition is split into two tasks where each participant is free to choose one of them or both. A simple track with newspaper pages only with continuous text (40 pages training data, 10 pages test data, see Figures 1.a and 1.b) and a complex track with pages including additional tables, images or advertisements (40 pages training data, 10 pages test data, see Figures 2.a and 2.b).