News text extractor

5/23/2023

The historical digital newspaper archive environment of the NLF is based on commercial docWorks2 software.

This paper describes first large scale article detection and extraction efforts on the Finnish Digi1 newspaper material of the National Library of Finland (NLF) using data of one newspaper, Uusi Suometar 1869-1898. This method which benefits from both a local analysis using a probabilistic model trained using machine learning procedures, and a more global structural analysis using recursive rules, is evaluated on a dataset of daily local press document images covering several time periods and different page layouts, to prove its effectiveness. This top-level structural analysis relies on the generation of an article separation grid applied recursively on the document image, allowing analyzing any type of Manhattan page layout, even for complex structures with multiple columns and overlapping entities. Then this first logical representation of the document content is analyzed in a second stage to get a higher logical representation including article segmentation and reading order.

Pixels are labeled in a first stage with a Conditional Random Field model in order to intent to label the areas of interest with a low logical level. The analysis of the document image is performed by a two stages scheme. The designed workflow can process large amounts of documents and generates digital objects in METS/ALTO format in order to facilitate the indexing and the browsing of information in digital libraries. We present a complete method for article segmentation in old newspapers, which deals with complex layouts analysis of degraded documents.

0 Comments

News text extractor

Leave a Reply.

Author

Archives

Categories