09 Improving OCR of old material

PhDr. Jiří Polišenský, Tomáš Foltýn

Every mass digitization project, which has been implemented so far or it is under preparation now, wants to make national culture heritage available on-line in very short time. The digitization of contemporary documents is a standard process using existing hardware and software. Huge number of problems, especially for the final users, are connected with poor OCR results, that have direct influence upon full-text searching and presentation.


OCR results can be affected by many negative factors: degradation level of the original paper material, errors in the scanning devices settings, fonts and characters used in the past by printers (especially gothic characters), which is very hard to recognize etc. There is a big issue how to solve the last problem. Solutions are focusing on the development of software tools for adaptive knowledge databases of old language versions, accurate segmentation levels of the text or new ways of text recognition connected for example with vector treatment.


Organizers of this workshop concluded to invite some experts from research organizations or private companies to promote their tools. National Library of the Czech Republic together with company Elsyst Engineering will present application CODEG 4, which could be helpful in the old language databases establishment. National Library of Czech republic will also promote tool for segmentation and manual text correction called Aletheia, which was developed in scope of the IMPACT project. Developers of the most common OCR solution – ABBYY FineReader – are also invited for tenth version news presentation, which were funded from the same project. New ways for page splitting and border removal will be described by the people from National Center for Scientific Research "Demokritos" from Greece. These tools were developed also in the IMPACT project cooperation. Finally some new ideas about the recognition of typewritten texts could be presented by the researchers from University of Salford.


After every single presentation there will be time for the questions coming from the audience and subsequent discussion.


Organizers hope, that the workshop can bring fruitful dialog among all interested experts and help them to share the information about OCR of old material.

Editor: Milan Janíček
Last modified: 5.4. 2011 07:04  
Contact: +420 232 002 515, milan.janicek@techlib.cz