User Collaboration for Improving Access to Historical Texts

Clemens Neudecker, Asaf Tzadok

Abstract


The paper will describe how web-based collaboration tools can engage users in the building of historical printed text resources created by mass digitisation projects. The drivers for developing such tools will be presented, identifying the benefits that can be derived for both the user community and cultural heritage institutions. The perceived risks, such as new errors introduced by the users, and the limitations of engaging with users in this way will be set out with the lessons that can be learned from existing activities, such as the National Library of Australia's newspaper website which supports collaborative correction of Optical Character Recognition (OCR) output.
The paper will present the work of the IMPACT (Improving Access to Text) project, a large-scale integrating project funded by the European Commission as part of the Seventh Framework Programme (FP7). One of the aims of the project is to develop tools that help improve OCR results for historical printed texts, specifically those works published before the industrial production of books from the middle of the 19th century.
Technological improvements to image processing and OCR engine technology are vital to improving access to historic text, but engaging the user community also has an important role to play. Utilising the intended user can help achieve the levels of accuracy currently found in born-digital materials. Improving OCR results will allow for better resource discovery and enhance performance by text mining and accessibility tools. The IMPACT project will specifically develop a tool that supports collaborative correction and validation of OCR results and a tool to allow user involvement in building historical dictionaries which can be used to validate word recognition. The technologies use the characteristics of human perception as a basis for error detection.

Keywords


digitisation; OCR; user collaboration; correction; crowd sourcing; IMPACT

Full Text:

html pdf




Creative Commons License
This work is licensed under a Creative Commons Attribution 3.0 License.