A comparison of four character-level string-to-string translation models for (OCR) spelling error correction

  • We consider the isolated spelling error correction problem as a specific subproblem of the more general string-to-string translation problem. In this context, we investigate four general string-to-string transformation models that have been suggested in recent years and apply them within the spelling error correction paradigm. In particular, we investigate how a simple ‘k-best decoding plus dictionary lookup’ strategy performs in this context and find that such an approach can significantly outdo baselines such as edit distance, weighted edit distance, and the noisy channel Brill and Moore model to spelling error correction. We also consider elementary combination techniques for our models such as language model weighted majority voting and center string combination. Finally, we consider real-world OCR post-correction for a dataset sampled from medieval Latin texts.

Download full text files

Export metadata

Metadaten
Author:Steffen Eger, Tim vor der Brück, Alexander MehlerORCiDGND
URN:urn:nbn:de:hebis:30:3-438408
DOI:https://doi.org/10.1515/pralin-2016-0004
ISSN:1804-0462
ISSN:0032-6585
Parent Title (English):The Prague bulletin of mathematical linguistics
Publisher:Universita Karlova
Place of publication:Praha
Document Type:Article
Language:English
Date of Publication (online):2017/08/31
Year of first Publication:2016
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2017/08/31
Volume:105
Issue:1
Page Number:23
First Page:77
Last Page:99
Note:
© 2016 Steffen Eger et al., published by De Gruyter Open. This work is licensed under the Creative Commons Attribution-NonCommercial-NoDerivatives 3.0 License. BY-NC-ND 3.0
HeBIS-PPN:432156925
Institutes:Informatik und Mathematik / Informatik
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 000 Informatik, Informationswissenschaft, allgemeine Werke
Sammlungen:Universitätspublikationen
Licence (German):License LogoCreative Commons - Namensnennung-Nicht kommerziell-Keine Bearbeitung 3.0