Automatic normalization for linguistic annotation of historical language data

This paper deals with spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers. Different methods for this task are presented and evaluated on a set of historical Ge
This paper deals with spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers. Different methods for this task are presented and evaluated on a set of historical German texts from the 15th–18th century, and specific problems inherent to the processing of historical data are discussed. A chain combination using word-based and character-based techniques is shown to be best for normalization, while POS tagging of normalized data is shown to benefit from ignoring punctuation marks. Using these techniques, when 500 manually normalized tokens are used as training data for the normalization, the tagging accuracy of a manuscript from the 15th century can be raised from 28.65% to 76.27%.
show moreshow less

Export metadata

  • Export Bibtex
  • Export RIS

Additional Services

    Share in Twitter Search Google Scholar
Metadaten
Author:Marcel Bollmann
URN:urn:nbn:de:hebis:30:3-310764
URL:http://www.linguistics.ruhr-uni-bochum.de/bla/
ISSN:2190-0949
Parent Title (German):Bochumer linguistische Arbeitsberichte ; 13
Series (Serial Number):Bochumer linguistische Arbeitsberichte : BLA (13)
Publisher:Ruhr-Universität Bochum, Sprachwiss. Inst.
Place of publication:Bochum
Document Type:Working Paper
Language:English
Year of Completion:2013
Year of first Publication:2013
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2013/11/14
SWD-Keyword:Annotation; Korpus <Linguistik>
Pagenumber:86
HeBIS PPN:359874436
Dewey Decimal Classification:410 Linguistik
Sammlungen:Linguistik
Linguistic-Classification:Linguistik-Klassifikation: Korpuslinguistik / Corpus linguistics
Licence (German):License Logo Veröffentlichungsvertrag für Publikationen

$Rev: 11761 $