Memory-based vocalization of Arabic

  • The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly.
Author:Sandra KüblerORCiDGND, Emad Mohamed
Editor:Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias
Document Type:Preprint
Year of Completion:2008
Year of first Publication:2008
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Release Date:2008/10/21
GND Keyword:Arabisch
Page Number:4
Erschienen in: Nicoletta Calzolari ; Khalid Choukri ; Bente Maegaard ; Joseph Mariani ; Jan Odijk ; Stelios Piperidis ; Stelios Piperidis ; Daniel Tapias (Hrsg.): Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008), May, 28-30, 2008. Marrakech, Marocco, Paris : ELRA, 2008, S. 2322-2329, ISBN: 2-9517408-4-0
Source: ; (in:) Proceedings of the LREC Workshop on HLT and NLP within the Arabic World : Marrakesh, Morocco, May 2008
Institutes:keine Angabe Fachbereich / Extern
Dewey Decimal Classification:4 Sprache / 40 Sprache / 400 Sprache
Linguistik-Klassifikation:Linguistik-Klassifikation: Computerlinguistik / Computational linguistics
Licence (German):License LogoDeutsches Urheberrecht