Memory-based vocalization of Arabic
- The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly.
Author: | Sandra KüblerORCiDGND, Emad Mohamed |
---|---|
URN: | urn:nbn:de:hebis:30-1110645 |
URL: | http://cl.indiana.edu/~skuebler/papers/vocal.pdf |
ISBN: | 2-9517408-4-0 |
Editor: | Nicoletta Calzolari, Khalid Choukri, Bente Maegaard, Joseph Mariani, Jan Odijk, Stelios Piperidis, Daniel Tapias |
Document Type: | Preprint |
Language: | English |
Year of Completion: | 2008 |
Year of first Publication: | 2008 |
Publishing Institution: | Universitätsbibliothek Johann Christian Senckenberg |
Release Date: | 2008/10/21 |
GND Keyword: | Arabisch |
Page Number: | 4 |
Note: | Erschienen in: Nicoletta Calzolari ; Khalid Choukri ; Bente Maegaard ; Joseph Mariani ; Jan Odijk ; Stelios Piperidis ; Stelios Piperidis ; Daniel Tapias (Hrsg.): Proceedings of the Sixth International Conference on Language Resources and Evaluation (LREC-2008), May, 28-30, 2008. Marrakech, Marocco, Paris : ELRA, 2008, S. 2322-2329, ISBN: 2-9517408-4-0 |
Source: | http://jones.ling.indiana.edu/~skuebler/papers/vocal.pdf ; (in:) Proceedings of the LREC Workshop on HLT and NLP within the Arabic World : Marrakesh, Morocco, May 2008 |
HeBIS-PPN: | 205689418 |
Institutes: | keine Angabe Fachbereich / Extern |
Dewey Decimal Classification: | 4 Sprache / 40 Sprache / 400 Sprache |
Sammlungen: | Linguistik |
Linguistik-Klassifikation: | Linguistik-Klassifikation: Computerlinguistik / Computational linguistics |
Licence (German): | Deutsches Urheberrecht |