Refine
Document Type
- Conference Proceeding (2)
- Part of a Book (1)
- Working Paper (1)
Language
- English (4) (remove)
Has Fulltext
- yes (4)
Is part of the Bibliography
- no (4)
Keywords
- Korpus <Linguistik> (4) (remove)
Institute
- Extern (1)
Nsong is a western Bantu language spoken in the neighbourhood of Kikwit (5°2'28"S 18°48'58"E, Kwilu District, Bandundu Province, DRC) and encoded as B85d in the New Updated Guthrie List (Maho 2009). To this B80 or Tiene-Yanzi group also belongs Mbuun, encoded as B87 by Guthrie (1971: 39) and spoken in the wider vicinity of Idiofa (4°57'35"S 19°35'40", Kwilu District, Bandundu Province, Democratic Republic of the Congo). Both languages are closely related. They share a high percentage of fundamental and other vocabulary as well as several rather atypical phonological innovations (Bostoen & Koni Muluwa 2014; Koni Muluwa 2014; Koni Muluwa & Bostoen 2012). Preliminary elicitation-based research on Mbuun has pointed out that the pre-verbal domain plays a crucial role in the marking of argument focus in Mbuun (Bostoen & Mundeke 2011, 2012). In this paper, we assess whether this is also the case in Nsong on the basis of a text corpus which the first author has been collecting, transcribing and annotating in 2013 and 2014 as part of an endangered language documentation project funded by the DoBeS program of the Volkswagen Foundation through a 3-year grant (2012-2015). More information on the project can be found on http://www.kwilubantu.ugent.be/. This Nsong text corpus exclusively consists of oral discourse and currently counts 48.022 tokens and 11.973 types. The team’s 2013 fieldwork aimed at documenting Nsong speech events in as many different cultural settings as possible. As a result, the corpus comprises different text genres, such as political speeches, historical traditions, folk music, tales, proverbs, hunting language, ceremonial language used during circumcision and twin rites, and popular biological knowledge. In line with previous research on Mbuun, we concentrate here on mono-clausal argument focus constructions, even if preliminary research has pointed out that bi-clausal focus structures are more common in the Nsong corpus.
We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats.
This paper deals with spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers. Different methods for this task are presented and evaluated on a set of historical German texts from the 15th–18th century, and specific problems inherent to the processing of historical data are discussed. A chain combination using word-based and character-based techniques is shown to be best for normalization, while POS tagging of normalized data is shown to benefit from ignoring punctuation marks. Using these techniques, when 500 manually normalized tokens are used as training data for the normalization, the tagging accuracy of a manuscript from the 15th century can be raised from 28.65% to 76.27%.