OPUS 4 | Search

4 search hits

1 to 4

Sort by

The immediate before the verb focus position in Nsong (Bantu B85d, DR Congo) : a corpus-based exploration (2014)

Nsong is a western Bantu language spoken in the neighbourhood of Kikwit (5°2'28"S 18°48'58"E, Kwilu District, Bandundu Province, DRC) and encoded as B85d in the New Updated Guthrie List (Maho 2009). To this B80 or Tiene-Yanzi group also belongs Mbuun, encoded as B87 by Guthrie (1971: 39) and spoken in the wider vicinity of Idiofa (4°57'35"S 19°35'40", Kwilu District, Bandundu Province, Democratic Republic of the Congo). Both languages are closely related. They share a high percentage of fundamental and other vocabulary as well as several rather atypical phonological innovations (Bostoen & Koni Muluwa 2014; Koni Muluwa 2014; Koni Muluwa & Bostoen 2012). Preliminary elicitation-based research on Mbuun has pointed out that the pre-verbal domain plays a crucial role in the marking of argument focus in Mbuun (Bostoen & Mundeke 2011, 2012). In this paper, we assess whether this is also the case in Nsong on the basis of a text corpus which the first author has been collecting, transcribing and annotating in 2013 and 2014 as part of an endangered language documentation project funded by the DoBeS program of the Volkswagen Foundation through a 3-year grant (2012-2015). More information on the project can be found on http://www.kwilubantu.ugent.be/. This Nsong text corpus exclusively consists of oral discourse and currently counts 48.022 tokens and 11.973 types. The team’s 2013 fieldwork aimed at documenting Nsong speech events in as many different cultural settings as possible. As a result, the corpus comprises different text genres, such as political speeches, historical traditions, folk music, tales, proverbs, hunting language, ceremonial language used during circumcision and twin rites, and popular biological knowledge. In line with previous research on Mbuun, we concentrate here on mono-clausal argument focus constructions, even if preliminary research has pointed out that bi-clausal focus structures are more common in the Nsong corpus.

Syntactic annotation of non-canonical linguistic structures (2007)

Hirschmann, Hagen ; Doolittle, Seanna ; Lüdeling, Anke

This paper deals with the syntactic annotation of corpora that contain both ‘canonical’ and ‘non-canonical’ sentences.

Corpora and evaluation tools for multilingual named entity grammar development (2003)

Bering, Christian ; Droźdźyński, Witold ; Erbach, Gregor ; Guasch, Clara ; Homola, Petr ; Lehmann, Sabine ; Li, Hong ; Krieger, Hans-Ulrich ; Piskorski, Jakub ; Schäfer, Ulrich ; Shimada, Atsuko ; Siegel, Melanie ; Xu, Feiyu ; Ziegler-Eisele, Dorothee

We present an effort for the development of multilingual named entity grammars in a unification-based finite-state formalism (SProUT). Following an extended version of the MUC7 standard, we have developed Named Entity Recognition grammars for German, Chinese, Japanese, French, Spanish, English, and Czech. The grammars recognize person names, organizations, geographical locations, currency, time and date expressions. Subgrammars and gazetteers are shared as much as possible for the grammars of the different languages. Multilingual corpora from the business domain are used for grammar development and evaluation. The annotation format (named entity and other linguistic information) is described. We present an evaluation tool which provides detailed statistics and diagnostics, allows for partial matching of annotations, and supports user-defined mappings between different annotation and grammar output formats.

Automatic normalization for linguistic annotation of historical language data (2013)

Bollmann, Marcel

This paper deals with spelling normalization of historical texts with regard to further processing with modern part-of-speech taggers. Different methods for this task are presented and evaluated on a set of historical German texts from the 15th–18th century, and specific problems inherent to the processing of historical data are discussed. A chain combination using word-based and character-based techniques is shown to be best for normalization, while POS tagging of normalized data is shown to benefit from ignoring punctuation marks. Using these techniques, when 500 manually normalized tokens are used as training data for the normalization, the tagging accuracy of a manuscript from the 15th century can be raised from 28.65% to 76.27%.

1 to 4

Open Access

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

4 search hits