OPUS 4 | Linguistik

A pragmatic explanation of the stage level/individual level contrast in combination with locatives (2004)

One important difference between stage level predicates (SLPs) and individual level predicates (ILPs) is their behavior with respect to locative modifiers. It is commonly assumed that SLPs but not ILPs combine with locatives. The present study argues against a semantic account for this behavior (as advanced by e.g. Kratzer 1995, Chierchia 1995) and proposes a genuinely pragmatic explanation of the observed stage level/individual level contrast instead. The proposal is spelled out using Blutners (1998, 2000) optimality theoretic version of the Gricean maxims. Building on the observation that the respective locatives are not event-related but frame-setting modifiers, the preference for main predicates that express temporary properties is explained as a side-effect of “synchronizing” the main predicate with the locative frame in the course of finding an optimal interpretation. By emphasizing the division of labor between grammar and pragmatics, the proposed solution takes a considerable load off of semantics.

A discourse-based account of Spanish ser/estar (2005)

Maienborn, Claudia

The study offers a discourse-based account of the Spanish copula forms ser and estar, which are generally considered to be lexical exponents of the stage-level/individual-level contrast. It argues against the popular view that the distinction between SLPs and ILPs rests on a fundamental cognitive division of the world that is reflected in the grammar. As it happens, conceptual oppositions like “temporary vs. permanent” or “arbitrary vs. essential“ provide only a preference for the interpretation of estar and ser. In addition, the evidence for an SLP/ILP impact on the grammar turns out to be far less conclusive than is currently assumed. The study argues against event-based accounts of the ser/estar contrast in particular, showing that ser and estar pattern alike in failing all of the standard eventuality tests. The discourse-based account proposed instead assumes that ser and estar both display the same lexical semantics (which is identical to the semantics of English be, German sein, etc.); estar differs from ser only in presupposing a relation to a specific discourse situation. By using estar a speaker restricts his or her claim to a specific discourse situation, whereas by using ser, the speaker makes no such restriction. The preference for interpreting estar predications as denoting temporary properties and ser predications as denoting permanent properties follows from economy principles driving the pragmatic legitimation of estars discourse dependence. The analysis proposed in this paper can also account for the observation that ser predications do not give rise to thetic judgements. The proposal is couched in terms of the framework of DRT.

Proceedings of the LREC workshop on partial parsing : between chunk parsing and deep parsing (2008)

Kübler, Sandra ; Piskorski, Jakub ; Przepiorkowski, Adam

Why is German dependency parsing more reliable than constituent parsing? (2006)

Kübler, Sandra ; Prokic, Jelena

In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used. Another direction in parsing research is the development of dependency parsers. Dependency parsing profits from the non-hierarchical nature of dependency relations, thus lexical information can be included in the parsing process in a much more natural way. Especially machine learning based approaches are very successful (cf. e.g.). The results achieved by these dependency parsers are very competitive although comparisons are difficult because of the differences in annotation. For English, the Penn Treebank has been converted to dependencies. For this version, Nivre et al. report an accuracy rate of 86.3%, as compared to an F-score of 92.1 for Charniaks parser. The Penn Chinese Treebank is also available in a constituent and a dependency representations. The best results reported for parsing experiments with this treebank give an F-score of 81.8 for the constituent version and 79.8% accuracy for the dependency version. The general trend in comparisons between constituent and dependency parsers is that the dependency parser performs slightly worse than the constituent parser. The only exception occurs for German, where F-scores for constituent plus grammatical function parses range between 51.4 and 75.3, depending on the treebank, NEGRA or TüBa-D/Z. The dependency parser based on a converted version of Tüba-D/Z, in contrast, reached an accuracy of 83.4%, i.e. 12 percent points better than the best constituent analysis including grammatical functions.

What linguists always wanted to know about german and did not know how to estimate (2006)

Hinrichs, Erhard ; Kübler, Sandra

This paper profiles significant differences in syntactic distribution and differences in word class frequencies for two treebanks of spoken and written German: the TüBa-D/S, a treebank of transliterated spontaneous dialogues, and the TüBa-D/Z treebank of newspaper articles published in the German daily newspaper die tageszeitung´(taz). The approach can be used more generally as a means of distinguishing and classifying language corpora of different genres.

Treebank profiling of spoken and written German (2005)

Hinrichs, Erhard ; Kübler, Sandra

This paper profiles significant differences in syntactic distribution and differences in word class frequencies for two treebanks of spoken and written German: the TüBa-D/S, a treebank of transliterated spontaneous dialogs, and the TüBa-D/Z treebank of newspaper articles published in the German daily newspaper ´die tageszeitung´(taz). The approach can be used more generally as a means of distinguishing and classifying language corpora of different genres.

Towards case-based parsing : are chunks reliable indicators for syntax trees? (2006)

Kübler, Sandra

This paper presents an approach to the question whether it is possible to construct a parser based on ideas from case-based reasoning. Such a parser would employ a partial analysis of the input sentence to select a (nearly) complete syntax tree and then adapt this tree to the input sentence. The experiments performed on German data from the Tüba-D/Z treebank and the KaRoPars partial parser show that a wide range of levels of generality can be reached, depending on which types of information are used to determine the similarity between input sentence and training sentences. The results are such that it is possible to construct a case-based parser. The optimal setting out of those presented here need to be determined empirically.

Towards a dependency-oriented evaluation for partial parsing (2002)

Kübler, Sandra ; Telljohann, Heike

Quantitative evaluation of parsers has traditionally centered around the PARSEVAL measures of crossing brackets, (labeled) precision, and (labeled) recall. However, it is well known that these measures do not give an accurate picture of the quality of the parsers output. Furthermore, we will show that they are especially unsuited for partial parsers. In recent years, research has concentrated on dependencybased evaluation measures. We will show in this paper that such a dependency-based evaluation scheme is particularly suitable for partial parsers. TüBa-D, the treebank used here for evaluation, contains all the necessary dependency information so that the conversion of trees into a dependency structure does not have to rely on heuristics. Therefore, the dependency representations are not only reliable, they are also linguistically motivated and can be used for linguistic purposes.

The Tüba-D/Z treebank : annotating German with a context-free backbone (2004)

Telljohann, Heike ; Hinrichs, Erhard ; Kübler, Sandra

The purpose of this paper is to describe the TüBa-D/Z treebank of written German and to compare it to the independently developed TIGER treebank (Brants et al., 2002). Both treebanks, TIGER and TüBa-D/Z, use an annotation framework that is based on phrase structure grammar and that is enhanced by a level of predicate-argument structure. The comparison between the annotation schemes of the two treebanks focuses on the different treatments of free word order and discontinuous constituents in German as well as on differences in phrase-internal annotation.

The earliest Gullah/AAVE texts : a case of 19th century mesolectal variation (2003)

Troike, Rudolph C.

The earliest known extensive texts in Gullah (and perhaps African American Vernacular English as well) to appear in print were published in The Riverside Magazine for Young People in November, 1868, under the title "Negro Fables" (p. 505-507). These are four animal stories, which the editor of the magazine, Horace Elisha Scudder, described in his column only as having been "taken down from the lips of an old negro, in the vicinity of Charleston" (see Appendix for the editor´s comments and the full text of the stories).2 The Story-Teller was evidently a genuine "man of words" (Abrahams, 1983), a true raconteur who could artistically embellish a simple traditional account (perhaps further embellished by the transcriber) in a variety of ways. That he commanded a certain range of Gullah is evident from particular signature features in the texts, but the absence of other typical Gullah features and the presence of shared Gullah/African American Vernacular English usages, together with the periodic appearance of standard English forms, demonstrate that these texts provide perhaps the earliest actual documentation (apart from early tertiary comments, cited e.g. in Feagin, 1997, p. 128-129) of register variation or style/code-switching among Gullah speakers. ...

The PaGe 2008 shared task on parsing German (2008)

Kübler, Sandra

The ACL 2008 Workshop on Parsing German features a shared task on parsing German. The goal of the shared task was to find reasons for the radically different behavior of parsers on the different treebanks and between constituent and dependency representations. In this paper, we describe the task and the data sets. In addition, we provide an overview of the test results and a first analysis.

The CoNLL 2007 shared task on dependency parsing (2007)

Nivre, Joakim ; Hall, Johan ; Kübler, Sandra ; McDonald, Ryan ; Nilsson, Jens ; Riedel, Sebastian ; Yuret, Deniz

The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, we define the tasks of the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.

Sometimes less is more : Romanian word sense disambiguation revisited (2007)

Dinu, Georgiana ; Kübler, Sandra

Recent approaches to Word Sense Disambiguation (WSD) generally fall into two classes: (1) information-intensive approaches and (2) information-poor approaches. Our hypothesis is that for memory-based learning (MBL), a reduced amount of data is more beneficial than the full range of features used in the past. Our experiments show that MBL combined with a restricted set of features and a feature selection method that minimizes the feature set leads to competitive results, outperforming all systems that participated in the SENSEVAL-3 competition on the Romanian data. Thus, with this specific method, a tightly controlled feature set improves the accuracy of the classifier, reaching 74.0% in the fine-grained and 78.7% in the coarse-grained evaluation.

Robustes chunkparsing mit variabler Analysetiefe (2000)

Kübler, Sandra ; Hinrichs, Erhard

Das Chunkparsing bietet einen besonders vielversprechenden Ansatz zum robusten, partiellen Parsing mit dem Ziel einer breiten Datenabdeckung. Ziel beim Chunkparsing ist eine partielle, nicht-rekursive syntaktische Struktur. Dieser extrem effiziente Parsing-Ansatz läßt sich als Kaskade endlicher Transducer realisieren. In diesem Beitrag wird TüSBL vorgestellt, ein System, bei dem die Eingabe aus spontaner, gesprochener Spache besteht, die dem Parser in Form eines Worthypothesengraphen aus einem Spracherkenner zur Verfügung gestellt wird. Chunkparsing ist für eine solche Anwendung besonders geeignet, da es fragmentarische oder nicht wohlgeformte Äußerungen robust behandeln kann. Des weiteren wird eine Baumkonstruktionskomponente vorgestellt, die die partiellen Chunkstrukturen zu vollständigen Bäumen mit grammatischen Funktionen erweitert. Das System wird anhand manuell überprüfter Systemeingaben evaluiert, da sich die üblichen Evaluationsparameter hierfür nicht eignen.

Recent developments in linguistic annotations of the TüBa-D/Z treebank (2004)

Hinrichs, Erhard ; Kübler, Sandra ; Naumann, Karin ; Telljohann, Heike ; Trushkina, Julia

The purpose of this paper is to describe recent developments in the morphological, syntactic, and semantic annotation of the TüBa-D/Z treebank of German. The TüBa-D/Z annotation scheme is derived from the Verbmobil treebank of spoken German [4, 10], but has been extended along various dimensions to accommodate the characteristics of written texts. TüBa-D/Z uses as its data source the "die tageszeitung" (taz) newspaper corpus. The Verbmobil treebank annotation scheme distinguishes four levels of syntactic constituency: the lexical level, the phrasal level, the level of topological fields, and the clausal level. The primary ordering principle of a clause is the inventory of topological fields, which characterize the word order regularities among different clause types of German, and which are widely accepted among descriptive linguists of German [3, 6]. The TüBa-D/Z annotation relies on a context-free backbone (i.e. proper trees without crossing branches) of phrase structure combined with edge labels that specify the grammatical function of the phrase in question. The syntactic annotation scheme of the TüBa-D/Z is described in more detail in [12, 11]. TüBa-D/Z currently comprises approximately 15 000 sentences, with approximately 7 000 sentences being in the correction phase. The latter will be released along with an updated version of the existing treebank before the end of this year. The treebank is available in an XML format, in the NEGRA export format [1] and in the Penn treebank bracketing format. The XML format contains all types of information as described above, the NEGRA export format contains all sentenceinternal information while the Penn treebank format includes only those layers of information that can be expressed as pure tree structures. Over the course of the last year, more fine grained linguistic annotations have been added along the following dimensions: 1. the basic Stuttgart-Tübingen tagset, STTS, [9] labels have been enriched by relevant features of inflectional morphology, 2. named entity information has been encoded as part of the syntactic annotation, and 3. a set of anaphoric and coreference relations has been added to link referentially dependent noun phrases. In the following sections, we will describe each of these innovations in turn and will demonstrate how the additional annotations can be incorporated into one comprehensive annotation scheme.

POS tagging for German : how important is the right context? (2008)

Ivanova, Steliana ; Kübler, Sandra

Part-of-Speech tagging is generally performed by Markov models, based on bigram or trigram models. While Markov models have a strong concentration on the left context of a word, many languages require the inclusion of right context for correct disambiguation. We show for German that the best results are reached by a combination of left and right context. If only left context is available, then changing the direction of analysis and going from right to left improves the results. In a version of MBT (Daelemans et al., 1996) with default parameter settings, the inclusion of the right context improved POS tagging accuracy from 94.00% to 96.08%, thus corroborating our hypothesis. The version with optimized parameters reaches 96.73%.

Parsing without grammar - using complete trees instead (2003)

Kübler, Sandra

The definition of similarity between sentences is formulated on the levels of words, POS tags, and chunks (Abney 91; Abney 96). The evaluation of this approach shows that while precision and recall based on the PARSEVAL measures (Black et al. 91) do not reach state of the art Parsers yet (F1=87.19 on syntactic constituents, F1=77.78 including functionargument structure), the parser shows a very reliable performance where function-argument structure is concerned (F1=96.52). The lower F-scores are very often due to unattached constituents.

Memory-based vocalization of Arabic (2008)

Kübler, Sandra ; Mohamed, Emad

The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly.

Maschineller Erwerb von Wortklassifikationsregeln (1995)

Kübler, Sandra

In dieser Arbeit soll erst ein kurzer Überblick über die Gebiete der Wortklassifizierung und des maschinellen Lernens gegeben werden (Kap. 1). Dann wird der Ansatz der transformationsbasierten fehlergesteuerten Wortklassifizierung (Transformation-Based Error-Driven Tagging) von Brill (1992, 1993, 1994) vorgestellt und für die Verwendung für deutschsprachige Korpora angepaßt (Kap. 2). Hierbei handelt es sich um ein regelbasiertes System, bei dem die Regeln im Gegensatz zu den bisher vorhandenen Systemen nicht manuell erarbeitet und dem System vorgegeben werden; das System erwirbt die Regeln vielmehr selbst anhand von wenigen Regelschemata aus einem kleinen bereits getaggten Lernkorpus. In Kapitel 3 werden die Ergebnisse aus der Anwendung des Systems auf Teile eines deutschsprachigen Korpus dargestellt. In Kapitel 4 schließlich werden andere Taggingsysteme vorgestellt und mit dem System von Brill (1993) anhand von acht Kriterien verglichen.

Learning a lexicalized grammar for German (1998)

Kübler, Sandra

In syntax, the trend nowadays is towards lexicalized grammar formalisms. It is now widely accepted that dividing words into wordclasses may serve as a laborsaving mechanism - but at the same time, it discards all detailed information on the idiosyncratic behavior of words. And that is exactly the type of information that may be necessary in order to parse a sentence. For learning approaches, however, lexicalized grammars represent a challenge for the very reason that they include so much detailed and specific information, which is difficult to learn. This paper will present an algorithm for learning a link grammar of German. The problem of data sparseness is tackled by using all the available information from partial parses as well as from an existing grammar fragment and a tagger. This is a report about work in progress so there are no representative results available yet.

Open Access

Linguistik

Refine

Author

Year of publication

Document Type

Language

Has Fulltext

Is part of the Bibliography

Keywords

Institute

3538 search hits