Linguistik
Refine
Year of publication
Document Type
- Article (186)
- Preprint (69)
- Part of a Book (65)
- Working Paper (40)
- Conference Proceeding (33)
- Book (24)
- Review (12)
- Part of Periodical (7)
- Course Material (1)
- Report (1)
Language
- Croatian (150)
- English (141)
- German (120)
- Portuguese (9)
- Turkish (7)
- mis (4)
- French (3)
- Italian (2)
- Multiple languages (1)
- Spanish (1)
Has Fulltext
- yes (438) (remove)
Is part of the Bibliography
- no (438)
Keywords
- Kroatisch (50)
- Linguistik (50)
- Rezension (48)
- Deutsch (35)
- Computerlinguistik (32)
- Syntax (19)
- Japanisch (18)
- Grammatik (17)
- Namenkunde (17)
- Rezensionen (17)
Institute
- Extern (438) (remove)
This article examines the expression of natural gender in Icelandic nouns denoting human beings. Particular attention will be paid to the system's symmetry with regards to nouns denoting women and men. Our society consists more or less exactly of half women and half men. One would therefore assume that systems for terms denoting persons would also be symmetrically organised. Yet this assumption could not be further from the truth, and not just in single isolated cases, but in many languages: I will attempt to show that Icelandic has numerous methods for referring to women, but also many barriers and idiosyncrasies.
In diesem Artikel wird erstmals der Wandel der phonologischen und prosodischen Strukturen der deutschen Rufnamen seit 1945 bis heute (2008) bezüglich der Kennzeichnung von Sexus beziehungsweise Gender untersucht. Auf der Grundlage der 20 häufigsten Rufnamen wird gezeigt, wie weibliche und männliche Namen sich diachron im Hinblick auf ihre Sonorität, die verwendeten Vokale (besonders im Nebenton), Hiate, Konsonantencluster, die Silbenzahl und das Akzentmuster verändern. Das wichtigste Ergebnis ist, dass heute die Rufnamen beider Geschlechter strukturell so ähnlich sind wie nie zuvor. Damit hat sich seit dem 2. Weltkrieg eine Androgynisierung vollzogen.
Zur Entstehung und Struktur ungebändigter Allomorphie : Pluralbildungsverfahren im Luxemburgischen
(2006)
Aus gesamtgermanistischer Perspektive verfügt das Luxemburgische über ein außergewöhnliches Maß an Pluralallomorphie bzw., nach H. GIRNTH (2000), an Heterograffimie. Oberstes Prinzip dabei scheint die deutliche Markierung der Kategorie 'Plural' direkt ani bzw. im Substantiv zu sein. Die morphologische Komplexität betrifft mehrere Dimensionen: Zum einen ist es die Vielzahl an Pluralisierungsprinzipien, die von additiven über modulatorische und Nullprozesse bis hin zu subtraktiven Techniken reichen, zum zweiten die Vielzahl an konkret sich manifestierender Allomorphie. Schließlich ist der maximale . Ausbau des reinen Umlauttyps auch bei Einsilblern hervorzuheben. Selbst Fremdwörter können noch heute ihren Plural mit reinem Vokalwechsel bilden, und dies auch auf nebenbetonten Silben. Aus diachroner Perspektive bildet. der reine Vokalwechsel einen wichtigen Endpunkt einer sich seit Jahrhunderten in diese Richtung vollziehenden Entwicklung. Aus synchroner Perspektive ist es mittlerweile verfehlt, noch - wie etwa beim deutschen Pluralsystem - von Umlaut zu sprechen, da längst eine Arbitrarisierung .des Vokalwechsels stattgefunden hat, die fast ablautähnliche Züge erreicht hat. Zusammenfassend gelangt man zu dem Eindruck, dass sich das Luxemburgische - etwa im Hinblick auf die subtraktive Pluralbildung - fast jedweden phonologischen Wandel zu Nutze macht bzw. - im Hinblick auf den Umlaut über die Morphologisierung sogar produktiv werden lässt. Aus der vorliegenden Untersuchung ergeben sich mehrere Fragestellungen, die Gegenstand weiterer Untersuchungen sein sollten. Zuerst wären genaue quantitative Erhebungen vorzunehmen, um die Nutzung und Verteilung der einzelnen Verfahren zu ermitteln. Auch die Produktivität der Regeln müsste untersucht werden. Des Weiteren ist noch ungeklärt, welche Regeln es genau sind, die die Distribution der Allomorphe steuern. Nimmt man z.B. das Englische mit seinen drei Pluralallomorphen [IZ], [z] und [s], so ist deren Verteilung rein phonologisch - nach dem Auslaut des Substantivs - gesteuert: Endet es auf einen Sibilanten, folgt silbisches [IZ] (horse-s ['horsIz]), endet es auf einen stimmhaften Laut, folgt stimmhaftes [z] (dog-s), und auf einen stimmlosen folgt stimmloses [s] (cat-s). Das Deutsche, das insgesamt neun konkrete Pluralallomorphe "besitzt, erlaubt auf grund der Singularform kaum Erschließbarkeit des Plurals, wie die folgenden drei einsilbigen Reimwörter gleichen Genus demonstrieren: der Hund - die Hunde, der Grund - die Gründe, der Mund - die Münder. Prosodische Kriterien wie die AkzentsteIle, syllabische (Silbenzahl), phonologische (Auslaut) und morphologische Kriterien " einschließlich der Genuszugehörigkeit fuhren nicht immer zum Ziel: Bei vielen Substantiven muss der Plural - siehe oben - mitgelernt werden, d.h. er ist Bestandteil des Lexikons. Was das Luxemburgische betrifft, so scheint das Steuerungsinstrumentarium komplexer zu sein, doch ist dies nur eine durch Stichproben gewonnene Vermutung, die zu fundieren wäre.
Außerhalb der indoeuropäischen Sprachen [erfreut sich] [d]ie Kategorie „Adjektiv“ […] einer geringeren Verbreitung als man als Laie vermuten würde, und es zeigen sich in nicht-indoeuropäischen Sprachen von den europäischen Sprachen stark verschiedene Aufteilungen der Welt in Nomina und Verba. Eine bisher nicht beschriebene Verteilung von Konzepten auf Wortarten in der Sprache Guarani, welche hauptsächlich in Paraguay gesprochen wird, ist das Thema dieser Arbeit.
The Conference on Computational Natural Language Learning features a shared task, in which participants train and test their learning systems on the same data sets. In 2007, as in 2006, the shared task has been devoted to dependency parsing, this year with both a multilingual track and a domain adaptation track. In this paper, we define the tasks of the different tracks and describe how the data sets were created from existing treebanks for ten languages. In addition, we characterize the different approaches of the participating systems, report the test results, and provide a first analysis of these results.
Politeness has become a key qualification in intercultural competence and didactics. The paper presents parts of an empirical research of the development and shaping of verbal politeness in critical incidents investigating the way German and Turkish students of the German language deal with criticism and complimenting. The findings show that Turkish students of German as a foreign language avoid direct criticism and prefer manners considered to be polite in German. Complimenting is an expression of their own positive feelings and acts as “messages about oneself”, whereas the German students prefer “meritorious praise” referring to merits. The discriminating effects of migration within the Turkish students are smaller than expected perhaps because of the increase of transcultural knowledge. This should give new ideas for the didactics of politeness.
This paper proposes an annotating scheme that encodes honorifics (respectful words). Honorifics are used extensively in Japanese, reflecting the social relationship (e.g. social ranks and age) of the referents. This referential information is vital for resolving zero
pronouns and improving machine translation outputs. Annotating honorifics is a complex task that involves identifying a predicate with honorifics, assigning ranks to referents of the
predicate, calibrating the ranks, and connecting referents with their predicates.
Die Germanistische Institutspartnerschaft zwischen der Staatlichen Pädagogischen Universität Barnaul (Linguistisches Institut) und der Europa-niversität Viadrina Frankfurt/Oder (Fakultät für Kulturwissenschaften) existiert seit 1993. In dieser Kooperation wurden im Laufe der Zeit gemeinsame Vorstellungen über die wichtigsten Maßnahmen entwickelt, die für eine Umstrukturierung und Modernisierung der germanistischen Lehre und Forschung an der russischen Hochschule geboten erscheinen.
Das vorliegende Arbeitspapier ist das Skript einer Vorlesung, die ich während des Wintersemesters 1986/87 am Institut für Sprachwissenschaft der Universität zu Köln gehalten habe. […] Das Arbeitspapier gliedert sich in zwei Teile. Im ersten Teil, Kapitel 1 - 4, werden die bei der Untersuchung und Beschreibung einer Sprache auftretenden soziolinguistischen Probleme besprochen, während im zweiten Teil, Kapitel 5 - 11, behandelt wird, wie eine Grammatik geschrieben werden sollte. Es geht dabei also nicht um die grammatische Analyse sprachlicher Daten, sondern um die Darstellung einer Sprache, d.h. um die schriftstellerische Aufgabe des Linguisten, des Grammatikers im eigentlichen Sinn.
Tema je ovoga rada raščlamba kategorije prijelaznosti u hrvatskim gramatikama. Pri raščlambi je proučen odnos subjekta i (auto)objekta. Prikazan je način na koji je prijelaznost opisana u gramatikama te su obrađena ova pitanja: Kako prijelazni glagoli mogu postati neprijelazni i što se događa s njihovim značenjem? Kako gramatike dijele glagole prema prijelaznosti? Kako se tumači neprava povratnost? Što znači da radnja proizlazi sama od sebe? Na koji je način moguća dodatna interpretacija prijelaznosti kod pravih povratnih glagola s obzirom na razine proučavanja?
This report explores the question of compatibility between annotation projects including translating annotation formalisms to each other or to common forms. Compatibility issues are crucial for systems that use the results of multiple annotation projects. We hope that this report will begin a concerted effort in the field to track the compatibility of annotation schemes for part of speech tagging, time annotation, treebanking, role labeling and other phenomena.
Some requirements for a VERBMOBIL system capable of processing Japanese dialogue input have been explored. Based on a pilot study in the VERBMOBIL domain, dialogues between 2 participants and a professional Japanese interpreter have been analyzed with respect to a very typical and frequent feature: zero pronouns. Zero pronouns in Japanese texts or dialogues as well as overt pronouns in English texts or dialogues are an important element of discourse coherence. As to translation, this difference in the use of pronouns is a case of translation mismatch: information not explicitly expressed in the source language is needed in the target language. (Verb argument positions, normally obligatory in English, are rather frequently omitted in Japanese. Furthermore, verbs in Japanese are not marked with respect to features necessary for pronoun selection in English.)
In this paper, we will argue for a novel analysis of the auxiliary alternation in Early English, its development and subsequent loss which has broader consequences for the way that auxiliary selection is looked at cross-linguistically. We will present evidence that the choice of auxiliaries accompanying past participles in Early English differed in several significant respects from that in the familiar modern European languages. Specifically, while the construction with have became a full-fledged perfect by some time in the ME period, that with be was actually a stative resultative, which it remained until it was lost. We will show that this accounts for some otherwise surprising restrictions on the distribution of BE in Early English and allows a better understanding of the spread of HAVE through late ME and EModE. Perhaps more importantly, the Early English facts also provide insight into the genesis of the kind of auxiliary selection found in German, Dutch and Italian. Our analysis of them furthermore suggests a promising strategy for explaining cross-linguistic variation in auxiliary selection in terms of variation in the syntactico-semantic structure of the perfect. In this introductory section, we will first provide some background on the historical situation we will be discussing, then we will lay out the main claims for which we will be arguing in the paper.
The retreat of BE as perfect auxiliary in the history of English is examined. Corpus data are presented showing that the initial advance of HAVE was most closely connected to a restriction against BE in past counterfactuals. Other factors which have been reported to favor the spread of HAVE are either dependent on the counterfactual effect, or significantly weaker in comparison. It is argued that the effect can be traced to the semantics of the BE perfect, which denoted resultativity rather than anteriority proper. Related data from other older Germanic and Romance languages are presented, and finally implications for existing theories of auxiliary selection stemming from the findings presented are discussed.
This paper proposes a new sound rule for Proto-Slavic, according to which *g (from PIE *g, *gw, *gh, and *gwh) was lost before *m. This development was posterior to Winter’s law and the merger of voiced and aspirated stop in Slavic. The operation of the rule is illustrated by new etymologies of four Slavic words: *ama, *jama ‘hole, pit’, *těmę ‘sinciput’, *mąžь ‘husband, man’, and *remy ‘leather belt’.
U radu će biti riječi o imenicama koje označuju mjeru i koje se redovito pojavljuju u akuzativu iako bi sintaktički na tome mjestu trebao doći koji drugi oblik. Učestalom upotrebom u akuzativnome obliku te imenice gube svoje osnovno morfološko obilježje – promjenjivost, a time i svoju nedvojbenu pripadnost imenicama kao vrsti riječi i nameću pitanje kako ih obraditi u rječniku.
Prema opisima u suvremenim hrvatskim gramatikama dalo bi se zaključiti da hrvatski koordinativne složenice ili ne poznaje ili da ih je toliko malo da ne traže opis. U članku se podsjeća da je u starijim gramatikama o njima bilo riječi, a da svojom suvremenom količinom i različitim ostvarajima (imeničke, pridjevske, priložne, sa spojnicima -o- i -0-) gramatički opis itekako zaslužuju. Pokazuje se zbog kojih se svojih odlika takve složenice mogu smatrati riječima, a ne spojevima riječi, sintagmama. Na primjeru jezika Anke Žagar pokazuje se da model koordinativnih složenica kao potencija može unutar poezije poprimiti i jezičnostvaralačke inačice.
Rječotvorni načini hrvatskoga jezika temelje se na ulančavanju morfema. U radu se opisuju tri tvorbena načina kojih nema u autohtonu, naslijeđenu hrvatskom leksiku – jedan koji se također temelji na morfemskoj raščlambi (infiksacija), dva kojima su temelji drugačiji (reduplikacija i leksička fuzija). Rad želi troje: i) istaknuti pojedine nedosljednosti postojećih opisa hrvatske morfologije, ii) opisati pojedinačne pozajmljene i domaće hrvatske lekseme i konstrukcije u kojima se o tim trima tvorbama može govoriti; iii) predvidjeti mogu li se neautohtoni tvorbeni načini i u kojoj mjeri importirati iz stranih jezika, danas ponajprije (jedino) engleskoga.
Razmatra se mogućnost hrvatskoga posvojnog pridjeva da bude antecedent relativnoj zamjenici, mogućnost koja se u slavenskim jezicima sve više gubi, odnosno mjesto posvojnoga pridjeva u toj funkciji zauzima genitiv. Potvrdama se pokazuje da ta mogućnost u pisanome hrvatskome (još) postoji. Provedena anketa s izvornim govornicima pokazuje ipak da takve konstrukcije kao prihvatljive ovjerava tek manji dio suvremenih govornika. Analiziraju se tipološki neobična svojstva relativnih rečenica s posvojnim pridjevom kao antecedentom, osobito to da se u njima posvojni pridjev vlada kao padežni oblik imenice, a ne njezin derivat. Ključne riječi: posvojni pridjev, antecedent, relativna rečenica, genitiv, slavenski jezici
Iako se prevedenicama aktiviraju vlastite izražajne mogućnosti jezika, one su također predmet purističkih reakcija. Cilj je rada analizirati latentni utjecaj engleskoga jezika na različite jezične razine kao pojavu koja je prisutna u hrvatskome i u drugim europskim jezicima. Primjeri pokazuju da se radi o rasprostranjenoj pojavi koja proizlazi iz doslovnoga i nemarnoga prijevoda, nepoznavanja norme vlastitoga jezika i pomodnoga slijeda engleske jezične norme.
Mjesni govor Kacane
(2009)
U članku su prikazane alijetetne, alteritetne te arealne jezične značajke mjesnoga govora Kacane, koja teritorijalno pripada Gradu Vodnjanu. Prema rezultatima istraživanja, taj idiom pripada jugozapadnome istarskom ili štakavsko-čakavskome dijalektu. Jezične značajke Kacane jednake su jezičnim značajkama susjednih Orbanića i drugih dosad istraženih govora Marčanštine te onih južne podskupine barbanskih mjesnih govora, što navodi na zaključak da se krak govora takvih jezičnih značajki proteže dalje prema zapadu.
Dijalekti u Gorskom kotaru
(2010)
U Gorskome kotaru govori se svim našim narječjima, kajkavskim, štokavskim i čakavskim, ali rijetki su dijalektolozi koji ih istražuju. U radu se iznosi pregled osnovnih fonoloških i morfoloških karakteristika zabilježenih u dosadašnjim istraživanjima na tom području. Uz zabilježene potvrde promatranih osobina, radu je priložen fonološki zapis jednoga goranskoga idioma.
Mediengestützter Deutschunterricht im türkischen universitären Bereich : eine Bestandsaufnahme
(2010)
A trend in nature of a permanent increase towards multimedia lifestyle has arisen in all stratas of the society. Thus, rather than using written course-books, publishing houses prefer to encourage use of multimedia which are dependent to course-book or which are independent of course-book and language learners prefer to learn with multimedia. Thus it is encouraged that courses are supported in that manner. This study aims to examine scope and limits of computer aided German teaching which is flourishing as a foreign language within Turkey university education recently. This study has been applied in preparatory classes of departments which provide four-year education. Results of a survey on use of multimedia dependent on course-book or independent of course-book within courses within Turkey university education has been given within scope of this study. Evidences on competence of German teachers and learners in use of multimedia has been given and have been visualized through use of graphics. Problems of multimedia aided German courses and solutions offers will be submitted.
U radu je ponuđena raščlamba stilskih i govorničkih figura u poeziji i u putopisima fra Ivana Franje Jukića, angažiranoga franjevačkoga pisca i borca za političku samostalnsot Bosne. Autor je utvrdio da Jukić u svoj književni izraz unosi elemente narodnih govora, što se posebno zapaža u uporabi pučkih fraza i kolokacija. S druge strane, izbor tzv. knjiških figura otkriva utjecaj franjevačke tradicije, posebno jezika starijih franjevačkih ljetopisa.
Rad je nastao iz potrebe da se opiše dubrovački pučki govor 17. i 18. st. Pri morfološkoj je analizi važno uzeti u obzir da se opisuje jezično razdoblje i područje podudarno s početkom formiranja današnjega standardnog jezika. Analiza postaje svrhovitom usporedi li se s rezultatima jezičnih studija razdobljā koja su joj prethodila i slijedila, do današnjega vremena.
Die zielsprachliche Verwendung des Artikels als grammatikalisiertem Mittel der NP-Determination im Deutschen stellt im Zweitspracherwerb besonders für Deutschlernende mit einer artikellosen Muttersprache eine große Schwierigkeit dar. Die vorliegende Arbeit untersucht die NP-Determination auf der Basis eines Spontansprachkorpus, welches Erwerbsdaten einer achtjährigen russischen Deutschlernenden in einer frühen und einer späten Erwerbsphase liefert. Das Ziel der Untersuchung ist, Erkenntnisse über Entwicklungsverlauf, Transferphänomene und insbesondere referenzsemantische und phonologische Determinanten der Artikelwahl zu gewinnen.
This article adresses one function of dialects showing their importance of controlling everyday language. On the example of Low German, a vernacular spoken in Northern Germany, the function of identity is shown and explained. Firstly the understanding of biography is given, followed by an overview about the research undertaking about biographical studies in linguistics, especially in dialectology and Low German philology. The main part concerns the exemplary analysis of an interview of a dialect speaker. The aim of the article is to show in detail the identity function of dialects and the chances qualitive methods can contribute to linguistic researches.
TT-MCTAG lets one abstract away from the relative order of co-complements in the final derived tree, which is more appropriate than classic TAG when dealing with flexible word order in German. In this paper, we present the analyses for sentential complements, i.e., wh-extraction, thatcomplementation and bridging, and we work out the crucial differences between these and respective accounts in XTAG (for English) and V-TAG (for German).
The focus of this paper is the perspectivization of thematic roles generally and the recipient role specifically. Whereas perspective is defined here as the representation of something for someone from a given position (Sandig 1996: 37), perspectivization refers to the verbalization of a situation in the speech generation process (Storrer 1996: 233). In a prototypical act of giving, for example, the focus of perception (the attention of the external observer) may be on the person who gives (agent), the transferred object (patient) or the person who receives the transferred object (recipient). The languages of the world provide differing linguistic means to perspectivize such an act of giving, or better: to perspectivize the participants of such an action. In this article, the linguistic means of three selected continental West Germanic languages –German, Dutch and Luxembourgish– will be taken into consideration, with an emphasis on the perspectivization of the recipient role.
MED (Media EDitor) is a program designed to facilitate the transcription of digitized soundfiles into textfiles. It was written by Hans Drexler and Daan Broeder, Max Planck Institute for Psycholinguistics, Nijmegen, The Netherlands. [...] The aim of MED is to facilitate the transcription of sound into text using a single program. It works on the principle of the coexistence and interaction of two basic elements, the waveform display window and the text window. [...] This means that you no longer need to use both a sound editor and a word processor at the same time in order to transcribe digitized speech files. Instead, you can directly type the sound you hear (and see) via MED into the text window. Furthermore, you can directly link sound portions of the waveform display window to text portions of the text window, so that you can easily locate and listen to the original source of your transcription once the links have been set. In this function the waveform display window and the text window virtually interact with each other.
This article attempts a brief introduction on the topic of cognitive sciences. By emphasizing cognitive linguistics, which separates in two positions will be part of the cognitive Sciences expressed with their linguistic function and is the heart matter, stands for a criticism about their lack of diagnostics. These positions of cognitive linguistics, whose paper are the neuro-linguistics and the cognitive linguistics, are presented in detail and both cognitively linguistic point of views are questioned for their scientific validity. Cognitive Linguistics is a field of cognitive science understood. Cognitive science tries with their research on Imitate human brain, which has arisen from this area, and also Artificial Intelligent researches in which the brain researchers with their colleagues from the field of computer technology try to develop artificialintelligence as an objective. The contribution of the linguistic component directs the Cognitive Linguistics in their research.
In this paper, we investigate the role of sub-optimality in training data for part-of-speech tagging. In particular, we examine to what extent the size of the training corpus and certain types of errors in it affect the performance of the tagger. We distinguish four types of errors: If a word is assigned a wrong tag, this tag can belong to the ambiguity class of the word (i.e. to the set of possible tags for that word) or not; furthermore, the major syntactic category (e.g. "N" or "V") can be correctly assigned (e.g. if a finite verb is classified as an infinitive) or not (e.g. if a verb is classified as a noun). We empirically explore the decrease of performance that each of these error types causes for different sizes of the training set. Our results show that those types of errors that are easier to eliminate have a particularly negative effect on the performance. Thus, it is worthwhile concentrating on the elimination of these types of errors, especially if the training corpus is large.
Quantitative evaluation of parsers has traditionally centered around the PARSEVAL measures of crossing brackets, (labeled) precision, and (labeled) recall. However, it is well known that these measures do not give an accurate picture of the quality of the parsers output. Furthermore, we will show that they are especially unsuited for partial parsers. In recent years, research has concentrated on dependencybased evaluation measures. We will show in this paper that such a dependency-based evaluation scheme is particularly suitable for partial parsers. TüBa-D, the treebank used here for evaluation, contains all the necessary dependency information so that the conversion of trees into a dependency structure does not have to rely on heuristics. Therefore, the dependency representations are not only reliable, they are also linguistically motivated and can be used for linguistic purposes.
Traditionally, parsers are evaluated against gold standard test data. This can cause problems if there is a mismatch between the data structures and representations used by the parser and the gold standard. A particular case in point is German, for which two treebanks (TiGer and TüBa-D/Z) are available with highly different annotation schemes for the acquisition of (e.g.) PCFG parsers. The differences between the TiGer and TüBa-D/Z annotation schemes make fair and unbiased parser evaluation difficult [7, 9, 12]. The resource (TEPACOC) presented in this paper takes a different approach to parser evaluation: instead of providing evaluation data in a single annotation scheme, TEPACOC uses comparable sentences and their annotations for 5 selected key grammatical phenomena (with 20 sentences each per phenomena) from both TiGer and TüBa-D/Z resources. This provides a 2 times 100 sentence comparable testsuite which allows us to evaluate TiGer-trained parsers against the TiGer part of TEPACOC, and TüBa-D/Z-trained parsers against the TüBa-D/Z part of TEPACOC for key phenomena, instead of comparing them against a single (and potentially biased) gold standard. To overcome the problem of inconsistency in human evaluation and to bridge the gap between the two different annotation schemes, we provide an extensive error classification, which enables us to compare parser output across the two different treebanks. In the remaining part of the paper we present the testsuite and describe the grammatical phenomena covered in the data. We discuss the different annotation strategies used in the two treebanks to encode these phenomena and present our error classification of potential parser errors.
In recent years, research in parsing has extended in several new directions. One of these directions is concerned with parsing languages other than English. Treebanks have become available for many European languages, but also for Arabic, Chinese, or Japanese. However, it was shown that parsing results on these treebanks depend on the types of treebank annotations used. Another direction in parsing research is the development of dependency parsers. Dependency parsing profits from the non-hierarchical nature of dependency relations, thus lexical information can be included in the parsing process in a much more natural way. Especially machine learning based approaches are very successful (cf. e.g.). The results achieved by these dependency parsers are very competitive although comparisons are difficult because of the differences in annotation. For English, the Penn Treebank has been converted to dependencies. For this version, Nivre et al. report an accuracy rate of 86.3%, as compared to an F-score of 92.1 for Charniaks parser. The Penn Chinese Treebank is also available in a constituent and a dependency representations. The best results reported for parsing experiments with this treebank give an F-score of 81.8 for the constituent version and 79.8% accuracy for the dependency version. The general trend in comparisons between constituent and dependency parsers is that the dependency parser performs slightly worse than the constituent parser. The only exception occurs for German, where F-scores for constituent plus grammatical function parses range between 51.4 and 75.3, depending on the treebank, NEGRA or TüBa-D/Z. The dependency parser based on a converted version of Tüba-D/Z, in contrast, reached an accuracy of 83.4%, i.e. 12 percent points better than the best constituent analysis including grammatical functions.
The problem of vocalization, or diacritization, is essential to many tasks in Arabic NLP. Arabic is generally written without the short vowels, which leads to one written form having several pronunciations with each pronunciation carrying its own meaning(s). In the experiments reported here, we define vocalization as a classification problem in which we decide for each character in the unvocalized word whether it is followed by a short vowel. We investigate the importance of different types of context. Our results show that the combination of using memory-based learning with only a word internal context leads to a word error rate of 6.64%. If a lexical context is added, the results deteriorate slowly.
How to compare treebanks
(2008)
Recent years have seen an increasing interest in developing standards for linguistic annotation, with a focus on the interoperability of the resources. This effort, however, requires a profound knowledge of the advantages and disadvantages of linguistic annotation schemes in order to avoid importing the flaws and weaknesses of existing encoding schemes into the new standards. This paper addresses the question how to compare syntactically annotated corpora and gain insights into the usefulness of specific design decisions. We present an exhaustive evaluation of two German treebanks with crucially different encoding schemes. We evaluate three different parsers trained on the two treebanks and compare results using EVALB, the Leaf-Ancestor metric, and a dependency-based evaluation. Furthermore, we present TePaCoC, a new testsuite for the evaluation of parsers on complex German grammatical constructions. The testsuite provides a well thought-out error classification, which enables us to compare parser output for parsers trained on treebanks with different encoding schemes and provides interesting insights into the impact of treebank annotation schemes on specific constructions like PP attachment or non-constituent coordination.
Prepositional phrase (PP) attachment is one of the major sources for errors in traditional statistical parsers. The reason for that lies in the type of information necessary for resolving structural ambiguities. For parsing, it is assumed that distributional information of parts-of-speech and phrases is sufficient for disambiguation. For PP attachment, in contrast, lexical information is needed. The problem of PP attachment has sparked much interest ever since Hindle and Rooth (1993) formulated the problem in a way that can be easily handled by machine learning approaches: In their approach, PP attachment is reduced to the decision between noun and verb attachment; and the relevant information is reduced to the two possible attachment sites (the noun and the verb) and the preposition of the PP. Brill and Resnik (1994) extended the feature set to the now standard 4-tupel also containing the noun inside the PP. Among many publications on the problem of PP attachment, Volk (2001; 2002) describes the only system for German. He uses a combination of supervised and unsupervised methods. The supervised method is based on the back-off model by Collins and Brooks (1995), the unsupervised part consists of heuristics such as ”If there is a support verb construction present, choose verb attachment”. Volk trains his back-off model on the Negra treebank (Skut et al., 1998) and extracts frequencies for the heuristics from the ”Computerzeitung”. The latter also serves as test data set. Consequently, it is difficult to compare Volk’s results to other results for German, including the results presented here, since not only he uses a combination of supervised and unsupervised learning, but he also performs domain adaptation. Most of the researchers working on PP attachment seem to be satisfied with a PP attachment system; we have found hardly any work on integrating the results of such approaches into actual parsers. The only exceptions are Mehl et al. (1998) and Foth and Menzel (2006), both working with German data. Mehl et al. report a slight improvement of PP attachment from 475 correct PPs out of 681 PPs for the original parser to 481 PPs. Foth and Menzel report an improvement of overall accuracy from 90.7% to 92.2%. Both integrate statistical attachment preferences into a parser. First, we will investigate whether dependency parsing, which generally uses lexical information, shows the same performance on PP attachment as an independent PP attachment classifier does. Then we will investigate an approach that allows the integration of PP attachment information into the output of a parser without having to modify the parser: The results of an independent PP attachment classifier are integrated into the parse of a dependency parser for German in a postprocessing step.
Parsing coordinations
(2009)
The present paper is concerned with statistical parsing of constituent structures in German. The paper presents four experiments that aim at improving parsing performance of coordinate structure: 1) reranking the n-best parses of a PCFG parser, 2) enriching the input to a PCFG parser by gold scopes for any conjunct, 3) reranking the parser output for all possible scopes for conjuncts that are permissible with regard to clause structure. Experiment 4 reranks a combination of parses from experiments 1 and 3. The experiments presented show that n- best parsing combined with reranking improves results by a large margin. Providing the parser with different scope possibilities and reranking the resulting parses results in an increase in F-score from 69.76 for the baseline to 74.69. While the F-score is similar to the one of the first experiment (n-best parsing and reranking), the first experiment results in higher recall (75.48% vs. 73.69%) and the third one in higher precision (75.43% vs. 73.26%). Combining the two methods results in the best result with an F-score of 76.69.
This paper presents a comparative study of probabilistic treebank parsing of German, using the Negra and TüBa-D/Z treebanks. Experiments with the Stanford parser, which uses a factored PCFG and dependency model, show that, contrary to previous claims for other parsers, lexicalization of PCFG models boosts parsing performance for both treebanks. The experiments also show that there is a big difference in parsing performance, when trained on the Negra and on the TüBa-D/Z treebanks. Parser performance for the models trained on TüBa-D/Z are comparable to parsing results for English with the Stanford parser, when trained on the Penn treebank. This comparison at least suggests that German is not harder to parse than its West-Germanic neighbor language English.
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances. Such larger structures are not only desirable for a deeper syntactic analysis. They also constitute a necessary prerequisite for assigning function-argument structure. The present paper offers a similaritybased algorithm for assigning functional labels such as subject, object, head, complement, etc. to complete syntactic structures on the basis of prechunked input. The evaluation of the algorithm has concentrated on measuring the quality of functional labels. It was performed on a German and an English treebank using two different annotation schemes at the level of function argument structure. The results of 89.73% correct functional labels for German and 90.40%for English validate the general approach.
Chunk parsing has focused on the recognition of partial constituent structures at the level of individual chunks. Little attention has been paid to the question of how such partial analyses can be combined into larger structures for complete utterances. The TüSBL parser extends current chunk parsing techniques by a tree-construction component that extends partial chunk parses to complete tree structures including recursive phrase structure as well as function-argument structure. TüSBLs tree construction algorithm relies on techniques from memory-based learning that allow similarity-based classification of a given input structure relative to a pre-stored set of tree instances from a fully annotated treebank. A quantitative evaluation of TüSBL has been conducted using a semi-automatically constructed treebank of German that consists of appr. 67,000 fully annotated sentences. The basic PARSEVAL measures were used although they were developed for parsers that have as their main goal a complete analysis that spans the entire input.This runs counter to the basic philosophy underlying TüSBL, which has as its main goal robustness of partially analyzed structures.
Das Chunkparsing bietet einen besonders vielversprechenden Ansatz zum robusten, partiellen Parsing mit dem Ziel einer breiten Datenabdeckung. Ziel beim Chunkparsing ist eine partielle, nicht-rekursive syntaktische Struktur. Dieser extrem effiziente Parsing-Ansatz läßt sich als Kaskade endlicher Transducer realisieren. In diesem Beitrag wird TüSBL vorgestellt, ein System, bei dem die Eingabe aus spontaner, gesprochener Spache besteht, die dem Parser in Form eines Worthypothesengraphen aus einem Spracherkenner zur Verfügung gestellt wird. Chunkparsing ist für eine solche Anwendung besonders geeignet, da es fragmentarische oder nicht wohlgeformte Äußerungen robust behandeln kann. Des weiteren wird eine Baumkonstruktionskomponente vorgestellt, die die partiellen Chunkstrukturen zu vollständigen Bäumen mit grammatischen Funktionen erweitert. Das System wird anhand manuell überprüfter Systemeingaben evaluiert, da sich die üblichen Evaluationsparameter hierfür nicht eignen.
Maschinelles Lernen wird häufig zur effzienten Annotation großer Datenmengen eingesetzt. Die Forschung zu maschinellen Lernverfahren beschränkt sich i.a. darauf unterschiedliche Lernverfahren zu vergelichen oder die optimale größe der Trainingsdaten zu bestimmen. Bisher wurde jedoch nicht untersucht, in wie weit sich linguistisches Wissen bei der Aufgabendefinition positiv auswirken kann. Dies soll hier anhand des Lernens von Base-Nominalphrasen mit drei unterschiedlichen Definitionen untersucht werden. Die Definitionen unterscheiden sich im Grad der linguistisch motivierten Erweiterungen, die zu einer eher praktisch motivierten ersten Definition hinzu kamen. Die Untersuchungen ergaben, dass sich die Anzahl der falsch klasssifizierten Wörter um ein Drittel reduzieren lässt.
In the last decade, the Penn treebank has become the standard data set for evaluating parsers. The fact that most parsers are solely evaluated on this specific data set leaves the question unanswered how much these results depend on the annotation scheme of the treebank. In this paper, we will investigate the influence which different decisions in the annotation schemes of treebanks have on parsing. The investigation uses the comparison of similar treebanks of German, NEGRA and TüBa-D/Z, which are subsequently modified to allow a comparison of the differences. The results show that deleted unary nodes and a flat phrase structure have a negative influence on parsing quality while a flat clause structure has a positive influence.
The ACL 2008 Workshop on Parsing German features a shared task on parsing German. The goal of the shared task was to find reasons for the radically different behavior of parsers on the different treebanks and between constituent and dependency representations. In this paper, we describe the task and the data sets. In addition, we provide an overview of the test results and a first analysis.
This paper presents an approach to the question whether it is possible to construct a parser based on ideas from case-based reasoning. Such a parser would employ a partial analysis of the input sentence to select a (nearly) complete syntax tree and then adapt this tree to the input sentence. The experiments performed on German data from the Tüba-D/Z treebank and the KaRoPars partial parser show that a wide range of levels of generality can be reached, depending on which types of information are used to determine the similarity between input sentence and training sentences. The results are such that it is possible to construct a case-based parser. The optimal setting out of those presented here need to be determined empirically.
In syntax, the trend nowadays is towards lexicalized grammar formalisms. It is now widely accepted that dividing words into wordclasses may serve as a laborsaving mechanism - but at the same time, it discards all detailed information on the idiosyncratic behavior of words. And that is exactly the type of information that may be necessary in order to parse a sentence. For learning approaches, however, lexicalized grammars represent a challenge for the very reason that they include so much detailed and specific information, which is difficult to learn. This paper will present an algorithm for learning a link grammar of German. The problem of data sparseness is tackled by using all the available information from partial parses as well as from an existing grammar fragment and a tagger. This is a report about work in progress so there are no representative results available yet.
The definition of similarity between sentences is formulated on the levels of words, POS tags, and chunks (Abney 91; Abney 96). The evaluation of this approach shows that while precision and recall based on the PARSEVAL measures (Black et al. 91) do not reach state of the art Parsers yet (F1=87.19 on syntactic constituents, F1=77.78 including functionargument structure), the parser shows a very reliable performance where function-argument structure is concerned (F1=96.52). The lower F-scores are very often due to unattached constituents.
Integration and social advancement in our time without a solid language skills are no longer possible. What has not been done for decades, they now try through the integration abroad and in Germany make up very successful. But German is unfortunately only the first, though perhaps the most important step for a successful integration. The next question should now be: Lack of integration in spite of good knowledge of German - why?
U radu se na temelju vlastitih terenskih istraživanja i literature prikazuje suglasnički sustav južnomoslavačkih kajkavskih govora, njegov inventar, distribucija i podrijetlo, na primjeru triju govora – Kutinskoga Sela, Osekova i Okešinca. Prikazuju se zajedničke i razlikovne značajke triju navedenih govora. Južnomoslavački kajkavski govori pripadaju južnomoslavačkomu ili donjolonjskomu dijalektu.
Deklinacija brojeva dva, oba, tri i četiri u kajkavskim pravnim tekstovima od 16. do 18. Stoljeća
(2007)
Autori se u članku bave deklinacijom brojeva dva, oba, tri i četiri u kajkavskim tekstovima pravne regulative od 16. do 18. stoljeća. Kao korpus za jezičnu analizu uzimaju 23 teksta iz 16. st., 40 tekstova iz 17. st. i 19 tekstova iz 18. st. U jezičnoj se analizi posebna pažnja posvećuje usporedbi između oblika dvojine i množine u deklinaciji brojeva dva i oba, kao i razvoju množinskih oblika u deklinaciji brojeva tri i četiri. Autori navode sve zabilježene oblike brojeva dva, oba, tri i četiri, uspoređuju njihovu pojavnost u različitom vremenskom presjeku i na temelju rezultata jezične analize nude deklinacijski tip navedenih brojeva. Deklinacija brojeva u kosim padežima promatra se s obzirom na to jesu li navedeni brojevi dijelom prijedložnih ili neprijedložnih izraza, a posebno je pitanje učestalosti indeklinabilnih oblika.
Mit der Möglichkeit, anhand digitaler Telefonanschlüsse Familiennamen nach Bestand, Trägerzahl und räumlicher Verbreitung mit großer Genauigkeit zu erfassen, hat eine neue Epoche der Anthroponomastik begonnen. Der Schatz von 850661 verschiedenen Familiennamen, die im Jahre 2005 in 28205713 privaten Festnetzanschlüssen registriert waren, ist immens, und die Fragestellungen zu seiner Erforschung sind in ihrer Ausrichtung und in ihrer Anzahl unerschöpflich. In dieser Situation ergaben sich vordringlich zwei Aufgaben: Erstens musste angesichts der von Jahr zu Jahr wachsenden Bevölkerungsmobilität, angesichts der Auswirkung neuerer Namengesetzgebung und angesichts der schnell zunehmenden Ablösung lokalisierter Festnetzanschlüsse durch Mobiltelefone der Namenbestand spätestens jetzt aufgrund der zuverlässigsten Quelle und in legitim nutzbarer Weise gesichert und archiviert werden. Die geschichtlich gewachsenen Namenlandschaften sind gerade noch, und zwar in erstaunlicher Stabilität, erhalten. Die Daten wurden nach Klärung der Datenschutzfragen von der Deutschen Telekom auf Stand Juni 2005 dem Deutschen Familiennamenatlas zur Verfügung gestellt und ihre Nutzung zur namenkundlichen Forschung mit Vertrag vom 28.06.2005 geregelt.
Die Familiennamen sind als einziger Bereich der europäischen Sprachen in ihrer ausgeprägten räumlichen Vielfalt noch höchst unzureichend erfasst. Noch sind die geschichtlich gewachsenen Namenlandschaften in erstaunlicher Stabilität erhalten. Sie werden im Bereich der Bundesrepublik Deutschland durch den seit 2005 in Kooperation der Universitäten Freiburg und Mainz in Angriff genommenen und durch die DFG geförderten 'Deutschen Familiennamenatlas' (OFA) auf der Basis von Telefonanschlüssen (Stand 2005) dokumentiert. Im vorliegenden Beitrag werden Vorarbeiten, Ziele, Gesamtanlage des Projekts, Systematik und Repräsentativität der Themenauswahl in den beiden Hauptteilen (grammatischer und lexikalischer Teil) sowie Kriterien und Methoden der inhaltlichen Konzipierung und formalen Gestaltung der Karten und Kommentare vorgestellt und begründet. Aus den genannten Vorarbeiten werden auch schon Perspektiven künftiger Auswertung der in den Datenbanken archivierten Materialien und der im Atlas exemplarisch dokumentierten Strukturen der Namenlandschaften ersichtlich.
Parni prijedlozi
(2007)
Tief im Osten, gleichsam „am Rande der Welt“, in der Republik Burjatien (Russische Föderation), hinter dem Baikalsee gelegen und viele tausend Kilometer von europäischen Großstädten entfernt, hat der Erwerb der deutschen Sprache einen hohen Stellenwert – insbesondere für Deutschlehrer, Deutschlehrerausbilder und Deutschstudierende.
The interest of this work devotes itself to the repeating linguistic actions of the students in the DaF conversation lessons. Repetitions in the lesson discourse are functionally different than repetitions in the daily discourse. The support of repetitions by the students in the class discourse is tried to be demonstrated here on the basis of examples. Recordings from the DaF conversation lessons were transcribed and reconstructed according to Hiat. The kinds of the repetitions and their functions in these DaF conversation lessons are limited with this study. The findings of the study should be concerned consciously in order to accomplish a better understanding and reacting to these repeating actions of the students like inquiry, correction, confirmation, precautionary self-control, verification and confirmation in the conversation lessons –most of which are accomplished by the students for a certain aim however unconsciously.
Za svojega kratkoga boravka u Petrogradu 1912. I. Milcetic; opisao je bogatu Bercicevu zbirku glagoljickih rukopisa i tiskanih knjiga iz Ruske nacionalne knjižnice, ali nije stigao podrobno prouciti svaki sastavni dio Berciceve grade. U Milceticevu opisu kodeksa br. 1 (Klimantovicev zbornik, 1514.) spominje se prolog Muke, ali se ne upozorava da se u nastavku toga prologa nalazi ulomak iz srednjovjekovnoga prikazanja sa scenom Judine izdaje Isusa. Ta je scena u srednjem vijeku ponajviše uznemiravala puk jer se tada od svega najviše mrzila laž, izdaja i prijetvornost. U radu se opisuju i prvi put objavljuju ulomci nepoznate redakcije Muke Isuhrstove iz petrogradske Berciceve zbirke (sign. Bc 1), koji predstavljaju za sada najstariji zapisani prolog i scenu hrvatskoga srednjovjekovnoga prikazanja pasionske tematike. Stihovi ulomaka usporeduju se s mladom ciklickom Mukom Spasitelja našega iz glagoljickoga Zbornika prikazanja (1556.), s kojom se u korpusu hrvatskoga srednjovjekovnoga pjesništva ti ulomci najviše podudaraju.
Tree-local MCTAG with shared nodes : an analysis of word order variation in German and Korean
(2004)
Tree Adjoining Grammars (TAG) are known not to be powerful enough to deal with scrambling in free word order languages. The TAG-variants proposed so far in order to account for scrambling are not entirely satisfying. Therefore, an alternative extension of TAG is introduced based on the notion of node sharing. Considering data from German and Korean, it is shown that this TAG-extension can adequately analyse scrambling data, also in combination with extraposition and topicalization.
This paper proposes a corpus encoding standard that meets the needs of linguistic research using a variety of linguistic data structures. The standard was developed in SFB 441, a research project at the University of Tuebingen. The principal concern of SFB 441 are the empirical data structures which feed into linguistic theory building. SFB 441 consists of several projects, most of which are building corpora to empirically investigate various linguistic phenomena in various languages (e.g. modal verbs in German, forms of address and politeness in Russian). These corpora will form the components of the "Tuebingen collection of reusable, empirical, linguistic data structures (TUSNELDA)". The TUSNELDA annotation standard aims at providing a uniform encoding scheme for all subcorpora and texts of TUSNELDA such that they can be processed with uniform standardized tools. To guarantee maximal reusability we use XML for encoding. Previous SGML standards for text encoding were provided by the Text Encoding Initiative (TEI) and the Expert Advisory Group on Language Engineering Standards (Corpus Encoding Standard, CES). The TUSNELDA standard is based on TEI and XCES (XML version of CES) but takes into account the specific needs of the SFB projects, i.e. the peculiarities of the examined languages and linguistic phenomena.
This paper investigates the class of Tree-Tuple MCTAG with Shared Nodes, TT-MCTAG for short, an extension of Tree Adjoining Grammars that has been proposed for natural language processing, in particular for dealing with discontinuities and word order variation in languages such as German. It has been shown that the universal recognition problem for this formalism is NP-hard, but so far it was not known whether the class of languages generated by TT-MCTAG is included in PTIME. We provide a positive answer to this question, using a new characterization of TT-MCTAG.
This paper sets up a framework for LTAG (Lexicalized Tree Adjoining Grammar) semantics that brings together ideas from different recent approaches addressing some shortcomings of TAG semantics based on the derivation tree. Within this framework, several sample analyses are proposed, and it is shown that the framework allows to analyze data that have been claimed to be problematic for derivation tree based LTAG semantics approaches.
Relative quantifier scope in German depends, in contrast to English, very much on word order. The scope possibilities of a quantifier are determined by its surface position, its base position and the type of the quantifier. In this paper we propose a multicomponent analysis for German quantifiers computing the scope of the quantifier, in particular its minimal nuclear scope, depending on the syntactic configuration it occurs in.
This paper presents an LTAG analysis of reflexives like himself and reciprocals like each other. These items need to find a c-commanding antecedent from which they retrieve (part of) their own denotation and with which they syntactically agree. The relation between anaphoric item and antecendent must satisfy the following important locality conditions (Chomsky (1981)).
This paper compares two approaches to computational semantics, namely semantic unification in Lexicalized Tree Adjoining Grammars (LTAG) and Lexical Resource Semantics (LRS) in HPSG. There are striking similarities between the frameworks that make them comparable in many respects. We will exemplify the differences and similarities by looking at several phenomena. We will show, first of all, that many intuitions about the mechanisms of semantic computations can be implemented in similar ways in both frameworks. Secondly, we will identify some aspects in which the frameworks intrinsically differ due to more general differences between the approaches to formal grammar adopted by LTAG and HPSG.
This paper investigates the relation between TT-MCTAG, a formalism used in computational linguistics, and RCG. RCGs are known to describe exactly the class PTIME; simple RCG even have been shown to be equivalent to linear context-free rewriting systems, i.e., to be mildly context-sensitive. TT-MCTAG has been proposed to model free word order languages. In general, it is NP-complete. In this paper, we will put an additional limitation on the derivations licensed in TT-MCTAG. We show that TT-MCTAG with this additional limitation can be transformed into equivalent simple RCGs. This result is interesting for theoretical reasons (since it shows that TT-MCTAG in this limited form is mildly context-sensitive) and, furthermore, even for practical reasons: We use the proposed transformation from TT-MCTAG to RCG in an actual parser that we have implemented.
Cet article étudie la relation entre les grammaires darbres adjoints à composantes multiples avec tuples darbres (TT-MCTAG), un formalisme utilisé en linguistique informatique, et les grammaires à concaténation dintervalles (RCG). Les RCGs sont connues pour décrire exactement la classe PTIME, il a en outre été démontré que les RCGs « simples » sont même équivalentes aux systèmes de réécriture hors-contextes linéaires (LCFRS), en dautres termes, elles sont légèrement sensibles au contexte. TT-MCTAG a été proposé pour modéliser les langages à ordre des mots libre. En général ces langages sont NP-complets. Dans cet article, nous définissons une contrainte additionnelle sur les dérivations autorisées par le formalisme TT-MCTAG. Nous montrons ensuite comment cette forme restreinte de TT-MCTAG peut être convertie en une RCG simple équivalente. Le résultat est intéressant pour des raisons théoriques (puisqu’il montre que la forme restreinte de TT-MCTAG est légèrement sensible au contexte), mais également pour des raisons pratiques (la transformation proposée ici a été utilisée pour implanter un analyseur pour TT-MCTAG).
Nous présentons ici différents algorithmes d’analyse pour grammaires à concaténation d’intervalles (Range Concatenation Grammar, RCG), dont un nouvel algorithme de type Earley, dans le paradigme de l’analyse déductive. Notre travail est motivé par l’intérêt porté récemment à ce type de grammaire, et comble un manque dans la littérature existante.
We present a CYK and an Earley-style algorithm for parsing Range Concatenation Grammar (RCG), using the deductive parsing framework. The characteristic property of the Earley parser is that we use a technique of range boundary constraint propagation to compute the yields of non-terminals as late as possible. Experiments show that, compared to previous approaches, the constraint propagation helps to considerably decrease the number of items in the chart.
In this paper, we present an open-source parsing environment (Tübingen Linguistic Parsing Architecture, TuLiPA) which uses Range Concatenation Grammar (RCG) as a pivot formalism, thus opening the way to the parsing of several mildly context-sensitive formalisms. This environment currently supports tree-based grammars (namely Tree-Adjoining Grammars (TAG) and Multi-Component Tree-Adjoining Grammars with Tree Tuples (TT-MCTAG)) and allows computation not only of syntactic structures, but also of the corresponding semantic representations. It is used for the development of a tree-based grammar for German.
Developing linguistic resources, in particular grammars, is known to be a complex task in itself, because of (amongst others) redundancy and consistency issues. Furthermore some languages can reveal themselves hard to describe because of specific characteristics, e.g. the free word order in German. In this context, we present (i) a framework allowing to describe tree-based grammars, and (ii) an actual fragment of a core multicomponent tree-adjoining grammar with tree tuples (TT-MCTAG) for German developed using this framework. This framework combines a metagrammar compiler and a parser based on range concatenation grammar (RCG) to respectively check the consistency and the correction of the grammar. The German grammar being developed within this framework already deals with a wide range of scrambling and extraction phenomena.
This paper proposes a compositional semantics for lexicalized tree adjoining grammars (LTAG). Tree-local multicompnent derivations allow seperation of semantiv contribution of a lexical item into one component contributing to the predicate argument structure and second a component contributing to scope semantics. Based on this idea a syntx-semantics interface is presented where the compositional semantics depends only on the derivation structure. It is shown that the derivation structure allows an appropriate amount of underspecification. This is illustrated by investigating underspecified representations for quantifier scpoe ambiguities and related phenomena such as adjunct scope and island constraints.
In this paper we propose a compositional semantics for lexicalized tree-adjoining grammar (LTAG). Tree-local multicomponent derivations allow separation of the semantic contribution of a lexical item into one component contributing to the predicate argument structure and a second component contributing to scope semantics. Based on this idea a syntax-semantics interface is presented where the compositional semantics depends only on the derivation structure. It is shown that the derivation structure (and indirectly the locality of derivations) allows an appropriate amount of underspecification. This is illustrated by investigating underspecified representations for quantifier scope ambiguities and related phenomena such as adjunct scope and island constraints.
The work presented here addresses the question of how to determine whether a grammar formalism is powerful enough to describe natural languages. The expressive power of a formalism can be characterized in terms of i) the string languages it generates (weak generative capacity (WGC)) or ii) the tree languages it generates (strong generative capacity (SGC)). The notion of WGC is not enough to determine whether a formalism is adequate for natural languages. We argue that even SGC is problematic since the sets of trees a grammar formalism for natural languages should be able to generate is difficult to determine. The concrete syntactic structures assumed for natural languages depend very much on theoretical stipulations and empirical evidence for syntactic structures is rather hard to obtain. Therefore, for lexicalized formalisms, we propose to consider the ability to generate certain strings together with specific predicate argument dependencies as a criterion for adequacy for natural languages.
Multicomponent Tree Adjoining Grammars (MCTAG) is a formalism that has been shown to be useful for many natural language applications. The definition of MCTAG however is problematic since it refers to the process of the derivation itself: a simultaneity constraint must be respected concerning the way the members of the elementary tree sets are added. Looking only at the result of a derivation (i.e., the derived tree and the derivation tree), this simultaneity is no longer visible and therefore cannot be checked. I.e., this way of characterizing MCTAG does not allow to abstract away from the concrete order of derivation. Therefore, in this paper, we propose an alternative definition of MCTAG that characterizes the trees in the tree language of an MCTAG via the properties of the derivation trees the MCTAG licences.