Linguistik
Refine
Year of publication
- 2011 (2) (remove)
Document Type
- Doctoral Thesis (2) (remove)
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Keywords
- machine translation (1)
- natural language (1)
Institute
Statistical machine translation (SMT) should benefit from linguistic information to improve performance but current state-of-the-art models rely purely on data-driven models. There are several reasons why prior efforts to build linguistically annotated models have failed or not even been attempted. Firstly, the practical implementation often requires too much work to be cost effective. Where ad-hoc implementations have been created, they impose too strict constraints to be of general use. Lastly, many linguistically-motivated approaches are language dependent, tackling peculiarities in certain languages that do not apply to other languages. This thesis successfully integrates linguistic information about part-of-speech tags, lemmas and phrase structure to improve MT quality. The major contributions of this thesis are: 1. We enhance the phrase-based model to incorporate linguistic information as additional factors in the word representation. The factored phrase-based model allows us to make use of different types of linguistic information in a systematic way within the predefined framework. We show how this model improves translation by as much as 0.9 BLEU for small German-English training corpora, and 0.2 BLEU for larger corpora. 2. We extend the factored model to the factored template model to focus on improving reordering. We show that by generalising translation with part-of-speech tags, we can improve performance by as much as 1.1 BLEU on a small French- English system. 3. Finally, we switch from the phrase-based model to a syntax-based model with the mixed syntax model. This allows us to transition from the word-level approaches using factors to multiword linguistic information such as syntactic labels and shallow tags. The mixed syntax model uses source language syntactic information to inform translation. We show that the model is able to explain translation better, leading to a 0.8 BLEU improvement over the baseline hierarchical phrase-based model for a small German-English task. Also, the model requires only labels on continuous source spans, it is not dependent on a tree structure, therefore, other types of syntactic information can be integrated into the model. We experimented with a shallow parser and see a gain of 0.5 BLEU for the same dataset. Training with more training data, we improve translation by 0.6 BLEU (1.3 BLEU out-of-domain) over the hierarchical baseline. During the development of these three models, we discover that attempting to rigidly model translation as linguistic transfer process results in degraded performance. However, by combining the advantages of standard SMT models with linguistically-motivated models, we are able to achieve better translation performance. Our work shows the importance of balancing the specificity of linguistic information with the robustness of simpler models.
This dissertation investigated the development of the complementiser that from the demonstrative pronoun in the Germanic languages; each chapter dealt with a different aspect. In the introduction, the terms ‘reanalysis’ and ‘analogy’ and their relevance for grammaticalisation were explained, and the issues of the chapters were presented. The second chapter introduced some information about the Germanic language family and the languages which were relevant for this investigation, namely Gothic, Old English, Old Icelandic, Old Saxon and Old High German. Previous assumptions about the diachrony of that were presented and discussed. One of these proposals which mainly draws on evidence from West Germanic involves the idea that the source construction contained two independent main clauses with a demonstrative pronoun (that) at the end of the first clause (cf. e.g. Paul 1962, § 248). In contrast to this, the Gothic evidence showed that the source construction of the reanalysis of ϸatei was not a proper paratactic construction (at least in Gothic) but already a complex construction which contained a complementiser (ei) in the appositional subordinate clause (cf. also e.g. Longobardi 1994 for the diachrony of ϸatei). This contradiction raised the question whether the analysis of the Gothic that-complementiser also applies to the diachrony of that in West Germanic. This issue was taken up in the third chapter which presented an overview of subordination and complementisers in Northwest Germanic. The aim was to show that the Northwest Germanic languages also show a subordinating particle, which functions like the Gothic ei, namely ϸe (OE), er/es (OI), the (OHG, OS). As a result, the subordinating particle could be observed in relative and adverbial clauses in all Northwest Germanic languages. In complement clauses, which are most crucial for the argumentation, the subordinating particle is found in Old English and Old Icelandic but not in Old Saxon. In Old High German, there are only combinations of the with a following pronoun, theih and theiz, in ‘Otfrids Evangelienbuch’ (see Wunder 1965). Consequently, the presence of a subordinating particle is confirmed in North and West Germanic. The fact that the patterns of subordination are quite similar in all Germanic languages suggested a unitary analysis of the development of that in Germanic was appropriate. In chapter four, the similarities and differences between the Germanic languages with respect to the development of that were explained. It was argued that the preconditions of the reanalysis were the same, whereas the consequences of the reanalysis are realised differently in each language. The most important precondition was that the appositional source construction (explained in more detail below) was generally available in Germanic. Since the demonstrative pronoun at the end of the matrix clause and the subordinating particle of the subordinate clause were adjacent, phonological combination might have been crucial for the subsequent reanalysis to take place. After reanalysis, however, different changes can be observed in the different languages. For instance, it appears that during the Old English period the final syllable of the form ϸætte was deleted (see chapter 4 for references), whereas the final –ei is still present in the Gothic ϸatei, and completely absent in Old High German and Old Saxon. The source structure of the reanalysis was discussed in detail in a separate subsection. The appositional source construction, which was already assumed for the reanalysis of Gothic ϸatei, was compared with analyses of clitic left dislocation which propose that two constituents with the same theta-role derive from a Big DP (see e.g. Grewendorf 2009, Belletti 2005). Based on the Big DP analysis of Grewendorf (2009), it was claimed that the appositional clause, introduced by the subordinating particle, is generated in the Spec of a DP, and adjoined to this DP on the surface. It was argued that this whole complement DP-node occurred in an extraposed position in OV-languages so that the verb, when it stays in-situ, does not appear between the demonstrative pronoun and the subordinating particle. The structure in (1) illustrates the syntactic source structure which is assumed to apply to the development of the complementiser that in Germanic. ...