Parser evaluation across text types

  • When a statistical parser is trained on one treebank, one usually tests it on another portion of the same treebank, partly due to the fact that a comparable annotation format is needed for testing. But the user of a parser may not be interested in parsing sentences from the same newspaper all over, or even wants syntactic annotations for a slightly different text type. Gildea (2001) for instance found that a parser trained on the WSJ portion of the Penn Treebank performs less well on the Brown corpus (the subset that is available in the PTB bracketing format) than a parser that has been trained only on the Brown corpus, although the latter one has only half as many sentences as the former. Additionally, a parser trained on both the WSJ and Brown corpora performs less well on the Brown corpus than on the WSJ one. This leads us to the following questions that we would like to address in this paper: - Is there a difference in usefulness of techniques that are used to improve parser performance between the same-corpus and the different-corpus case? - Are different types of parsers (rule-based and statistical) equally sensitive to corpus variation? To achieve this, we compared the quality of the parses of a hand-crafted constraint-based parser and a statistical PCFG-based parser that was trained on a treebank of German newspaper text.
Metadaten
Author:Yannick Versley
URN:urn:nbn:de:hebis:30-1111538
URL:http://www.versley.de/versley_tlt05.pdf
Document Type:Preprint
Language:English
Year of Completion:2005
Year of first Publication:2005
Publishing Institution:Universit├Ątsbibliothek Johann Christian Senckenberg
Release Date:2008/11/04
Page Number:12
Note:
Erschienen in: Proceedings of the Fourth Workshop on Treebanks and Linguistic Theories, TLT 2005, Barcelona, Spain, S. 209-220
Source:Arbeitspapier vom Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005) ; http://www.versley.de/versley_tlt05.pdf
HeBIS-PPN:207012296
Institutes:keine Angabe Fachbereich / Extern
Dewey Decimal Classification:4 Sprache / 40 Sprache / 400 Sprache
Sammlungen:Linguistik
Linguistik-Klassifikation:Linguistik-Klassifikation: Computerlinguistik / Computational linguistics
Licence (German):License LogoDeutsches Urheberrecht