Parser evaluation across text types

When a statistical parser is trained on one treebank, one usually tests it on another portion of the same treebank, partly due to the fact that a comparable annotation format is needed for testing. But the user of a pars
When a statistical parser is trained on one treebank, one usually tests it on another portion of the same treebank, partly due to the fact that a comparable annotation format is needed for testing. But the user of a parser may not be interested in parsing sentences from the same newspaper all over, or even wants syntactic annotations for a slightly different text type. Gildea (2001) for instance found that a parser trained on the WSJ portion of the Penn Treebank performs less well on the Brown corpus (the subset that is available in the PTB bracketing format) than a parser that has been trained only on the Brown corpus, although the latter one has only half as many sentences as the former. Additionally, a parser trained on both the WSJ and Brown corpora performs less well on the Brown corpus than on the WSJ one. This leads us to the following questions that we would like to address in this paper: - Is there a difference in usefulness of techniques that are used to improve parser performance between the same-corpus and the different-corpus case? - Are different types of parsers (rule-based and statistical) equally sensitive to corpus variation? To achieve this, we compared the quality of the parses of a hand-crafted constraint-based parser and a statistical PCFG-based parser that was trained on a treebank of German newspaper text.
show moreshow less

Export metadata

  • Export Bibtex
  • Export RIS

Additional Services

    Share in Twitter Search Google Scholar
Metadaten
Author:Yannick Versley
URN:urn:nbn:de:hebis:30-1111538
Document Type:Article
Language:English
Date of Publication (online):2008/11/04
Year of first Publication:2005
Publishing Institution:Univ.-Bibliothek Frankfurt am Main
Release Date:2008/11/04
Source:Arbeitspapier vom Fourth Workshop on Treebanks and Linguistic Theories (TLT 2005) ; http://www.versley.de/versley_tlt05.pdf
HeBIS PPN:207012296
Dewey Decimal Classification:400 Sprache
Sammlungen:Linguistik
Linguistic-Classification:Linguistik-Klassifikation: Computerlinguistik / Computational linguistics
Licence (German):License Logo Veröffentlichungsvertrag für Publikationen

$Rev: 11761 $