570 Biowissenschaften; Biologie
Refine
Document Type
- Article (4)
- Conference Proceeding (3)
Has Fulltext
- yes (7)
Is part of the Bibliography
- no (7)
Keywords
- BIOfid (2)
- Biodiversity (2)
- Universitätsbibliothek Johann Christian Senckenberg (2)
- Annotation (1)
- Biologie (1)
- Botanik (1)
- Dokumentlieferung (1)
- Inter-annotator agreement (1)
- Literaturdatenbank (1)
- Named entity recognition (1)
Institute
- Universitätsbibliothek (7) (remove)
Biodiversity information is contained in countless digitized and unprocessed scholarly texts. Although automated extraction of these data has been gaining momentum for years, there are still innumerable text sources that are poorly accessible and require a more advanced range of methods to extract relevant information. To improve the access to semantic biodiversity information, we have launched the BIOfid project (www.biofid.de) and have developed a portal to access the semantics of German language biodiversity texts, mainly from the 19th and 20th century. However, to make such a portal work, a couple of methods had to be developed or adapted first. In particular, text-technological information extraction methods were needed, which extract the required information from the texts. Such methods draw on machine learning techniques, which in turn are trained by learning data. To this end, among others, we gathered the BIOfid text corpus, which is a cooperatively built resource, developed by biologists, text technologists, and linguists. A special feature of BIOfid is its multiple annotation approach, which takes into account both general and biology-specific classifications, and by this means goes beyond previous, typically taxon- or ontology-driven proper name detection. We describe the design decisions and the genuine Annotation Hub Framework underlying the BIOfid annotations and present agreement results. The tools used to create the annotations are introduced, and the use of the data in the semantic portal is described. Finally, some general lessons, in particular with multiple annotation projects, are drawn.
The Specialized Information Service Biodiversity Research (BIOfid) has been launched to mobilize valuable biological data from printed literature hidden in German libraries for over the past 250 years. In this project, we annotate German texts converted by OCR from historical scientific literature on the biodiversity of plants, birds, moths and butterflies. Our work enables the automatic extraction of biological information previously buried in the mass of papers and volumes. For this purpose, we generated training data for the tasks of Named Entity Recognition (NER) and Taxa Recognition (TR) in biological documents. We use this data to train a number of leading machine learning tools and create a gold standard for TR in biodiversity literature. More specifically, we perform a practical analysis of our newly generated BIOfid dataset through various downstream-task evaluations and establish a new state of the art for TR with 80.23% F-score. In this sense, our paper lays the foundations for future work in the field of information extraction in biology texts.
BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.
Biodiversity research heavily relies on recent and older literature, and the data contained therein. Despite great effort, large parts of the literature and the data it holds are still not available in appropriate formats needed for efficient compilation and analysis. As a part of the current funding strategy of the German Research Council (Deutsche Forschungsgemeinschaft, DFG), and resulting from an extensive dialogue with the scientific community in Germany, a "Specialised Information Service" (Fachinformationsdienst, FID) for Biodiversity Research will be established with the objective of making further segments of literature about biodiversity available in up-to-date formats. This project, starting 2017, is conducted by the University Library Johann Christian Senckenberg (Frankfurt/Main, Germany) together with the Senckenberg Gesellschaft für Naturforschung and the Text Technology Lab of the Goethe University (Frankfurt/Main).
The new Specialised Information Service for Biodiversity Research (FID Biodiversitätsforschung) comprises four core elements: (A) A text mining approach which encompasses advanced text technologies and a large body of 20th century literature; (B) the digitisation of selected German biodiversity literature; (C) a platform für Open Access journals; and (D) Acquisition of specialised print literature.
In order to promote the accessibility of biodiversity data in historic and contemporary literature, we introduce a new interdisciplinary project called BIOfid (FID=Fachinformationsdienst, a service for providing specialized information). The project aims at a mobilization of data available in print only by combining digitization of scientific biodiversity literature with the development of innovative text mining tools for complex, eventually semantic searches throughout the complete text corpus. A major prerequisite for the development of such search tools is the provision of sophisticated anatomy ontologies on the one hand, and of complete lists of species names (currently considered valid as well as all synonyms) at a global scale on the other hand. In the initial stage, we chose examples from German publications of the past 250 years dealing with the geographic distribution and ecology of vascular plants (Tracheophyta), birds (Aves), as well as moths and butterflies (Lepidoptera) in Germany. These taxa have been prioritized according to current demands of German research groups (about 50 sites) aiming at analyses and modeling of distribution patterns and their changes through time. In the long term, we aim at providing data and open source software applicable for any taxon and geographic region. For this purpose, a platform for open access journals for long-term availability of professional e-journals will be established. All generated data will also be made accessible through GFBio (German Federation for Biological Data). BIOfid is supported by the LIS-Scientific Library Services and Information Systems program of the German Research Foundation (DFG).
Die Virtuelle Fachbibliothek Biologie (www.vifabio.de) bündelt die Recherche nach wissenschaftlich hochwertigen Quellen aus Bibliotheken, Aufsatzbanken und Internet. Zentrales Element von vifabio ist dabei der Virtuelle Katalog: Mit einer Suchanfrage werden mehrere Kataloge zoologisch bzw. ornithologisch relevanter Bibliotheken, Zeitschriftendatenbanken wie Zoological Record (Nationallizenz 1864 bis 2007 für Nutzer in akademischen Einrichtungen), BioLIS und der Aufsatzkatalog OLC, sowie Landesbibliographien und der Internetquellen-Führer von vifabio durchsucht. Verlinkungen zur Elektronischen Zeitschriftenbibliothek Regensburg (EZB), zum Lieferdienst subito sowie zum Karlsruher Virtuellen Katalog (KVK) erleichtern den Zugang zum Volltext oder zum gedruckten Exemplar. Weitere Module von vifabio wie der Internetquellen-Führer bzw. der Datenbank-Führer eröffnen zusätzliche Rechercheoptionen.
Die Datenbank BioLIS wird durch die Universitätsbibliothek Johann Christian Senckenberg (Frankfurt/M.) kostenfrei online zur Verfügung gestellt. Sie weist deutsche biologische Zeitschriftenliteratur aus dem Zeit¬raum 1970 bis 1996 nach – damit ist BioLIS eine wesentliche Ergänzung zu der Datenbank „Biological Abstracts“. Die bibliografischen Angaben zu den nachgewiesenen Aufsätzen werden durch umfassende Schlagwörter und Namen behandelter Organismen ergänzt, so dass Spezialrecherchen insbesondere nach Literatur über bestimmte Organismen möglich sind.