Refine
Document Type
- Article (8)
- Conference Proceeding (3)
- Part of a Book (2)
Has Fulltext
- yes (13)
Is part of the Bibliography
- no (13)
Keywords
- web archiving (3)
- Digital libraries (2)
- Special issue (2)
- TPDL conference (2)
- Biodiversity (1)
- Culturomics (1)
- Digital humanities (1)
- Information Extraction (1)
- Knowledge Networking (1)
- Knowledgebased analysis (1)
Institute
BIOfid is a specialized information service currently being developed to mobilize biodiversity data dormant in printed historical and modern literature and to offer a platform for open access journals on the science of biodiversity. Our team of librarians, computer scientists and biologists produce high-quality text digitizations, develop new text-mining tools and generate detailed ontologies enabling semantic text analysis and semantic search by means of user-specific queries. In a pilot project we focus on German publications on the distribution and ecology of vascular plants, birds, moths and butterflies extending back to the Linnaeus period about 250 years ago. The three organism groups have been selected according to current demands of the relevant research community in Germany. The text corpus defined for this purpose comprises over 400 volumes with more than 100,000 pages to be digitized and will be complemented by journals from other digitization projects, copyright-free and project-related literature. With TextImager (Natural Language Processing & Text Visualization) and TextAnnotator (Discourse Semantic Annotation) we have already extended and launched tools that focus on the text-analytical section of our project. Furthermore, taxonomic and anatomical ontologies elaborated by us for the taxa prioritized by the project’s target group - German institutions and scientists active in biodiversity research - are constantly improved and expanded to maximize scientific data output. Our poster describes the general workflow of our project ranging from literature acquisition via software development, to data availability on the BIOfid web portal (http://biofid.de/), and the implementation into existing platforms which serve to promote global accessibility of biodiversity data.
The scientific innovation process embraces the steps from problem definition through the development and evaluation of innovative solutions to their successful exploitation. The challenges imposed by this process can be answered by the creation of a powerful and flexible next-generation e-Science infrastructure, which exploits leading edge information and knowledge technologies and enables a comprehensive and intelligent means of supporting this process. This paper describes our vision of a Knowledge-based eScience infrastructure, which is based on the results of an in-depth study of the researchers requirements. Furthermore, it introduces the Fraunhofer e-Science Cockpit as a first implementation of our vision.
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
Herausforderungen für die nationale, regionale und thematische Webarchivierung und deren Nutzung
(2015)
Das World Wide Web ist als weltweites Informations- und Kommunikationsmedium etabliert. Neue Technologien erweitern regelmäßig die Nutzungsformen und erlauben es auch unerfahrenen Nutzern, Inhalte zu publizieren oder an Diskussionen teilzunehmen. Daher wird das Web auch als eine gute Dokumentation der heutigen Gesellschaft angesehen. Aufgrund seiner Dynamik sind die Inhalte des Web vergänglich und neue Technologien und Nutzungsformen stellen regelmäßig neue Herausforderungen an die Sammlung von Webinhalten für die Webarchivierung. Dominierten in den Anfangstagen der Webarchivierung noch statische Seiten, so hat man es heute häufig mit dynamisch generierten Inhalten zu tun, die Informationen aus verschiedenen Quellen integrieren. Neben dem klassischen domainorientieren Webharvesting kann auch ein steigendes Interesse aus verschiedenen Forschungsdisziplinen an thematischen Webkollektionen und deren Nutzung und Exploration beobachtet werden. In diesem Artikel werden einige Herausforderungen und Lösungsansätze für die Sammlung von thematischen und dynamischen Inhalten aus dem Web und den sozialen Medien vorgestellt. Des Weiteren werden aktuelle Probleme der wissenschaftlichen Nutzung diskutiert und gezeigt, wie Webarchive und andere temporale Kollektionen besser durchsucht werden können.
The constantly growing amount of Web content and the success of the SocialWeb lead to increasing needs for Web archiving. These needs go beyond the pure preservationo of Web pages. Web archives are turning into “community memories” that aim at building a better understanding of the public view on, e.g., celebrities, court decisions and other events. Due to the size of the Web, the traditional “collect-all” strategy is in many cases not the best method to build Web archives. In this paper, we present the ARCOMEM (From Future Internet 2014, 6 689 Collect-All Archives to Community Memories) architecture and implementation that uses semantic information, such as entities, topics and events, complemented with information from the Social Web to guide a novel Web crawler. The resulting archives are automatically enriched with semantic meta-information to ease the access and allow retrieval based on conditions that involve high-level concepts.
Current research on theory and practice of digital libraries: best papers from TPDL 2019 & 2020
(2022)
This volume presents a special issue on selected papers from the 2019 & 2020 editions of the International Conference on Theory and Practice of Digital Libraries (TPDL). They cover different research areas within Digital Libraries, from Ontology and Linked Data to quality in Web Archives and Topic Detection. We first provide a brief overview of both TPDL editions, and we introduce the selected papers.
The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
High impact events, political changes and new technologies are reflected in our language and lead to constant evolution of terms, expressions and names. Not knowing about names used in the past for referring to a named entity can severely decrease the performance of many computational linguistic algorithms. We propose NEER, an unsupervised method for named entity evolution recognition independent of external knowledge sources. We find time periods with high likelihood of evolution. By analyzing only these time periods using a sliding window co-occurrence method we capture evolving terms in the same context. We thus avoid comparing terms from widely different periods in time and overcome a severe limitation of existing methods for named entity evolution, as shown by the high recall of 90% on the New York Times corpus. We compare several relatedness measures for filtering to improve precision. Furthermore, using machine learning with minimal supervision improves precision to 94%.
We present a method for detecting word sense changes by utilizing automatically induced word senses. Our method works on the level of individual senses and allows a word to have e.g. one stable sense and then add a novel sense that later experiences change. Senses are grouped based on polysemy to find linguistic concepts and we can find broadening and narrowing as well as novel (polysemous and homonymic) senses. We evaluate on a testset, present recall and estimates of the time between expected and found change.
The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure "semantic" accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution this paper formalizes the problem and discusses issues, first ideas and relevant technologies.