Universitätsbibliothek
Refine
Document Type
- Article (8)
- Conference Proceeding (3)
- Part of a Book (2)
Has Fulltext
- yes (13) (remove)
Is part of the Bibliography
- no (13)
Keywords
- web archiving (3)
- Digital libraries (2)
- Special issue (2)
- TPDL conference (2)
- Biodiversity (1)
- Culturomics (1)
- Digital humanities (1)
- Information Extraction (1)
- Knowledge Networking (1)
- Knowledgebased analysis (1)
- Machine Learning (1)
- Maschinelles Lernen (1)
- Named Entity Changes (1)
- Namensänderungen (1)
- Natural language processing (1)
- Ontologies (1)
- Semantics (1)
- Specialized Information Service (1)
- Statistical analysis (1)
- Temporal Named Entity Evolution (1)
- Temporal text analysis (1)
- Terminology Evolution (1)
- Text mining (1)
- Web Archives (1)
- Zeitliche Namensevolution (1)
- architecture (1)
- content acquisition (1)
- crawling architecture (1)
- e-Science (1)
- eInfrastructure (1)
- eScience (1)
- enrichment (1)
- entity and event extraction (1)
- parliament libraries (1)
- semantic content analysis (1)
- social Web (1)
- text analysis (1)
- topic detection (1)
- web crawler (1)
Institute
The scientific innovation process embraces the steps from problem definition through the development and evaluation of innovative solutions to their successful exploitation. The challenges imposed by this process can be answered by the creation of a powerful and flexible next-generation e-Science infrastructure, which exploits leading edge information and knowledge technologies and enables a comprehensive and intelligent means of supporting this process. This paper describes our vision of a Knowledge-based eScience infrastructure, which is based on the results of an in-depth study of the researchers requirements. Furthermore, it introduces the Fraunhofer e-Science Cockpit as a first implementation of our vision.
The web and the social web play an increasingly important role as an information source for Members of Parliament and their assistants, journalists, political analysts and researchers. It provides important and crucial background information, like reactions to political events and comments made by the general public. The case study presented in this paper is driven by two European parliaments (the Greek and the Austrian parliament) and targets an effective exploration of political web archives. In this paper, we describe semantic technologies deployed to ease the exploration of the archived web and social web content and present evaluation results.
The World Wide Web is the largest information repository available today. However, this information is very volatile and Web archiving is essential to preserve it for the future. Existing approaches to Web archiving are based on simple definitions of the scope of Web pages to crawl and are limited to basic interactions with Web servers. The aim of the ARCOMEM project is to overcome these limitations and to provide flexible, adaptive and intelligent content acquisition, relying on social media to create topical Web archives. In this article, we focus on ARCOMEM’s crawling architecture. We introduce the overall architecture and we describe its modules, such as the online analysis module, which computes a priority for the Web pages to be crawled, and the Application-Aware Helper which takes into account the type of Web sites and applications to extract structure from crawled content. We also describe a large-scale distributed crawler that has been developed, as well as the modifications we have implemented to adapt Heritrix, an open source crawler, to the needs of the project. Our experimental results from real crawls show that ARCOMEM’s crawling architecture is effective in acquiring focused information about a topic and leveraging the information from social media.
Current research on theory and practice of digital libraries: best papers from TPDL 2019 & 2020
(2022)
This volume presents a special issue on selected papers from the 2019 & 2020 editions of the International Conference on Theory and Practice of Digital Libraries (TPDL). They cover different research areas within Digital Libraries, from Ontology and Linked Data to quality in Web Archives and Topic Detection. We first provide a brief overview of both TPDL editions, and we introduce the selected papers.
Current research on theory and practice of digital libraries: best papers from TPDL 2019 & 2020
(2022)
This volume presents a special issue on selected papers from the 2019 & 2020 editions of the International Conference on Theory and Practice of Digital Libraries (TPDL). They cover different research areas within Digital Libraries, from Ontology and Linked Data to quality in Web Archives and Topic Detection. We first provide a brief overview of both TPDL editions, and we introduce the selected papers.
Web archives created by the Internet Archive (IA) (https://archive.org), national libraries and other archiving services contain large amounts of information collected for a time period of over twenty years. These archives constitute a valuable source for research in many disciplines, including the digital humanities and the historical sciences by offering a unique possibility to look into past events and their representation on the Web.
Most Web archive services aim to capture the entire Web (IA) or national top-level domains and are therefore broad in their scope, diverse regarding the topics they contain and the time intervals they cover. Due to the large size and the broad scope it is difficult for interested researchers to locate relevant information in the archives as search facilities are very limited. Many users are more interested in studying smaller and topically coherent event-centric collections of documents contained in a Web archive [1,2]. Such collections can reflect specific events such as elections, or natural disasters, e.g. the Fukushima nuclear disaster (2011) or the German federal elections.
We present a method for detecting word sense changes by utilizing automatically induced word senses. Our method works on the level of individual senses and allows a word to have e.g. one stable sense and then add a novel sense that later experiences change. Senses are grouped based on polysemy to find linguistic concepts and we can find broadening and narrowing as well as novel (polysemous and homonymic) senses. We evaluate on a testset, present recall and estimates of the time between expected and found change.
Herausforderungen für die nationale, regionale und thematische Webarchivierung und deren Nutzung
(2015)
Das World Wide Web ist als weltweites Informations- und Kommunikationsmedium etabliert. Neue Technologien erweitern regelmäßig die Nutzungsformen und erlauben es auch unerfahrenen Nutzern, Inhalte zu publizieren oder an Diskussionen teilzunehmen. Daher wird das Web auch als eine gute Dokumentation der heutigen Gesellschaft angesehen. Aufgrund seiner Dynamik sind die Inhalte des Web vergänglich und neue Technologien und Nutzungsformen stellen regelmäßig neue Herausforderungen an die Sammlung von Webinhalten für die Webarchivierung. Dominierten in den Anfangstagen der Webarchivierung noch statische Seiten, so hat man es heute häufig mit dynamisch generierten Inhalten zu tun, die Informationen aus verschiedenen Quellen integrieren. Neben dem klassischen domainorientieren Webharvesting kann auch ein steigendes Interesse aus verschiedenen Forschungsdisziplinen an thematischen Webkollektionen und deren Nutzung und Exploration beobachtet werden. In diesem Artikel werden einige Herausforderungen und Lösungsansätze für die Sammlung von thematischen und dynamischen Inhalten aus dem Web und den sozialen Medien vorgestellt. Des Weiteren werden aktuelle Probleme der wissenschaftlichen Nutzung diskutiert und gezeigt, wie Webarchive und andere temporale Kollektionen besser durchsucht werden können.
High impact events, political changes and new technologies are reflected in our language and lead to constant evolution of terms, expressions and names. Not knowing about names used in the past for referring to a named entity can severely decrease the performance of many computational linguistic algorithms. We propose NEER, an unsupervised method for named entity evolution recognition independent of external knowledge sources. We find time periods with high likelihood of evolution. By analyzing only these time periods using a sliding window co-occurrence method we capture evolving terms in the same context. We thus avoid comparing terms from widely different periods in time and overcome a severe limitation of existing methods for named entity evolution, as shown by the high recall of 90% on the New York Times corpus. We compare several relatedness measures for filtering to improve precision. Furthermore, using machine learning with minimal supervision improves precision to 94%.
The correspondence between the terminology used for querying and the one used in content objects to be retrieved, is a crucial prerequisite for effective retrieval technology. However, as terminology is evolving over time, a growing gap opens up between older documents in (long-term) archives and the active language used for querying such archives. Thus, technologies for detecting and systematically handling terminology evolution are required to ensure "semantic" accessibility of (Web) archive content on the long run. As a starting point for dealing with terminology evolution this paper formalizes the problem and discusses issues, first ideas and relevant technologies.