Refine
Year of publication
Document Type
- Article (12)
- Conference Proceeding (7)
- Working Paper (4)
- Part of a Book (1)
Keywords
- Named entity recognition (3)
- BioCreative V.5 (2)
- BioNLP (2)
- Biodiversity (2)
- Annotation (1)
- Architekturen (1)
- Attention mechanism (1)
- BIOfid (1)
- Big Data (1)
- Biomedical named entity recognition (1)
- On the limit value of compactness of some graph classes (2018)
- In this paper, we study the limit of compactness which is a graph index originally introduced for measuring structural characteristics of hypermedia. Applying compactness to large scale small-world graphs (Mehler, 2008) observed its limit behaviour to be equal 1. The striking question concerning this finding was whether this limit behaviour resulted from the specifics of small-world graphs or was simply an artefact. In this paper, we determine the necessary and sufficient conditions for any sequence of connected graphs resulting in a limit value of CB = 1 which can be generalized with some consideration for the case of disconnected graph classes (Theorem 3). This result can be applied to many well-known classes of connected graphs. Here, we illustrate it by considering four examples. In fact, our proof-theoretical approach allows for quickly obtaining the limit value of compactness for many graph classes sparing computational costs.
- Setup of BIOfid, a new specialised information service for biodiversity research (2017)
- In order to promote the accessibility of biodiversity data in historic and contemporary literature, we introduce a new interdisciplinary project called BIOfid (FID=Fachinformationsdienst, a service for providing specialized information). The project aims at a mobilization of data available in print only by combining digitization of scientific biodiversity literature with the development of innovative text mining tools for complex, eventually semantic searches throughout the complete text corpus. A major prerequisite for the development of such search tools is the provision of sophisticated anatomy ontologies on the one hand, and of complete lists of species names (currently considered valid as well as all synonyms) at a global scale on the other hand. In the initial stage, we chose examples from German publications of the past 250 years dealing with the geographic distribution and ecology of vascular plants (Tracheophyta), birds (Aves), as well as moths and butterflies (Lepidoptera) in Germany. These taxa have been prioritized according to current demands of German research groups (about 50 sites) aiming at analyses and modeling of distribution patterns and their changes through time. In the long term, we aim at providing data and open source software applicable for any taxon and geographic region. For this purpose, a platform for open access journals for long-term availability of professional e-journals will be established. All generated data will also be made accessible through GFBio (German Federation for Biological Data). BIOfid is supported by the LIS-Scientific Library Services and Information Systems program of the German Research Foundation (DFG).
- BIOfid, a platform to enhance accessibility of biodiversity data (2018)
- With the ongoing loss of global biodiversity, long-term recordings of species distribution patterns are increasingly becoming important to investigate the causes and consequences for their change. Therefore, the digitization of scientific literature, both modern and historical, has been attracting growing attention in recent years. To meet this growing demand the Specialised Information Service for Biodiversity Research (BIOfid) was launched in 2017 with the aim of increasing the availability and accessibility of biodiversity information. Closely tied to the research community the interdisciplinary BIOfid team is digitizing data sources of biodiversity related research and provides a modern and professional infrastructure for hosting and sharing them. As a pilot project, German publications on the distribution and ecology of vascular plants, birds, moths and butterflies covering the past 250 years are prioritized. Large parts of the text corpus defined in accordance with the needs of the relevant German research community have already been transferred to a machine-readable format and will be publicly accessible soon. Software tools for text mining, semantic annotation and analysis with respect to the current trends in machine learning are developed to maximize bioscientific data output through user-specific queries that can be created via the BIOfid web portal (https://www.biofid.de/). To boost knowledge discovery, specific ontologies focusing on morphological traits and taxonomy are being prepared and will continuously be extended to keep up with an ever-expanding volume of literature sources.
- A comparison of four character-level string-to-string translation models for (OCR) spelling error correction (2016)
- We consider the isolated spelling error correction problem as a specific subproblem of the more general string-to-string translation problem. In this context, we investigate four general string-to-string transformation models that have been suggested in recent years and apply them within the spelling error correction paradigm. In particular, we investigate how a simple ‘k-best decoding plus dictionary lookup’ strategy performs in this context and find that such an approach can significantly outdo baselines such as edit distance, weighted edit distance, and the noisy channel Brill and Moore model to spelling error correction. We also consider elementary combination techniques for our models such as language model weighted majority voting and center string combination. Finally, we consider real-world OCR post-correction for a dataset sampled from medieval Latin texts.
- Computational Humanities - bridging the gap between Computer Science and Digital Humanities : Report from Dagstuhl Seminar 14301 [20. – 25. Juli 2014, Dagstuhl Seminar 14301] (2014)
- Research in the field of Digital Humanities, also known as Humanities Computing, has seen a steady increase over the past years. Situated at the intersection of computing science and the humanities, present efforts focus on making resources such as texts, images, musical pieces and other semiotic artifacts digitally available, searchable and analysable. To this end, computational tools enabling textual search, visual analytics, data mining, statistics and natural language processing are harnessed to support the humanities researcher. The processing of large data sets with appropriate software opens up novel and fruitful approaches to questions in the traditional humanities. This report summarizes the Dagstuhl seminar 14301 on “Computational Humanities - bridging the gap between Computer Science and Digital Humanities”. 1998 ACM Subject Classification I.2.7 Natural Language Processing, J.5 Arts and Humanities
- WikiNect: image schemata as a basis of gestural writing for kinetic museum wikis (2015)
- This paper provides a theoretical assessment of gestures in the context of authoring image-related hypertexts by example of the museum information system WikiNect. To this end, a first implementation of gestural writing based on image schemata is provided (Lakoff in Women, fire, and dangerous things: what categories reveal about the mind. University of Chicago Press, Chicago, 1987). Gestural writing is defined as a sort of coding in which propositions are only expressed by means of gestures. In this respect, it is shown that image schemata allow for bridging between natural language predicates and gestural manifestations. Further, it is demonstrated that gestural writing primarily focuses on the perceptual level of image descriptions (Hollink et al. in Int J Hum Comput Stud 61(5):601–626, 2004). By exploring the metaphorical potential of image schemata, it is finally illustrated how to extend the expressiveness of gestural writing in order to reach the conceptual level of image descriptions. In this context, the paper paves the way for implementing museum information systems like WikiNect as systems of kinetic hypertext authoring based on full-fledged gestural writing.
- A network model of interpersonal alignment in dialog (2010)
- In dyadic communication, both interlocutors adapt to each other linguistically, that is, they align interpersonally. In this article, we develop a framework for modeling interpersonal alignment in terms of the structural similarity of the interlocutors’ dialog lexica. This is done by means of so-called two-layer time-aligned network series, that is, a time-adjusted graph model. The graph model is partitioned into two layers, so that the interlocutors’ lexica are captured as subgraphs of an encompassing dialog graph. Each constituent network of the series is updated utterance-wise. Thus, both the inherent bipartition of dyadic conversations and their gradual development are modeled. The notion of alignment is then operationalized within a quantitative model of structure formation based on the mutual information of the subgraphs that represent the interlocutor’s dialog lexica. By adapting and further developing several models of complex network theory, we show that dialog lexica evolve as a novel class of graphs that have not been considered before in the area of complex (linguistic) networks. Additionally, we show that our framework allows for classifying dialogs according to their alignment status. To the best of our knowledge, this is the first approach to measuring alignment in communication that explores the similarities of graph-like cognitive representations. Keywords: alignment in communication; structural coupling; linguistic networks; graph distance measures; mutual information of graphs; quantitative network analysis
- FIGURE: gesture annotation : annotation schema of the Frankfurt Image GestUREs (2016)
- This report documents the annotation format used in order to prepare the FIGURE corpus. For a description of data gathering, research rationale, agreement studies and first results please see the respective LREC publication (Lücking, Mehler, Walther, Mauri, and Kurfürst 2016).
- LSTMVoter : chemical named entity recognition using a conglomerate of sequence labeling tools (2019)
- Background: Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential step in chemical text mining pipelines for identifying chemical mentions, their properties, and relations as discussed in the literature. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities. For this purpose, we transform the task of NER into a sequence labeling problem. We present a series of sequence labeling systems that we used, adapted and optimized in our experiments for solving this task. To this end, we experiment with hyperparameter optimization. Finally, we present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier. Results: We introduce LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilizes a conditional random field layer in conjunction with attention-based feature modeling. Our approach explores information about features that is modeled by means of an attention mechanism. LSTMVoter outperforms each extractor integrated by it in a series of experiments. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%. Availability and implementation: Data and code are available at https://github.com/texttechnologylab/LSTMVoter.