004 Datenverarbeitung; Informatik
Refine
Year of publication
- 2019 (41) (remove)
Document Type
- Article (17)
- Doctoral Thesis (9)
- Working Paper (5)
- Bachelor Thesis (3)
- Preprint (3)
- Conference Proceeding (2)
- Book (1)
- Contribution to a Periodical (1)
Has Fulltext
- yes (41)
Is part of the Bibliography
- no (41)
Keywords
- concurrency (3)
- BioCreative V.5 (2)
- BioNLP (2)
- Multimodal Learning Analytics (2)
- Named entity recognition (2)
- Petrov-Galerkin finite volumes (2)
- Virtuelle Realität (2)
- functional programming (2)
- pi-calculus (2)
- ALICE (1)
Institute
- Informatik (17)
- Informatik und Mathematik (8)
- Frankfurt Institute for Advanced Studies (FIAS) (4)
- Medizin (3)
- Biowissenschaften (2)
- Center for Scientific Computing (CSC) (2)
- Deutsches Institut für Internationale Pädagogische Forschung (DIPF) (2)
- Gesellschaftswissenschaften (2)
- Kulturwissenschaften (1)
- Neuere Philologien (1)
Zielsetzung dieser Arbeit ist es Nutzern, ohne Programmierkenntnisse oder Fachwissen im Bereich der Informatik, Zugang zu der automatischen Verarbeitung von Texten zu gewährleisten. Speziell soll es um Geotagging, also das Referenzieren verschiedener Objekte auf einer Karte, gehen. Als Basis soll ein ontologisches Modell dienen, mit Hilfe dessen Struktur die Objekte in Klassen eingeteilt werden. Zur Verarbeitung des Textes werden NaturalLanguage Processing Werkzeuge verwendet. Natural Language Processing beschreibt Methoden zur maschinellen Verarbeitung natürlicher Sprache. Sie ermöglichen es, die in Texten enthaltenen unstrukturierten Informationen in eine strukturierte Form zu bringen. Die so erhaltenen Informationen können für weitere maschinelle Verarbeitungsschritte verwendet oder einem Nutzer direkt bereitgestellt werden. Sollten sie direkt bereitgestellt werden, ist es ausschlaggebend, sie in einer Form zu präsentieren, die auch ohne Fachkenntnisse oder Vorwissen verständlich ist. Im Bereich der Geographie wird oft der Ansatz befolgt, die erhaltenen Informationen auf Basis verschiedener Karten, also visuell zu verarbeiten. Visualisierungen dienen hierbei der Veranschaulichung von Informationen. Durch sie werden die relevanten Aspekte dem Nutzer verdeutlicht und so die Komplexität der Informationen reduziert. Es bietet sich also an, die durch das Natural Language Processing gesammelten Informationen in Form einer Visualisierung für den Nutzer zugänglich zu machen. Im Rahmen dieser Arbeit über Geotagging und Ontologie-basierte Visualisierung für das TextImaging wird ein Tool entwickelt, das diese Brücke schlägt. Die Texte werden auf einer Karte visualisiert und bieten so eine Möglichkeit, beschriebene geographische Zusammenhänge auf einen Blick zu erfassen. Durch die Kombination der Visualisierung auf einer Karte und der Markierung der entsprechenden Entitäten im Text kann eine zuverlässige und nutzerfreundliche Visualisierung erzeugt werden. Bei einer abschließenden Evaluation hat sich gezeigt das mit dem Tool der Zeitaufwand und die Anzahl der fehlerhaften Annotationen reduziert werden konnte.Die von dem Tool gebotenen Funktionen machen dieses auch für weiterführende Arbeiten interessant. Eine Möglichkeit ist die entwickelten Annotatoren zu verwenden um ein ontology matching auf Basis bestimmter Texte auszuführen. Im Bereich der Visualisierung bieten sich Projekte wie die Visualisierung historischer Texte auf Basis automatisch ermittelter, zeitgerechter Karten an.
Multi-view microscopy techniques are used to increase the resolution along the optical axis for 3D imaging. Without this, the resolution is insufficient to resolve subcellular events. In addition, parts of the images of opaque specimens are often highly degraded or masked. Both problems motivate scientists to record the same specimen from multiple directions. The images, then have to be digitally fused into a single high-quality image. Selective-plane illumination microscopy has proven to be a powerful imaging technique due to its unsurpassed acquisition speed and gentle optical sectioning. However, even in the case of multi view imaging techniques that illuminate and image the sample from multiple directions, light scattering inside tissues often severely impairs image contrast.
Here we show that for c-elegans embryos multi view registration can be achieved based on segmented nuclei. However, segmentation of nuclei in high density distribution like c-elegans embryo is challenging. We propose a method which uses 3D Mexican hat filter for preprocessing and 3D Gaussian curvature for the post-processing step to separate nuclei. We used this method successfully on 3 data sets of c-elegans embryos in 3 different views. The result of segmentation outperforms previous methods. Moreover, we provide a simple GUI for manual correction and adjusting the parameters for different data.
We then proposed a method that combines point and voxel registration for an accurate multi view reg- istration of c-elegans embryo, which does not need any special experimental preparation. We demonstrate the performance of our approach on data acquired from fixed embryos of c-elegans worms. This multi step approach is successfully evaluated by comparison to different methods and also by using synthetic data. The proposed method could overcome the typically low resolution along the optical axis and enable stitching to- gether the different parts of the embryo available through the different views. A tool for running the code and analyzing the results is developed.
Dieser Arbeit war zum Ziel gesetzt, Methoden zur Simulation von neuronalen Prozessen zu entwickeln, zu implementieren, einzusetzen und zu vergleichen. Ein besonderes Augenmerk lag dabei auf der Frage, wo eine volle räumliche Auflösung der Modelle benötigt wird und wo darauf zugunsten von vereinfachenden niederdimensionalen Modellen, die wesentlich weniger Ressourcen und mathematischen Sachverstand erfordern, verzichtet werden kann. Außerdem wurde speziell bei der Beschreibung der verschiedenen Modelle für die Elektrik der Nervenzellen das Anliegen verfolgt, deren Zusammenhänge und die Natur vereinfachender Annahmen herauszuarbeiten, um deutlich zu machen, an welchen Stellen Probleme bei der Benutzung der weniger komplexen Modelle auftreten können.
In etlichen Beispielen wurde daraufhin untersucht, inwieweit die Vereinfachung auf ein eindimensionales Kabelmodell sowie der Verzicht auf die Betrachtung einzelner Ionensorten die realistische Darstellung der zellulären Elektrik beeinträchtigen können. Dabei stellte sich heraus, dass alle betrachteten Modelle für das rein elektrische Verhalten der Neuronen im Wesentlichen dieselben Ergebnisse liefern, weshalb zu dessen Simulation in den allermeisten Fällen ein 1D-Kabelmodell völlig ausreichend und angezeigt sein dürfte.
Nur wenn Größen von Interesse sind, die in diesem Modell nicht erfasst werden, etwa das Außenraumpotential oder die Ionenkonzentrationen, muss auf genauere Modelle zurückgegriffen werden. Außerdem ist in einer Konvergenzstudie exemplarisch vorgeführt worden, dass bereits eine recht grobe Darstellung der zugrundeliegenden Rechengitter genügt, um korrekte Ergebnisse bei der Simulation der rein elektrischen Signale sicherzustellen.
In scharfem Kontrast steht hierzu die Simulation von einzelnen Ionen-Dynamiken. Bereits in der Untersuchung des Poisson-Nernst-Planck-Modells für das Membranpotential erwies sich, dass für eine korrekte Simulation der diffusiven Anteile der Ionenbewegung wesentlich feinere Gitter benötigt werden.
Noch viel deutlicher wurde dies in Simulationen von Calcium-Wellen in Dendriten, wo -- neben anderen Einsichten -- aufgezeigt werden konnte, dass nicht nur eine feine axiale
(und Zeit-) Auflösung der Dendritengeometrie zur Sicherstellung exakter Ergebnisse notwendig ist, sondern auch die räumliche Auflösung in die übrigen Dimensionen wichtig ist, weswegen eine eindimensionale Kabeldarstellung der Calcium-Dynamik erheblich fehlerbehaftet und
(jedenfalls im Zusammenhang mit Ryanodin-Rezeptorkanälen) von deren Nutzung dringend abzuraten ist. Auch die Darstellung von Kanälen als eine kontinuierliche Dichte in der Membran kann, wie darüber hinaus vorgeführt wurde, problematisch sein.
Ihre exaktere Modellierung, etwa durch Einbettung auch probabilistischer Einzelkanaldarstellungen in das räumliche Modell sollte in zukünftigen Arbeiten noch mehr thematisiert werden.
Mit Blick auf die Wiederverwendbarkeit bereits implementierter Funktionalität innerhalb dieser Arbeiten wurden spezielle Teile dieser Funktionalität hier in einem gesonderten
Kapitel genauer beschrieben. Als komplexes Beispiel für das, was simulationstechnisch bereits im Bereich des Machbaren
liegt, und gleichsam für eine Anwendung, die zeigt, wie möglichst viele der im Rahmen dieser Arbeit entwickelten Methoden miteinander kombiniert werden können, wurde die
Calcium-Dynamik eines kompletten Dendriten innerhalb eines großen aktiven neuronalen Netzwerks simuliert.
Vom 19. bis 21. September fand im Forschungskolleg Humanwissenschaften die bereits dritte Bad Homburg Conference statt. Die Konferenz zum Thema Künstliche Intelligenz brachte Perspektiven aus verschiedenen wissenschaftlichen Disziplinen und der Praxis zusammen. Referentinnen und Referenten u. a. aus Informatik, Rechtswissenschaft, Medizin, Philosophie und Hirnforschung diskutierten mit Vertretern der gesellschaftlichen Praxis: Unternehmern, Industrievertretern, einem Kriminalhauptkommissar des Landeskriminalamts Hessen und einer Bürgerrechtsaktivistin und Politikberaterin aus den USA. Begrüßt wurden die Teilnehmer von ForschungskollegDirektor Prof. Dr. Matthias Lutz-Bachmann, der Vizepräsidentin der Goethe-Universität, Prof. Dr. Simone Fulda, dem Bürgermeister der Stadt Bad Homburg, Meinhard Matern, sowie der Hessischen Ministerin für Digitale Strategie und Entwicklung, Prof. Dr. Kristina Sinemus. Der Abschlusskommentar kam von Christoph Burchard, Professor für Straf- und Strafprozessrecht, Internationales und Europäisches Strafrecht, Rechtsvergleichung und Rechtstheorie an der Goethe-Universität. Der UniReport hatte Gelegenheit, mit Christoph Burchard nach der Konferenz zu sprechen.
Programmable hardware in the form of FPGAs found its place in various high energy physics experiments over the past few decades. These devices provide highly parallel and fully configurable data transport, data formatting, and data processing capabilities with custom interfaces, even in rigid or constrained environments. Additionally, FPGA functionalities and the number of their logic resources have grown exponentially in the last few years, making FPGAs more and more suitable for complex data processing tasks. ALICE is one of the four main experiments at the LHC and specialized in the study of heavy-ion collisions. The readout chain of the ALICE detectors makes use of FPGAs at various places. The Read-Out Receiver Cards (RORCs) are one example of FPGA-based readout hardware, building the interface between the custom detector electronics and the commercial server nodes in the data processing clusters of the Data Acquisition (DAQ) system as well as the High Level Trigger (HLT). These boards are implemented as server plug-in cards with serial optical links towards the detectors. Experimental data is received via more than 500 optical links, already partly pre-processed in the FPGAs, and pushed towards the host machines. Computer clusters consisting of a few hundred nodes collect, aggregate, compress, reconstruct, and prepare the experimental data for permanent storage and later analysis. With the end of the first LHC run period in 2012 and the start of Run 2 in 2015, the DAQ and HLT systems were renewed and several detector components were upgraded for higher data rates and event rates. Increased detector link rates and obsolete host interfaces rendered it impossible to reuse the previous RORCs in Run 2.
This thesis describes the development, integration, and maintenance of the next generation of RORCs for ALICE in Run 2. A custom hardware platform, initially developed as a joint effort between the ALICE DAQ and HLT groups in the course of this work, found its place in the Run 2 readout systems of the ALICE and ATLAS experiments. The hardware fulfills all experiment requirements, matches its target performance, and has been running stable in the production systems since the start of Run 2. Firmware and software developments for the hardware evaluation, the design of the board, the mass production hardware tests, as well as the operation of the final board in the HLT, were carried out as part of this work. 74 boards were integrated into the HLT hardware and software infrastructure, with various firmware and software developments, to provide the main experimental data input and output interface of the HLT for Run 2. The hardware cluster finder, an FPGA-based data pre-processing core from the previous generation of RORCs, was ported to the new hardware. It has been improved and extended to meet the experimental requirements throughout Run 2. The throughput of this firmware component could be doubled and the algorithm extended, providing an improved noise rejection and an increased overall mean data compression ratio compared to its previous implementation. The hardware cluster finder forms a crucial component in the HLT data reconstruction and compression scheme with a processing performance of one board equivalent to around ten server nodes for comparable processing steps in software.
The work on the firmware development, especially on the hardware cluster finder, once more demonstrated that developing and maintaining data processing algorithms with the common low-level hardware description methods is tedious and time-consuming. Therefore, a high-level synthesis (HLS) hardware description method applying dataflow computing at an algorithmic level to FPGAs was evaluated in this context. The hardware cluster finder served as an example of a typical data processing algorithm in a high energy physics readout application. The existing and highly optimized low-level implementation provided a reference for comparisons in terms of throughput and resource usage. The cluster finder algorithm could be implemented in the dataflow description with comparably little effort, providing fast development cycles, compact code and at, the same time, simplified extension and maintenance options. The performance results in terms of throughput and resource usage are comparable to the manual implementation. The dataflow environment proved to be highly valuable for design space explorations. An integration of the dataflow description into the HLT firmware and software infrastructure could be demonstrated as a proof of concept. A high-level hardware description could ease both the design space exploration, the initial development, the maintenance, and the extension of hardware algorithms for high energy physics readout applications.
The World Wide Web is increasing the number of freely accessible textual data, which has led to an increasing interest in research in the field of computational linguistics (CL). This area of research addresses theoretical research to answer the question of how language and knowledge must be represented in order to understand and produce language. For this purpose, mathematical models are being developed to capture the phenomena at various levels in human languages. Another field of research experiencing an increase in interest that is closely related to CL is Natural Language Processing (NLP), which is primarily concerned with developing effective and efficient data structures and algorithms that implement the mathematical models of CL.
With increasing interest in these areas, NLP tools are rapidly and frequently being developed incorporating different CL models to handle different levels of language. The open source trend has benefited all those in the scientific community who develop and use these tools. Due to yet undefined I/O standards for NLP, however, the rapid growth leads to a heterogeneous NLP landscape in which the specializations of the tools cannot benefit from each other because of interface incompatibility. In addition, the constantly growing amount of freely accessible text data requires a high-performance processing solution. This performance can be achieved by horizontal and vertical scaling of hardware and software. For these reasons the first part of this thesis deals with the homogenization of the NLP tool landscape, which is achieved by a standardized framework called TextImager. It is a cloud computing based multi-service, multi-server, multi-pipeline, multi-database, multi-user, multi-representation and multi-visual framework that already provides a variety of tools for various languages to process various levels of linguistic complexity. This makes it possible to answer research questions that require the processing of a large amount of data at several linguistic levels.
The integrated tools and the homogenized I/O data streams of the TextImager make it possible to combine the built-in tools in two dimensions: (1) the horizontal dimension to achieve NLP task-specific improvement (2) the orthogonal dimension to implement CL models that are based on multiple linguistic levels and thus rely on a combination of different NLP tools. The second part of this thesis therefore deals with the creation of models for the horizontal combination of tools in order to show the possibilities for improvement using the example of the NLP task of Named Entity Recognition (NER). The TextImager offers several tools for each NLP task, most of which have been trained on the same training basis, but can produce different results. This means that each of the tools processes a subset of the data correctly and at the same time makes errors in another subset. In order to process as large a subset of the data as possible correctly, a horizontal combination of tools is therefore required. Machine learning-based voting mechanisms called LSTMVoter and CRFVoter were developed for this purpose, which allow a combination of the outputs of individual NLP tools so that better partial data results can be achieved. In this thesis the benefit of Voter is shown using the example of the NER task, whose results flow
back into the TextImager tool landscape.
The third part of this thesis deals with the orthogonal combination of TextImager tools to accomplish the verb sense disambiguation (VSD). The CL question is investigated, how verb senses should be modelled in order to disambiguate them computatively. Verbsenses have a syntagmatic-paradigmatic relationship with surrounding words. Therefore, preprocessing on several linguistic levels and consequently an orthogonal combination of NLP tools is required to disambiguate verbs on a computational level. With TextImager’s integrated NLP landscape, it is now possible to perform these preprocessing steps to induce the information needed for the VSD. The newly developed NLP tool for the VSD has been integrated into the TextImager tool landscape, enabling the analysis of a further linguistic level.
This thesis presents a framework that homogenizes the NLP tool landscape in a cluster-based way. Methods for combining the integrated tools are implemented to improve the analysis of a specific linguistic level or to develop tools that open up new linguistic levels.
CRFVoter : gene and protein related object recognition using a conglomerate of CRF-based tools
(2019)
Background: Gene and protein related objects are an important class of entities in biomedical research, whose identification and extraction from scientific articles is attracting increasing interest. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of gene and protein related objects. For this purpose, we transform the task as posed by BioCreative V.5 into a sequence labeling problem. We present a series of sequence labeling systems that we used and adapted in our experiments for solving this task. Our experiments show how to optimize the hyperparameters of the classifiers involved. To this end, we utilize various algorithms for hyperparameter optimization. Finally, we present CRFVoter, a two-stage application of Conditional Random Field (CRF) that integrates the optimized sequence labelers from our study into one ensemble classifier.
Results: We analyze the impact of hyperparameter optimization regarding named entity recognition in biomedical research and show that this optimization results in a performance increase of up to 60%. In our evaluation, our ensemble classifier based on multiple sequence labelers, called CRFVoter, outperforms each individual extractor’s performance. For the blinded test set provided by the BioCreative organizers, CRFVoter achieves an F-score of 75%, a recall of 71% and a precision of 80%. For the GPRO type 1 evaluation, CRFVoter achieves an F-Score of 73%, a recall of 70% and achieved the best precision (77%) among all task participants.
Conclusion: CRFVoter is effective when multiple sequence labeling systems are to be used and performs better then the individual systems collected by it.
LSTMVoter : chemical named entity recognition using a conglomerate of sequence labeling tools
(2019)
Background: Chemical and biomedical named entity recognition (NER) is an essential preprocessing task in natural language processing. The identification and extraction of named entities from scientific articles is also attracting increasing interest in many scientific disciplines. Locating chemical named entities in the literature is an essential step in chemical text mining pipelines for identifying chemical mentions, their properties, and relations as discussed in the literature. In this work, we describe an approach to the BioCreative V.5 challenge regarding the recognition and classification of chemical named entities. For this purpose, we transform the task of NER into a sequence labeling problem. We present a series of sequence labeling systems that we used, adapted and optimized in our experiments for solving this task. To this end, we experiment with hyperparameter optimization. Finally, we present LSTMVoter, a two-stage application of recurrent neural networks that integrates the optimized sequence labelers from our study into a single ensemble classifier.
Results: We introduce LSTMVoter, a bidirectional long short-term memory (LSTM) tagger that utilizes a conditional random field layer in conjunction with attention-based feature modeling. Our approach explores information about features that is modeled by means of an attention mechanism. LSTMVoter outperforms each extractor integrated by it in a series of experiments. On the BioCreative IV chemical compound and drug name recognition (CHEMDNER) corpus, LSTMVoter achieves an F1-score of 90.04%; on the BioCreative V.5 chemical entity mention in patents corpus, it achieves an F1-score of 89.01%.
Availability and implementation: Data and code are available at https://github.com/texttechnologylab/LSTMVoter.
The website Sci-Hub enables users to download PDF versions of scholarly articles, including many articles that are paywalled at their journal’s site. Sci-Hub has grown rapidly since its creation in 2011, but the extent of its coverage has been unclear. Here we report that, as of March 2017, Sci-Hub’s database contains 68.9% of the 81.6 million scholarly articles registered with Crossref and 85.1% of articles published in toll access journals. We find that coverage varies by discipline and publisher, and that Sci-Hub preferentially covers popular, paywalled content. For toll access articles, we find that Sci-Hub provides greater coverage than the University of Pennsylvania, a major research university in the United States. Green open access to toll access articles via licit services, on the other hand, remains quite limited. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. For the first time, nearly all scholarly literature is available gratis to anyone with an Internet connection, suggesting the toll access business model may become unsustainable.