004 Datenverarbeitung; Informatik
Refine
Year of publication
- 2019 (41) (remove)
Document Type
- Article (17)
- Doctoral Thesis (9)
- Working Paper (5)
- Bachelor Thesis (3)
- Preprint (3)
- Conference Proceeding (2)
- Book (1)
- Contribution to a Periodical (1)
Has Fulltext
- yes (41)
Is part of the Bibliography
- no (41)
Keywords
- concurrency (3)
- BioCreative V.5 (2)
- BioNLP (2)
- Multimodal Learning Analytics (2)
- Named entity recognition (2)
- Petrov-Galerkin finite volumes (2)
- Virtuelle Realität (2)
- functional programming (2)
- pi-calculus (2)
- ALICE (1)
Institute
- Informatik (17)
- Informatik und Mathematik (8)
- Frankfurt Institute for Advanced Studies (FIAS) (4)
- Medizin (3)
- Biowissenschaften (2)
- Center for Scientific Computing (CSC) (2)
- Deutsches Institut für Internationale Pädagogische Forschung (DIPF) (2)
- Gesellschaftswissenschaften (2)
- Kulturwissenschaften (1)
- Neuere Philologien (1)
Vom 19. bis 21. September fand im Forschungskolleg Humanwissenschaften die bereits dritte Bad Homburg Conference statt. Die Konferenz zum Thema Künstliche Intelligenz brachte Perspektiven aus verschiedenen wissenschaftlichen Disziplinen und der Praxis zusammen. Referentinnen und Referenten u. a. aus Informatik, Rechtswissenschaft, Medizin, Philosophie und Hirnforschung diskutierten mit Vertretern der gesellschaftlichen Praxis: Unternehmern, Industrievertretern, einem Kriminalhauptkommissar des Landeskriminalamts Hessen und einer Bürgerrechtsaktivistin und Politikberaterin aus den USA. Begrüßt wurden die Teilnehmer von ForschungskollegDirektor Prof. Dr. Matthias Lutz-Bachmann, der Vizepräsidentin der Goethe-Universität, Prof. Dr. Simone Fulda, dem Bürgermeister der Stadt Bad Homburg, Meinhard Matern, sowie der Hessischen Ministerin für Digitale Strategie und Entwicklung, Prof. Dr. Kristina Sinemus. Der Abschlusskommentar kam von Christoph Burchard, Professor für Straf- und Strafprozessrecht, Internationales und Europäisches Strafrecht, Rechtsvergleichung und Rechtstheorie an der Goethe-Universität. Der UniReport hatte Gelegenheit, mit Christoph Burchard nach der Konferenz zu sprechen.
The website Sci-Hub enables users to download PDF versions of scholarly articles, including many articles that are paywalled at their journal’s site. Sci-Hub has grown rapidly since its creation in 2011, but the extent of its coverage has been unclear. Here we report that, as of March 2017, Sci-Hub’s database contains 68.9% of the 81.6 million scholarly articles registered with Crossref and 85.1% of articles published in toll access journals. We find that coverage varies by discipline and publisher, and that Sci-Hub preferentially covers popular, paywalled content. For toll access articles, we find that Sci-Hub provides greater coverage than the University of Pennsylvania, a major research university in the United States. Green open access to toll access articles via licit services, on the other hand, remains quite limited. Our interactive browser at https://greenelab.github.io/scihub allows users to explore these findings in more detail. For the first time, nearly all scholarly literature is available gratis to anyone with an Internet connection, suggesting the toll access business model may become unsustainable.
Gegenstand der hier vorgestellten Arbeit ist eine Applikation für die virtuelle Realität (VR), die in der Lage ist, die Struktur eines beliebigen Textes als begehbare, interaktive Stadt zu visualisieren. Darüber hinaus bietet das Programm eine besondere Textsuche an, die so in anderen konventionellen Textverarbeitungsprogrammen nicht vorzufinden ist. Dank der strukturellen Analyse und der Verwendung einiger außergewöhnlicher Analysetools des TextImager [2], ermöglicht text2City nicht nur die Suche nach bestimmten Textmustern, sondern zum Beispiel auch die Bestimmung der Textebene (Wort, Satz, Absatz, etc.) und einiges mehr. Ein weiteres Feature ist die Kommunikationsverbindung zwischen dem TextAnnotator-Service [1] und text2City, die dem Benutzer die Möglichkeit zum Annotieren bietet, aber auch von anderen Personen durchgeführte Annotationen sofort sichtbar machen kann. Für die Ausführung des Programms ist eine der beiden VRBrillen, Oculus Rift oder HTC Vive, ein für VR geeigneter PC, sowie die Software Unity nötig.
Human readers have the ability to infer knowledge from text, even if that particular information is not explicitly stated. In this thesis, we address the phenomena of text-level implicit information and outline novel automated methods for its recovery.
The main focus of this work is on two types of unexpressed content that arises between sentences (implicit discourse relations) and within sentences (implicit semantic roles).
Traditional approaches mostly rely on costly rich linguistic features, e.g., sentiment or frame-based lexicons, and require heuristics or manual feature engineering.
As an improvement, we propose a collection of generic resource-lean methods, implemented in the form of statistical background knowledge or by means of neural architectures.
Our models are largely language-independent and produce state-of-the-art performance, e.g., in the classification of Chinese implicit discourse relations, or the detection of locally covert predicative arguments in free texts.
In novel experiments, we quantitatively demonstrate that both types of implicit information are mutually dependent insofar as, for instance, some implicit roles directly correlate with implicit discourse relations of similar properties.
We show that implicit information processing further benefits downstream applications and demonstrate its applicability to the higher-level task of narrative story understanding.
In the conclusion of the dissertation, we argue for the need of implicit information processing in order to realize the goal of true natural language understanding.
Relying on the theory of Saward (2010) and Disch (2015), we study political representation through the lens of representative claim-making. We identify a gap between the theoretical concept of claim-making and the empirical (quantitative) assessment of representative claims made in the real world’s representative contexts. Therefore, we develop a new approach to map and quantify representative claims in order to subsequently measure the reception and validation of the claims by the audience. To test our method, we analyse all the debates of the German parliament concerned with the introduction of the gender quota in German supervisory boards from 2013 to 2017 in a two-step process. At first, we assess which constituencies the MPs claim to represent and how they justify their stance. Drawing on multiple correspondence analysis, we identify different claim patterns. Second, making use of natural language processing techniques and logistic regression on social media data, we measure if and how the asserted claims in the parliamentary debates are received and validated by the respective audience. We come to the conclusion that the constituency as ultimate judge of legitimacy has not been comprehensively conceptualized yet.
The World Wide Web is increasing the number of freely accessible textual data, which has led to an increasing interest in research in the field of computational linguistics (CL). This area of research addresses theoretical research to answer the question of how language and knowledge must be represented in order to understand and produce language. For this purpose, mathematical models are being developed to capture the phenomena at various levels in human languages. Another field of research experiencing an increase in interest that is closely related to CL is Natural Language Processing (NLP), which is primarily concerned with developing effective and efficient data structures and algorithms that implement the mathematical models of CL.
With increasing interest in these areas, NLP tools are rapidly and frequently being developed incorporating different CL models to handle different levels of language. The open source trend has benefited all those in the scientific community who develop and use these tools. Due to yet undefined I/O standards for NLP, however, the rapid growth leads to a heterogeneous NLP landscape in which the specializations of the tools cannot benefit from each other because of interface incompatibility. In addition, the constantly growing amount of freely accessible text data requires a high-performance processing solution. This performance can be achieved by horizontal and vertical scaling of hardware and software. For these reasons the first part of this thesis deals with the homogenization of the NLP tool landscape, which is achieved by a standardized framework called TextImager. It is a cloud computing based multi-service, multi-server, multi-pipeline, multi-database, multi-user, multi-representation and multi-visual framework that already provides a variety of tools for various languages to process various levels of linguistic complexity. This makes it possible to answer research questions that require the processing of a large amount of data at several linguistic levels.
The integrated tools and the homogenized I/O data streams of the TextImager make it possible to combine the built-in tools in two dimensions: (1) the horizontal dimension to achieve NLP task-specific improvement (2) the orthogonal dimension to implement CL models that are based on multiple linguistic levels and thus rely on a combination of different NLP tools. The second part of this thesis therefore deals with the creation of models for the horizontal combination of tools in order to show the possibilities for improvement using the example of the NLP task of Named Entity Recognition (NER). The TextImager offers several tools for each NLP task, most of which have been trained on the same training basis, but can produce different results. This means that each of the tools processes a subset of the data correctly and at the same time makes errors in another subset. In order to process as large a subset of the data as possible correctly, a horizontal combination of tools is therefore required. Machine learning-based voting mechanisms called LSTMVoter and CRFVoter were developed for this purpose, which allow a combination of the outputs of individual NLP tools so that better partial data results can be achieved. In this thesis the benefit of Voter is shown using the example of the NER task, whose results flow
back into the TextImager tool landscape.
The third part of this thesis deals with the orthogonal combination of TextImager tools to accomplish the verb sense disambiguation (VSD). The CL question is investigated, how verb senses should be modelled in order to disambiguate them computatively. Verbsenses have a syntagmatic-paradigmatic relationship with surrounding words. Therefore, preprocessing on several linguistic levels and consequently an orthogonal combination of NLP tools is required to disambiguate verbs on a computational level. With TextImager’s integrated NLP landscape, it is now possible to perform these preprocessing steps to induce the information needed for the VSD. The newly developed NLP tool for the VSD has been integrated into the TextImager tool landscape, enabling the analysis of a further linguistic level.
This thesis presents a framework that homogenizes the NLP tool landscape in a cluster-based way. Methods for combining the integrated tools are implemented to improve the analysis of a specific linguistic level or to develop tools that open up new linguistic levels.
Advances in flow cytometry enable the acquisition of large and high-dimensional data sets per patient. Novel computational techniques allow the visualization of structures in these data and, finally, the identification of relevant subgroups. Correct data visualizations and projections from the high-dimensional space to the visualization plane require the correct representation of the structures in the data. This work shows that frequently used techniques are unreliable in this respect. One of the most important methods for data projection in this area is the t-distributed stochastic neighbor embedding (t-SNE). We analyzed its performance on artificial and real biomedical data sets. t-SNE introduced a cluster structure for homogeneously distributed data that did not contain any subgroupstructure. Inotherdatasets,t-SNEoccasionallysuggestedthewrongnumberofsubgroups or projected data points belonging to different subgroups, as if belonging to the same subgroup. As an alternative approach, emergent self-organizing maps (ESOM) were used in combination with U-matrix methods. This approach allowed the correct identification of homogeneous data while in sets containing distance or density-based subgroups structures; the number of subgroups and data point assignments were correctly displayed. The results highlight possible pitfalls in the use of a currently widely applied algorithmic technique for the detection of subgroups in high dimensional cytometric data and suggest a robust alternative.
Zielsetzung dieser Arbeit ist es Nutzern, ohne Programmierkenntnisse oder Fachwissen im Bereich der Informatik, Zugang zu der automatischen Verarbeitung von Texten zu gewährleisten. Speziell soll es um Geotagging, also das Referenzieren verschiedener Objekte auf einer Karte, gehen. Als Basis soll ein ontologisches Modell dienen, mit Hilfe dessen Struktur die Objekte in Klassen eingeteilt werden. Zur Verarbeitung des Textes werden NaturalLanguage Processing Werkzeuge verwendet. Natural Language Processing beschreibt Methoden zur maschinellen Verarbeitung natürlicher Sprache. Sie ermöglichen es, die in Texten enthaltenen unstrukturierten Informationen in eine strukturierte Form zu bringen. Die so erhaltenen Informationen können für weitere maschinelle Verarbeitungsschritte verwendet oder einem Nutzer direkt bereitgestellt werden. Sollten sie direkt bereitgestellt werden, ist es ausschlaggebend, sie in einer Form zu präsentieren, die auch ohne Fachkenntnisse oder Vorwissen verständlich ist. Im Bereich der Geographie wird oft der Ansatz befolgt, die erhaltenen Informationen auf Basis verschiedener Karten, also visuell zu verarbeiten. Visualisierungen dienen hierbei der Veranschaulichung von Informationen. Durch sie werden die relevanten Aspekte dem Nutzer verdeutlicht und so die Komplexität der Informationen reduziert. Es bietet sich also an, die durch das Natural Language Processing gesammelten Informationen in Form einer Visualisierung für den Nutzer zugänglich zu machen. Im Rahmen dieser Arbeit über Geotagging und Ontologie-basierte Visualisierung für das TextImaging wird ein Tool entwickelt, das diese Brücke schlägt. Die Texte werden auf einer Karte visualisiert und bieten so eine Möglichkeit, beschriebene geographische Zusammenhänge auf einen Blick zu erfassen. Durch die Kombination der Visualisierung auf einer Karte und der Markierung der entsprechenden Entitäten im Text kann eine zuverlässige und nutzerfreundliche Visualisierung erzeugt werden. Bei einer abschließenden Evaluation hat sich gezeigt das mit dem Tool der Zeitaufwand und die Anzahl der fehlerhaften Annotationen reduziert werden konnte.Die von dem Tool gebotenen Funktionen machen dieses auch für weiterführende Arbeiten interessant. Eine Möglichkeit ist die entwickelten Annotatoren zu verwenden um ein ontology matching auf Basis bestimmter Texte auszuführen. Im Bereich der Visualisierung bieten sich Projekte wie die Visualisierung historischer Texte auf Basis automatisch ermittelter, zeitgerechter Karten an.
Dancing is an activity that positively enhances the mood of people that consists of feeling the music and expressing it in rhythmic movements with the body. Learning how to dance can be challenging because it requires proper coordination and understanding of rhythm and beat. In this paper, we present the first implementation of the Dancing Coach (DC), a generic system designed to support the practice of dancing steps, which in its current state supports the practice of basic salsa dancing steps. However, the DC has been designed to allow the addition of more dance styles. We also present the first user evaluation of the DC, which consists of user tests with 25 participants. Results from the user test show that participants stated they had learned the basic salsa dancing steps, to move to the beat and body coordination in a fun way. Results also point out some direction on how to improve the future versions of the DC.