Refine
Document Type
- Doctoral Thesis (8)
- Bachelor Thesis (6)
- Master's Thesis (2)
Has Fulltext
- yes (16)
Is part of the Bibliography
- no (16)
Keywords
- Virtual Reality (2)
- Virtuelle Realität (2)
- AI Safety (1)
- Autonomous Driving (1)
- Bitcoin (1)
- Cognitive Maps (1)
- Cognitive Spatial Distortions (1)
- DNN Robustness (1)
- Deep Learning (1)
- LDPC Codes (1)
Institute
Die vorliegende Arbeit beschäftigt sich mit dem Thema Stemmatologie, d.h. primär der Rekonstruktion der Kopiergeschichte handschriftlich fixierter Dokumente. Zentrales Objekt der Stemmatologie ist das Stemma, eine visuelle Darstellung der Kopiergeschichte, welche i.d.R. graphtheoretisch als Baum bzw. gerichteter azyklischer Graph vorliegt, wobei die Knoten Textzeugen (d.s. die Textvarianten) darstellen während die Kanten für einzelne Kopierprozesse stehen. Im Mittelpunkt des Wissenschaftszweiges steht die Frage des Autorenoriginals (falls ein einziges solches existiert haben sollte) und die Frage der Rekonstruktion seines Textes. Das Stemma selbst ist ein Mittel zu diesem Hauptzweck (Cameron 1987). Der durch für manuelle Kopierprozesse kennzeichnende Abweichungen zunehmend abgewandelte Originaltext ist meist nicht direkt überliefert. Ziel der Arbeit ist es, die semi-automatische Stemmatologie umfassend zu beschreiben und durch Tools und analytische Verfahren weiterzuentwickeln. Der erste Teil der Arbeit beschreibt die Geschichte der computer-assistierten Stemmatologie inkl. ihrer klassischen Vorläufer und mündet in der Vorstellung eines einfachen Tools zur dynamischen graphischen Darstellung von Stemmata. Ein Exkurs zum philologischen Leitphänomen Lectio difficilior erörtert dessen mögliche psycholinguistische Ursachen im schnelleren lexikalischen Zugriff auf hochfrequente Lexeme. Im zweiten Teil wird daraufhin die existenziellste aller stemmatologischen Debatten, initiiert durch Joseph Bédier, mit mathematischen Argumenten auf Basis eines von Paul Maas 1937 vorgeschlagenen stemmatischen Models beleuchtet. Des Weiteren simuliert der Autor in diesem Kapitel Stemmata, um den potenziellen Einfluss der Distribution an Kopierhäufigkeiten pro Manuskript abzuschätzen.
Im nächsten Teil stellt der Autor ein eigens erstelltes Korpus in persischer Sprache vor, welches ebenso wie 3 der bekannten artifiziellen Korpora (Parzival, Notre Besoin, Heinrichi) qualitativ untersucht wird. Schließlich wird mit der Multi Modal Distance eine Methode zur Stemmagenerierung angewandt, welche auf externen Daten psycholinguistisch determinierter Buchstabenverwechslungswahrscheinlichkeiten beruht. Im letzten Teil arbeitet der Autor mit minimalen Spannbäumen zur Stemmaerzeugung, wobei eine vergleichende Studie zu 4 Methoden der Distanzmatrixgenerierung mit 4 Methoden zur Stemmaerzeugung durchgeführt, evaluiert und diskutiert wird.
Gegenstand der hier vorgestellten Arbeit ist eine Applikation für die virtuelle Realität (VR), die in der Lage ist, die Struktur eines beliebigen Textes als begehbare, interaktive Stadt zu visualisieren. Darüber hinaus bietet das Programm eine besondere Textsuche an, die so in anderen konventionellen Textverarbeitungsprogrammen nicht vorzufinden ist. Dank der strukturellen Analyse und der Verwendung einiger außergewöhnlicher Analysetools des TextImager [2], ermöglicht text2City nicht nur die Suche nach bestimmten Textmustern, sondern zum Beispiel auch die Bestimmung der Textebene (Wort, Satz, Absatz, etc.) und einiges mehr. Ein weiteres Feature ist die Kommunikationsverbindung zwischen dem TextAnnotator-Service [1] und text2City, die dem Benutzer die Möglichkeit zum Annotieren bietet, aber auch von anderen Personen durchgeführte Annotationen sofort sichtbar machen kann. Für die Ausführung des Programms ist eine der beiden VRBrillen, Oculus Rift oder HTC Vive, ein für VR geeigneter PC, sowie die Software Unity nötig.
The World Wide Web is increasing the number of freely accessible textual data, which has led to an increasing interest in research in the field of computational linguistics (CL). This area of research addresses theoretical research to answer the question of how language and knowledge must be represented in order to understand and produce language. For this purpose, mathematical models are being developed to capture the phenomena at various levels in human languages. Another field of research experiencing an increase in interest that is closely related to CL is Natural Language Processing (NLP), which is primarily concerned with developing effective and efficient data structures and algorithms that implement the mathematical models of CL.
With increasing interest in these areas, NLP tools are rapidly and frequently being developed incorporating different CL models to handle different levels of language. The open source trend has benefited all those in the scientific community who develop and use these tools. Due to yet undefined I/O standards for NLP, however, the rapid growth leads to a heterogeneous NLP landscape in which the specializations of the tools cannot benefit from each other because of interface incompatibility. In addition, the constantly growing amount of freely accessible text data requires a high-performance processing solution. This performance can be achieved by horizontal and vertical scaling of hardware and software. For these reasons the first part of this thesis deals with the homogenization of the NLP tool landscape, which is achieved by a standardized framework called TextImager. It is a cloud computing based multi-service, multi-server, multi-pipeline, multi-database, multi-user, multi-representation and multi-visual framework that already provides a variety of tools for various languages to process various levels of linguistic complexity. This makes it possible to answer research questions that require the processing of a large amount of data at several linguistic levels.
The integrated tools and the homogenized I/O data streams of the TextImager make it possible to combine the built-in tools in two dimensions: (1) the horizontal dimension to achieve NLP task-specific improvement (2) the orthogonal dimension to implement CL models that are based on multiple linguistic levels and thus rely on a combination of different NLP tools. The second part of this thesis therefore deals with the creation of models for the horizontal combination of tools in order to show the possibilities for improvement using the example of the NLP task of Named Entity Recognition (NER). The TextImager offers several tools for each NLP task, most of which have been trained on the same training basis, but can produce different results. This means that each of the tools processes a subset of the data correctly and at the same time makes errors in another subset. In order to process as large a subset of the data as possible correctly, a horizontal combination of tools is therefore required. Machine learning-based voting mechanisms called LSTMVoter and CRFVoter were developed for this purpose, which allow a combination of the outputs of individual NLP tools so that better partial data results can be achieved. In this thesis the benefit of Voter is shown using the example of the NER task, whose results flow
back into the TextImager tool landscape.
The third part of this thesis deals with the orthogonal combination of TextImager tools to accomplish the verb sense disambiguation (VSD). The CL question is investigated, how verb senses should be modelled in order to disambiguate them computatively. Verbsenses have a syntagmatic-paradigmatic relationship with surrounding words. Therefore, preprocessing on several linguistic levels and consequently an orthogonal combination of NLP tools is required to disambiguate verbs on a computational level. With TextImager’s integrated NLP landscape, it is now possible to perform these preprocessing steps to induce the information needed for the VSD. The newly developed NLP tool for the VSD has been integrated into the TextImager tool landscape, enabling the analysis of a further linguistic level.
This thesis presents a framework that homogenizes the NLP tool landscape in a cluster-based way. Methods for combining the integrated tools are implemented to improve the analysis of a specific linguistic level or to develop tools that open up new linguistic levels.
In dieser Arbeit werden 4,6 Millionen englische Tweets, welche das Keyword „Bitcoin“ enthalten, analysiert und der Zusammenhang zwischen dem Sentiment der Tweets und den Renditen des Bitcoin untersucht. Zur Bestimmung der Sentiment-Klassen werden Text-Klassifizierer mit verschiedenen Ansätzen, darunter auch auf Convolutional Neural Networks und Transformern basierende Modelle, in diesem Kontext evaluiert und optimiert. Es wird außerdem ein Meta-Modell konstruiert, welches beim Problem der Sentiment-Klassifikation von Tweets in drei Klassen {Positiv, Negativ, Neutral} in der betrachteten Domäne besser abschneidet, als die anderen begutachteten Modelle. Bezüglich des Zusammenhangs wird im Speziellen auch der Einfluss von Merkmalen der Tweets und ihrer Verfassern anhand der Distanzkorrelation untersucht.
When we browse via WiFi on our laptop or mobile phone, we receive data over a noisy channel. The received message may differ from the one that was sent originally. Luckily it is often possible to reconstruct the original message but it may take a lot of time. That’s because decoding the received message is a complex problem, NP-hard to be exact. As we continue browsing, new information is sent to us in a high frequency. So if lags are to be avoided and as memory is finite, there is not much time left for decoding. Coding theory tackles this problem by creating models of the channels we use to communicate and tailor codes based on the channel properties. A well known family of codes are Low-Density Parity-Check codes (LDPC codes), they are widely used in standards like WiFi and DVB-T2. In practical settings the complexity of decoding a received message can be heavily reduced by using LDPC codes and approximative decoding algorithms. This thesis lays out the basic construction of LDPC codes and a proper decoding using the sum-product algorithm. On this basis a neural network to improve decoding is introduced. Therefore the sum-product algorithm is transformed into a neural network decoder. This approach was first presented by Nachmani et al. and treated in detail by Navneet Agrawal in 2017. To find out how machine learning can improve the codes, the bit error rates of the trained neural network decoder are compared with the bit error rates of the classic sum-product algorithm approach. Experiments with static and dynamic training datasets of diverse sizes, various signal-to-noise ratios, a feed forward as well as a recurrent architecture show how to tune the neural network decoder even further. Results of the experiments are used to verify statements made in Agrawal’s work. In addition, corrections and improvements in the area of metrics are presented. An implementation of the neural network to facilitate access for others will be made available to the public.
This thesis will first introduce in more detail the Bayesian theory and its use in integrating multiple information sources. I will briefly talk about models and their relation to the dynamics of an environment, and how to combine multiple alternative models. Following that I will discuss the experimental findings on multisensory integration in humans and animals. I start with psychophysical results on various forms of tasks and setups, that show that the brain uses and combines information from multiple cues. Specifically, the discussion will focus on the finding that humans integrate this information in a way that is close to the theoretical optimal performance. Special emphasis will be put on results about the developmental aspects of cue integration, highlighting experiments that could show that children do not perform similar to the Bayesian predictions. This section also includes a short summary of experiments on how subjects handle multiple alternative environmental dynamics. I will also talk about neurobiological findings of cells receiving input from multiple receptors both in dedicated brain areas but also primary sensory areas. I will proceed with an overview of existing theories and computational models of multisensory integration. This will be followed by a discussion on reinforcement learning (RL). First I will talk about the original theory including the two different main approaches model-free and model-based reinforcement learning. The important variables will be introduced as well as different algorithmic implementations. Secondly, a short review on the mapping of those theories onto brain and behaviour will be given. I mention the most in uential papers that showed correlations between the activity in certain brain regions with RL variables, most prominently between dopaminergic neurons and temporal difference errors. I will try to motivate, why I think that this theory can help to explain the development of near-optimal cue integration in humans. The next main chapter will introduce our model that learns to solve the task of audio-visual orienting. Many of the results in this section have been published in [Weisswange et al. 2009b,Weisswange et al. 2011]. The model agent starts without any knowledge of the environment and acts based on predictions of rewards, which will be adapted according to the reward signaling the quality of the performed action. I will show that after training this model performs similarly to the prediction of a Bayesian observer. The model can also deal with more complex environments in which it has to deal with multiple possible underlying generating models (perform causal inference). In these experiments I use di#erent formulations of Bayesian observers for comparison with our model, and find that it is most similar to the fully optimal observer doing model averaging. Additional experiments using various alterations to the environment show the ability of the model to react to changes in the input statistics without explicitly representing probability distributions. I will close the chapter with a discussion on the benefits and shortcomings of the model. The thesis continues whith a report on an application of the learning algorithm introduced before to two real world cue integration tasks on a robotic head. For these tasks our system outperforms a commonly used approximation to Bayesian inference, reliability weighted averaging. The approximation is handy because of its computational simplicity, because it relies on certain assumptions that are usually controlled for in a laboratory setting, but these are often not true for real world data. This chapter is based on the paper [Karaoguz et al. 2011]. Our second modeling approach tries to address the neuronal substrates of the learning process for cue integration. I again use a reward based training scheme, but this time implemented as a modulation of synaptic plasticity mechanisms in a recurrent network of binary threshold neurons. I start the chapter with an additional introduction section to discuss recurrent networks and especially the various forms of neuronal plasticity that I will use in the model. The performance on a task similar to that of chapter 3 will be presented together with an analysis of the in uence of different plasticity mechanisms on it. Again benefits and shortcomings and the general potential of the method will be discussed. I will close the thesis with a general conclusion and some ideas about possible future work.
The main goal of this work was to create a network environment for the Unity Engine project StolperwegeVR, developed by the Text Technology Lab of Goethe University, in which you will be able to annotate one to several documents in a group. For this, basic network utils like seeing other users or moving objects had to be implemented which had to be easy to use and work with in the future.
Due to the resurrection of data-hungry models (such as deep convolutional neural nets), there is an increasing demand for large-scale labeled datasets and benchmarks in the computer vision fields (CV). However, collecting real data across diverse scene contexts along with high-quality annotations is often expensive and time-consuming, especially for detailed pixel-level label prediction tasks such as semantic segmentation, etc. To address the scarcity of real-world training sets, recent works have proposed the use of computer graphics (CG) generated data to train and/or characterize performance of modern CV systems. CG based virtual worlds provide easy access to ground truth annotations and control over scene states. Most of these works utilized training data simulated from video games and pre-designed virtual environments and demonstrated promising results. However, little effort has been devoted to the systematic generation of massive quantities of sufficiently complex synthetic scenes for training scene understanding algorithms. In this work, we develop a full pipeline for simulating large-scale datasets along with per-pixel ground truth information. Our simulation pipeline constitutes of mainly two components: (a) a stochastic scene generative model that automatically synthesizes traffic scene layouts by using marked point processes coupled with 3D CAD objects and factor potentials, (b) an annotated-image rendering tool that renders the sampled 3D scene as RGB image with a chosen rendering method along with pixel-level annotations such as semantic labels, depth, surface normals etc. This pipeline is capable of automatically generating and rendering a potentially infinite variety of outdoor traffic scenes that can be used to train convolutional neural nets (CNN).
However, several recent works, including our own initial experiments demonstrated that the CV models that are trained naively on simulated data lack generalization capabilities to real-world scenes. This opens up several fundamental questions about what is it lacking in simulated data compared to real data and how to use it effectively. Furthermore, there has been a long debate since 1980’s on the usefulness of CG generated data for tuning CV systems. Primarily, the impact of modeling errors and computational rendering approximations, due to various choices in the rendering pipeline, on trained CV systems generalization performance is still not clear. In this thesis, we take a case study in the context of traffic scenarios to empirically analyze the performance degradations when CV systems trained with virtual data are transferred to real data. We first explore system performance tradeoffs due to the choice of the rendering engine (e.g., Lambertian shader (LS), ray-tracing (RT), and Monte-Carlo path tracing (MCPT)) and their parameters. A CNN architecture, DeepLab, that performs semantic segmentation, is chosen as the CV system being evaluated. In our case study, involving traffic scenes, a CNN trained with CG data samples generated with photorealistic rendering methods (such as RT or MCPT), shows already a reasonably good performance on real-world testing data from CityScapes benchmark. Use of samples from an elementary rendering method, i.e., LS, degraded the performance of CNN by nearly 20%. This result conveys that training data must be photorealistic enough for better generalizability of the trained CNN models. Furthermore, the use of physics-based MCPT rendering improved the performance by 6% but at the cost of more than three times the rendering time. This MCPT generated dataset when augmented with just 10% of real-world training data from CityScapes dataset, the performance levels achieved are comparable to that of training CNN with the complete CityScapes dataset.
The next aspect we study in the thesis involves the impact of choice of parameter settings of scene generation model on the generalization performance of CNN models trained with the generated data. Towards this end, we first propose an algorithm to estimate our scene generation model parameters given an unlabeled real world dataset from the target domain. This unsupervised tuning approach utilizes the concept of generative adversarial training, which aims at adapting the generative model by measuring the discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier. Our method involves an iterative estimation of the posterior density of prior distributions for the generative graphical model used in the simulation. Initially, we assume uniform distributions as priors over parameters of a scene described by our generative graphical model. As iterations proceed the uniform prior distributions are updated sequentially to distributions for the simulation model parameters that leads to simulated data with statistics that are closer to the distributions of the unlabeled target data.
...
Viele Methoden wurden in dieser Arbeit vorgestellt, die sich mit dem Hauptziel der automatischen Dokumentenanalyse auf semantischer Ebene befassen. Um das Hauptziel zu erreichen, mussten wir jedoch zunächst eine solide Basis entwickeln, um das Gesamtbild zu vervollständigen. So wurden verschiedene Methoden und Werkzeuge entwickelt, die verschiedene Aspekte des NLP abdecken. Das Zusammenspiel dieser Methoden ermöglichte es, unser Ziel erfolgreich zu erreichen. Neben der automatischen Dokumentenanalyse legen wir großen Wert auf die drei Prinzipien von Effizienz, Anwendbarkeit und Sprachunabhängigkeit. Dadurch waren die entwickelten Tools für die Anwendungen bereit. Die Größe und Sprache der zu analysierenden Daten ist kein Hindernis mehr, zumindest für die im Bezug auf die von Wikipedia unterstützten Sprachen.
Einen großen Beitrag dazu leistete TextImager, das Framework, dass für die zugrunde liegende Architektur verschiedener Methoden und die gesamte Vorverarbeitung der Texte verantwortlich ist. TextImager ist als Multi-Server und Multi-Instanz-Cluster konzipiert, sodass eine verteilte Verarbeitung von Daten ermöglicht wird. Hierfür werden Cluster-Management-Dienste UIMA-AS und UIMA-DUCC verwendet. Darüber hinaus ermöglicht die Multi-Service-Architektur von TextImager die Integration beliebiger NLP-Tools und deren gemeinsame Ausführung. Zudem bietet der TextImager eine webbasierte Benutzeroberfläche, die eine Reihe von interaktiven Visualisierungen bietet, die die Ergebnisse der Textanalyse darstellen. Das Webinterface erfordert keine Programmierkenntnisse - durch einfaches Auswählen der NLP-Komponenten und der Eingabe des Textes wird die Analyse gestartet und anschließend visualisiert, so dass auch Nicht-Informatiker mit diesen Tools arbeiten können.
Zudem haben wir die Integration des statistischen Frameworks R in die Funktionalität und Architektur von TextImager demonstriert. Hier haben wir die OpenCPU-API verwendet, um R-Pakete auf unserem eigenen R-Server bereitzustellen. Dies ermöglichte die Kombination von R-Paketen mit den modernsten NLP-Komponenten des TextImager. So erhielten die Funktionen der R-Pakete extrahierte Informationen aus dem TextImager, was zu verbesserten Analysen führte.
Darüber hinaus haben wir interaktive Visualisierungen integriert, um die von R abgeleiteten Informationen zu visualisieren.
Einige der im TextImager entwickelten Visualisierungen sind besonders herausragend und haben in vielen Bereichen Anwendung gefunden. Ein Beispiel dafür ist PolyViz, ein interaktives Visualisierungssystem, das die Darstellung eines multipartiten Graphen ermöglicht. Wir haben PolyViz anhand von zwei verschiedenen Anwendungsfällen veranschaulicht.
SemioGraph, eine Visualisierungstechnik zur Darstellung multikodaler Graphen wurde auch vorgestellt. Die visuellen und interaktiven Funktionen von SemioGraph wurden mit einer Anwendung zur Visualisierung von Worteinbettungen vorgestellt. Wir haben gezeigt, dass verschiedene Modelle zu völlig unterschiedlichen Grafiken führen können. So kann Semiograph bei der Suche nach Worteinbettungen für bestimmte NLP-Aufgaben helfen.
Inspiriert von all den Textvisualisierungen im TextImager ist die Idee für text2voronoi geboren. Hier stellten wir einen neuartigen Ansatz zur bildgetriebenen Textklassifizierung vor, der auf einem Voronoi-Diagram linguistischer Merkmale basiert. Dieser Klassifikationsansatz wurde auf die automatische Patientendiagnose angewendet und wir haben gezeigt, dass wir das traditionelle Bag-Of-Words-Modell sogar übertreffen. Dieser Ansatz ermöglicht es, die zugrunde liegenden Merkmale anschließend zu analysieren und damit einen ersten Schritt zur Lösung der Black Box zu machen.
Wir haben text2voronoi auf literarische Werke angewendet und die entstandenen Visualisierungen auf einer webbasierten Oberfläche (LitViz) präsentiert. Hier ermöglichen wir den Vergleich von Voronoi-Diagrammen der verschiedenen Literaturen und damit den visuellen Vergleich der Sprachstile der zugrunde liegenden Autoren.
Mit unserer Kompetenz in der Vorverarbeitung und der Analyse von Texten sind wir unserem Ziel der semantischen Dokumentenanalyse einen Schritt näher gekommen. Als nächstes haben wir die Auflösung der Sinne auf der Wortebene untersucht. Hier stellten wir fastSense vor, ein Disambigierungsframework, das mit großen Datenmengen zurecht kommt. Um dies zu erreichen, haben wir einen Disambiguierungskorpus erstellt, der auf Wikipedias 221965 Disambiguierungsseiten basiert, wobei die sich auf 825179 Sinne beziehen. Daraus resultierten mehr als 50 Millionen Datensätze, die fast 50 GB Speicherplatz benötigten. Wir haben nicht nur gezeigt, dass fastSense eine so große Datenmenge problemlos verarbeiten kann, sondern auch, dass wir mit unseren Wettbewerbern mithalten und sie bei einigen NLP-Aufgaben sogar übertreffen können.
Jetzt, da wir den Wörtern Sinne zuordnen können, sind wir der semantischen Dokumentenanalyse einen weiteren Schritt näher gekommen. Je mehr Informationen wir aus einem Text und seinen Wörtern gewinnen können, desto genauer können wir seinen Inhalt analysieren. Wir stellten zudem einen netzwerktheoretischen Ansatz zur Modellierung der Semantik großer Textnetzwerke am Beispiel der deutschen Wikipedia vor. Zu diesem Zweck haben wir einen Algorithmus namens text2ddc entwickelt, um die thematische Struktur eines Textes zu modellieren. Dabei basiert das Modell auf einem etablierten Klassifikationsschema, nämlich der Dewey Decimal Classification. Mit diesem Modell haben wir gezeigt, wie man aus der Vogelperspektive die Hervorhebung und Verknüpfung von Themen, die sich in Millionen von Dokumenten manifestiert, darstellt. So haben wir eine Möglichkeit geschaffen, die thematische Dynamik von Dokumentnetzwerken automatisch zu visualisieren. Die Trainings- und Testdaten, die wir in diesem Kapitel hatten, bestanden jedoch hauptsächlich aus kurzen Textausschnitten. Zudem haben wir DDC Korpora erstellt, indem wir Informationen aus Wikidata, Wikipedia und der von der Deutschen Nationalbibliothek verwalteten Gemeinsamen Normdatei (GND) vereinigt haben. Auf diese Weise konnten wir nicht nur die Datenmenge erhöhen, sondern auch Datensätze für viele bisher unzugängliche Sprachen erstellen. Wir haben text2ddc so weit optimiert, dass wir einen F-score von 87.4% erzielen für die 98 Klassen der zweiten DDC-Stufe. Die Vorverarbeitung von TextImager und die Disambiguierung durch fastSense hatten einen großen Einfluss darauf. Für jedes Textstück berechnet text2ddc eine Wahrscheinlichkeitsverteilung über die DDC-Klassen berechnen
Der klassifikatorinduzierte semantische Raum von text2ddc wurde auch zur Verbesserung weiterer NLP-Methoden genutzt. Dazu gehört auch text2wiki, ein Framework für automatisches Tagging nach dem Wikipedia-Kategoriensystem. Auch hier haben wir einen klassifikatorinduzierten semantischen Raum, aber diesmal basiert er auf dem Wikipedia-Kategoriensystem. Ein großer Vorteil dieses Modells ist die Präzision und Tiefe der behandelten Themen und das sich ständig weiterentwickelnde Kategoriesystem. Damit sind auch die Kriterien eines offenen Themenmodells erfüllt. Um die Vorteile von text2wiki zu demonstrieren, haben wir anschließend die von text2wiki bereitgestellten Themenvektoren verwendet, um text2ddc zu verbessern, so dass sich beide Systeme gegenseitig verbessern können. Die Synergie zwischen den erstellten Methoden in dieser Dissertation war entscheidend für den Erfolg jeder einzelnen Methode.
Principles of cognitive maps
(2021)
This thesis analyses the concept of a cognitive map in the research fields of geography. Cognitive mapping research is essential as it investigates the relations between cognitive maps and external representations of space that people regularly use by acquiring spatial knowledge, such as maps in geographic information systems. Moreover, cognitive maps, when expanded on semantic maps, explain the relations between people and things in a non-physically environment, where the considered space is not spanned by distance but with other non-spatially variables. Nevertheless, cognitive maps are often distorted. Although a good formation of a cognitive map is vital in navigation processes, cognitive distortions are barely investigated in the field of geography. By analyzing the relevant work, especially Tobler’s first law of geography, a new lexical variant of Tobler’s first law could be stated that could presumably describe a specific distortion in the processing of landmarks in cognitive maps.
Die folgende Arbeit handelt von einem Human Computer Interaction Interface, welches es gestattet, mit Hilfe von Gesten zu schreiben. Das System ermöglicht seinen Nutzern, neue Gesten hinzuzufügen und zu verwenden. Da Gesten besser erkannt werden können, je genauer die Darstellung der Hände ist, wird diese durch Datenhandschuhe an den Computer übertragen. Die Hände werden einerseits in der Virtual Reality (VR) dargestellt, damit sie der Nutzer sieht. Andererseits werden die Daten, die die Gestenerkennung benötigt, an das Interface weitergeleitet. Die Erkennung der Gesten wird mit Hilfe eines Neuronales Netz (NN) implementiert. Dieses ist in der Lage, Gesten zu unterscheiden, sofern es genügend Trainingsdaten erhalten hat. Die genutzten Gesten sind entweder einhändig oder beidhändig auszuführen. Die Aussagen der Gesten beziehen sich in dieser Arbeit vor allem auf relationale Operatoren, die Beziehungen zwischen Objekten ausdrücken, wie beispielsweise „gleich“ oder „größer gleich“. Abschließend wird in dieser Arbeit ein System geschaffen, das es ermöglicht, mit Gesten Sätze auszudrücken. Dies betrifft das sogenannte gestische Schreiben nach Mehler, Lücking und Abrami 2014. Zu diesem Zweck befindet sich der Nutzer in einem virtuellen Raum mit Objekten, die er verknüpfen kann, wobei er Sätze in einem relationalen Kontext manifestiert.
Deep learning with neural networks seems to have largely replaced traditional design of computer vision systems. Automated methods to learn a plethora of parameters are now used in favor of previously practiced selection of explicit mathematical operators for a specific task. The entailed promise is that practitioners no longer need to take care of every individual step, but rather focus on gathering big amounts of data for neural network training. As a consequence, both a shift in mindset towards a focus on big datasets, as well as a wave of conceivable applications based exclusively on deep learning can be observed.
This PhD dissertation aims to uncover some of the only implicitly mentioned or overlooked deep learning aspects, highlight unmentioned assumptions, and finally introduce methods to address respective immediate weaknesses. In the author’s humble opinion, these prevalent shortcomings can be tied to the fact that the involved steps in the machine learning workflow are frequently decoupled. Success is predominantly measured based on accuracy measures designed for evaluation with static benchmark test sets. Individual machine learning workflow components are assessed in isolation with respect to available data, choice of neural network architecture, and a particular learning algorithm, rather than viewing the machine learning system as a whole in context of a particular application. Correspondingly, in this dissertation, three key challenges have been identified: 1. Choice and flexibility of a neural network architecture. 2. Identification and rejection of unseen unknown data to avoid false predictions. 3. Continual learning without forgetting of already learned information. These latter challenges have already been crucial topics in older literature, alas, seem to require a renaissance in modern deep learning literature. Initially, it may appear that they pose independent research questions, however, the thesis posits that the aspects are intertwined and require a joint perspective in machine learning based systems. In summary, the essential question is thus how to pick a suitable neural network architecture for a specific task, how to recognize which data inputs belong to this context, which ones originate from potential other tasks, and ultimately how to continuously include such identified novel data in neural network training over time without overwriting existing knowledge.
Thus, the central emphasis of this dissertation is to build on top of existing deep learning strengths, yet also acknowledge mentioned weaknesses, in an effort to establish a deeper understanding of interdependencies and synergies towards the development of unified solution mechanisms. For this purpose, the main portion of the thesis is in cumulative form. The respective publications can be grouped according to the three challenges outlined above. Correspondingly, chapter 1 is focused on choice and extendability of neural network architectures, analyzed in context of popular image classification tasks. An algorithm to automatically determine neural network layer width is introduced and is first contrasted with static architectures found in the literature. The importance of neural architecture design is then further showcased on a real-world application of defect detection in concrete bridges. Chapter 2 is comprised of the complementary ensuing questions of how to identify unknown concepts and subsequently incorporate them into continual learning. A joint central mechanism to distinguish unseen concepts from what is known in classification tasks, while enabling consecutive training without forgetting or revisiting older classes, is proposed. Once more, the role of the chosen neural network architecture is quantitatively reassessed. Finally, chapter 3 culminates in an overarching view, where developed parts are connected. Here, an extensive survey further serves the purpose to embed the gained insights in the broader literature landscape and emphasizes the importance of a common frame of thought. The ultimately presented approach thus reflects the overall thesis’ contribution to advance neural network based machine learning towards a unified solution that ties together choice of neural architecture with the ability to learn continually and the capability to automatically separate known from unknown data.
People can describe spatial scenes with language and, vice versa, create images based on linguistic descriptions. However, current systems do not even come close to matching the complexity of humans when it comes to reconstructing a scene from a given text. Even the ever-advancing development of better and better Transformer-based models has not been able to achieve this so far. This task, the automatic generation of a 3D scene based on an input text, is called text-to-3D scene generation. The key challenge, and focus of this dissertation, now relate to the following topics:
(a) Analyses of how well current language models understand spatial information, how static embeddings compare, and whether they can be improved by anaphora resolution.
(b) Automated resource generation for context expansion and grounding that can help in the creation of realistic scenes.
(c) Creation of a VR-based text-to-3D scene system that can be used as an annotation and active-learning environment, but can also be easily extended in a modular way with additional features to solve more contexts in the future.
(d) Analyze existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies in the context of VR.
In the first part of this work, we could show that static word embeddings do not benefit significantly from pronoun substitution. We explain this result by the loss of contextual information, the reduction in the relative occurrence of rare words, and the absence of pronouns to be substituted. But we were able to we have shown that both static and contextualizing language models appear to encode object knowledge, but require a sophisticated apparatus to retrieve it. The models themselves in combination with the measures differ greatly in terms of the amount of knowledge they allow to extract.
Classifier-based variants perform significantly better than the unsupervised methods from bias research, but this is also due to overfitting. The resources generated for this evaluation are later also an important component of point three.
In the second part, we present AffordanceUPT, a modularization of UPT trained on the HICO-DET dataset, which we have extended with Gibsonien/telic annotations. We then show that AffordanceUPT can effectively make the Gibsonian/telic distinction and that the model learns other correlations in the data to make such distinctions (e.g., the presence of hands in the image) that have important implications for grounding images to language.
The third part first presents a VR project to support spatial annotation respectively IsoSpace. The direct spatial visualization and the immediate interaction with the 3D objects should make the labeling more intuitive and thus easier. The project will later be incorporated as part of the Semantic Scene Builder (SeSB). The project itself in turn relies on the Text2SceneVR presented here for generating spatial hypertext, which in turn is based on the VAnnotatoR. Finally, we introduce Semantic Scene Builder (SeSB), a VR-based text-to-3D scene framework using Semantic Annotation Framework (SemAF) as a scheme for annotating semantic relations. It integrates a wide range of tools and resources by utilizing SemAF and UIMA as a unified data structure to generate 3D scenes from textual descriptions and also supports annotations. When evaluating SeSB against another state-of-the-art tool, it was found that our approach not only performed better, but also allowed us to model a wider variety of scenes. The final part reviews existing practices and tools for digital and virtual teaching, learning, and collaboration, as well as the conditions and strategies needed to make the most of technological opportunities in the future.
Die folgende Arbeit handelt von einer Text2Scene Anwendung, welche in der Virtual Reality (VR) umgesetzt wurde. Das System ermöglicht es den Usern aus einer Beschreibung einer Szene, diese virtuell nachzustellen. Dies bietet eine neue Art der Interaktion mit einem Text, die die visuelle Komponente hervorhebt und somit eine Geschichte auf neue Wege erfahrbar macht.
Dazu kann der User einen fertigen Text entweder vom Server zu laden oder einen eigenen erstellen, der dann automatisch verarbeitet wird. Dabei werden die vorhanden physischen Objekte im Text automatisch erkannt und dem User als 3D-Objekte in der virtuellen Umgebung zur Verfügung gestellt. Diese können dann manuell platziert werden und erzeugen dadurch die Szene, die im Ausgangstext beschrieben wurde. Das Ziel der Textverarbeitung ist eine möglichst genaue Beschreibung der Objekte, damit diese zielgerichtet in der Objektdatenbank gesucht werden können.
Bei der Textverarbeitung wird besonderer Wert auf das Erkennen von Teil-Ganz Beziehungen gelegt. Sodass Objekte, die im Text vorkommen und ein Holonym besitzen, automatisch mit diesem verknüpft werden. Gleichzeitig wird die Teil-Ganz Beziehung aber auch in die andere Richtung genauer betrachtet. Die Textverarbeitung soll ferner dazu in der Lage sein, Objekte genauer zu spezifizieren und an den Kontext des Textes anzupassen. Weiterhin wurde das Natural Language Processing (NLP) so ausgebaut, dass der Kontext des Textes erkannt wird und die Objekte entsprechend kategorisiert werden. Die Textverarbeitung wird mithilfe eines Neuronalen Netzes implementiert. Die verwendeten Tools zur Erkennung von Teil-Ganz Beziehungen, Kontext und Spezifikation von Objekten wurden anhand von Texteingaben nach der Genauigkeit der Ausgabe evaluiert.
Zur Nutzung der Textverarbeitung wurde eine virtuelle Szene entwickelt, die das Erstellen von eigenen Szenen aus vorher geladenen beziehungsweise eingegebenen Texten ermöglicht.
Dazu kann der Nutzer manuell oder automatisch Objekte laden lassen, die er dann platzieren kann.
High-resolution, compactness, scalability, efficiency – these are the critical requirements which imaging radar systems have to fulfil in applications such as environmental monitoring, cloud mapping, body sensing or autonomous driving. This thesis presents a modular millimetre-wave frequency modulated continuous-wave (FMCW) radar front-end solution intended for such applications. High-resolution is achieved by enlarging the operating frequency band of the radar system. This can be realized at millimetre-wave frequencies due to the large spectrum availability. Furthermore, the size of components decreasing with increasing frequency makes millimetre-wave systems a good candidate for compactness. However, the full integration of radar front-ends is a challenge at millimetre-wave frequencies due to poor signal integrity and spectral purity, which are essential for imaging applications. The proposed radar uses an alternative technique and tackles this limitation by featuring highly-integrable architectures, specifically the Hartley architecture for signal conversion and enhanced push-pull amplifier for harmonic suppression. The resolution of imaging radars can be further improved by increasing the number of transmitters and receivers. This has spurred the investigation of spectrum, time and energy-efficient multiplexing techniques for multi-input multi-output (MIMO) radar systems. The FMCW radar architecture proposed in this thesis is based on code-division technique using intra-pulse, also called intra-chirp modulation. This advanced scalable and non-complex solution, made possible by the latest achievements on direct digital synthesis for signal generation, guarantees signal integrity and compact size implementation. The proposed architecture is investigated by a thorough system analysis. A transmitter module and a receiver module for a 35 GHz imaging radar prototype are designed, fabricated and fully characterized to validate the feasibility of our novel approach for high-resolution highly-integrated MIMO front-ends.
AI-based computer vision systems play a crucial role in the environment perception for autonomous driving. Although the development of self-driving systems has been pursued for multiple decades, it is only recently that breakthroughs in Deep Neural Networks (DNNs) have led to their widespread application in perception pipelines, which are getting more and more sophisticated. However, with this rising trend comes the need for a systematic safety analysis to evaluate the DNN's behavior in difficult scenarios as well as to identify the various factors that cause misbehavior in such systems. This work aims to deliver a crucial contribution to the lacking literature on the systematic analysis of Performance Limiting Factors (PLFs) for DNNs by investigating the task of pedestrian detection in urban traffic from a monocular camera mounted on an autonomous vehicle. To investigate the common factors that lead to DNN misbehavior, six commonly used state-of-the-art object detection architectures and three detection tasks are studied using a new large-scale synthetic dataset and a smaller real-world dataset for pedestrian detection. The systematic analysis includes 17 factors from the literature and four novel factors that are introduced as part of this work. Each of the 21 factors is assessed based on its influence on the detection performance and whether it can be considered a Performance Limiting Factor (PLF). In order to support the evaluation of the detection performance, a novel and task-oriented Pedestrian Detection Safety Metric (PDSM) is introduced, which is specifically designed to aid in the identification of individual factors that contribute to DNN failure. This work further introduces a training approach for F1-Score maximization whose purpose is to ensure that the DNNs are assessed at their highest performance. Moreover, a new occlusion estimation model is introduced to replace the missing pedestrian occlusion annotations in the real-world dataset. Based on a qualitative analysis of the correlation graphs that visualize the correlation between the PLFs and the detection performance, this study identified 16 of the initial 21 factors as being PLFs for DNNs out of which the entropy, the occlusion ratio, the boundary edge strength, and the bounding box aspect ratio turned out to be most severely affecting the detection performance. The findings of this study highlight some of the most serious shortcomings of current DNNs and pave the way for future research to address these issues.