Simulations for cognitive vision

  • Due to the resurrection of data-hungry models (such as deep convolutional neural nets), there is an increasing demand for large-scale labeled datasets and benchmarks in the computer vision fields (CV). However, collecting real data across diverse scene contexts along with high-quality annotations is often expensive and time-consuming, especially for detailed pixel-level label prediction tasks such as semantic segmentation, etc. To address the scarcity of real-world training sets, recent works have proposed the use of computer graphics (CG) generated data to train and/or characterize performance of modern CV systems. CG based virtual worlds provide easy access to ground truth annotations and control over scene states. Most of these works utilized training data simulated from video games and pre-designed virtual environments and demonstrated promising results. However, little effort has been devoted to the systematic generation of massive quantities of sufficiently complex synthetic scenes for training scene understanding algorithms. In this work, we develop a full pipeline for simulating large-scale datasets along with per-pixel ground truth information. Our simulation pipeline constitutes of mainly two components: (a) a stochastic scene generative model that automatically synthesizes traffic scene layouts by using marked point processes coupled with 3D CAD objects and factor potentials, (b) an annotated-image rendering tool that renders the sampled 3D scene as RGB image with a chosen rendering method along with pixel-level annotations such as semantic labels, depth, surface normals etc. This pipeline is capable of automatically generating and rendering a potentially infinite variety of outdoor traffic scenes that can be used to train convolutional neural nets (CNN). However, several recent works, including our own initial experiments demonstrated that the CV models that are trained naively on simulated data lack generalization capabilities to real-world scenes. This opens up several fundamental questions about what is it lacking in simulated data compared to real data and how to use it effectively. Furthermore, there has been a long debate since 1980’s on the usefulness of CG generated data for tuning CV systems. Primarily, the impact of modeling errors and computational rendering approximations, due to various choices in the rendering pipeline, on trained CV systems generalization performance is still not clear. In this thesis, we take a case study in the context of traffic scenarios to empirically analyze the performance degradations when CV systems trained with virtual data are transferred to real data. We first explore system performance tradeoffs due to the choice of the rendering engine (e.g., Lambertian shader (LS), ray-tracing (RT), and Monte-Carlo path tracing (MCPT)) and their parameters. A CNN architecture, DeepLab, that performs semantic segmentation, is chosen as the CV system being evaluated. In our case study, involving traffic scenes, a CNN trained with CG data samples generated with photorealistic rendering methods (such as RT or MCPT), shows already a reasonably good performance on real-world testing data from CityScapes benchmark. Use of samples from an elementary rendering method, i.e., LS, degraded the performance of CNN by nearly 20%. This result conveys that training data must be photorealistic enough for better generalizability of the trained CNN models. Furthermore, the use of physics-based MCPT rendering improved the performance by 6% but at the cost of more than three times the rendering time. This MCPT generated dataset when augmented with just 10% of real-world training data from CityScapes dataset, the performance levels achieved are comparable to that of training CNN with the complete CityScapes dataset. The next aspect we study in the thesis involves the impact of choice of parameter settings of scene generation model on the generalization performance of CNN models trained with the generated data. Towards this end, we first propose an algorithm to estimate our scene generation model parameters given an unlabeled real world dataset from the target domain. This unsupervised tuning approach utilizes the concept of generative adversarial training, which aims at adapting the generative model by measuring the discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier. Our method involves an iterative estimation of the posterior density of prior distributions for the generative graphical model used in the simulation. Initially, we assume uniform distributions as priors over parameters of a scene described by our generative graphical model. As iterations proceed the uniform prior distributions are updated sequentially to distributions for the simulation model parameters that leads to simulated data with statistics that are closer to the distributions of the unlabeled target data. ...
  • Begründet in der Auferstehung daten-hungriger Modelle (z.B. tiefe neuronale Netze) steigt die Nachfrage nach großangelegten annotierten Datensätzen - vor allem in den verschiedenen Feldern des künstlichen Sehens. Die Sammlung von realen Daten über verschiedene Szenenkontexte hinweg sowie qualitativ hochwertige Annotationen sind jedoch oft kostspielig und zeitaufwendig, insbesondere für detaillierte, Pixelgenaue Vorhersagen wie z.B. semantische Segmentierung. Um dem Mangel an realen Trainings-Datensätzen zu begegnen, haben jüngste Arbeiten die Verwendung von durch Computergrafik (CG) generierten Daten vorgeschlagen, um moderne CV-Systeme zu trainieren und deren Performanz zu charakterisieren. CG-basierte virtuelle Welten bieten einfachen Zugriff auf ”ground-truth” Annotationen und erlauben volle Kontrolle über die Zustände von Szenarien. Die meisten dieser Arbeiten verwendeten Trainingsdaten, die aus Videospielen und vorgefertigten virtuellen Umgebungen simuliert wurden, und zeigten vielversprechende Ergebnisse. Es wurden jedoch nur wenige Anstrengungen unternommen, um systematisch große Mengen an ausreichend komplexen synthetischen Szenen zu erzeugen, um damit Algorithmen zum Szenen-Verständnis zu trainieren. In dieser Arbeit entwickeln wir eine vollständige Pipeline zur Simulation großer Datensätze mit Pixelgenauen ”ground-truth”-Annotationen. Unsere Simulations-Pipeline besteht im Wesentlichen aus zwei Komponenten: (a) ein stochastisches Szenengenerationsmodell, das Verkehrsszenen-Layouts unter Verwendung von markierten Punktprozessen in Verbindung mit 3D-CAD-Objekten und Faktorpotentialen automatisch synthetisiert, (b) ein Rendering-Tool für annotierte Bilder, das die gesampelte 3D-Szene als RGB-Bild mit einem gewählten Renderingverfahren zusammen mit Pixel-genauen Annotationen wie semantischen Labels, Tiefenwerten, Oberflächennormalen, usw. darsgetellt. Diese Pipeline ist in der Lage, automatisch eine potenziell unendliche Vielfalt von Verkehrsszenen zu erzeugen und zu rendern, die zum Trainieren von ”faltenden neuronalen Netzen” (convolutional neural network - CNN) verwendet werden können...

Download full text files

Export metadata

Metadaten
Author:Venkata Subbarao Veeravasarapu
URN:urn:nbn:de:hebis:30:3-551382
Place of publication:Frankfurt am Main
Referee:Visvanathan RameshORCiD, Constantin RothkopfORCiDGND
Advisor:Visvanathan Ramesh
Document Type:Doctoral Thesis
Language:English
Date of Publication (online):2020/07/08
Year of first Publication:2018
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Granting Institution:Johann Wolfgang Goethe-Universität
Date of final exam:2019/10/24
Release Date:2020/07/13
Page Number:113
HeBIS-PPN:466787650
Institutes:Informatik und Mathematik
Dewey Decimal Classification:0 Informatik, Informationswissenschaft, allgemeine Werke / 00 Informatik, Wissen, Systeme / 004 Datenverarbeitung; Informatik
Sammlungen:Universitätspublikationen
Licence (German):License LogoDeutsches Urheberrecht