Refine
Document Type
- Doctoral Thesis (2)
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Institute
- Informatik und Mathematik (1)
- Physik (1)
In our daily life, we carry out lots of tasks like typing, playing tennis, and playing the piano, without even noticing there is sequence learning involved. No matter how simple or complex they are, these tasks require the sequential planning and execution of a series of movements. As an ability of primary importance in one’s life, and an ability that everyone manages to learn, action-sequence learning has been studied by researchers from different fields: psychologists, neurophysiologists as well as roboticists. In the concept of sequence learning, perceptual learning and motor learning, implicit and explicit learning have been studied and discussed independently.
We are interested in infancy research, because infants, with underdeveloped brain functions and with limited motor ability, have little experience with the world and not yet built internal models as presumption of how to interpret the world. A series of infant experiments in the 1980s provided evidence that infants can rapidly develop anticipatory eye movements for visual events. Even when infants have no control of those spatial-temporal patterns, they can respond actually prior to the onset of the visual event, referred as "Anticipation".
In this work, we applied a gaze-contingent paradigm using real-time eye tracking to put 6- and 8-month-old infants in direct control of their visual surroundings. This paradigm allows the infant to change an image on a screen by looking at a peripheral red disc, which functions as a switch. We found that infants quickly learn to perform eye movements to trigger the appearance of new stimuli and that they anticipate the consequences of their actions in an early stage of the experiment.
Attention-shift from learning one stimulus to the next novel stimulus is important in sequence learning. In the test phase of infant visual habituation with two objects, we propose a new theory of explaining the familiarity-to-novelty shift. In our opinion an infant’s interest in a stimulus is related to its learning progress, the improvement of performance. As a consequence, infants prefer the stimulus which their current learning progress is maximal for, naturally giving rise to a familiarity-to-novelty shift in certain situations. Our network model predicts that the familiarity-to-novelty-shift only emerges for complex stimuli that produce bell-shaped learning curves after brief familiarization, but does not emerge for simple stimuli that produce exponentially decreasing learning curves or for long familiarization time, which is consistent with experimental results. This research suggests the infant's interest in a stimulus may be related to its current learning progress. This can give rise to a dynamic familiarity-to-novelty shift depending on both the infant's learning efficiency and the task complexity.
We know that for both infants and adults, the performance on certain motor-sequence tasks can be improved through practice. However, adults usually have to perform complex tasks in complicated environments; for example, learning multiple tasks is unavoidable in our daily life. In existing research, learning multiple tasks showed puzzling and seemingly contradictory results. On the one hand, a wide variety of proactive and retroactive interference effects have been observed when multiple tasks have to be learned. On the other hand, some studies have reported facilitation and transfer of learning between different tasks.
In order to find out the interaction between multiple-task learning, and to find an optimal training schedule, we use a recurrent neural network to model a series of experiments on movement sequence learning. The network model learns to carry out the correct movement sequences through training and reproduces differences between training schedules such as blocked training vs. random training in psychophysics experiments. The network model also shows striking similarity to human performance, and makes prediction for tasks similarity and different training schedules.
In conclusion, the thesis presents learning sequences of actions in infants and recurrent neural networks. We carried out a gaze-contingent experiment to study infants’ rapid anticipation of their own action outcomes, and we also constructed two recurrent neural network models, with one model explaining infant attention shift in visual habituation, and the other model directing to task similarity and training schedule in motor sequence control in adults.
Due to the resurrection of data-hungry models (such as deep convolutional neural nets), there is an increasing demand for large-scale labeled datasets and benchmarks in the computer vision fields (CV). However, collecting real data across diverse scene contexts along with high-quality annotations is often expensive and time-consuming, especially for detailed pixel-level label prediction tasks such as semantic segmentation, etc. To address the scarcity of real-world training sets, recent works have proposed the use of computer graphics (CG) generated data to train and/or characterize performance of modern CV systems. CG based virtual worlds provide easy access to ground truth annotations and control over scene states. Most of these works utilized training data simulated from video games and pre-designed virtual environments and demonstrated promising results. However, little effort has been devoted to the systematic generation of massive quantities of sufficiently complex synthetic scenes for training scene understanding algorithms. In this work, we develop a full pipeline for simulating large-scale datasets along with per-pixel ground truth information. Our simulation pipeline constitutes of mainly two components: (a) a stochastic scene generative model that automatically synthesizes traffic scene layouts by using marked point processes coupled with 3D CAD objects and factor potentials, (b) an annotated-image rendering tool that renders the sampled 3D scene as RGB image with a chosen rendering method along with pixel-level annotations such as semantic labels, depth, surface normals etc. This pipeline is capable of automatically generating and rendering a potentially infinite variety of outdoor traffic scenes that can be used to train convolutional neural nets (CNN).
However, several recent works, including our own initial experiments demonstrated that the CV models that are trained naively on simulated data lack generalization capabilities to real-world scenes. This opens up several fundamental questions about what is it lacking in simulated data compared to real data and how to use it effectively. Furthermore, there has been a long debate since 1980’s on the usefulness of CG generated data for tuning CV systems. Primarily, the impact of modeling errors and computational rendering approximations, due to various choices in the rendering pipeline, on trained CV systems generalization performance is still not clear. In this thesis, we take a case study in the context of traffic scenarios to empirically analyze the performance degradations when CV systems trained with virtual data are transferred to real data. We first explore system performance tradeoffs due to the choice of the rendering engine (e.g., Lambertian shader (LS), ray-tracing (RT), and Monte-Carlo path tracing (MCPT)) and their parameters. A CNN architecture, DeepLab, that performs semantic segmentation, is chosen as the CV system being evaluated. In our case study, involving traffic scenes, a CNN trained with CG data samples generated with photorealistic rendering methods (such as RT or MCPT), shows already a reasonably good performance on real-world testing data from CityScapes benchmark. Use of samples from an elementary rendering method, i.e., LS, degraded the performance of CNN by nearly 20%. This result conveys that training data must be photorealistic enough for better generalizability of the trained CNN models. Furthermore, the use of physics-based MCPT rendering improved the performance by 6% but at the cost of more than three times the rendering time. This MCPT generated dataset when augmented with just 10% of real-world training data from CityScapes dataset, the performance levels achieved are comparable to that of training CNN with the complete CityScapes dataset.
The next aspect we study in the thesis involves the impact of choice of parameter settings of scene generation model on the generalization performance of CNN models trained with the generated data. Towards this end, we first propose an algorithm to estimate our scene generation model parameters given an unlabeled real world dataset from the target domain. This unsupervised tuning approach utilizes the concept of generative adversarial training, which aims at adapting the generative model by measuring the discrepancy between generated and real data in terms of their separability in the space of a deep discriminatively-trained classifier. Our method involves an iterative estimation of the posterior density of prior distributions for the generative graphical model used in the simulation. Initially, we assume uniform distributions as priors over parameters of a scene described by our generative graphical model. As iterations proceed the uniform prior distributions are updated sequentially to distributions for the simulation model parameters that leads to simulated data with statistics that are closer to the distributions of the unlabeled target data.
...