Unsupervised relational feature learning for vision

Konda, Kishore Reddy

This thesis contributes to the field of machine learning with a specific focus on the methods for learning relations between the inputs. Learning relationships between images is the most common primitive in vision. There are many vision tasks in which relationships across images play an important role. Some of them are motion estimation, activity recognition, stereo vision, multi-view geometry and visual odometry. Many of such tasks mainly depend on motion and disparity cues, which are inferred based on the relations across multiple image pairs. The approaches presented in this thesis mainly deal with, but are not limited to, learning of the representations for motion and depth. This thesis by articles consists of five articles which present relational feature learning models along with their applications in computer vision. In the first article, we present an approach for encoding motion in videos. To this end, we show that the detection of spatial transformations can be viewed as detection of coincidence or synchrony between the given sequence of frames and a sequence of features which are related by the transformation we wish to detect. Learning to detect synchrony is possible by introducing "multiplicative interactions'' into the hidden units of single layered sparse coding models. We show that the learned motion representations employed for the task of activity recognition achieve competitive performance on multiple benchmarks. Stereo vision is an important challenge in computer vision and useful for many applications in that field. In the second article, we extend the energy based learning models, which were previously used for motion encoding, to the context of depth perception. Given the common architecture of the models for encoding motion and depth, we show that it is possible to define a single model for learning a unified representation for both the cues. Our experimental results show that learning a combined representation for depth and motion makes it possible to achieve state-of-the-art performance at the task of 3-D activity analysis, and to perform better than the existing hand-engineered 3-D motion features. Autoencoder is a popular unsupervised learning method for learning efficient encoding for a given set of data samples. Typically, regularized autoencoders which are used to learn over-complete and sparse representations for the input data, were shown to fail on intrinsically high dimensional data like videos. In the third article, we investigate the reason for such a behavior. It can be observed that the regularized autoencoders typically learn negative hidden unit biases. We show that the learning of negative biases is the result of hidden units being responsible for both the sparsity and the representation of the input data. It is shown that, as a result, the behavior of the model resembles clustering methods which would require exponentially large number of features to model intrinsically high dimensional data. Based on this understanding, we propose a new activation function which decouples the roles of hidden layer and uses linear encoding. This allows to learn representations on data with very high intrinsic dimensionality. We also show that gating connections in the bi-linear models and the single layer models from articles one and two of this thesis can be thought of as a way to attain a linear encoding scheme which allows them to learn good representations on videos. Visual odometry is the task of inferring egomotion of a moving object from visual information such as images and videos. It can primarily be used for the task of localization and has many applications in the fields of robotics and navigation. The work in article four was motivated by the idea of using deep learning techniques, which are successful methods for many vision tasks, for visual odometry. The visual odometry task mainly requires inference of motion and depth information from visual input which can then be mapped to velocity and change in direction. We use relational feature models presented in the articles one and two for inferring a combined motion and depth representation from stereo video sequences. The combined representation is then mapped to discrete velocity and change in direction labels using convolutional neural networks. Our approach is an end-to-end deep learning-based architecture which uses a single type of computational model and learning rule. Preliminary results show that the architecture is capable of learning the mapping from input video to egomotion. Activity recognition is a challenging computer vision task with many real world applications. It is well know that it is a hard task to use computer vision research for real-time applications. In the fifth article of this thesis, we present a real-time activity recognition system based on deep learning based methods. Our approach uses energy based relational feature learning models for the computation of local motion features directly from videos. A bag-of-words over the local motion features is used for the analysis of activity in a given video sequence. We implement this system on a distributed computational platform and demonstrate its performance on the iCub robot. Using GPUs we demonstrate real time performance which makes the deployment of activity recognition systems in real world scenarios possible.

Author:	Kishore Reddy Konda
URN:	urn:nbn:de:hebis:30:3-394070
Place of publication:	Frankfurt am Main
Advisor:	Visvanathan Ramesh, Roland Memisevic
Document Type:	Doctoral Thesis
Language:	English
Date of Publication (online):	2016/02/12
Year of first Publication:	2016
Publishing Institution:	Universitätsbibliothek Johann Christian Senckenberg
Granting Institution:	Johann Wolfgang Goethe-Universität
Date of final exam:	2015/11/10
Release Date:	2016/02/12
Page Number:	152
HeBIS-PPN:	369824075
Institutes:	Informatik und Mathematik / Informatik
Dewey Decimal Classification:	5 Naturwissenschaften und Mathematik / 51 Mathematik / 510 Mathematik
Sammlungen:	Universitätspublikationen
Licence (German):	Deutsches Urheberrecht

Open Access

Unsupervised relational feature learning for vision

Download full text files

Export metadata

Additional Services