Object recognition in active vision

  • Object recognition is such an everyday task it seems almost mundane. We look at the spaces around us and name things seemingly effortlessly. Yet understanding how the process of object recognition unfolds is a great challenge to vision science. Models derived from abstract stimuli have little predictive power for the way people explore "naturalistic" scenes and the objects in them. Naturalistic here refers to unaltered photographs of real scenes. This thesis therefore focusses on the process of recognition of the objects in such naturalistic scenes. People can, for instance, find objects in scenes much more efficiently than models derived from abstract stimuli would predict. To explain this kind of behavior, we describe scenes not solely in terms of physical characteristics (colors, contrasts, lines, orientations, etc.) but by the meaning of the whole scene (kitchen, street, bathroom, etc.) and of the objects within the scene (oven, fire hydrant, soap, etc.). Object recognition now refers to the process of the visual system assigning meaning to the object. The relationship between objects in a naturalistic scene is far from random. Objects do not typically float in mid-air and cannot take up the same physical space. Moreover, certain scenes typically contain certain objects. A fire hydrant in the kitchen would seem like an anomaly to the average observer. These "rules" can be described as the "grammar" of the scene. Scene grammar is involved in multiple aspects of scene- and object perception. There is, for instance, evidence that overall scene category influences identification of individual objects. Typically, experiments that directly target object recognition do not involve eye movements and studies that involve eye movements are not directly aimed at object recognition, but at gaze allocation. But eye movements are abundant in everyday life, they happen roughly 4 times per second. Here we therefore present two studies that use eye movements to investigate when object recognition takes place while people move their eyes from object to object in a scene. The third study is aimed at the application of novel methods for analyzing data from combined eye movement and neurophysiology (EEG) measurements. One way to study object perception is to violate the grammar of a scene by placing an object in a scene it does not typically occur in and measuring how long people look at the so-called semantic inconsistency, compared to an object that one would expect in the given scene. Typically, people look at semantic inconsistencies longer and more often, signaling that it requires extra processing. In Study 1 we make use of this behavior to ask whether object recognition still happens when it is not necessary for the task. We designed a search task that made it unnecessary to register object identities. Still, participants looked at the inconsistent objects longer than consistent objects, signaling they did indeed process object and scene identities. Interestingly, the inconsistent objects were not remembered better than the consistent ones. We conclude that object and scene identities (their semantics) are processed in an obligatory fashion; when people are involved in a task that does not require it. In Study 2, we investigate more closely when the first signs of object semantic processing are visible while people make eye movements. Although the finding that semantic inconsistencies are looked at longer and more often has been replicated often, many of these replications look at gaze duration over a whole trial. The question when during a trial differences between consistencies occur, has yielded mixed results. Some studies only report effects of semantic consistency that accumulate over whole trials, whereas others report influences already on the duration of the very first fixations on inconsistent objects. In study 2 we argue that prior studies reporting first fixation duration may have suffered from methodological shortcomings, such as low trial- and sample sizes, in addition to the use of non-robust statistics and data descriptions. We show that a subset of fixations may be influenced more than others (as is indicated by more skewed fixation duration distributions). Further analyses show that the relationship between the effect of object semantics on fixation durations and its effect on oft replicated cumulative measures is not straightforward (fixation duration distributions do not predict dwell effects) but the effect on both measures may be related in a different way. Possibly, the processing of object meaning unfolds over multiple fixations, only when one fixation does not suffice. However, it would be very valuable to be able to study how processing continues, after a fixation ends. Study 3 aims to make such a measure possible by combining EEG recordings with eye tracking measurements. Difficulties in analyzing eye tracking–EEG data exist because neural responses vary with different eye movements characteristics. Moreover, fixations follow one another in short succession, causing neural responses to each fixation to overlap in time. These issues make the well-established approach of averaging single trial EEG data into ERPs problematic. As an alternative, we propose the use of multiple regression, explicitly modelling both temporal overlap and eye movement parameters. In Study 3 we show that such a method successfully estimates the influence of covariates it is meant to control for. Moreover, we discuss and explore what additional covariates may be modeled and in what way, in order to obtain confound-free estimates of EEG differences between conditions. One important finding is that stimulus properties of physically variable stimuli such as complex scenes, can influence EEG signals and deserve close consideration during experimental design or modelling efforts. Overall, the method compares favorably to averaging methods. From the studies in this thesis, we directly learn that object recognition is a process that happens in an obligatory fashion, when the task does not require it. We also learn that only a subset of first fixations to objects are affected by the processing of object meaning and its fit to its surroundings. Comparison between first fixation and first dwell effects suggest that, in active vision, object semantics processing sometimes unfolds over multiple fixations. And finally, we learn that regression-based methods for combined eye tracking-EEG analysis provide a plausible way forward for investigating how object recognition unfolds in active vision.

Download full text files

  • PDF_CORNELISSEN_FINAL.pdf
    eng

Export metadata

Additional Services

Share in Twitter Search Google Scholar
Metadaten
Author:Tim CornelissenGND
URN:urn:nbn:de:hebis:30:3-480976
Referee:Melissa Lê-Hoa VõORCiDGND, Christian FiebachORCiDGND
Advisor:Melissa Lê-Hoa Võ
Document Type:Doctoral Thesis
Language:English
Year of Completion:2018
Year of first Publication:2018
Publishing Institution:Universitätsbibliothek Johann Christian Senckenberg
Granting Institution:Johann Wolfgang Goethe-Universität
Date of final exam:2018/07/11
Release Date:2019/12/04
Page Number:170
HeBIS-PPN:45766035X
Institutes:Psychologie und Sportwissenschaften / Psychologie
Dewey Decimal Classification:1 Philosophie und Psychologie / 15 Psychologie / 150 Psychologie
Sammlungen:Universitätspublikationen
Licence (German):License LogoArchivex. zur Lesesaalplatznutzung § 52b UrhG