Refine
Document Type
- Doctoral Thesis (2)
Language
- English (2)
Has Fulltext
- yes (2)
Is part of the Bibliography
- no (2)
Keywords
- cognitive psychology (1)
- memory (1)
- vision (1)
Institute
Object recognition is such an everyday task it seems almost mundane. We look at the spaces around us and name things seemingly effortlessly. Yet understanding how the process of object recognition unfolds is a great challenge to vision science. Models derived from abstract stimuli have little predictive power for the way people explore "naturalistic" scenes and the objects in them. Naturalistic here refers to unaltered photographs of real scenes. This thesis therefore focusses on the process of recognition of the objects in such naturalistic scenes. People can, for instance, find objects in scenes much more efficiently than models derived from abstract stimuli would predict. To explain this kind of behavior, we describe scenes not solely in terms of physical characteristics (colors, contrasts, lines, orientations, etc.) but by the meaning of the whole scene (kitchen, street, bathroom, etc.) and of the objects within the scene (oven, fire hydrant, soap, etc.). Object recognition now refers to the process of the visual system assigning meaning to the object.
The relationship between objects in a naturalistic scene is far from random. Objects do not typically float in mid-air and cannot take up the same physical space. Moreover, certain scenes typically contain certain objects. A fire hydrant in the kitchen would seem like an anomaly to the average observer. These "rules" can be described as the "grammar" of the scene. Scene grammar is involved in multiple aspects of scene- and object perception. There is, for instance, evidence that overall scene category influences identification of individual objects. Typically, experiments that directly target object recognition do not involve eye movements and studies that involve eye movements are not directly aimed at object recognition, but at gaze allocation. But eye movements are abundant in everyday life, they happen roughly 4 times per second. Here we therefore present two studies that use eye movements to investigate when object recognition takes place while people move their eyes from object to object in a scene. The third study is aimed at the application of novel methods for analyzing data from combined eye movement and neurophysiology (EEG) measurements.
One way to study object perception is to violate the grammar of a scene by placing an object in a scene it does not typically occur in and measuring how long people look at the so-called semantic inconsistency, compared to an object that one would expect in the given scene. Typically, people look at semantic inconsistencies longer and more often, signaling that it requires extra processing. In Study 1 we make use of this behavior to ask whether object recognition still happens when it is not necessary for the task. We designed a search task that made it unnecessary to register object identities. Still, participants looked at the inconsistent objects longer than consistent objects, signaling they did indeed process object and scene identities. Interestingly, the inconsistent objects were not remembered better than the consistent ones. We conclude that object and scene identities (their semantics) are processed in an obligatory fashion; when people are involved in a task that does not require it. In Study 2, we investigate more closely when the first signs of object semantic processing are visible while people make eye movements.
Although the finding that semantic inconsistencies are looked at longer and more often has been replicated often, many of these replications look at gaze duration over a whole trial. The question when during a trial differences between consistencies occur, has yielded mixed results. Some studies only report effects of semantic consistency that accumulate over whole trials, whereas others report influences already on the duration of the very first fixations on inconsistent objects. In study 2 we argue that prior studies reporting first fixation duration may have suffered from methodological shortcomings, such as low trial- and sample sizes, in addition to the use of non-robust statistics and data descriptions. We show that a subset of fixations may be influenced more than others (as is indicated by more skewed fixation duration distributions). Further analyses show that the relationship between the effect of object semantics on fixation durations and its effect on oft replicated cumulative measures is not straightforward (fixation duration distributions do not predict dwell effects) but the effect on both measures may be related in a different way. Possibly, the processing of object meaning unfolds over multiple fixations, only when one fixation does not suffice. However, it would be very valuable to be able to study how processing continues, after a fixation ends.
Study 3 aims to make such a measure possible by combining EEG recordings with eye tracking measurements. Difficulties in analyzing eye tracking–EEG data exist because neural responses vary with different eye movements characteristics. Moreover, fixations follow one another in short succession, causing neural responses to each fixation to overlap in time. These issues make the well-established approach of averaging single trial EEG data into ERPs problematic. As an alternative, we propose the use of multiple regression, explicitly modelling both temporal overlap and eye movement parameters. In Study 3 we show that such a method successfully estimates the influence of covariates it is meant to control for. Moreover, we discuss and explore what additional covariates may be modeled and in what way, in order to obtain confound-free estimates of EEG differences between conditions. One important finding is that stimulus properties of physically variable stimuli such as complex scenes, can influence EEG signals and deserve close consideration during experimental design or modelling efforts. Overall, the method compares favorably to averaging methods.
From the studies in this thesis, we directly learn that object recognition is a process that happens in an obligatory fashion, when the task does not require it. We also learn that only a subset of first fixations to objects are affected by the processing of object meaning and its fit to its surroundings. Comparison between first fixation and first dwell effects suggest that, in active vision, object semantics processing sometimes unfolds over multiple fixations. And finally, we learn that regression-based methods for combined eye tracking-EEG analysis provide a plausible way forward for investigating how object recognition unfolds in active vision.
Our mind has the function of representing the physical and social world we are in, so that we can efficiently interact with it. This results in a constant and dynamic interaction between mind and world that produces a balance when representations are at the same time accurate with respect to what the world is communicating to our organism, but also compatible with how our mind works.
A paradigmatic case of this interaction is offered by perception, which is the mental function that represents contingent aspects of the world built from what is captured by our senses. Indeed, the dominant philosophical view in cognitive science is that our perceptual states are representations of the world and not direct access to that world. These representational perceptual states therefor include the aspects of the world they represent and that initiate the perception by stimulating our sensory organs.
Perceptual representations are built using information from the sensory system, i.e., bottom-up information, but are also integrated with information previously acquired, i.e., top-down information, so that perception interacts with memory through language and other mental functions. Such organization is believed to reflect a general mechanism of our mind/brain, which is to acquire and use information to make efficient predictions about the future, continuously updating older information with present information.
This predictive processing works because the world is not random, but shows a regular structure from which reliable expectations can be built. One way that our minds make these predictions is by adapting to the structure of the world in an implicit, automatic and unconscious way, a process that has been called Implicit Statistical Learning (ISL). ISL is a learning process that does not require awareness and happens in an incidental and spontaneous way, with mere exposure to statistical regularities of the world. It is what happens when we learn a language during early childhood, and that allows us to be implicitly sensitive to the phonological structure of speech, or to associate speech patterns with objects and events to learn word meaning.
A specific case of ISL is the learning of spatial configuration in the visual world, which we apply to abstract arrays of items, but most importantly, also to more ecological settings such as the visual scenes we are immersed in during our everyday life. The knowledge we acquire about the structure of visual scenes has been called “Scene Grammar”, because it informs about presence and position of objects in a similar way to what linguistic grammar tells us about the presence and position of words. So, we implicitly acquire the semantics of scenes, learning which objects are consistent with a certain scene, as well as the syntax of scenes, learning where objects are positioned in a consistent way within a certain scene.
More recent developments have proposed that scene grammar knowledge might be organized based on a hierarchical system: objects are arranged in the scene, which offers the more general context, but within a scene we can identify different spatial and functional clusters of objects, called “phrases”, that offer a second level of context; within every phrase, then, objects have different status, with usually one object (“anchor object”) offering strong prediction of where and which are the other objects within the phrase (“local objects”). However, these further aspects of the organization of objects In scenes remain poorly understood.
Another problem relates to the way we measure the structure of scenes to compare the organization of the visual world with the organization in the mind. Typically, to decide if an object appears or not in a certain scene, and whether or not it appears in a certain position within a scene, researchers based their decision on intuition and common-sense, maybe validating those decisions with independent raters. But it has been shown that often these decisions can be limited and more complex information about objects’ arrangement in scenes can be lost.
A potential solution to this problem might be using large set of real-world images, that have annotations and segmentations of objects, to measures statistics about how objects are arranged in the environment. This idea exploits the nowadays larger availability of this kind of datasets due to increasing developments of computer vision algorithms, and also parallels with the established usage of large text corpora in language research.
The goals of the current investigation were to extract object statistics from this image datasets and test if they reliably predict behavioural responses during object processing, as well as to use these statistics to investigate more complex aspects of scene grammar, such as its hierarchical organization, to see if this organization is reflected in the organization of objects in our mind.