Refine
Document Type
- Preprint (5)
- Article (2)
- Doctoral Thesis (1)
Language
- English (8)
Has Fulltext
- yes (8)
Is part of the Bibliography
- no (8)
Keywords
Institute
- Informatik (6)
- Informatik und Mathematik (2)
Abstract: The human visual cortex enables visual perception through a cascade of hierarchical computations in cortical regions with distinct functionalities. Here, we introduce an AI-driven approach to discover the functional mapping of the visual cortex. We related human brain responses to scene images measured with functional MRI (fMRI) systematically to a diverse set of deep neural networks (DNNs) optimized to perform different scene perception tasks. We found a structured mapping between DNN tasks and brain regions along the ventral and dorsal visual streams. Low-level visual tasks mapped onto early brain regions, 3-dimensional scene perception tasks mapped onto the dorsal stream, and semantic tasks mapped onto the ventral stream. This mapping was of high fidelity, with more than 60% of the explainable variance in nine key regions being explained. Together, our results provide a novel functional mapping of the human visual cortex and demonstrate the power of the computational approach.
Author Summary: Human visual perception is a complex cognitive feat known to be mediated by distinct cortical regions of the brain. However, the exact function of these regions remains unknown, and thus it remains unclear how those regions together orchestrate visual perception. Here, we apply an AI-driven brain mapping approach to reveal visual brain function. This approach integrates multiple artificial deep neural networks trained on a diverse set of functions with functional recordings of the whole human brain. Our results reveal a systematic tiling of visual cortex by mapping regions to particular functions of the deep networks. Together this constitutes a comprehensive account of the functions of the distinct cortical regions of the brain that mediate human visual perception.
The human visual cortex enables visual perception through a cascade of hierarchical computations in cortical regions with distinct functionalities. Here, we introduce an AI-driven approach to discover the functional mapping of the visual cortex. We related human brain responses to scene images measured with functional MRI (fMRI) systematically to a diverse set of deep neural networks (DNNs) optimized to perform different scene perception tasks. We found a structured mapping between DNN tasks and brain regions along the ventral and dorsal visual streams. Low-level visual tasks mapped onto early brain regions, 3-dimensional scene perception tasks mapped onto the dorsal stream, and semantic tasks mapped onto the ventral stream. This mapping was of high fidelity, with more than 60% of the explainable variance in nine key regions being explained. Together, our results provide a novel functional mapping of the human visual cortex and demonstrate the power of the computational approach.
The human brain achieves visual object recognition through multiple stages of linear and nonlinear transformations operating at a millisecond scale. To predict and explain these rapid transformations, computational neuroscientists employ machine learning modeling techniques. However, state-of-the-art models require massive amounts of data to properly train, and to the present day there is a lack of vast brain datasets which extensively sample the temporal dynamics of visual object recognition. Here we collected a large and rich dataset of high temporal resolution EEG responses to images of objects on a natural background. This dataset includes 10 participants, each with 82,160 trials spanning 16,740 image conditions. Through computational modeling we established the quality of this dataset in five ways. First, we trained linearizing encoding models that successfully synthesized the EEG responses to arbitrary images. Second, we correctly identified the recorded EEG data image conditions in a zero-shot fashion, using EEG synthesized responses to hundreds of thousands of candidate image conditions. Third, we show that both the high number of conditions as well as the trial repetitions of the EEG dataset contribute to the trained models’ prediction accuracy. Fourth, we built encoding models whose predictions well generalize to novel participants. Fifth, we demonstrate full end-to-end training of randomly initialized DNNs that output EEG responses for arbitrary input images. We release this dataset as a tool to foster research in visual neuroscience and computer vision.
In the human brain, the incoming light to the retina is transformed into meaningful representations that allow us to interact with the world. In a similar vein, the RGB pixel values are transformed by a deep neural network (DNN) into meaningful representations relevant to solving a computer vision task it was trained for. Therefore, in my research, I aim to reveal insights into the visual representations in the human visual cortex and DNNs solving vision tasks.
In the previous decade, DNNs have emerged as the state-of-the-art models for predicting neural responses in the human and monkey visual cortex. Research has shown that training on a task related to a brain region’s function leads to better predictivity than a randomly initialized network. Based on this observation, we proposed that we can use DNNs trained on different computer vision tasks to identify functional mapping of the human visual cortex.
To validate our proposed idea, we first investigate a brain region occipital place area (OPA) using DNNs trained on scene parsing task and scene classification task. From the previous investigations about OPA’s functions, we knew that it encodes navigational affordances that require spatial information about the scene. Therefore, we hypothesized that OPA’s representation should be closer to a scene parsing model than a scene classification model as the scene parsing task explicitly requires spatial information about the scene. Our results showed that scene parsing models had representation closer to OPA than scene classification models thus validating our approach.
We then selected multiple DNNs performing a wide range of computer vision tasks ranging from low-level tasks such as edge detection, 3D tasks such as surface normals, and semantic tasks such as semantic segmentation. We compared the representations of these DNNs with all the regions in the visual cortex, thus revealing the functional representations of different regions of the visual cortex. Our results highly converged with previous investigations of these brain regions validating the feasibility of the proposed approach in finding functional representations of the human brain. Our results also provided new insights into underinvestigated brain regions that can serve as starting hypotheses and promote further investigation into those brain regions.
We applied the same approach to find representational insights about the DNNs. A DNN usually consists of multiple layers with each layer performing a computation leading to the final layer that performs prediction for a given task. Training on different tasks could lead to very different representations. Therefore, we first investigate at which stage does the representation in DNNs trained on different tasks starts to differ. We further investigate if the DNNs trained on similar tasks lead to similar representations and on dissimilar tasks lead to more dissimilar representations. We selected the same set of DNNs used in the previous work that were trained on the Taskonomy dataset on a diverse range of 2D, 3D and semantic tasks. Then, given a DNN trained on a particular task, we compared the representation of multiple layers to corresponding layers in other DNNs. From this analysis, we aimed to reveal where in the network architecture task-specific representation is prominent. We found that task specificity increases as we go deeper into the DNN architecture and similar tasks start to cluster in groups. We found that the grouping we found using representational similarity was highly correlated with grouping based on transfer learning thus creating an interesting application of the approach to model selection in transfer learning.
During previous works, several new measures were introduced to compare DNN representations. So, we identified the commonalities in different measures and unified different measures into a single framework referred to as duality diagram similarity. This work opens up new possibilities for similarity measures to understand DNN representations. While demonstrating a much higher correlation with transfer learning than previous state-of-the-art measures we extend it to understanding layer-wise representations of models trained on the Imagenet and Places dataset using different tasks and demonstrate its applicability to layer selection for transfer learning.
In all the previous works, we used the task-specific DNN representations to understand the representations in the human visual cortex and other DNNs. We were able to interpret our findings in terms of computer vision tasks such as edge detection, semantic segmentation, depth estimation, etc. however we were not able to map the representations to human interpretable concepts. Therefore in our most recent work, we developed a new method that associates individual artificial neurons with human interpretable concepts.
Overall, the works in this thesis revealed new insights into the representation of the visual cortex and DNNs...
Grasping the meaning of everyday visual events is a fundamental feat of human intelligence that hinges on diverse neural processes ranging from vision to higher-level cognition. Deciphering the neural basis of visual event understanding requires rich, extensive, and appropriately designed experimental data. However, this type of data is hitherto missing. To fill this gap, we introduce the BOLD Moments Dataset (BMD), a large dataset of whole-brain fMRI responses to over 1,000 short (3s) naturalistic video clips and accompanying metadata. We show visual events interface with an array of processes, extending even to memory, and we reveal a match in hierarchical processing between brains and video-computable deep neural networks. Furthermore, we showcase that BMD successfully captures temporal dynamics of visual events at second resolution. BMD thus establishes a critical groundwork for investigations of the neural basis of visual event understanding.
Visual scene perception is mediated by a set of cortical regions that respond preferentially to images of scenes, including the occipital place area (OPA) and parahippocampal place area (PPA). However, the differential contribution of OPA and PPA to scene perception remains an open research question. In this study, we take a deep neural network (DNN)-based computational approach to investigate the differences in OPA and PPA function. In a first step we search for a computational model that predicts fMRI responses to scenes in OPA and PPA well. We find that DNNs trained to predict scene components (e.g., wall, ceiling, floor) explain higher variance uniquely in OPA and PPA than a DNN trained to predict scene category (e.g., bathroom, kitchen, office). This result is robust across several DNN architectures. On this basis, we then determine whether particular scene components predicted by DNNs differentially account for unique variance in OPA and PPA. We find that variance in OPA responses uniquely explained by the navigation-related floor component is higher compared to the variance explained by the wall and ceiling components. In contrast, PPA responses are better explained by the combination of wall and floor, that is scene components that together contain the structure and texture of the scene. This differential sensitivity to scene components suggests differential functions of OPA and PPA in scene processing. Moreover, our results further highlight the potential of the proposed computational approach as a general tool in the investigation of the neural basis of human scene perception.
Studying the neural basis of human dynamic visual perception requires extensive experimental data to evaluate the large swathes of functionally diverse brain neural networks driven by perceiving visual events. Here, we introduce the BOLD Moments Dataset (BMD), a repository of whole-brain fMRI responses to over 1,000 short (3s) naturalistic video clips of visual events across ten human subjects. We use the videos’ extensive metadata to show how the brain represents word- and sentence-level descriptions of visual events and identify correlates of video memorability scores extending into the parietal cortex. Furthermore, we reveal a match in hierarchical processing between cortical regions of interest and video-computable deep neural networks, and we showcase that BMD successfully captures temporal dynamics of visual events at second resolution. With its rich metadata, BMD offers new perspectives and accelerates research on the human brain basis of visual event perception.
The human brain achieves visual object recognition through multiple stages of nonlinear transformations operating at a millisecond scale. To predict and explain these rapid transformations, computational neuroscientists employ machine learning modeling techniques. However, state-of-the-art models require massive amounts of data to properly train, and to the present day there is a lack of vast brain datasets which extensively sample the temporal dynamics of visual object recognition. Here we collected a large and rich dataset of high temporal resolution EEG responses to images of objects on a natural background. This dataset includes 10 participants, each with 82,160 trials spanning 16,740 image conditions. Through computational modeling we established the quality of this dataset in five ways. First, we trained linearizing encoding models that successfully synthesized the EEG responses to arbitrary images. Second, we correctly identified the recorded EEG data image conditions in a zero-shot fashion, using EEG synthesized responses to hundreds of thousands of candidate image conditions. Third, we show that both the high number of conditions as well as the trial repetitions of the EEG dataset contribute to the trained models’ prediction accuracy. Fourth, we built encoding models whose predictions well generalize to novel participants. Fifth, we demonstrate full end-to-end training of randomly initialized DNNs that output M/EEG responses for arbitrary input images. We release this dataset as a tool to foster research in visual neuroscience and computer vision.