TY  - JOUR
A1  - Thao, Ha Thi Phuong
A1  - Balamurali, B T
A1  - Roig Noguera, Gemma
A1  - Herremans, Dorien
T1  - AttendAffectNet-emotion prediction of movie viewers using multimodal fusion with self-attention
T2  - Sensors
N2  - In this paper, we tackle the problem of predicting the affective responses of movie viewers, based on the content of the movies. Current studies on this topic focus on video representation learning and fusion techniques to combine the extracted features for predicting affect. Yet, these typically, while ignoring the correlation between multiple modality inputs, ignore the correlation between temporal inputs (i.e., sequential features). To explore these correlations, a neural network architecture—namely AttendAffectNet (AAN)—uses the self-attention mechanism for predicting the emotions of movie viewers from different input modalities. Particularly, visual, audio, and text features are considered for predicting emotions (and expressed in terms of valence and arousal). We analyze three variants of our proposed AAN: Feature AAN, Temporal AAN, and Mixed AAN. The Feature AAN applies the self-attention mechanism in an innovative way on the features extracted from the different modalities (including video, audio, and movie subtitles) of a whole movie to, thereby, capture the relationships between them. The Temporal AAN takes the time domain of the movies and the sequential dependency of affective responses into account. In the Temporal AAN, self-attention is applied on the concatenated (multimodal) feature vectors representing different subsequent movie segments. In the Mixed AAN, we combine the strong points of the Feature AAN and the Temporal AAN, by applying self-attention first on vectors of features obtained from different modalities in each movie segment and then on the feature representations of all subsequent (temporal) movie segments. We extensively trained and validated our proposed AAN on both the MediaEval 2016 dataset for the Emotional Impact of Movies Task and the extended COGNIMUSE dataset. Our experiments demonstrate that audio features play a more influential role than those extracted from video and movie subtitles when predicting the emotions of movie viewers on these datasets. The models that use all visual, audio, and text features simultaneously as their inputs performed better than those using features extracted from each modality separately. In addition, the Feature AAN outperformed other AAN variants on the above-mentioned datasets, highlighting the importance of taking different features as context to one another when fusing them. The Feature AAN also performed better than the baseline models when predicting the valence dimension.
KW  - neural networks
KW  - self-attention
KW  - emotion prediction
KW  - MediaEval 2016
KW  - COGNIMUSE
KW  - affective computing
KW  - multimodal fusion
KW  - computer vision
Y1  - 2021
UR  - http://publikationen.ub.uni-frankfurt.de/frontdoor/index/index/docId/79501
UR  - https://nbn-resolving.org/urn:nbn:de:hebis:30:3-795015
SN  - 1424-8220
N1  - This research was funded by MOE Tier 2 grant number MOE2018-T2-2-161, and the SUTD President’s Graduate International Fellowship.
VL  - 21
IS  - 24, art. 8356
SP  - 1
EP  - 25
PB  - MDPI
CY  - Basel
ER  -