The introduction of social robots in human living spaces has brought to attention the need for robots to be equipped with emotion recognition capabilities to facilitate natural and social human-robot interactions. This paper explores the recognition of continuous dimensional emotion from facialexpressions.
It further investigates the use of principal component analysis (PCA), locality preserving projections (LPP) and factor analysis (FA) for reduction of the many features that are typically produced by facial feature extraction algorithms.
The reduced features sets are modelled using Nonlinear AutoRegressive with eXogenous inputs Recurrent Neural Networks (NARX-RNN). The results show that PCA significantly outperfoms both LPP and FA techniques, and that the NARX-RNN model is a powerful predictor of continuous emotion.
This section describes the affective computing framework (depicted in Figure 2) for analysis of facial expressions to inform a robot of a person’s cognitive state. A robot is typically fitted with a camera to capture its environment.
The image sequences captured by the camera are fed into a face tracker which attempts to locate a face on each image. Once located, the facial expression features are extracted from the image sequence and a dimension reduction technique is applied prior to modelling the data.
A. Face Tracking:
Tracking of the face is achieved through the use of the GAVAM-CLM tracker which combines non-rigid face tracking and rigid head pose tracking approaches for accurate location of the face. Non-rigid tracking approach refers to locating facial landmarks of interest from an image such as the corner of the eyes and the outline of the lips.
B. Feature Extraction:
The cropped image sequences from the previous subsection are passed onto a temporal local binary pattern algorithm to extract facial features that will enable sucessful modelling of the face expressions.
The temporal local binary algorithm used in this work is an extension of the original local binary pattern (LBP) operator which captures the motion and appearance of an image sequence and produces a feature descriptor that describes the dynamic textures.
C. Feature Reduction:
Machine learning algorithms are known to degrade in performance when faced with many features that are not necessary for predicting the desired output – a concept known as the curse of dimensionality .
Therefore the selection or extraction of relevant features lead to efficient modelling of data. The reader is referred to section III for a description of the feature reduction techniques investigated in this work.
D. Emotion Modeling: NARX Recurrent Neural Network:
The modeling of emotion remains a challenging task due to the large variance in emotion expressions and the temporal nature of emotion, amongst other factors. This can be addressed by employing prediction models that capture the temporal dynamics of emotion as it unfolds.
One such model is the Nonlinear AutoRegressive with eXogenous inputs Recurrent Neural Network (NARX-RNN) which is a dynamic network with feedback connections (as depicted in Figure 4) that allow the model to retain information about past inputs and to learn correlations between temporally distant events.
Dimensionality reduction is an essential preprocessing technique for high-dimensional data as too many features could lead to overfitting of models and an increase in noise. There are two basic approaches to dimensionality reduction: feature selection and feature extraction.
Feature selection is a process of selecting a subset of relevant features from the original feature set, while feature extraction creates a new feature set by transforming the existing features into a lower dimension. The data transformation can be linear or nonlinear.
EXPERIMENTS AND RESULTS
This section describes the database used and the experimental setup for the comparative analysis of the three linear techniques described in section III. It also reports the parameter search procedure for the different NARX recurrent neural networks used. The results are presented and discussed in section IV-C.
A. Emotion Database:
The emotion database used in this work forms part of the SEMAINE corpus which was recorded to study natural social signals that occur between humans and artificially intelligent agents. It contains audiovisual recordings of humans who interact with four emotionally stereotyped characters -role-played by humans – portraying the following personalities:(i) even-tempered and sensible, (ii) happy and outgoing, (iii) angry and confrontational, and (iv) sad and depressive.
B. Experimental Setup:
An experiment was designed to determine the optimal dimension size of each feature reduction technique. Each technique was setup to reduce the features to the following dimensions: 10,20,40,60,80 and 100.
A NARX recurrent neural network was optimized for each feature reduction technique per dimension size. A grid-search was conducted to estimate the NARX-RNN model parameters using five-fold cross validation with Pearson’s correlation coefficient as the evaluation metric.
The feature reduction techniques were evaluated using the average Pearson’s correlation coefficient which is obtained by computing the correlation between the emotion predictions and ground truth for each video in the dataset, and then averaging over all videos in a specific emotion dimension. The performance of the investigated techniques was estimated over 30 independent runs.
D. Data Analysis:
The annotation of continuous dimensional emotion is a very challenging task as it is highly subjective, and requires a higher amount of attention and cognitive processing compared to non real-time, discrete annotation tasks. This affects the learning ability of a model as it relies on accurate and consistent ground truth.
This paper comparatively discussed three feature reduction techniques, namely principal component analysis (PCA), locality preserving projections (LPP) and factor analysis (FA) on the problem of continuous dimensional emotion recognition. Various dimension sizes were explored for each technique and a NARX-recurrent neural network was optimized for each technique variant.
Experimental results showed that PCA significantly outperformed LPP and FA for both arousal and valence emotion dimensions. The large reduction of features and corresponding better performance confirm that feature reduction is a crucial step for building compact and accurate models, especially for incorporation in human-robot technologies.
Recently, the input modalities of emotion recognition systems have been extended to allow for detection of facial and vocal expressions, gestures and body postures. The multiple modalities often increase the accuracy and robustness of emotion systems since some modalities may carry complementary information. Thus, multiple feature sets representing each modality have to be obtained which leads to a very arge feature space of different forms.
Therefore, future work includes exploring dimensionality reduction methods that can transform the multiple feature sets into a unified space of lower dimension. The fusion of multiple kernel learning algorithms with dimension reduction techniques show great promise, and provide a good starting point.
Source: University of Cambridge
Authors: Ntombikayise Banda | Andries Engelbrecht | Peter Robinson