Facial Expression Recognition with Identity and Spatial-temporal Integrated Learning
Spatial-temporal structure of expression frames plays a critical role in the task of video based facial expression recognition (FER). In this paper, we propose a 3D CNN based framework to learn the spatial-temporal structure from expression frames for video-based FER. First, we use the data labeled with identities to train an identity network to capture the facial biometric features from expression frames. Second, we remove the impact of facial biometric features from the expression features and construct typical facial expression (TFE) features. Then, we feed the TFE features to a 3D network to discover spatial-temporal structure of expression frames. In the end, we feed the spatial-temporal vector to a fully-connected layer to get a vector for classification. The proposed method achieves comparable accuracy with the state-of-art of 88.54% on Oulu-CASIA, and is efficient to be used for the task of video-based FER.