A new AI framework has been developed to enhance emotion analysis in text posts on social media. Social media enthusiasts often use emojis, images, audio, or video to attract more attention to their posts. This new framework leverages two stacked layers of transformers, which are state-of-the-art AI models for multimodal sentiment analysis.
The Chinese research team behind this framework aims to improve the understanding of how different modalities interact and enhance each other when conveying emotions. By fusing information in two stages, the framework effectively captures information at multiple levels. This approach was tested on three open datasets (MOSI, MOSEI, and SIMS) and outperformed or matched the performance of benchmark models.
The workflow of the framework involves feature extraction, two stages of information fusion, and emotion prediction. Text, audio, and video signals from source video clips are processed and encoded with additional context information to create context-aware representations. These representations then go through two stages of fusion, where text, audio, and video interactions are optimized for emotion prediction.
The core of this framework is the stacked transformers, which include bidirectional cross-modal transformers and a transformer encoder. These components facilitate cross-modal interactions and nuanced second-stage fusion. An attention weight accumulation mechanism was implemented to aggregate attention weights from different modalities during fusion, enhancing the extraction of shared information.
In the future, the research team plans to integrate more advanced transformers to improve computational efficiency and address challenges associated with the self-attention mechanism. By leveraging cutting-edge AI models and innovative fusion techniques, this new framework offers a promising approach to analyzing emotions expressed through different modalities in text posts on social media.