MIT researchers, in collaboration with the MIT-IBM Watson AI Lab and IBM Research, have created a new machine-learning technique that blends multiple modalities to learn more like humans. The breakthrough method involves analyzing unlabeled audio and visual data, and it resonates with many models, including speech recognition, audio creation engines, and object detection. The approach uses contrastive learning and masked data modeling, and its aim is to replicate how humans perceive and understand the world and then duplicate the same behavior.
The researchers used a contrastive audio-visual masked autoencoder (CAV-MAE) neural network, which maps meaningful latent representations from audio and visual data. The models can be trained on massive datasets of 10-second YouTube clips that use both audio and video components. The researchers claim that CAV-MAE outperforms previous techniques because it explicitly emphasizes the association between audio and visual data.
CAV-MAE involves two approaches: masked data modeling and contrastive learning. The masked data modeling approach hides certain data points and then recovers the missing data through a joint encoder/decoder. The reconstruction loss, which measures the difference between the reconstructed prediction and the original audio-visual combination, trains the model. The main goal of this approach is to map similar representations close to one another. It does so by associating the relevant parts of audio and video data, such as connecting the mouth movements of spoken words.
The researchers tested CAV-MAE-based models with other methods in audio-video retrieval and audio-visual classification tasks. The results showed that contrastive learning and masked data modeling methods are complementary. CAV-MAE outperformed previous techniques in event classification and remained competitive with models trained using industry-level computational resources. Additionally, multi-modal data significantly improved fine-tuning of single-modality representation and performance on audio-only event classification tasks.
The MIT researchers believe that CAV-MAE represents a significant breakthrough in self-supervised audio-visual learning. Its use-cases range from action recognition, including sports, education, entertainment, motor vehicles, and public safety, to cross-linguistic automatic speech recognition and audio-video generations. While the current method focuses on audio-visual data, the researchers aim to extend it to other sensory modalities.
As machine learning continues to evolve, CAV-MAE and other techniques like it will become increasingly valuable. Its use will enable models to interpret and understand the world better, and the researchers are hopeful about the potential it presents for the future.