Microsoft AI Introduces MM-REACT: A Multimodal System for Advanced Reasoning and Action
Large Language Models (LLMs) have been revolutionizing the field of artificial intelligence, making significant contributions to our economic and social landscapes. Among these models, OpenAI’s ChatGPT has gained immense popularity in recent months for its ability to generate human-like text. Powered by the GPT-4 transformer architecture, ChatGPT has become a go-to tool for natural language processing tasks.
Simultaneously, computer vision has made remarkable progress due to advancements in artificial intelligence and machine learning. Researchers have now introduced MM-REACT, a novel system paradigm that combines multiple vision experts with ChatGPT for advanced multimodal reasoning and action. This integration aims to overcome complex visual understanding challenges by utilizing a more flexible approach.
MM-REACT has been developed to tackle a wide range of intricate visual tasks that existing vision and vision-language models often struggle with. To achieve this, the system uses a prompt design that incorporates various types of information, including text descriptions, textualized spatial coordinates, and dense visual signals like images and videos represented through aligned file names. This design allows ChatGPT to effectively process different information types in conjunction with visual input, leading to a more comprehensive and accurate understanding.
Comprising both ChatGPT and a pool of vision experts, MM-REACT adds multimodal functionalities to enhance the capabilities of the system. By using file paths as placeholders and inputting them into ChatGPT, the system becomes capable of accepting images as input. Whenever specific information from an image is required, such as identifying a celebrity or locating object coordinates, ChatGPT relies on the expertise of a specific vision model. The output from the vision expert is then serialized as text and combined with the input to further activate ChatGPT. If no external experts are needed, the response is directly returned to the user.
To enable ChatGPT to utilize the knowledge of vision experts, specific instructions related to each expert’s capabilities, input arguments, and outputs are added to ChatGPT prompts. Additionally, in-context examples for each expert are provided, along with special keywords that can invoke the experts using regex expression matching.
Experiments conducted on MM-REACT have demonstrated its effectiveness in addressing specific capabilities of interest. Zero-shot experiments have showcased its ability to solve a wide range of advanced visual tasks that require complex visual understanding. For instance, MM-REACT can provide solutions to linear equations displayed in images and grasp concepts by identifying products and their ingredients.
In conclusion, the MM-REACT system paradigm seamlessly combines language and vision expertise, resulting in advanced visual intelligence. By leveraging the power of ChatGPT and incorporating a diverse set of vision experts, Microsoft AI has pushed the boundaries of what is possible in multimodal reasoning and action. This development holds tremendous potential for industries and research fields that require sophisticated visual understanding and reasoning capabilities.
As AI continues to evolve, it is essential to explore innovative approaches like MM-REACT that bring together different AI disciplines to tackle complex challenges. The integration of language and vision models offers new opportunities for AI-powered systems to comprehend and interact with the world more comprehensively. With further advancements in multimodal AI, we can look forward to remarkable transformations across various industries and domains.