A YouTuber has successfully recreated Google’s controversial Gemini Ultra video that seemingly showcased real-time responses to changes in a live video. However, the reality is that Google actually faked the demonstration. In response, the YouTuber utilized OpenAI’s vision AI model GPT-4V to develop a similar video, testing the capabilities of the technology.
Google introduced the Gemini artificial intelligence models, including the flagship Gemini Ultra, which supposedly exhibited the ability to respond in real-time to video changes. Although the promotional video was impressive, it was eventually revealed that Google achieved the results by solving problems from still images over an extended period, rather than true real-time processing.
In an attempt to determine the feasibility of AI-powered features showcased in Google’s video, YouTuber Greg Technology developed a simple app using OpenAI’s GPT-4V to assess its performance. Gemini Ultra was trained using a multimodal dataset incorporating images, text, code, video, audio, and motion data, enabling it to comprehend the world in a manner similar to humans.
Google’s video presented various actions being performed with Gemini providing descriptive voiceover of what it could allegedly see. While the responses depicted were indeed accurate, they were derived from still images or segmented clips, and not generated in real-time. Essentially, the video served more as a marketing tool rather than a technical demonstration.
In his two-minute video, Greg expressed his excitement about Google’s Gemini demo but was disappointed to uncover its lack of real-time functionality. According to Greg, GPT-4 vision, which was released a month earlier, already accomplished what was showcased in the Gemini video, but with real-time processing.
The video interaction with GPT-4, similar to the Voice version of ChatGPT, featured responses delivered in a natural tone. However, in addition to text, the video included hand gestures, the ability to identify a drawing of a duck on water, and even play rock, paper, scissors.
Greg Technology has generously made the code used to create the ChatGPT Video interface available on GitHub so that others can experiment and explore its capabilities.
To verify the authenticity of the video, the code furnished by Greg Technology was installed and tested, successfully identifying hand gestures, a glass coffee cup, and even providing information regarding a book’s title and author. This serves as a testament to OpenAI’s significant lead in terms of multimodal support, surpassing other models that struggle with real-time video analysis, despite their ability to analyze image content.
As OpenAI continues to pioneer advancements in the field, it remains at the forefront of developing AI models capable of comprehending various modes of data. While other models have made strides in image analysis, OpenAI’s GPT-4V demonstrates an enhanced ability to process real-time video.
In conclusion, the recreation of Google’s Gemini Ultra video using OpenAI’s GPT-4V highlights the ongoing progress and potential of AI technology. By leveraging multimodal support, OpenAI has achieved remarkable results, surpassing the capabilities of other models. Despite the initial disappointment surrounding Google’s faked video, this development showcases the exciting possibilities that lie ahead in the realm of artificial intelligence and video analysis.