On Wednesday, Google introduced Gemini, its highly anticipated general purpose, multimodal, generative AI model, claiming it’s more powerful than OpenAI’s GPT-4. According to Demis Hassabis, founder of DeepMind, Google’s elite AI lab, Gemini can understand the world around us in the way that humans do, making it superior to any other model available.
Gemini boasts 5 times the computational power of GPT-4, allowing for faster training and potentially larger model sizes. It is the first model to outperform human experts on MMLU (Massive Multitask Language Understanding), a popular method to evaluate AI models’ knowledge and problem-solving abilities.
Starting from December 13, developers can access Gemini through Google Cloud’s API. Furthermore, a more powerful version of the model is projected to debut in 2024, pending thorough trust and safety checks.
Gemini comes in three sizes and can efficiently run across various platforms, including data centers and mobile devices. It combines different types of information, such as text, code, audio, image, and video, enabling it to comprehend and reason about diverse inputs better than existing multi-modal models.
Google highlights that Gemini Ultra excels in tasks involving deliberate reasoning, surpassing previous state-of-the-art models. Additionally, it excels in image benchmarks, demonstrating its native multi-modality and complex reasoning abilities.
Unlike the standard approach of training separate components for different modalities, Gemini was natively designed to be multi-modal from the start. This unique design enables it to understand and reason about various inputs more effectively than its counterparts.
Gemini has undergone extensive training to simultaneously recognize and understand text, images, audio, and more. As a result, it excels in explaining complex subjects such as math and physics.
Gemini’s sophisticated multi-modal reasoning capabilities unlock its potential to comprehend intricate written and visual information. By extracting insights from hundreds of thousands of documents, Gemini facilitates breakthroughs in fields ranging from science to finance, all at digital speeds.
Another standout feature of Gemini is its ability to understand, explain, and generate high-quality code in popular programming languages. Thus, it solidifies its place as one of the leading foundation models for coding globally.
During training, Google utilized its AI-optimized infrastructure and Tensor Processing Units (TPUs), reducing its dependency on GPUs that often face shortages, which can disrupt other models like GPT-4.
The company invested considerable effort into ensuring Gemini’s reliability and scalability for training purposes. Moreover, they focused on making it an efficient model to serve users. Google emphasizes the addition of new protections to mitigate potential risks associated with Gemini’s multi-modal capabilities, considering safety measures at every development stage.
Gemini is currently being incorporated into various products and platforms. For instance, Google’s chatbot, Bard, will utilize a fine-tuned version of Gemini Pro to enhance reasoning, planning, understanding, and more.
While the strengths of generative AI models will continue to evolve over time, Google’s unveiling of Gemini undoubtedly raises the bar in this rapidly evolving field.