Google Unveils TPU v5p Pods to Accelerate AI Training
Google has announced the release of its latest Tensor Processing Unit (TPU) called the v5p, aimed at reducing the time required for training large language models. The TPU v5p is an upgraded version of the TPU v5e and offers increased performance and scalability.
TPUs are specialized chips used by Google to power machine learning features in its various web products, such as Gmail, Google Maps, and YouTube. While initially used internally, Google has recently opened up its TPUs to the public for AI training and inference tasks.
The TPU v5p is touted as Google’s most powerful accelerator yet, capable of delivering 459 teraFLOPS of bfloat16 performance or 918 teraOPS of Int8. It comes equipped with 95GB of high bandwidth memory, capable of transferring data at an impressive speed of 2.76 TB/s.
What sets the TPU v5p apart is its ability to be clustered together in pods, with up to 8,960 accelerators in a single pod. These pods are connected using Google’s inter-chip interconnect, enabling faster model training and increasing precision. Compared to previous versions, the TPU v5p allows for clusters that are 35 times larger than what was possible with the TPU v5e and more than twice as large as the TPU v4.
According to Google, this new hardware is capable of training large language models, such as OpenAI’s 175 billion parameter GPT3, up to 1.9 times faster using BF16 and up to 2.8 times faster than the older TPU v4 when using 8-bit integer calculations.
However, this enhanced performance and scalability do come at a higher cost. The TPU v5p accelerators are priced at $4.20 per hour, compared to $3.22 per hour for the TPU v4 and $1.20 per hour for the TPU v5e. For users who are not in a rush to train or refine their models, the TPU v5e remains a more cost-effective option.
In addition to the hardware upgrades, Google has introduced the concept of an AI hypercomputer. This system optimizes hardware, software, machine learning frameworks, and consumption models to reduce inefficiencies and bottlenecks commonly associated with demanding AI workloads.
To complement the new hardware, Google has also unveiled Gemini, a multi-modal large language model capable of handling various forms of data, including text, images, video, audio, and code.
The introduction of the TPU v5p and AI hypercomputing architecture marks another significant step by Google to advance AI technology and make it more accessible to developers and researchers. Although the increased cost may deter some users, the higher performance and scalability offer substantial benefits for those in need of accelerated AI training.
As Google continues to innovate in the AI space, it will be interesting to see how these advancements further shape the future of machine learning and its applications across various industries.