US chip investment expert: Google TPU has the upper hand for now, but NVIDIA GPU has a stronger long-term advantage.

12-10

This article is machine translated

Show original

In a recent interview, US chip investment expert Gavin Baker provides an in-depth analysis of the differences between NVIDIA GPUs (Hopper, Blackwell) and Google TPUs, including a thorough analysis from the perspectives of technology, performance, cost, and synergy. He points out that while Google TPUs have a temporary lead in the short term, NVIDIA's GPU ecosystem still holds a stronger monopoly in the long run.

GPUs are full-stack platforms, while TPUs are single-point ASICs.

Baker stated that the divergence in AI accelerators stems from the most fundamental design philosophies. NVIDIA's GPUs, from Hopper and Blackwell to the future Rubin, emphasize being full-stack platforms. From the GPU itself, NVLink bidirectional interconnect technology, network cards, switches, to software layers such as CUDA and TensorRT, everything is handled by NVIDIA. After purchasing the GPUs, enterprises gain access to a complete environment that can be directly used for training and inference, eliminating the need to assemble networks or rewrite software themselves.

In contrast, Google TPUs (v4, v5e, v6, v7) are essentially Application-Specific Integrated Circuits (ASICs), meaning they are accelerators specifically designed for specific AI computations. Google is responsible for the front-end logic design, but the back-end is manufactured by Broadcom and then outsourced to TSMC for production. Google also integrates other essential components of the TPU, such as switches, network cards, and the software ecosystem, making the supply chain collaboration much more complex than that of GPUs.

Overall, the advantage of GPUs lies not in the performance of a single chip, but in the completeness of the entire platform and ecosystem. This is also the starting point for the increasingly obvious competitive gap between the two.

Blackwell delivers a significant performance leap, putting greater pressure on TPU v6/v7.

Baker points out that the performance gap between GPUs and TPUs will become increasingly pronounced in 2024-2025. Blackwell's transition from the GB200 to the GB300 represents a significant architectural leap, shifting to a liquid-cooled design with a single rack power consumption of 130kW and unprecedented overall complexity. Mass deployment has only been underway for three or four months, still in a very new phase.

The next-generation GB300 can be directly inserted into GB200 racks, allowing for faster enterprise expansion. xAI, with its fastest data center construction speed, is considered one of the first customers to fully leverage Blackwell's performance. Baker uses this analogy:

"If the Hopper is described as the most advanced aircraft at the end of World War II, then the TPU v6/v7 is like the F-4 Phantom, which is two generations later. Blackwell, on the other hand, is like the F-35, belonging to a completely different level of performance."

This indicates that TPU v6/v7 and Blackwell are at different hardware levels, and also points out that the Google Gemini 3 currently uses TPU v6/v7, not Blackwell-level devices. Although Google can train high-quality models like the Gemini 3 using TPU v6/v7, the performance difference between the two architectures will become increasingly apparent as the Blackwell series is widely released.

TPU was once the king of low-cost chips, but GB300 will change that.

Baker stated that TPU's most crucial advantage in the past was its world-leading training cost. And Google did indeed use this advantage to squeeze its competitors' fundraising and operational space.

However, Baker points out that once GB300 is deployed on a large scale, the lowest-cost training platforms on the market will shift to companies using GB300, especially teams like XAI that have vertical integration capabilities and self-built data centers. OpenAI, if it can overcome computing power bottlenecks and develop its own hardware capabilities in the future, may also join the GB300 camp.

This means that once Google loses its cost leadership, its previous low-price strategy will be difficult to sustain. Control over training costs will also shift from being dominated by TPUs to being redistributed by GB300.

GPU expansion offers faster collaboration, while TPU integration carries a heavier burden.

The faster large models progress, the greater the demand for large-scale GPU collaboration, which is one of the key factors that has led to GPUs significantly outperforming TPUs in recent years. Baker points out that GPU clusters, through NVLink, can push the collaborative scale to 200,000 to 300,000 GPUs, allowing large models to use larger training budgets. XAI's rapidly built large data centers have also forced NVIDIA to release optimized solutions earlier, accelerating the evolution of the entire GPU ecosystem.

In contrast, the TPU is more complex than the GPU because Google has to integrate the switches and network itself, and coordinate the supply chains of Broadcom and TSMC.

GPUs are moving towards a one-year generation, while TPU iterations are constrained by the supply chain.

Baker noted that in response to competitive pressure from ASICs, both Nvidia and AMD are accelerating their update cycles, with GPUs moving towards a "one generation per year" approach. This is a highly advantageous pace for the era of large-scale models, as model size expansion is virtually uninterrupted.

The iteration speed of TPUs is more limited. From v1 to v4, and then to v6, each generation took several years to mature. The future v8 and v9 will face even greater challenges because the supply chain involves Google, Broadcom, TSMC, and other companies, making development and iteration slower than GPUs. Therefore, in the next three years, the advantage of GPUs in terms of iteration speed will become increasingly apparent.

(Technological differences and future market trends of NVIDIA GPUs, Google TPUs, and Amazon AWS self-developed AI chips)

The three giants are clearly aligning themselves with Nvidia, while Google is clinging to its TPU.

Currently, the world's four leading model providers are OpenAI, Gemini (Google), Anthropic, and xAI, but the overall alignment is increasingly leaning towards NVIDIA.

Baker stated that Anthropic has signed a $5 billion long-term procurement contract with NVIDIA, officially aligning itself with the GPU camp. xAI is Blackwell's largest early customer and has invested heavily in building GPU data centers. OpenAI, on the other hand, faces excessive cost pressures due to the need to lease computing power from external suppliers, and is therefore hoping to address its long-standing computing power bottleneck through the Stargate project.

Among the four companies, Google is the only one that makes extensive use of TPUs, but it also faces pressure from declining cost competitiveness and slower iteration speed of TPUs. Overall, the computing power landscape is a "three against one" situation, with OpenAI, Anthropic, and XAI clustered in the GPU camp, while Google is relatively isolated in the TPU camp.

(Nvidia's financial report shows strong revenue: AI data center business is booming; Jensen Huang: Blackwell is selling out)

This article, titled "US Chip Investment Expert: Google TPU Temporarily Holds the Upper Hand, But NVIDIA GPU Has a Greater Long-Term Advantage," first appeared on ABMedia, a ABMedia .

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ME News

CoinFound Gold RWA Weekly Report (March 12): Leading assets maintain incremental resilience, on-chain transaction volume shows contraction.

PAXG

0.11%

Bitpush

Before the oil shock even arrives, a stock market bubble is already looming.

The Defiant

Bitcoin Holds Above $70,000 as U.S. Inflation Remains Subdued

BTC

1.83%