Musk spent billions of dollars to build the largest supercomputing center, 100,000 H100s to train Grok to catch up with GPT-4o

avatar
36kr
05-27
This article is machine translated
Show original

[Introduction] Recently, Musk, who has not made any new moves for a long time, released big news - his artificial intelligence startup xAI will invest heavily in building a supercomputing center to ensure the training of Grok 2 and later versions. This "supercomputing factory" is expected to be completed in the fall of 2025 and will be four times the size of the current largest GPU cluster.

Some time ago, OpenAI, Google, and Microsoft held conferences one after another, and the competition in the AI circle was in full swing.

How can Musk be missing at such a lively time?

He was busy with Tesla and Starlink some time ago, but he seems to have freed up his hands recently. He made a big announcement, which was to build the world's largest supercomputing center.

In March this year, his xAI released the latest version of Grok 1.5. Since then, there have been rumors that Grok 2 will be released soon, but there has been no official news.

Is it because of insufficient computing power?

Yes, the billionaire may not be able to buy enough chips. In April this year, he personally stated that there were not enough advanced chips, delaying the training and release of the Grok 2 model.

He said training Grok 2 required about 20,000 Nvidia H100 GPUs based on the Hopper architecture, adding that Grok 3 models and beyond will require 100,000 H100 chips.

Tesla's first-quarter financial report also showed that the company had been limited by computing power. At that time, Musk's plan was to deploy 85,000 H100 GPUs by the end of the year and spend most of the $6 billion xAI raised from Sequoia Capital and other investors on chips.

Currently, each H100 sells for about $30,000. Not counting construction costs and other server equipment, the chips alone will cost $2.8 billion.

According to Musk's estimate, this chip reserve is more than enough to train Grok 2.

But maybe after thinking about it for a month, Ma felt that this step was not big enough and not groundbreaking enough. After all, xAI is positioned to compete head-on with strong rivals such as OpenAI and Google, and it cannot afford to be slowed down by computing power if it wants to train models in the future.

As a result, he recently publicly stated that xAI needs to deploy 100,000 H100s to train and run the next version of Grok.

Moreover, xAI also plans to connect all the chips into a huge computer - Musk calls it the "Gigafactory of Compute".

Ma told investors this month that he hopes to have the supercomputer running by the fall of 2025 and that he will be "personally responsible for delivering the supercomputer on time" because it is critical to developing the LLM.

This supercomputer may be jointly built by xAI and Oracle. In recent years, xAI has rented servers with about 16,000 H100 chips from Oracle and is the largest source of orders for these chips.

If xAI does not develop its own computing power, it is likely to spend $10 billion on cloud servers in the next few years. In the end, it is actually more cost-effective to use a "supercomputing factory".

The largest GPU cluster currently

Once completed, this "supercomputing factory" will be at least four times the size of the current largest GPU cluster.

For example, data released on Meta’s official website in March showed that they had launched two clusters consisting of 24,000 H100 GPUs for Llama 3 training.

Although Nvidia has announced that it will begin production and delivery of the B100 GPU with a new architecture Blackwell in the second half of this year, Musk's current plan is to purchase the H100.

Why not use the latest model of chips, but instead buy large quantities of models that are about to be eliminated? Huang himself explained the reason to us: "In today's AI competition, time is very important."

NVIDIA will update the next generation of products every year, and if you want to wait for my next product, then you will lose the training time and first-mover advantage.

The next company to hit a milestone will announce a breakthrough AI, while the next closest competitor will only improve on it by 0.3%. Which one will you choose?

This is why it is important to always be a technology leader. Your customers will build on you and trust that you will always be ahead. Time is important here.

That's why my customers are still building Hopper systems like crazy. Timing is everything. The next milestone is coming soon.

However, even if everything goes well and the "supercomputing factory" is delivered on time under Musk's "personal responsibility", it is still unknown whether this cluster will still have scale advantages by next fall.

Zuckerberg posted on Instagram in January this year that Meta will deploy another 350,000 H100s by the end of this year, which together with the previous computing power will be equivalent to a total of 600,000 H100s, but he did not mention the number of chips in a single cluster.

But this number almost doubled in less than half a year. Before the release of Llama 3 in early May, there were reports that Meta had purchased an additional 500,000 GPUs from NVIDIA, bringing the total to 1 million, with a retail value of $30 billion.

Meanwhile, Microsoft aims to have 1.8 million GPUs by the end of the year, and OpenAI is even more aggressive, hoping to use 10 million GPUs for its latest AI models. The two companies are also discussing the development of a $100 billion supercomputer containing millions of Nvidia GPUs.

Who will win in this battle of computing power?

It should be Nvidia.

And it's not just H100. Nvidia CFO Colette Kress once mentioned a list of priority customers for the Blackwell flagship chip, including OpenAI, Amazon, Google, xAI, and more.

The B100, which is about to go into production, and the chips that Nvidia will update once a year, will continue to flow into the supercomputing centers of technology giants, helping them to upgrade and iterate their computing power.

There is a shortage of chips and electricity

When talking about Tesla's computing power problem, Musk also added that although chip shortages have been a major constraint on the development of AI so far, electricity supply will be crucial in the next one or two years and may even replace chips as the biggest limiting factor.

The most important factor to consider when choosing a location for the new “supercomputing factory” is power supply. A data center with 100,000 GPUs may require 100 megawatts of dedicated power.

To provide this amount of power, the San Francisco Bay Area, where xAI's headquarters is located, is obviously not an ideal choice. In order to reduce costs, data centers are often built in remote areas where electricity is cheaper and more abundant.

For example, in addition to planning the $100 billion supercomputer, Microsoft and OpenAI are also building a large data center in Wisconsin at a construction cost of approximately $10 billion; Amazon Cloud Service's data center is located in Arizona.

A very likely location for the "supercomputing factory" is Tesla's headquarters in Austin, Texas.

The Dojo that Tesla announced last year was deployed here. This supercomputer is based on custom chips and helps train AI self-driving software. It can also be used to provide cloud services to the outside world.

The first Dojo runs on 10,000 GPUs and cost about $300 million to build. Musk said in April that Tesla currently has a total of 35,000 GPUs used to train its self-driving system.

Training models in data centers is an extremely power-intensive process. It is estimated that training GPT-3 consumes 1,287 megawatt-hours of electricity, which is roughly equivalent to the amount of electricity consumed by 130 American households per year.

Musk is not the only CEO to notice the AI power problem. Sam Altman himself has invested $375 million in the startup Helion Energy, which aims to use nuclear fusion to provide a more environmentally friendly and lower-cost way to operate AI data centers.

Musk is not betting on nuclear fusion technology. He believes that AI companies will soon begin competing for step-down transformers, which can convert high-voltage current into electricity that can be used by the grid. "It is a huge drop to get electricity from the utility grid (for example, 300 kilovolts) down to less than 1 volt."

After chips, the AI industry needs "transformers for transformers".

References:

https://www.theinformation.com/articles/musk-plans-xai-supercomputer-dubbed-gigafactory-of-compute?rc=epv9gi

https://www.inc.com/ben-sherry/elon-musk-touts-nvidia-dominance-predicts-a-giant-leap-in-ai-power.html

https://finance.yahoo.com/news/jensen-huang-elon-musk-openai-182851783.html?guccounter=1

This article comes from the WeChat public account "New Intelligence" (ID: AI_era) , edited by Qiao Yang, so sleepy, and published by 36Kr with authorization.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments