Huang decrypted three generations of GPUs in one go, crushed Moore's Law to build an AI empire, and mass-produced Blackwell to solve ChatGPT's global power consumption problem

avatar
36kr
06-03
This article is machine translated
Show original

Just now, Huang made another high-profile show to the world: Blackwell, which has been mass-produced, will reduce the training energy consumption of the 1.8 trillion parameter GPT-4 to 1/3 50 within 8 years; Nvidia's amazing product iteration directly broke through Moore's Law on the spot; Blackwell's roadmap for the next three generations was also released in one go.

Just now, when Huang held Blackwell and showed it to the world, the audience was excited.

It is the world's largest chip so far!

This product in front of you embodies an astonishing amount of technology.

To use Huang's words, it is "the most complex and high-performance computer ever built in the world."

In 8 years, the energy consumption of training GPT-4 with 1.8 trillion parameters has been reduced to 1/3.50, and the energy consumption of inference has been reduced to 1/45,000.

The iteration speed of NVIDIA products has completely ignored Moore's Law.

As netizens said, it doesn’t matter, Huang has his own Moore’s Law.

With hardware in one hand and CUDA in the other, Huang confidently cut through the "computational inflation" and made a bold prediction - in the near future, every processing-intensive application will be accelerated, and every data center will definitely be accelerated.

The roadmap for the next three generations of Blackwell is also released: Blackwell Ultra (2025), Rubin (2026), Rubin Ultra (2027)

Huang's mathematical formula of "the more you buy, the more you save" also made its appearance again.

A new era of computing begins

At the beginning of his speech, Huang first released a demonstration of the Omniverse simulation world.

He said, "Nvidia is at the intersection of computer graphics simulation and artificial intelligence. This is our soul."

All of this is a simulation in the physical world, and its realization, thanks to two basic technologies - accelerated computing and artificial intelligence, will reshape the computer industry.

The computer industry has a history of more than 60 years, and now a new era of computing has begun.

In 1964, IBM's System 360 introduced the CPU for the first time, and general computing separated hardware and software through the operating system. Architecture compatibility, backward compatibility, and all the technologies we know today came from this point in time.

It was not until 1995 that the PC revolution began, bringing computing into every household and making it more democratic. In 2007, the iPhone was launched, putting a "computer" directly into your pocket and enabling cloud connectivity.

It can be seen that in the past 60 years, we have witnessed 2-3 important technological nodes that have driven the transformation of the computing industry.

Accelerated computing: GPU on one hand, CUDA on the other

And now, we are witnessing history once again. Huang said, "Two fundamental things are happening."

First of all, the performance expansion of processors has slowed down significantly, while the amount of computing we need and the data we need to process are growing exponentially.

In Huang's words, we are experiencing "computational inflation."

Nvidia has been researching accelerated computing for the past 20 years. For example, the emergence of CUDA has accelerated CPU loads. In fact, dedicated GPUs will be even better.

When we run an application, we don’t want it to be an app that runs for 100 seconds or even 100 hours.

Therefore, NVIDIA pioneered heterogeneous computing, allowing the CPU and GPU to run in parallel, accelerating the processing time from 100 units to just 1 unit.

It can be seen that it has achieved a 100-fold speed increase, while the power consumption has only increased by 3 times and the cost is only 1.5 times the original.

Nvidia also equipped a billion-dollar data center with $500 million worth of GPUs, turning it into an "AI factory."

With accelerated computing, many companies around the world can save hundreds of millions of dollars in processing data in the cloud. This also confirms Huang's "mathematical formula": the more you buy, the more you save.

In addition to GPUs, NVIDIA has also done something that is difficult for the industry to achieve, which is to rewrite the software to accelerate the operation of the hardware.

As shown in the figure below, there are dedicated CUDA software in areas ranging from deep learning cuDNN, physics Modulus, communications Aerial RAN, gene sequence Parabricks, to QC simulation cuQUANTUM, data processing cuDF, etc.

In other words, without CUDA, it is equivalent to computer graphics processing without OpenGL and data processing without SQL.

Now, the ecosystem using CUDA is all over the world. Just last week, Google announced that it would add cuDF to Google Cloud and accelerate Pandas, the world's most popular data science library.

Now, you can use Pandas in CoLab with just one click. The data processing speed is incredible.

Huang said that promoting a new platform is a "chicken and egg" dilemma, and both developers and users are indispensable.

But after 20 years of development, CUDA has broken this dilemma and achieved a virtuous circle through 5 million developers around the world and users in countless fields.

The more people install CUDA and the more computations they run, the more they can improve performance and iterate to create more efficient and energy-efficient CUDA.

"AI Factory" Full Stack Reshaping

In 2012, the birth of the neural network AlexNet linked Nvidia to AI for the first time. We all know that AI godfather Hinton and his apprentice completed the training of AlexNet on two Nvidia GPUs at that time.

Deep learning was born, and it has expanded the algorithms invented decades ago at an unimaginable speed.

However, as neural network architectures continue to scale and their appetite for data and computing grows, Nvidia has to reinvent everything.

After 2012, NVIDIA changed the Tensor Core and invented NvLink, as well as TensorRT, Triton inference server, and DGX supercomputer.

At that time, no one understood what Nvidia did, and no one was willing to pay for it.

As a result, in 2016, Huang personally gave Nvidia's first DGX supercomputer to OpenAI, a "small company" located in San Francisco.

Since then, Nvidia has continued to expand from a single supercomputer to a super-large data center.

Until the birth of the Transformer architecture in 2017, larger data was needed to train LLM to recognize and learn patterns that occur continuously over a period of time.

Later, Nvidia built a larger supercomputer. In November 2022, ChatGPT, which was trained on tens of thousands of Nvidia GPUs, was born and can interact like humans.

This was the first time the world saw generative AI. It would output one token at a time, which could be an image, voice, text, video, or even a weather token. It was all about generation.

Huang said, "Everything we can learn can now be generated. We have now entered a new era of generative AI."

The computer that was originally a supercomputer has now become a data center. It can output tokens and has transformed into an "AI factory."

And this "AI factory" is creating and producing things of huge value.

In the late 1890s, Nikola Tesla invented the AC Generator, and now Nvidia is creating an AI Generator that can output tokens.

What NVIDIA brings to the world is that accelerated computing is leading a new round of industrial revolution.

For the first time, humanity has achieved the goal of creating everything that can directly serve a $100 trillion industry using only a $3 trillion IT industry.

The transformation from traditional software factories to today's AI factories has achieved upgrades from CPU to GPU, from retrieval to generation, from instructions to large models, and from tools to skills.

It can be seen that generative AI has driven the reshaping of the entire stack.

From Blackwell GPU to Super "AI Factory"

Next, let’s take a look at how NVIDIA turns the most powerful Blackwell chips on the planet into super “AI factories”.

Pay attention, the one below is a mass-produced motherboard equipped with a Blackwell GPU.

The point that Lao Huang is pointing to is the Grace CPU.

And here we can clearly see two Blackwell chips connected together.

In the past eight years, the Flops of each generation of NVIDIA chips has increased 1,000 times.

At the same time, Moore's Law seems to have gradually become ineffective in the past eight years.

Even compared to the best moments of Moore's Law, Blackwell's computing power improvement is amazing.

This will directly lead to a significant reduction in costs.

For example, the energy consumption for training a GPT-4 with 1.8 trillion parameters and 8 trillion tokens is directly reduced to 1/3 50!

Pascal consumes 1,000 gigawatt hours, which means it requires a 1,000-gigawatt data center. (1 gigawatt = 1,000 megawatts)

And if such a data center really exists, training GPT-4 will take a full month.

A 100-megawatt data center would take about a year.

This is why an LLM like ChatGPT would have been impossible to exist eight years ago.

Now with Blackwell, the previous 1,000 GWh can be directly reduced to 3 GWh.

It can be said that Blackwell was born for reasoning and generating tokens. It directly reduces the energy per token by 45,000 times.

In the past, the cost of generating one token using Pascal was equivalent to running two 200-watt light bulbs for two days. It takes about three tokens for GPT-4 to generate one word. This makes it impossible for us to have the experience of chatting with GPT-4 today.

Now, we can use only 0.4 joules per token, and we can produce amazing tokens with very little energy.

The background of its birth is the exponential growth of the scale of computing models.

Every exponential growth enters a new stage.

As we expand from DGX to large AI supercomputers, Transformer can be trained on large-scale data sets.

The next generation of AI needs to understand the physical world. However, most AI today does not understand the laws of physics. One solution is to let AI learn from video data, and another is to synthesize data.

The third method is to let computers learn from each other! This is essentially the same principle as AlphaGo.

How to solve the huge computing demand? The current solution is that we need a bigger GPU.

And Blackwell was born for this purpose.

There are several important technological innovations in Blackwell.

The first is the size of the chip.

Nvidia connected two of the largest chips it can currently produce with a 10TB/s link; then put them on the same computing node and connected them to a Grace CPU.

During training, it is used for quick checkpoints; during inference and generation, it can be used to store context memory.

Moreover, this second-generation GPU is highly secure, and we can require the server to protect AI from theft or tampering when using it.

In addition, Blackwell uses the 5th generation NVLink.

Moreover, it is the first generation of reliable and usable engine.

With this system, we can test every transistor, flip-flop, on-chip memory, and off-chip memory, so we can determine on the spot if a chip is faulty.

Based on this, NVIDIA has shortened the failure interval of its supercomputer with 100,000 GPUs to minutes.

Therefore, if we don’t invent technology to improve the reliability of supercomputers, it will not be possible for them to run for a long time, and it will not be possible to train models that can run for months.

If you improve reliability, you increase model uptime, which obviously has a direct impact on cost.

Finally, Huang said that data processing of the decompression engine is also one of the most important things NVIDIA must do.

By adding data compression engines and decompression engines, data can be extracted from storage at a speed of 20 times, which is much faster than the current speed.

Super air-cooled DGX & brand-new liquid-cooled MGX

Blackwell was a significant leap forward, but it wasn't big enough for Huang.

Nvidia not only wants to make chips, but also to manufacture servers equipped with the most advanced chips. With Blackwell's DGX supercomputer, it has achieved a leap in capabilities in all aspects.

The latest DGX with integrated Blackwell chip consumes only 10 times more energy than the previous generation Hopper, but the FLOPS level has increased by 45 times.

The air-cooled DGX Blackwell below has 8 GPUs.

The size of the corresponding radiator is also amazing, reaching 15kW, and it is completely air-cooled.

If you prefer liquid cooling, Nvidia has a new model, the MGX.

A single MGX integrates 72 Blackwell GPUs at the same time and has the latest fifth-generation NVLink with a transmission speed of 130TB per second.

NVLink connects these individual GPUs to each other, so we get 72 GPUs of MGX

After introducing the chip, Huang specifically mentioned the NVLink technology developed by NVIDIA, which is also an important reason why NVIDIA's motherboards can become bigger and bigger.

As the number of LLM parameters increases and the memory consumption increases, it is almost impossible to fit the model into a single GPU, and a cluster must be built. Among them, GPU communication technology is as important as computing power.

NVIDIA's NVLink is the world's most advanced GPU interconnect technology, and the data transmission rate can be crazy!

Because today's DGX has 72 GPUs, while the previous generation only had 8, the number of GPUs has increased by 9 times. The bandwidth has increased by 18 times, and the AI ​​FLops has increased by 45 times, but the power has only increased by 10 times, or 100 kilowatts.

The NVLink chip below is also a miracle.

The reason people realize it's important is because it connects all these different GPUs together to make the trillion-parameter LLM run.

50 billion transistors, 74 ports, 400GB per port, 7.2TB per second of cross-sectional bandwidth, which is a miracle in itself.

More importantly, NVLink also has mathematical functions inside it to perform reductions, which is especially important for deep learning on the chip.

Interestingly, NVLink technology has greatly broadened our imagination of GPUs.

For example, in the traditional concept, a GPU should look like this.

But with NVLink, GPUs can also become this large.

The skeleton supporting 72 GPUs is NVLink's 5,000 cables, which can save 20kw of power consumption in transmission for chip computing.

What Huang is holding in his hand is an NVLink backbone, which, in Huang’s own words, is an “electrical-mechanical miracle.”

All NVLink does is connect different GPU chips together, so Huang said again, "This is not grand enough."

To connect different hosts in a supercomputing center, the most advanced technology is InfiniBand.

However, the infrastructure and ecosystem of many data centers are built based on the Ethernet that was once used, and the cost of starting over is too high.

Therefore, in order to help more data centers smoothly enter the AI ​​era, NVIDIA has developed a series of Ethernet switches compatible with AI supercomputers.

NVIDIA has leveraged its leading position in network-level RDMA, congestion control, adaptive routing, and noise isolation to transform Ethernet into a network suitable for point-to-point communication between GPUs.

This also means that the era of data centers with millions of GPUs is coming.

28 million developers worldwide, instant deployment of LLM

In Nvidia's AI factory, a new software called NIM is running that can accelerate computing reasoning.

Huang said, "What we created is AI in containers."

This container contains a lot of software, including the Triton inference server for inference services, optimized AI models, cloud native stacks, and more.

At the scene, Lao Huang once again demonstrated the all-round AI model - it can achieve full-modal intercommunication. With NIM, all of this is not a problem.

It can provide a simple, standardized way to add generative AI to applications, greatly improving developer productivity.

Now, 28 million developers around the world can download NIM to their own data centers for hosted use.

In the future, developers will be able to easily build generative AI applications in minutes instead of weeks.

At the same time, NIM also supports Meta Llama 3-8B, which can generate up to 3 times more tokens on accelerated infrastructure.

This allows companies to generate more responses using the same computing resources.

Various applications built based on NIM will also emerge, including digital humans, intelligent bodies, digital twins, and so on.

Huang said, "NVIDIA NIM is integrated into various platforms, developers can access it anywhere and run it anywhere - it is helping the technology industry make generative AI within reach."

Intelligent agents team up to create a trillion-dollar market

And intelligent agents are the most important applications in the future.

Huang said that almost every industry needs customer service agents, and there is a trillion-dollar market prospect.

As you can see, on top of the NIM container, most agents are responsible for reasoning, figuring out the task and breaking it down into multiple subtasks. Others are responsible for retrieving information, searching, and even using tools.

All intelligent agents form a team.

In the future, every company will have a large number of NIM agents, which can be connected to form a team to accomplish impossible tasks.

GPT-4o shell, Lao Huang has made it

When it comes to human-computer interaction, Huang and Sam Altman can be said to have the same ideas.

He said that although we can use text or voice prompts to give instructions to AI, in many applications, we still need a more natural and human-like way of interaction.

This points to Huang's vision of digital humans who can be more engaging and empathetic than today's LLMs.

Although GPT-4o has achieved unparalleled human-like interaction, what it lacks is a "body".

This time, Huang has thought it all out for OpenAI.

In the future, brand ambassadors may not necessarily be “real people”; AI can do the job perfectly.

From customer service to advertising, gaming and other industries, the possibilities brought by digital humans will be endless.

CG technology connected to Gen AI can also render realistic human faces in real time.

Low-latency digital human processing, covering more than 100 regions around the world.

This is the magic provided by NVIDIA ACE, which provides the corresponding AI tools for creating lifelike digital humans.

Now, Nvidia plans to deploy the ACE PC NIM microservice on 100 million RTX AI PCs and laptops.

This includes NVIDIA’s first small language model, Nemotron-3 4.5B, which is designed to run on-device with similar precision and accuracy to cloud-based LLMs.

In addition, the new ACE Digital Human AI suite also includes NVIDIA Audio2Gesture, which generates body gestures based on audio tracks and is coming soon.

Huang said, "Digital humans will revolutionize every industry. The breakthroughs in multimodal LLM and neurographics provided by ACE bring us closer to the future of intent-driven computing, where interaction with computers will be as natural as interaction with humans."

Preview of the next generation chip Rubin

The launch of the Hopper and Blackwell series marks that NVIDIA has gradually built a complete AI supercomputing technology stack, including CPU, GPU chips, NVLink GPU communication technology, and server networks consisting of NICs and switches.

If you wanted, you could have an entire data center running on Nvidia technology.

This is big enough and full-stack enough, right? But Huang said that we need to speed up our iterations to keep up with the update speed of GenAI.

NVIDIA recently announced that it will adjust the iteration rate of GPU from once every two years to once a year, in order to push the boundaries of all technologies at the fastest speed.

In today's speech, Huang once again confirmed that the GPU will be upgraded annually. But he then added a hint, saying that he might regret it.

In any case, we now know that Nvidia will launch Blackwell Ultra soon, and the next-generation Rubin series next year.

From Twin Earth to Embodied AI Robots

In addition to chips and supercomputing servers, Huang also released a project that no one expected - the digital twin earth "Earth-2".

This is perhaps the most ambitious project in the world (if not the only one).

And judging from Huang's tone, Earth-2 has been advanced for several years, and the major breakthrough achieved this year made him feel that it was time to show it off.

Why build a digital twin of the entire earth? Is it to move social interaction and interaction to an online platform like Zuckerberg's Metaverse?

No, Huang's vision is grander.

He hopes that the simulation of Earth-2 can predict the future of the entire planet, thereby helping us better cope with climate change and various extreme weather conditions, such as predicting the landing point of a typhoon.

Earth-2 incorporates the generative AI model CorrDiff, trained on WRF numerical simulations, and can generate weather models at 12 times higher resolution, from a 25 km range to 2 km.

Not only is the resolution higher, but it also runs 1,000 times faster and 3,000 times more energy efficient than physical simulations, so it can run continuously on a server and make real-time predictions.

Moreover, the next step for Earth-2 is to increase the prediction accuracy from 2 kilometers to tens of meters, while taking into account the infrastructure within the city, and even predicting when strong winds will blow on the streets.

Moreover, Nvidia wants to create a digital twin not only of the Earth, but also of the entire physical world.

In this era of rapid development of AI, Huang boldly predicted the next wave - physical AI, or embodied AI.

They not only need to have super high cognitive abilities to understand humans and the physical world, but also have extreme mobility to complete various real-world tasks.

Imagine this cyberpunk future: a group of robots come together, communicate and collaborate like humans, and create more robots in factories.

And it’s not just robots. Everything that moves will be autonomous!

Driven by multimodal AI, they can learn and perceive the world, understand human instructions, and evolve planning, navigation, and movement skills to complete various complex tasks.

So how do you train these robots? Letting them run wild in the real world would be much more expensive than training an LLM.

This is where the digital twin world comes in handy.

Just as LLM can align values ​​through RLHF, robots can also continuously trial and error, learn, imitate human behavior, and ultimately achieve general intelligence in the digital twin world that follows the laws of physics.

Nvidia's Omniverse can be used as a platform for building digital twins, integrating Gen AI models, physical simulation, and dynamic real-time rendering technology to become a "robot gym."

Nvidia, which aims to be a full-stack company, is not only satisfied with operating systems. They will also provide supercomputers for training models, as well as Jetson Thor and Orin for running models.

In order to adapt to robotic systems in different application scenarios, NVIDIA's Omniverse will gradually expand into a Warehouse ecosystem.

This ecosystem will be all-encompassing, from SDKs and APIs that work with applications, to interfaces for running edge AI computing, to the most basic customizable chips.

In terms of full-stack products, NVIDIA just wants to make its own "family bucket" and leave others with no way out.

In order to make this AI robot era look more realistic, at the end of the demonstration, nine robots of the same height as Lao Huang appeared on stage.

As Huang said, "This is not the future, this is all happening now."

This article comes from the WeChat public account "New Intelligence" (ID: AI_era), author: New Intelligence, and is authorized to be published by 36Kr.

Sector:
Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
3
Add to Favorites
1
Comments