OpenAI, in collaboration with NVIDIA, AMD, and Microsoft, has launched the "MRC Network Protocol," which completely solves the bottleneck problem in training with hundreds of thousands of GPUs.

This article is machine translated
Show original

In the arms race of cutting-edge AI models, the bottleneck of computing power often lies not in the GPU itself, but in how to make thousands of GPUs exchange data "perfectly synchronized".

On May 5, 2026, OpenAI released a groundbreaking infrastructure update that shook the tech world: they collaborated with chip and cloud giants such as AMD, Broadcom, Intel, Microsoft, and NVIDIA to successfully develop a network protocol called "MRC (Multipath Reliable Connection)," and have open-sourced the specifications to the entire industry through the Open Computing Project (OCP).

The fatal flaw in training large models: a single packet jam can bring the entire network to a standstill.

In its announcement, OpenAI pointed out that the training of cutting-edge models heavily relies on extremely fast and reliable data transfer between GPUs. In traditional network architectures, a single data packet delay or device failure can halt the entire synchronous training process, rendering expensive GPUs idle. In the past, a single connection failure often led to training interruptions, forced restarts, or significant time spent waiting for routing recalculations, resulting in extremely high costs.

To address this growing problem as clusters scale (such as the rumored Stargate supercomputer), OpenAI decided to fundamentally redesign the network layer.

MRC's three core design innovations

The MRC protocol achieves ultra-low latency and extremely high fault tolerance through three disruptive underlying architectural changes:

  • Multi-plane network topology: This involves breaking down a network interface with speeds up to 800Gb/s into multiple smaller connections (e.g., eight 100Gb/s connections) and connecting them to different switches to form parallel "planes". This allows the system to connect more than 100,000 GPUs with only 2 layers of switches (traditional architectures require 3-4 layers), significantly reducing deployment costs, power consumption, and component count.
  • Adaptive Packet Spraying: Unlike traditional single-path transmission which carries the risk of congestion, MRC distributes packets across hundreds of paths. The system features "dynamic load balancing," automatically switching over when congestion is detected; if the switch is overloaded, "packet trimming" is initiated, forwarding only the header to trigger fast retransmission, effectively reducing false positives.
  • Static source routing (SRv6) replaces dynamic routing: It boldly abandons the traditional BGP dynamic routing protocol, instead allowing the sender to directly embed the complete path in the packet. The switch simply follows the static forwarding table, eliminating complex dynamic faults. When a fault occurs, MRC can bypass bad paths in "microseconds," making training operations virtually unaffected.

Deployed on the world's largest GB200 supercomputer

This technology is not just theoretical. OpenAI has confirmed that MRC is now fully deployed on all of its largest NVIDIA GB200 supercomputers , including the site in partnership with Oracle Cloud in Abilene, Texas, and Microsoft's Fairwater supercomputer, and is being used to train multiple next-generation, cutting-edge large models. OpenAI emphasizes:

"In a production environment, even if multiple connections jitter every minute, or if the first-layer switch needs to be restarted, the training operation is almost unaffected and no longer requires special coordination of maintenance time."

加入動區 Telegram 頻道

📍 Related reports📍

OpenAI smartphone mass production timeline moved up to 2027, with reports suggesting MediaTek will monopolize processor orders.

OpenAI spent $5,000 to hire KOLs to film an article titled "Chinese AI Threatens Personal Data," attempting to influence AI regulation in 2026.

The biggest bombshell in the first week of Musk's lawsuit against OpenAI was the admission that xAI distilled ChatGPT.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments