OpenAI reveals the secrets of GPT-4.5 training: 100,000 GPUs, almost all staff members on the battlefield, "catastrophic problems" occurred

04-13

This article is machine translated

Show original

According to Zhidongxi on April 13, more than a month after the release of GPT-4.5, the most expensive model in OpenAI's history, OpenAI co-founder and CEO Sam Altman had a 45-minute high-information conversation with three core technical personnel of GPT-4.5, disclosing for the first time many unknown details about the model, such as the serious overdue development time , frequent failures of computing clusters , and unpredictable improvement paths .

The GPT-4.5 project was launched two years ago. It is OpenAI's most comprehensive plan to date, involving the collaboration of hundreds of people. Altman said that OpenAI is almost "all-hands-on-deck" for this project.

During the development process, the OpenAI team encountered many " catastrophic problems ". The 100,000-card cluster exposed the hidden low-probability and deep-seated faults of the infrastructure. In order to balance timeliness and performance, the OpenAI system team had to " train while repairing ". Among them, there was a hidden small bug that caused the cluster to report errors frequently, and it was not found until the training progress bar went through about 40%.

However, this also helped OpenAI build a more powerful technology stack: now it is possible to replicate a large model at the level of GPT-4 with only 5-10 people. The performance improvement from GPT-4 to GPT-4.5 is about 10 times, and it has achieved " difficult to quantify but all-round enhanced intelligence ", which surprised OpenAI employees.

The OpenAI team has realized that to achieve the next 10-fold or even 100-fold performance improvement, computing power is no longer the bottleneck. The key lies in data efficiency , that is, developing methods that can utilize more computing power to learn more knowledge from the same amount of data.

At the same time, the system is moving from a single cluster to a multi-cluster architecture. Future training may involve collaborative learning on the scale of 10 million GPUs , and its fault tolerance needs to be further improved.

During the conversation, OpenAI employees also shared the relationship between the long-tail effect of data and the Scaling Law, the advantages of the deep co-design model of machine learning and system teams, the essence of unsupervised learning and the troubleshooting culture of "never letting go of any anomalies", fully demonstrating OpenAI's thoughts and gains during the development of GPT-4.5.

In addition to Altman, the three OpenAI employees who participated in this conversation were Alex Paino (responsible for the pre-training machine learning algorithm of GPT-4.5), Amin Tootoonchian (OpenAI Chief System Architect) and Daniel Selsam (researching data efficiency and algorithms).

The following is a complete compilation of the video of Altman's conversation with the OpenAI GPT-4.5 team (to improve readability, Zhidongxi has made certain additions, deletions and modifications without violating the original intention):

01. GPT-4.5 was launched two years ago.

The project took much longer than expected

Sam Altman: What does it take to build such a large model (GPT-4.5)?

Alex Paino: We started this project about two years ago. At that time, OpenAI was about to launch a new large computing cluster, and our team saw this opportunity and did a lot of work to determine the functions that the model needed to include and conducted a lot of risk-reducing run tests.

We have developed a long plan for this, involving the entire technology stack from systems to machine learning. It is a long execution process to reduce risks and prepare for training, and the training itself is also a very large project.

Amin Tootoonchian: I think this process requires close collaboration between the machine learning team and the system team from the beginning until we know what model we want to train and then start training.

We have made predictions in both machine learning and systems to try to minimize the gap between expectations and reality, but because we work at a fast pace and use the latest computing resources, model training is difficult to plan perfectly in advance .

We almost always start training with many unsolved problems and try to overcome challenges and make progress as we go. The main solution is to add more computing resources.

The final stage is execution, which requires a lot of energy and motivation from many people over a long period of time to complete the training process.

Sam Altman: How big of a gap do you think there is between our expectations and reality?

Amin Tootoonchian: On the system side, we are usually far from the desired state at the beginning. We are always faced with a choice: whether to delay the launch and wait for the problems to be solved, or to launch early and solve the problems along the way. It is always a trade-off to avoid unreasonably delaying the process.

But there are almost always some unexpected problems, and what we have to do is to handle these nodes as well as possible, deal with unknown factors, and make a plan for model training.

Alex Paino: In this project, our goal is to make GPT-4.5, which means that its ability is 10 times smarter than GPT-4. This was our initial goal set about 2 years ago.

There were a lot of things going on in that process, and we were thinking, can we do better, or will we do worse than expected? It was a very complicated process, but in the end, in terms of the effective computation that we put in, we got a model that we think is 10 times smarter than GPT-4.

Amin Tootoonchian: In terms of execution, the GPT-4.5 project took far less time than we initially expected.

02. Now train a GPT-4 level model,

Only 5-10 people are needed to complete

Sam Altman: Why did the cluster encounter so many problems when it expanded from 10,000 cards to 100,000 cards?

Amin Tootoonchian: I think that if the system developers are sharp enough, most of the problems can be observed at a small scale.

Some problems are not unique to large-scale training, but are common in nature, but they become catastrophic problems when the scale is increased , especially when the team does not anticipate that these problems will deteriorate to such an extent.

Sam Altman: What were some of the things that had disastrous consequences?

Amin Tootoonchian: I think the infrastructure issues are well known, whether it is the failure rate, failure type or total failure rate, they are all high. The 100,000 card cluster is a large sample pool, so we also found problems that the computing power providers have not observed.

The network is one part of it, and individual accelerators can have problems. But that's the beauty of this system - almost all of the components need to work as expected to produce the expected results. Our job is to minimize these problems.

Sam Altman: It is indeed difficult to work at the limits of cluster size, but I also noticed that it became much easier to do things that are no longer on the cutting edge of technology. Training GPT-4.5 required hundreds of people, and almost all of OpenAI's employees were involved.

But if you were to pick a minimal team from OpenAI today and retrain GPT-4 from scratch using everything we know and all the systems work, how many people would it take?

Alex Paino: I think it would take about 5 to 10 people to make a GPT-4 level model now. In the process of completing GPT-4.5, the technology stack has been greatly improved.

We actually did something similar when we were training GPT-4.5 — we trained GPT-4o, a GPT-4-level model, retrained using a lot of the same content from the GPT-4.5 research project. That training was done with a much smaller number of people.

03. Data efficiency is the key to breakthroughs in large models.

New generation hardware brings many challenges

Sam Altman: What's your perspective, Dan? Why is it hard to train large models?

Daniel Selsam: I think doing anything new is hard. I think even just finding out that someone else has done something makes it a lot easier because the hardest part is having the conviction to do something in the first place. I think just knowing that something is possible is a really strong cheat code that makes things a lot easier.

Alex Paino: We are scaling up GPT pre-training runs 10x as fast as before, and we always find interesting new things that you wouldn't necessarily expect.

Sam Altman: What will it take to achieve the next 10x or 100x increase in pre-training scale?

Daniel Selsam: Data efficiency. The Transformer architecture (GPT) is very efficient in utilizing data. It absorbs and compresses information very well and generalizes it. Its biggest feature is that it can absorb information efficiently using computing resources.

However, the depth of insight it can gain from data is limited. When computing power grows rapidly, but data grows relatively slowly, data becomes a bottleneck for this standard model. This requires algorithmic innovation to develop methods that can use more computing power to learn more knowledge from the same amount of data.

Sam Altman: What else do you think we need to keep expanding?

Amin Tootoonchian: My answer is about the system. I think the huge amount of work required for GPT-4.5 is essentially a necessary consequence of the model specifications. We cannot train GPT-4.5 with exactly the same technical architecture as GPT-4.

In terms of state management, since the required computing resources have exceeded the capacity of a single cluster, we have to turn to a multi-cluster training architecture. To achieve this goal, we must integrate multiple different workflows in a short period of time.

Although this has helped us achieve a breakthrough, to achieve the next order of magnitude of performance improvement, we still need to solve several known but temporarily shelved technical problems - these problems cannot be avoided. It is precisely this kind of technical trade-off that continues to extend the development cycle of the perfect system, and we are always making strategic choices in the pursuit of the best implementation plan.

It should be clear that the system itself is not the ultimate goal, and its actual output value is the core consideration. As far as the next 10x performance improvement is concerned, I think a breakthrough in fault tolerance is crucial. We need to build a fault tolerance mechanism that deeply collaborates with the workload to significantly reduce operational anxiety. The operational complexity of current ultra-large-scale systems is fundamentally different from previous systems.

Sam Altman: Do you know what the percentage of failures in GPT-4.5 training were due to certain components?

Amin Tootoonchian: I don't have specific numbers to share, but generally speaking, in the early stages of deploying a new generation of hardware, the system operation often faces many technical challenges that are not fully understood. We choose to move forward with the project when the problems are not fully understood, which leads to a high failure rate in the early stages.

But experience shows that as root causes are identified and addressed, failure rates drop dramatically. This phenomenon essentially reflects a deepening of our understanding of infrastructure—some call it infrastructure cleanup or understanding the fundamentals of infrastructure.

The early stages of execution are almost always quite painful , and as we progress through the project we continue to discover and address new failure modes, but eventually the failure rate decreases and the uptime increases.

This is essentially a question of priority trade-offs: in the early stages of the infrastructure life cycle, the risk of failure is often difficult to accurately estimate; and excessive pursuit of the ultimate ideal state (originally called "City Estate", an ideal city-state design) may lead to extremely poor availability of the system in the early stages.

04. Computing resources are no longer the main bottleneck.

The algorithm has not yet reached the theoretical upper limit

Sam Altman: Although inference models are a key component of our future technology stack, let's focus on the development boundaries of traditional pre-trained models for now. Assuming we have unlimited GPU computing power, unlimited network bandwidth, and unlimited electricity supply, we are still limited by the current technical bottlenecks - including system reliability issues, the lack of fault-tolerant training methods, and the limitations of existing data sets.

According to the evolution law that each major GPT version number achieves a 100-fold scale increase, based on the current technical boundaries, what level can the development of pre-trained models reach? Specifically for the GPT series of models, with our existing knowledge system, what kind of model can we train in theory? Can we make GPT-5.5?

Alex Paino: From the perspective of machine learning and algorithm development, we have not yet reached a clear theoretical ceiling. In fact, we are just beginning to explore algorithms that are more data efficient and how to make better use of existing data resources. This situation is very interesting - even models like GPT-4 are still largely developed under the conditions of limited computing resources, which also determines the direction of most previous research.

But the situation is completely different now. Since GPT-4.5, data rather than computation is becoming the main constraint in some key dimensions. This shift makes related research less exciting.

Sam Altman: But this is an amazing development that the world may not have fully realized yet: computing resources are no longer the main bottleneck for the best models we can build. This is a very significant shift, after all, we have lived in a computationally constrained environment for too long.

05. The overall performance improvement of the model is predictable.

The path of intelligence improvement is difficult to predict

Sam Altman: What are the most interesting machine learning lessons we learned while training GPT-4.5? Tell us what you want to share.

Amin Tootoonchian: In general, the most thought-provoking situations are those that deviate from our expectations—especially when we try to understand why actual performance deviates from the expected curve.

Alex Paino: One of the most surprising things we found was that different machine learning components scaled very differently. Some parts scaled very well, and some didn’t. This was something we really realized during the actual training process. This experience taught us a lot.

Daniel Selsam: I think the two core features of the GPT paradigm are that first, the test loss (a measure of how well the model performs on unseen test data) can be accurately predicted ; second, the performance of the model improves predictably as the scale increases . Even more amazingly, the reduction in test loss translates into an all-around enhanced level of intelligence in mysterious ways that are difficult to quantify but amazing.

Sam Altman: Are you absolutely optimistic about this? Do you totally agree with this?

Daniel Selsam: What I want to say is that we found a particularly interesting phenomenon from the GPT-4.5 test - after retesting, the many sophisticated capabilities demonstrated by the model completely exceeded everyone's expectations.

We are sure that it will become smarter in various ways that are difficult to define in advance, and after actual deployment, these subtle improvements can be observed from user satisfaction: stronger common sense reserves, more accurate contextual understanding, and more delicate semantic grasp - this is the magic brought by those additional test losses. In my opinion, Scaling Law has been perfectly verified in this dimension.

06. Machine learning works closely with the systems team.

Not "sweeping the snow in front of one's own door"

Sam Altman: What has been the most positive moment of the entire training process? What is your favorite memory? Obviously there has been a lot of pain, but hopefully that pain has eased a little.

Alex Paino: I did have one of those moments. We were doing a lot of machine learning stuff during training, and I think some of the changes we made during the run had a pretty good impact, probably better than expected, and that was a very exciting moment for us.

Amin Tootoonchian: For me, while we were training, we were also building infrastructure at the same time. We firmly believed that we could get over this performance cliff, and we had a plan, and everyone was executing, but it would take a long time. It was hard work, definitely harder than I thought. My predictions were wrong, and I underestimated how long it would take to solve these problems.

I still remember the moment when the team finally solved those key problems and achieved significant performance improvements. You could clearly feel the energy of the entire team shift - everyone was suddenly full of energy and sprinted towards the final goal with a new motivation.

The most amazing thing is that the estimated completion time shown on our status tracker has been continuously reduced from the initial two years to a clear time point. This visible progress has boosted team morale immeasurably. I think this is the beauty of it.

I want to emphasize that the work on ML never stops. Even after training starts, this process of ML co-design continues. Not only does the ML team proactively follow up on issues that were once marked as “follow up,” but they also continue to deliver improvements that actually optimize training time.

This perfectly reflects our team spirit - there are no boundaries of work where everyone takes care of their own business , but rather a truly seamless collaboration, and this cohesion is our greatest advantage.

07. GPT-4.5 pre-training is the most thorough plan.

Never let go of any abnormality

Daniel Selsam: There has been a lot of discussion about how challenging and predictive this exercise was. But the fact is that it was all based on extremely careful planning - can you elaborate on this?

Alex Paino: This is definitely the most thorough plan we have ever had. As I said, we started preparing for this project a year before the official training started. During this period, we conducted many large-scale risk control test runs.

We pay special attention to introducing all improvements step by step: starting with a high-confidence base configuration - which can be understood as a mature architecture similar to GPT-4, which we have fully mastered at the machine learning level - and then stacking new features like building blocks.

The key is to rigorously verify the scalability of each improvement at different scales: not only to see performance improvements, but also to ensure that these improvements can continue to be effective as the model scales up. Many improvements work well when tested on a small scale, but fail in large-scale applications.

Therefore, we have remained highly vigilant throughout the process and continuously iterated and improved our expansion law methodology. Through this risk control practice, we have accumulated a lot of valuable experience, which will continue to guide the development of future GPT series models.

Amin Tootoonchian: I remember one particularly interesting moment that I miss very much. You know, every time we start a training mission, we will encounter various bugs. This is already a common occurrence. But the key is to ensure that progress is not blocked. You must always confirm whether the current progress is indeed on track, and whether these bugs will have a fatal impact on the health of the training.

Although we were initially very sure that there was a major flaw, the monitoring system we had built allowed us to accurately identify the root cause of the problem: Was it a hardware failure? What kind of hardware failure? Was it data corruption? Was it a bug in the machine learning model itself? Or was it a race condition in the code?

At the time, we had multiple discussion forums open at the same time, with a wide variety of symptoms. After a series of bug fixes, we were stuck: there were multiple unresolved issues in front of us, and everyone was wondering - are these caused by different bugs? Or is it just one bug?

Later, we held a poll and asked team members to vote for the most likely root cause. The least favored option turned out to be the truth: it turned out that there was a problem with the torch.sum function in PyTorch upstream, a simple sum operation.

This bug is particularly interesting. Remember that we mainly use Triton kernels and only fall back to torch operations in some insignificant edge cases. The torch.sum function bug triggered by our specific code path will occasionally cause illegal memory access due to the distribution of data - it makes a mistake in calculating the memory offset.

The most dramatic thing is that when an engineer finally located the problem and submitted a fix, all the errors with different symptoms disappeared. Everyone excitedly changed the name of the Slack channel from "multiple bug theory" to "single bug theory", and the scene was particularly joyful.

How long has this bug been lurking? It existed since the early stages of training, but was not discovered until the progress bar reached about 40%. The discovery process was also dramatic: there was a complex sequence of kernel calls, and the second call triggered an illegal memory access.

Although this kind of crash frequency is extremely low (occurs only once every few hundred or even thousands of training steps), it is easy to ignore it as an occasional failure, but our team's motto is: never let go of any abnormality . The most exciting part of this story is this persistence in not giving up easily.

08. We are still far from an ideal system

Sam Altman: What else do you need to do after GPT-4.5 pre-training starts?

Alex Paino: We all need to watch the loss curve frequently. In addition, we need to continuously optimize the system and improve the co-design that was not completed before the training started. We closely monitor various statistical indicators during the training process to ensure that there are no unexpected abnormal trends. At the same time, we explore possible improvements from the perspective of machine learning. Although the work at the data level will be temporarily reduced after the pre-training starts, there is still a lot of work to be done.

Amin Tootoonchian: I think machine learning relies heavily on correctness judgment. After pre-training starts, facing a large amount of noise signals, we are like fortune tellers interpreting tea leaves, and we need to judge whether the system is healthy. This is our responsibility.

Sam Altman: At the system level, what will limit our ability to train models? Is it the chip, processor, memory, network, or power?

Amin Tootoonchian: The beauty of the system is that when you do co-design, the workload can adapt to the infrastructure that you build. There is no universal statement here that the network is the bottleneck, or the memory bandwidth is the bottleneck, or something like that. Even for the same form factor model, we can choose to shift resource requirements, we can choose to create a more balanced system, but having more memory bandwidth is always beneficial. It's hard to answer this question without qualifications.

When designing GPT-4.5, we may need a certain property in the system, which can only be produced through human guidance. So collaborative design is important for forming model architecture and architectural elements, which to some extent ties the system and machine learning together. If the system has a property that we don't want to have. My ideal situation is that everything should be decoupled to give each other the most space.

Sometimes things are linked together and we need to meet the infrastructure requirements, or things should be the way they are. Many times, we need a balanced system, balanced communication. And the best mediation we have is all of these collaborative designs.

Sam Altman: How close are we to this ideal system goal?

Amin Tootoonchian: We are far from that goal. The process of building a system is always like this: start with an idealized view of how things should work, and then reconcile those differences with the resources available.

I think we don't do theory for theory's sake, we just want to discuss what we want it to be, make it happen, and get as close to that ideal as possible. That's probably the most exciting part of the systems field. In the past people will say this is an elegant system design, and ultimately history will tell us whether this was the right choice or the wrong choice.

Sam Altman: If you could get the answer to one machine learning problem before your next big training run, what would it be?

Alex Paino: I want to know what algorithms we should use under limited data and in specific fields. Although this is a broad question, it is indeed the most critical one.

Sam Altman: Will there be simultaneous pre-training on 10 million GPUs or more in the future?

Alex Paino: I think there will be, but it may not be the traditional pre-training model. Its form may be completely different from existing technology, but it will still retain the core of unsupervised learning.

Amin Tootoonchian: I prefer semi-synchronous mode. Due to the limitations of physical laws, full synchronization is not realistic.

Daniel Selsam: I think it's more likely to be decentralized. There will certainly be 10 million GPUs working together on an AI system that learns and performs tasks, but like the various parts of the brain, they don't necessarily communicate with each other .

09. Algorithm improvements produce a cumulative effect.

Driving data efficiency

Sam Altman: How far is the data efficiency of the most advanced algorithms compared to humans? Is there any hope of catching up in the future?

Daniel Selsam: It is difficult to compare the two directly. The gap in language learning is definitely huge. The key lies in how to define the amount of information received by human visual nerves. I think that in general, the data efficiency of algorithms is much lower than that of humans.

For decades, deep learning has been about computing efficiency. In addition to the growth of data and computing power, the real surprise is the cumulative effect of algorithmic improvements. Every 10% or 20% improvement in algorithm performance has a significant effect on data efficiency. Until now, there has been no such mobilization around data efficiency because it is not worthwhile when data is not flowing and computing power is limited.

We are now entering a new phase of AI research where we will start to accumulate data efficiency wins. I think it would be foolish to predict that we will hit an insurmountable barrier right now. The human brain certainly works differently than our algorithmic improvements, and we need to be cautious in that regard. But I think it is fair to be optimistic about the future of algorithms.

Sam Altman: Is there any correlation between larger-scale pre-training and stronger learning and reasoning capabilities of the model?

Alex Paino: What we observe is that better pre-training and unsupervised learning tend to improve the overall intelligence of the model and help a lot with generalization, which goes hand in hand with reasoning, which can be a little bit slower in improving intelligence. I think they are complementary.

Sam Altman: Pre-training seems to be general enough to do a lot of things, whereas training a model only makes it good at a certain kind of thing, is that right?

Alex Paino: That's interesting, but it's not surprising when you look at the data that they were trained on. The pre-training datasets are very large, and we're looking for breadth and diversity. When it comes to reinforcement learning for the model and giving it a clear good reward signal and a good training environment, I think it's hard to balance breadth of datasets.

Daniel Selsam: I agree, but I think there is another factor, pre-training is essentially compressing data to find connections between different things. It is about analogies, which is more abstract. Reasoning is a skill that requires careful thinking on a specific problem, and it can also obtain solutions to many types of problems. But in the process of pre-training, when you compress data across different fields, you can learn more abstract knowledge.

10. The essence of intelligence is compression.

The long-tail effect of data keeps the Scaling Law effective

Sam Altman: Why does unsupervised learning work?

Daniel Selsam: The key is compression. The ideal form of intelligence is Solomonov induction. Generally speaking, machine learning considers all possibilities, but tends to start with simpler programs.

The essence of current pre-training is a compression process, which achieves approximate expression by finding a simplest program to explain all the data generated by humans so far.

Sam Altman: How can next token prediction help achieve compression?

Daniel Selsam: There is a paradox in statistics - why do deep networks seem to be incompressible but can generalize? Normally, when you have a lot of data and some small models, these models must be compressed to learn something.

In pre-training, the scale of data and model is very large. Some people think that this kind of training is just memorization and interpolation learning. In fact, they ignore another understanding perspective of compression - pre-quential compression. It is like a compressor. Even if the data weight is large, the binary does not need to store this information. The result of the next Token prediction can be used to quickly retrieve useful information and improve compression efficiency.

Sam Altman: The process of training GPT-4.5 consumed a lot of manpower, time, and money. This can actually be seen as an experiment to verify the Scaling Law, and the results proved that it is effective and will continue for a long time. Why can the Scaling Law be called the law of the universe?

Daniel Selsam: The higher the compression, the more powerful the intelligence, which has a very deep philosophical connotation. Why does the longer it takes to train a larger model, the higher the compression rate? This involves a lot of theory, and the one I like is sparse representations.

In reality, key concepts follow a power law distribution. For example, the 100th important concept may only appear once in every 100 documents, which shows a clear long-tail effect. This distribution characteristic requires large-scale data and computing power to effectively capture all key concepts, which also determines the long-term effectiveness of the Scaling Law.

Editor | Panken

This article comes from the WeChat public account "Smart Things" (ID: zhidxcom) , author: Chen Junda Chen Jiayang, and is authorized to be published by 36Kr.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

ME News

Breaking News! The Year of China's RWA: A Compliant Channel Opens for Trillions of Yuan in Domestic Assets to Go Global

BlockTempo

Arthur Hayes speculates that the reason for the BTC crash is "institutional hedging operations": IBIT options saw a surge of $900 million.

BTC

1.45%

The Defiant

Bitcoin Selloff Sparks Hedge Fund Speculation Around BlackRock ETF

BTC

1.45%