After Ilya's verdict, GPT-5 was revealed to have failed after repeated training, taking months to train at a time, and the data had to be manually rebuilt from scratch

avatar
36kr
15 hours ago
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms retained:

GPT-5 is reported to be far from meeting expectations.

OpenAI has just concluded 12 consecutive release events, but the much-anticipated GPT-5/4.5 was nowhere to be seen, leading to a leak from The Wall Street Journal.

GPT-5 has completed at least 2 rounds of training, each lasting several months, but new problems have arisen after each training session.

OpenAI is specifically hiring people to write code and solve math problems to create data from scratch for GPT-5, and they are also using o1 synthetic data, but the efficiency is not high enough, and it is difficult to meet the pre-training requirements of GPT-5.

According to market estimates, a single 6-month training session would cost $500 million just in computation. The progress of GPT-5's two training sessions has been bumpy, and the underlying costs must be astronomical.

Ilya's recent pronouncement at NeurIPS 2024 on the impending end of pre-training seems to be further validated...

This also echoes the previous leak from The Information that as the evolution of the GPT series slows down, OpenAI is trying to adjust its strategy, such as the launch of the o1 and o3 series.

Currently, OpenAI has not responded to the latest leak.

But is GPT-5 something that OpenAI is hiding, or something they cannot release? The answer is becoming more certain.

Massive data and computing power cannot solve the pre-training of GPT-5

In the Wall Street Journal's leak, OpenAI has high expectations for GPT-5.

It should be able to conduct scientific exploration and complete routine human tasks, such as scheduling and booking flights. And they hope it will make fewer mistakes, or be able to acknowledge the existence of errors, i.e., reduce hallucinations.

This is consistent with earlier revealed information. OpenAI's former CTO Mira once vividly compared the intelligence level of GPT-5 to that of a doctoral student.

This means that GPT-5 should be able to achieve high-level performance in certain specific fields, be able to deeply understand, reason, and possess professional knowledge like a graduate student or a doctoral student. In comparison, GPT-3 is a toddler learning to walk, and GPT-4 is a high school student.

In October this year, OpenAI's latest $6.6 billion fundraising round pushed its valuation to $157 billion. Investors' continued investment is also believed to be due to their belief that GPT-5 will be able to make a major breakthrough.

But the release of GPT-5 has been pending.

Otarman previously stated that GPT-5 will not have a definite release date, and it will be released whenever it is ready. This could be 2025 or 2026.

Looking back, the launch of GPT-5 has been bumpy all along.

In 2023, OpenAI was reported to have abandoned a model codenamed Arrakis. The reason for abandoning it was that the model could not reduce the demand for computing resources while maintaining performance, and did not achieve the expected training efficiency.

This actually proves in reverse that if you want to train larger-scale models, you still need more massive computing resources and more time.

From the setting, GPT-5 is obviously going to be a "behemoth".

The development of GPT-5 was initiated when GPT-4 was released. It has been over 18 months now.

Its internal codename is Orion. According to the original plan, Microsoft wanted to see GPT-5 by mid-2024.

The Wall Street Journal revealed that GPT-5's large-scale training has been conducted at least 2 rounds, each taking several months, and new problems have been encountered each time.

In the best case, Orion would perform better than OpenAI's current products. But the improvement is not obvious compared to the cost consumed.

It is estimated that a 6-month training session would consume $500 million just in computing costs. In comparison, the training cost of GPT-4 exceeded $100 million.

To get a better model, more data is needed.

The public data resources have been exhausted, so OpenAI has decided to hire people to build data from scratch. According to the leak, they have specifically hired some software engineers and mathematicians to write code and solve math problems for GPT-5 to learn from.

It has always been believed in the AI community that a model learning code can improve its ability to solve other problems.

At the same time, OpenAI is also collaborating with some physicists to let GPT-5 learn how scientists understand problems in their field.

But the problem is, this is too slow.

OpenAI has also taken the path of AI-generated synthetic data. It is said that GPT-5 has used data synthesized by o1.

This paradigm may already be demonstrable.

Anthropic next door has also been reported to use AI-generated synthetic data to train models. Their approach is to keep the best-performing synthetic data internally, as model performance is directly proportional to the quality of the synthetic data.

That's roughly the latest information on GPT-5.

But to be honest, who still cares about GPT-5 these days (hand-waving dog head)?

After all, OpenAI has launched the Reasoning Scaling Law with the o1 and o3 series.

The recently released o3 has set a new record on the ARC-AGI benchmark. The latest results show that on 400 public tasks, the best performance of o3 has reached 91.5%.

In terms of core mechanisms, o3 also provides new insights. It achieves knowledge reorganization during testing by searching and executing in the token space of the LLM.

With the release of the o3 series, the prophecy of AGI remains very attractive.

o3 tops the ARC-AGI test, how far is it from AGI?

Let me briefly introduce the ARC-AGI dataset. The questions consist of grid arrays with colored blocks (described in text using numbers to represent colors), and the large models need to observe 3 input-output examples in each question, and then fill in the new blank grid based on the pattern.

These examples are relatively simple, but the actual problems may look like this:

The ARC-AGI test set includes 400 public questions and 100 private questions.

On the public questions, the high-efficiency version of o3 achieved an accuracy of 82.8%, consuming 111 million Tokens, with an average cost of $17 per task.

The low-efficiency version (172 times the computing power of the high-efficiency version) achieved an accuracy of 91.5%, but consumed a staggering 9.5 billion Tokens.

Additionally, OpenAI has also made a version specifically for the ARC-AGI, using 75% of the public dataset for training.

This version achieved 76% accuracy on the private test set in the low-compute mode, and 88% in the high-compute mode.

And the low-compute version's cost is within the ARC-AGI-Pub rules (<$10k), becoming the top-ranked model on the public leaderboard.

The 88% high-compute version is too expensive, but it still shows that the performance on new tasks does improve with increased computing power.

Prior to this, GPT-3's accuracy was zero, GPT-4o's was 5%, and o1's best was just over 30%.

One of the initiators of the ARC challenge, former senior Google engineer, and father of Keras François Chollet believes that o3 can adapt to tasks it has never encountered before, and can be said to approach human level in the ARC-AGI field.

Of course, the cost is also very high. Even in low-computation mode, each task requires $17-20, while the cost for the initiator to hire a human to solve such problems averages only $5 per problem.

But putting aside the cost issue, Chollet points out that the improvement of o3 over the GPT series proves the importance of architecture, believing that such achievements cannot be obtained on GPT-4 by investing more computing power.

So, does the ARC-AGI test mean that o3 has achieved AGI? Chollet believes it does not.

The test found that o3 still fails on some very simple tasks, indicating a fundamental difference from human intelligence.

In addition, the next generation of ARC-AGI, ARC-AGI-2, is about to be launched, and early tests show that it will pose a major challenge to o3, even in high-computation mode, its score may drop to below 30% (while smart people can still score over 95%).

But regardless of whether it reaches AGI, the achievements o3 can achieve are unprecedented, and some even believe that for tasks like ARC, the human advantage is actually in visual reasoning, and if it is changed to describe the shapes as the model sees them in text form, humans may not necessarily do better than AI.

And for a case where o3 "failed", some have questioned whether the standard answer was wrong.

In this problem, the change rule is to connect the two blue squares in the same row or column with a line, and to color the red area that the line passes through blue.

The difference between the "standard answer" and o3's attempt is whether the area in the green frame is colored blue:

In the three examples, the parts that change from red to blue are all passed through by the connecting line, but in this problem the line passes through the bottom of the 3x4 red area, so o3 believes this area should not be colored blue.

So how did o3 achieve this?

Some believe it was through prompts, but ARC challenge organizer Greg Kamradt and OpenAI researcher Brandon McKinzie both denied this, saying the prompts given to o3 were very simple.

Additionally, Chollet speculates that the core mechanism of o3 seems to be searching and executing natural language programs in Token space - under the guidance of some evaluation model, the search may be in the space of possible descriptions of the steps needed to solve the task.

According to Chollet, o3 has achieved knowledge reorganization during testing, and in general, o3 has built a new paradigm towards AGI.

NVIDIA AI scientist Jim Fan believes that the essence of o3 is "relaxing single-point RL super-intelligence to cover more points in the useful problem space".

That is, to trade depth for breadth, to relax the reinforcement learning on individual tasks, and to gain generality across more tasks.

Fan gives examples like AlphaGo and Boston Dynamics' robot maps, which are super-AIs that perform extremely well on specific tasks.

But o3 is no longer just an expert that can handle single-point tasks, but an expert that performs excellently on a larger set of useful tasks.

However, Fan also said that o3 still cannot cover all the distributions of human cognition, and we are still in the Moravec's paradox.

(Moravec's paradox states that high-level reasoning requires very little computation, but low-level skills that humans take for granted require enormous computational resources.)

The ARC challenge organizers' finding - that o3 fails on some very simple tasks - seems to just confirm this view.

Finally, on AGI, Fan said that we have achieved a huge milestone and have a clear roadmap, but there is still more to be done.

One More Thing

As part of the 12-day release, on the last day OpenAI also released a paper on security issues.

The paper introduces a method called deliberative alignment, which directly imparts human-written, interpretable safety rules to the reasoning model and trains them to explicitly reason about these rules before answering.

As a result, the trained models can adhere to OpenAI's safety policies with high precision without the need for human-labeled CoT or answers.

OpenAI found that o1 significantly outperforms other state-of-the-art models like GPT-4o on a series of internal and external security benchmarks, and its performance reaches saturation on many challenging (security) datasets.

This finding reveals that reasoning will become a new path to improve model safety.

Reference links:

[1]https://www.wsj.com/tech/ai/openai-gpt5-orion-delays-639e7693?st=ng5hBi

[2]https://x.com/mckbrando/status/1870285050555810198

[3]https://x.com/DrJimFan/status/1870542485023584334[4]https://arcprize.org/blog/oai-o3-pub-breakthrough

This article is from the WeChat public account "Quantum", author: Focus on frontier technology, 36Kr authorized release.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments