Why are GPT-5 and Opus 3.5 not released yet? New conjectures: Already born, distilled into small models to sell

avatar
36kr
01-17
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms preserved:

"From now on, the base model may be running in the background, allowing other models to accomplish feats they could not do on their own - like an old hermit passing down wisdom from a secret mountain cave."

In the past few months, the media, AI community, and the general public have all been paying attention to the progress of OpenAI's next-generation large model "GPT-5".

We all know that OpenAI is researching new models, and the new models may have encountered difficulties and cannot be launched as planned, but if someone says that GPT-5 has already been shaping the world somewhere, what would you think?

Suppose the following: OpenAI has already built GPT-5, but has applied it internally, because doing so would yield a much higher return on investment than releasing it to millions of ChatGPT users. Furthermore, the returns they are getting are not monetary. As you can see, the idea is simple, but the challenge lies in how to connect those little clues. Recently, the technology analyst Alberto Romero delved into this line of thought.

First, let's clarify: this is not a 100% reliable guess. The evidence is public, but there are no leaks or insider rumors confirming the idea is correct. The author did not obtain exclusive information - if they did, they would have signed a non-disclosure agreement regardless. However, at least logically, the speculation seems quite convincing.

Let's see what the article says.

Original article: https://www.thealgorithmicbridge.com/p/this-rumor-about-gpt-5-changes-everything

I. The Mysterious Disappearance of Opus 3.5

Before introducing GPT-5, we must first visit its distant cousin, the equally missing Anthropic's Claude Opus 3.5.

As you know, the three major AI research labs overseas - OpenAI, Google DeepMind, and Anthropic - have provided a series of large model services covering various price ranges, latencies, and performance ranges. OpenAI offers GPT-4o, GPT-4o mini, and o1 and o1-mini; Google DeepMind offers Gemini Ultra, Pro, and Flash; and Anthropic has Claude Opus, Sonnet, and Haiku.

Their goal is clear: to cater to as many customers as possible. Some prioritize top-notch performance, while others seek affordable yet sufficiently good solutions, and so far, everything has been going well.

But in October 2024, something strange happened. Everyone was expecting Anthropic to announce the launch of Claude Opus 3.5 to respond to GPT-4o (launched in May 2024). However, by October 22nd, they released an updated version of Claude Sonnet 3.5 (which people started calling Sonnet 3.6). Opus 3.5 was nowhere to be found, as if Anthropic no longer had a direct competitor to GPT-4o. At this point, the research progress seems to have hit a snag. Here's what people have been saying and what actually happened to Opus 3.5:

On October 28th, there were rumors that Sonnet 3.6 was... an intermediate checkpoint of the failed training of the much-anticipated Opus 3.5. A post on the r/ClaudeAI subreddit claimed that the Claude 3.5 Opus had been abandoned, with a link to the Anthropic model page. As of today, that page does not mention Opus 3.5. Some speculate that the removal of Opus 3.5 was a strategic move to maintain investor confidence before an upcoming funding round.

On November 11th, Anthropic CEO Dario Amodei dispelled the rumors on the Lex Fridman podcast, saying, "There's no exact date, but as far as we know, the plan is still to release Claude 3.5 Opus." The tone was cautious, but the confirmation was there.

On November 13th, a Bloomberg report confirmed the previous rumors: "After training, Anthropic found that 3.5 Opus performed better in evaluations than the previous version, but the advantage was not significant enough given the model's size and the cost of building and running it." Dario seems to have not given a date because, although the training of Opus 3.5 did not fail, the results were not as satisfactory as expected. Note that the focus is on the cost-to-performance ratio, not just performance alone.

On December 11th, semiconductor expert Dylan Patel and his Semianalysis team provided the final plot twist, offering an explanation that weaves all the data points into a coherent story: "Anthropic completed the training of Claude 3.5 Opus, and it performed well and could be properly scaled... but Anthropic did not release it. This is because Anthropic did not publicly release it, but instead used Claude 3.5 Opus to generate synthetic data and do reward modeling, significantly improving the performance of Claude 3.5 Sonnet with user data."

In short, Anthropic did indeed train the Claude Opus 3.5 large model. They abandoned the name because it was not good enough yet. Dario believed that different training attempts could improve the results, so he avoided giving a release date. Bloomberg confirmed that the results were better than the existing models, but not enough to justify the inference cost. Dylan and his team discovered the connection between the mysterious Sonnet 3.6 and the missing Opus 3.5: the latter was being used internally to generate synthetic data to improve the former's performance.

II. Better Models Become Smaller and Cheaper?

The process of using a powerful, expensive model to generate data to boost the performance of a slightly weaker but cheaper model is called distillation. This is a common practice. This technique allows AI labs to elevate their small models to levels that cannot be achieved by additional pre-training alone.

There are various methods of distillation, but we won't go into the details. What you need to remember is that the powerful "teacher" model will transform the "student" model from [small, cheap, fast]+ weak to [small, cheap, fast]+ powerful. Distillation turns the powerful model into a gold mine. Dylan explained why it makes sense for Anthropic to do this with the Opus 3.5-Sonnet 3.6 pair:

The inference cost (compared to the old Sonnet) has not changed significantly, but the model performance has improved. Why bother releasing 3.5 Opus? From a cost perspective, it is economically irrational to do so, compared to releasing the further trained 3.5 Sonnet.

We're back to the cost issue: distillation can boost performance while keeping inference costs low. This immediately solves the main problem reported by Bloomberg. Anthropic's choice not to release Opus 3.5 was not just because of poor performance, but because it was more valuable internally. (Dylan says this is why the open-source community can catch up to GPT-4 so quickly - they're directly mining the gold from OpenAI's gold mine.)

The most astonishing finding is that Sonnet 3.6 is not just good - it has reached SOTA level, outperforming GPT-4o. Due to the distillation from Opus 3.5 (and possibly other reasons, five months is a long time in AI), Anthropic's mid-range model has surpassed OpenAI's flagship product. Suddenly, high cost as a proxy for high performance has been proven wrong.

What happened to "bigger is better"? OpenAI CEO Sam Altman warned that this era is over. Once top labs become secretive and carefully guard their precious knowledge, they stop sharing numbers. Parameter count is no longer a reliable metric, and we wisely shift our attention to benchmark performance. OpenAI's last official model size disclosure was GPT-3 with 175 billion parameters in 2020. By June 2023, there were rumors that GPT-4 was an expert ensemble model totaling around 1.8 trillion parameters. Semianalysis later confirmed in a detailed evaluation that GPT-4 has 1.76 trillion parameters, as of July 2023.

Until December 2024, which was a year and a half later, Ege Erdil, a researcher at EpochAI (an organization focused on the future impact of AI), estimated in an article "Frontier language models have become much smaller" that the leading AI models (including GPT-4o and Sonnet 3.6) are much smaller than GPT-4, even though both outperform GPT-4 on benchmarks.

Here is the English translation of the text, with the specified terms preserved:

......The parameter count of current state-of-the-art models like GPT-4o and Claude 3.5 Sonnet may be an order of magnitude smaller than GPT-4: 4o has around 200 billion parameters, and 3.5 Sonnet has around 400 billion parameters...... Given the rough way I arrived at these numbers, these estimates may have a 2x error.

Why was he able to arrive at these numbers without the lab releasing any architectural details? He explained the reason in depth. But this reason is not important to us. What's important is that the fog is clearing: Anthropic and OpenAI seem to be following a similar trajectory. Their latest models are not only better, but also smaller and cheaper than the previous generation. We know Anthropic's approach is to distill Opus 3.5 into Sonnet 3.6. But what about OpenAI?

III. The Driving Force Behind AI Labs is Widely Prevalent

People may think Anthropic's distillation method stems from its own unique situation - that the Opus 3.5 training results were disappointing. But in fact, Anthropic's situation is by no means an exception. Google DeepMind and OpenAI have also reported that their latest training results have not been very satisfactory. (It should be emphasized that "not very satisfactory" does not mean the models are worse.) The reasons for this situation are not important to us: diminishing returns due to insufficient data, inherent limitations of the Transformer architecture, stagnation of pre-training Scaling Law, etc. In any case, Anthropic's unique situation is actually quite common.

But remember the Bloomberg report: it is only when the cost is taken into account that the performance metrics can be judged as good or bad? Yes, Ege Erdil explained the reason: the demand for AI has surged after the ChatGPT/GPT-4 craze.

The pace of generative AI adoption has been so fast that labs have struggled to keep up, leading to ever-increasing losses. This situation has prompted them all to reduce the cost of inference (training is done only once, but inference cost grows proportionally with the number of users and usage). If 300 million people use your AI product every week, the operating expenses could suddenly be the death of you.

Whatever the reason that prompted Anthropic to distill Sonnet 3.6 from Opus 3.5, it will weigh on OpenAI many times over. The reason distillation is useful is that it turns two ubiquitous problems into an advantage: by providing smaller models to users, it solves the inference cost problem, while also avoiding being publicly criticized for poor performance if they don't release larger models.

Ege Erdil believes OpenAI may have chosen another approach: overtraining. That is, using more data than the compute-optimal amount to train smaller models: "When inference becomes the dominant or primary part of your spend on the model, the better approach is... to train smaller models on more tokens." But overtraining is no longer viable. AI labs have already exhausted high-quality data sources for pre-training. Elon Musk and Ilya Sutskever have both acknowledged this in recent weeks.

Returning to distillation, Ege Erdil concludes: "I think GPT-4o and Claude 3.5 Sonnet are very likely distilled from larger models."

So far, all the evidence suggests that OpenAI is doing the same thing (distillation) for the same reasons (poor performance/cost control) that Anthropic did to Opus 3.5 (training and hiding). This is an important finding. But don't rush, Opus 3.5 is still hidden. Where is OpenAI's similar model? Is it hidden in the company's basement? Can you guess its name...?

IV. Venturing into Uncharted Territory Requires Forging One's Own Path

My analysis started with the Opus 3.5 story at Anthropic because there was a lot of information about it. Then I used the concept of distillation to build a bridge to OpenAI, and explained why the potential forces driving Anthropic are also driving OpenAI. However, a new obstacle has emerged in our theory: since OpenAI is the pioneer, they may face obstacles that Anthropic and its competitors have not yet encountered.

One major obstacle is the hardware requirements for training GPT-5. Sonnet 3.6 is comparable to GPT-4o, but it was released five months later. We should assume that GPT-5 is on another level. More powerful and larger. The inference cost will also be higher, as will the training cost. It may take $500 million to run a single training. Is it possible to accomplish this with the existing hardware?

Ege's answer is yes. Serving 300 million people is an unbearable burden, but training is a piece of cake:

"In theory, even our current hardware would be sufficient to support models much larger than GPT-4: for example, a 50x scaled-up version of GPT-4, with around 1 trillion parameters, could cost around $3,000 per million output tokens, with an output speed of 10-20 tokens per second. However, for this to be achieved, these large models must deliver immense economic value to the customers using them."

However, even Microsoft, Google, and Amazon (the investors of OpenAI, DeepMind, and Anthropic respectively) cannot find a reasonable justification for this inference cost. So how do they solve this problem? It's simple: they only need to "deliver immense economic value" when they plan to release models with trillions of parameters to the public. So they choose not to release those models.

They train it. They realize it "performs better than their current products." But they must accept that it "has not progressed enough to justify the enormous cost of keeping it running." (Does this sound familiar? This is a Wall Street Journal report from a month ago about GPT-5. Remarkably similar to Bloomberg's statement about Opus 3.5.)

They report results that are not too good (more or less accurate, they can always play with the narrative here). They keep it as a large teacher model internally, using it to distill smaller student models. Then they release these smaller models. We get Sonnet 3.6 and GPT-4o and o1, and we're very happy that they're cheap and quite good. Even as we grow more impatient, our expectations for Opus 3.5 and GPT-5 remain unchanged. And their pockets continue to shine like gold mines.

V. Microsoft, OpenAI, and AGI

When I reached this point in my investigation, I still didn't quite believe it. Of course, all the evidence suggests that this is entirely reasonable for OpenAI, but there is a gap between reasonable - even likely - and real. I won't fill that gap for you - after all, it's just speculation. But I can further strengthen the argument.

Is there any additional evidence that OpenAI is operating this way? Besides poor performance and increasing losses, do they have other reasons to withhold GPT-5? What can we infer from OpenAI executives' public statements about GPT-5? By repeatedly delaying model releases, aren't they risking their reputation? After all, OpenAI is the representative of the AI revolution, and Anthropic operates in its shadow. Anthropic can afford these actions, but what about OpenAI? Perhaps there is a cost to doing so.

Speaking of money, let's dig up some relevant details about the OpenAI-Microsoft partnership. First, what everyone knows: the AGI clause. In OpenAI's blog post about its structure, they have five governance clauses that describe how they operate, their relationship with the non-profit, their relationship with the board, and their relationship with Microsoft. The fifth clause defines AGI as "highly autonomous systems that surpass human performance on most economically valuable work" and stipulates that once the OpenAI board declares AGI has been achieved, "such a system will be excluded from the IP licensing and other commercial terms with Microsoft, which only apply to pre-AGI technology."

Here is the English translation:

Needless to say, neither company wants the partnership to break down. OpenAI has set this clause, but will do whatever is necessary to avoid complying with it. One way is to delay the release of systems that may be labeled as AGI. "But GPT-5 is certainly not AGI," you might say. And I would say there is a second fact that almost no one knows: OpenAI and Microsoft have a secret definition of AGI: AGI is an "AI system that can generate at least $100 billion in profits." This definition, although irrelevant for scientific purposes, legally constructs their partnership.

If OpenAI hypothetically uses "not ready yet" as an excuse to withhold GPT-5, in addition to controlling costs and preventing public backlash, they can do one more thing: they will avoid declaring whether it has reached the threshold to be classified as AGI. Although $100 billion in profits is an extraordinary figure, nothing prevents ambitious customers from creating even more profits on that basis. On the other hand, let's be clear: if OpenAI forecasts that GPT-5 will generate $100 billion in recurring revenue per year, they won't mind triggering the AGI clause and parting ways with Microsoft.

Most of the public's reaction to OpenAI not releasing GPT-5 is based on the assumption that they don't do it because it's not good enough. Even if this is true, no skeptic has stopped to think that OpenAI may have better internal use cases than what they get from the outside. There is a huge difference between creating an excellent model and creating an excellent model that can cheaply serve 300 million people. If you can't do it, you don't do it. But also, if you don't need to, you don't do it. They previously gave us access to their best models because they needed our data. Now they don't need it as much. They're not chasing our money either. It's Microsoft that wants the money, not them. They want AGI, and then ASI. They want a legacy.

VI. The Wise Old Hermit Passing Wisdom from the Cave

The article is nearing its end. I believe I have laid out enough arguments to build a solid case: OpenAI very likely has a running GPT-5 internally, just as Anthropic has Opus 3.5. It is also possible that OpenAI will never release GPT-5. The public is now comparing performance to o1/o3, not just GPT-4o or Claude Sonnet 3.6. As OpenAI explores scaling laws during testing, the threshold for GPT-5 to cross keeps rising. How can they release a GPT-5 that truly surpasses o1, o3, and the upcoming o series models, especially considering the speed at which they produce these models? Furthermore, they no longer need our money or data.

Training new base models - GPT-5, GPT-6, and beyond - will always make sense internally for OpenAI, but not necessarily as products. This may be coming to an end. The only important goal for them now is to continue generating better data for the next generation of models. From now on, base models may be running in the background, allowing other models to accomplish feats they could not themselves - like a wise old hermit passing wisdom from a secret mountain cave, except this cave is a massive data center. And whether we can see him or not, we will experience the consequences of his wisdom.

Even if GPT-5 is eventually released, this fact suddenly seems almost irrelevant. If OpenAI and Anthropic have indeed launched recursive self-improvement efforts (although still with human involvement), what they publicly give us doesn't matter. They will keep going further and further - just like the universe is expanding so fast that the light from distant galaxies can no longer reach us.

Perhaps this is how OpenAI managed to jump from o1 to o3 in just three months, and how they will jump to o4 and o5. It may also be why they have been so excited on social media lately. Because they have implemented a new, improved mode of operation.

Do you really think that approaching AGI means you can use increasingly powerful AI anytime? Will they release every advancement for us to use? Of course, you won't believe that. When they say their models will put them so far ahead of anyone that they can't be caught up, they mean it. Each new generation of models is an engine of escape velocity. Starting from the stratosphere, they are already waving goodbye to us.

Whether they will come back remains to be seen.

Excerpt from The Algorithmic Bridge

This article is from the WeChat public account "Jiqizhixin", author: Alberto Romero, translated by Jiqizhixin, published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments