"I don't need a better model anymore": A glimpse into the diverse world of AI under a trending Reddit post.

This article is machine translated
Show original

Author: Friday, TechFlow TechFlow

Anthropic has just delivered a report card that is impeccable on paper.

Released on June 9, Claude Fable 5 is the company's first Mythos-level model to be made publicly available. It achieved a score of 80.3% on the real software engineering task benchmark SWE-Bench Pro, leading its previous flagship Opus 4.8 by about 11 percentage points and GPT-5.5 by more than 20 percentage points.

But the users' reactions were a damper on the enthusiasm.

Three days after its release, a trending post on the r/artificial forum (305,000 weekly visits) was titled, "Claude Fable made me realize I don't need a better model." The poster, Axi0m-22, said he used Fable for security research and daily work for a while, then almost immediately switched back to Opus for coding and Haiku for miscellaneous tasks. He used an analogy: it's like watching the iPhone 17 launch while holding an iPhone 14, "You know the new one is better, but you're thinking: 'Never mind, mine is fine.'"

image

The highly praised section is dominated by "good enough" comments: Aesthetic fatigue with models has become the mainstream sentiment.

The top-ranked comment, with 42 upvotes, stated: "Aside from a larger context window, I haven't felt the need for a more powerful model since Opus 4.5. "

Another user, hyprlab, received 13 upvotes for his comment: "I don't see any benefit to my workflow from switching to a model that burns tokens even more aggressively. Opus 4.8's high-intensity mode is already comfortable enough."

There is a common cost behind these kinds of statements.

Fable 5's API is priced at $10 per million input tokens, nearly double that of Opus 4.8. User siromega37 bluntly stated, "Higher token consumption, but no return on investment. I think we're seeing a plateau, and the bubble will eventually burst."

User hobopwnzor offered a more systematic interpretation: "We've been at the top of the S-curve for a while now. Recent progress has mainly come from tool calls and peripheral engineering, not from the capabilities of the model itself."

Safety railings have become the biggest point of contention: "90% of their intended use is rejected outright."

If "good enough" is merely an emotional sentiment, then complaints about safety railings are specific product issues.

According to Anthropic's official documentation, Fable 5 shares the same underlying model as Mythos 5, which is only available to a limited number of organizations. The difference lies in Fable's addition of a security classifier: requests involving high-risk areas such as cybersecurity are blocked and handled by Opus 4.8. The official description states that this mechanism is tuned conservatively, triggering on average in less than 5% of sessions, and may mistakenly flag harmless requests.

Under this Reddit post, the perceived trigger rate is clearly much higher than 5%. User jradoff, who received 17 upvotes, said he asked Fable to check the security of his code, and "it basically refused to process anything related to security," then was relegated to Opus. Another comment with 12 upvotes was even more blunt: "90% of what you want to do with it will be rejected, it's practically useless."

Paid users are even more resentful. kaitava, a user on the $200 subscription tier, wrote: "I paid double the usage fee to get it to undergo a security audit, and I was downgraded to Opus. Now I don't like anything about it, and I'm just waiting for OpenAI to catch up."

For a flagship product that emphasizes leapfrog capabilities, the "usability cost of security" is becoming a core variable in users' decisions to buy it.

The opposing view: Heavy users experience it like "night and day."

The popular post did not lack opponents, and the opposing side was quite clearly characterized: the heavier the task, the higher the evaluation.

User Phylaras's comment received 15 upvotes: "Fable has made a real difference for me. It has caught previously undiscovered bugs for complex tasks that require huge context windows." A user who claims to be doing high-energy physics simulations said that a single simulation model often has 8,000 to 10,000 lines of code and hundreds of models interacting with each other. "Having a model that can work independently and continuously and understand the details of the environment is something I look forward to."

image

The most vehement rebuttal came from user Navetz: "Honestly, anyone who's used this model would think this post is insane. To me, it's incredibly intelligent, and I've been using it constantly. I explained to my non-technical friends: it's like going from a college player to an NBA starter."

Some have offered compromises. User ready-eddy suggests using Fable as a "planner and fixer" rather than a daily "builder," unless you don't mind spending money. Another comment summarizes it more like a user manual: using Fable to calculate tables is choosing the wrong model, and using Haiku to run complex tasks with 16 agents is also choosing the wrong model. "There are no inherently bad models, only models used in the wrong scenarios."

After benchmark scores and user experience are decoupled, will publicly available AI become even stronger?

One of the most interesting comments in this debate shifted the focus from the product to the industry structure.

User KedMcJenna proposed a "public AI freeze theory": models accessible to ordinary people may remain at their current level indefinitely, while corporate and government elites will continue to acquire stronger private models. "We know of at least Mythos, and there are likely even stronger models that we will never hear about."

This comment points to the fact that Mythos 5 is indeed not open to the public and is currently only available to cyber defense agencies and critical infrastructure companies through the Project Glasswing program.

When benchmark scores and public opinion are considered together, the conclusions are not contradictory.

Benchmark tests measure the upper limit of capabilities, while highly rated sections on Reddit reflect the ceiling of everyday needs. Since most users' tasks were already met in the Opus 4.6 era, more powerful models can only prove themselves in extreme scenarios such as physics simulations and extremely long contexts. Model vendors are no longer facing the question of "can we do it?", but rather "who needs it, how much are they willing to pay, and how much safety friction can they tolerate?"

Within three days of its release, Fable 5 has garnered two completely different reports in benchmark rankings and public discourse. Which one is closer to the truth depends on how quickly Anthropic adjusts its security classifier and the wallet votes of heavy users.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments