No further training or fine-tuning is needed; an auxiliary system has propelled GPT-5.2 accuracy to a record-breaking 75%.

This article is machine translated
Show original

What? The upper limit of AI is no longer determined by the base model, but by the peripheral "inference orchestration".

Without altering the LLM model, a single Agentic System can dramatically increase the intelligence of AI.

After reviewing the latest assessment of Poetiq, a startup focused on "AI reasoning and self-improvement systems," some have come to this conclusion.

Partial screenshots

Recently, Poetiq announced that it ran GPT-5.2 X-High on their system (called meta-system) using the ARC-AGI-2 test set . This test set is typically used to measure the performance of current state-of-the-art (SOTA) models on complex abstract inference tasks.

The results show that, on the same Poetiq test platform, GPT-5.2 X-High achieves a score of 75% on the full PUBLIC-EVAL dataset, which is about 15% higher than the previous state-of-the-art (SOTA) score, while costing less than $8 per problem.

The PUBLIC-EVAL test here is part of the ARC test. The former generally includes basic reasoning tasks and standard NLP and mathematical reasoning tests, which are suitable for a wide range of model evaluations and have more open and standardized datasets. The latter includes more complex and challenging reasoning problems, which examine the model's abstract reasoning, common sense reasoning, and innovation capabilities. It is a test of the reasoning limits for high-level models.

The following figure shows the performance distribution of various state-of-the-art (SOTA) models on the PUBLIC-EVAL dataset:

Poetiq also specifically emphasized that it did not perform any retraining or model-specific optimizations on GPT-5.2.

In such a short time, GPT-5.2 has achieved significant improvements in accuracy and price compared to other models previously tested by Poetiq on the PUBLIC-EVAL dataset.

Poetiq further envisions that if the good performance in the PUBLIC-EVAL test can be carried over to the official ARC Prize SEMI-PRIVATE test, then the "GPT-5.2 X-High + Poetiq" configuration will be stronger and better than any previous system configuration.

Greg Kamradt, president of the ARC Prize, said, "It's great to see Poetiq release the results for GPT-5.2 X-High. If they can maintain this performance, their system looks like it can handle model swapping very well. However, the results are not fully validated until the infrastructure issues of the OpenAI API are resolved."

Model switching here refers to the system switching between different models to meet different task requirements without requiring large-scale adjustments or retraining of the system or model .

OpenAI President Greg Brockman also retweeted, saying: GPT-5.2 surpasses human benchmark performance on ARC-AGI-2.

The comments section raised more questions about the new test results, such as "How long does each task take on average?"

Poetiq responded, "We don't specifically collect these statistics right now. The simplest questions can be completed in about 8 to 10 minutes, while the most difficult questions must be finished within 12 hours to stay within the time limit. So, there is definitely room for improvement in the future."

Others have pointed out that "most of the improvements seem to come from the testing framework and coordination mechanisms, rather than any model-specific tuning. With no training changes, the improvement on ARC-AGI-2 is about 15%, indicating that there is still a lot of room for improvement in search, routing, and termination logic alone."

The question is: why is X-High less expensive per task than High in this setup? Is it because it converges faster by finding the correct solution earlier, or because the testing framework more aggressively prunes invalid inference processes?

Regarding this question, Poetiq affirmed the view that "X-High simply converges to the correct answer faster than High".

A team of 6 people built the Meta-system.

Poetiq is a team of six researchers and engineers, with several core members from Google DeepMind.

Ian Fischer (Co-founder & Co-CEO): Formerly a senior researcher at Google DeepMind;

Shumeet Baluja (Co-founder & Co-CEO): A senior expert who also came from Google/DeepMind.

The key to Poetiq's success lies in its meta-system .

Meta-systems do not depend on specific large models and can be used with any cutting-edge model (such as Gemini 3, GPT-5.1, Grok, etc.) instead of training or fine-tuning the model itself. This means that it can quickly adapt and improve performance as new models are released.

The Poetiq meta-system constructs an iterative reasoning process, which differs from traditional one-time answer generation methods, and has two main mechanisms:

Iterative problem-solving loop: The system doesn't just pose a problem to the model once, but instead uses a Large Language Model (LLM) to generate a potential solution, then receives and analyzes feedback, and calls the LLM again to improve the solution. This multi-step, self-improving process enables the system to gradually build and continuously refine the final answer.

Self-Auditing: The system can autonomously audit its own progress and determine when it has obtained sufficient information and whether the current solution is satisfactory, thus deciding to terminate the entire process. This self-monitoring mechanism is crucial for avoiding unnecessary computational waste and effectively reducing overall costs.

Poetiq also emphasized that all their meta-system adaptation work was completed before the new model was released, and the system never directly touched the ARC-AGI task set, but still achieved cross-version and cross-model family performance improvements on multiple different models, indicating that the meta-system has good generalization ability for reasoning strategies.

It is this flexible, powerful, and recursive architecture that enables a small team like Poetiq to achieve a series of state-of-the-art (SOTA) results in a very short time.

Regarding this meta-system, some people think, "It's fantastic. Building intelligence on top of the model, rather than inside the model, means that new models can be adapted in a few hours, which is brilliant. Adapting to open-source models and successfully migrating to new closed models shows that what has been captured is the fundamental law of the reasoning process itself, rather than model-specific quirks."

Reference link: https://poetiq.ai/posts/arcagi_verified/

This article is from the WeChat public account "Machine Heart" , edited by Du Wei and Chen Chen, and published with authorization from 36Kr.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments