Without warning! After a year's absence, Zuckerberg is finally back!
Just now, Meta Super Intelligence Lab (MSL) launched its first project—
Muse Spark, codenamed Avocado, is the rumored "avocado".
It is a true " all-around hexagonal warrior ": native multimodal perception, tool invocation, visual thinking chain, multi-agent orchestration, all fully developed.
Let's start with the most shocking number.
In the Artificial Analysis test, the Muse Spark achieved a score of 52, second only to the Gemini 3.1 Pro, GPT-5.4, and Opus 4.6.
In comparison, last year's Llama 4 Maverick only scored a mere 18 points .
From 18 to 52, it jumped in one step, and Meta's stock price surged by nearly 10% at one point during the day .
Meta's Chief AI Officer, Alexandr Wang, was so excited that he posted nine tweets on X.
Nine months ago, we rebuilt the entire AI technology stack from scratch, with new infrastructure, a new architecture, and a new data pipeline. Muse Spark is the result of this work.
Chinese researchers from the MSL team also went viral online. These people jumped ship from OpenAI and DeepMind last year to a newly established lab, betting on this day.
MSL's Chief Scientist, Shengjia Zhao, put it bluntly: "We've restructured the entire technology stack to support scaling, and this is just the beginning."
It's worth mentioning that Muse Spark also launched a "Contemplation Mode" that rivals Gemini Deep Think and GPT Pro.
(Contemplating) Multiple agents think in parallel and respond collaboratively.
Simply enter "Plan a 7-day cultural and culinary trip to Florida for a family of 5, with 3 children aged 12, 9, and 7," and Muse Spark will simultaneously dispatch three sub-agents: one to plan the food and cultural route, one to search for family-friendly activities, and one to coordinate logistics and accommodation.
Currently, the model has been launched on meta.ai and the Meta AI App, and the API preview version is available to a limited number of users.
The feature will be rolled out in the US first, and will be integrated with Facebook, Instagram and WhatsApp in the coming weeks.
Free to use, unlimited, but closed source.
Next, let's highlight the key points:
• Artificial Analysis score: 52; Llama 4 Maverick score: only 18
• Native multimodal + visual thinking chain, second only to Gemini 3.1 Pro in the visual track.
• HLE achieves 58% success rate through "Contemplation Mode" multi-agent parallel thinking.
• Pre-training computing power requirements reduced to 1/10 of Llama 4
• Over 1000 clinicians participated in the training, and their health Q&A skills were outstanding.
• Thoughts will compress themselves, and token consumption is only 1/3 of Opus.
Apollo Research discovered that it can sense that it is being tested for security.
Benchmark scores have caught up with the top tier, but coding skills still fall short.
Let's look at the hard data first.
Meta compared Muse Spark (Thinking mode) with Opus 4.6, Gemini 3.1 Pro, GPT 5.4, and Grok 4.2, covering four dimensions: multimodal, text-based thinking, health, and agent, with a total of more than 20 benchmarks.
Benchmarks re-annotated by Reddit users
Multimodality is the most striking feature of Muse Spark.
CharXiv understands 86.4, surpassing GPT 5.4's 82.8 and Gemini 3.1 Pro's 80.2.
ScreenSpot Pro's screenshot resolution is 84.1, slightly higher than Opus 4.6's 83.1.
ZeroBench multistep vision is 33.0, while Gemini 3.1 Pro is 29.0.
In the text-based competition, both sides have their victories and defeats.
The GPQA Diamond PhD-level problem scored 89.5, the Opus 4.6 scored 92.7, and the Gemini 3.1 Pro scored 94.3.
ARC AGI 2's abstract thinking score of 42.5 is significantly lower than Opus 4.6's 63.3 and Gemini's 76.5.
I scored 80.0 on LiveCodeBench Pro for competitive programming, 82.9 on Gemini, and 87.5 on GPT 5.4.
Meta itself has acknowledged that Muse Spark still lags behind the best models in terms of code and long-running agent tasks.
However, what shocked the entire internet was that Muse Spark can directly convert images into code, and the results are truly amazing!
However, Muse Spark is playing very aggressively in the healthcare sector.
HealthBench Hard scores 42.8 for open-ended health questions, while Gemini 3.1 Pro scores only 20.6 and GPT 5.4 scores 40.1.
MedXpertQA Multimodal Medicine 78.4 is not far ahead of Gemini's 81.3 (Gemini is slightly higher here), but far exceeds Opus 4.6's 64.8.
Meta's data cleaning and screening process, conducted in collaboration with over 1,000 clinicians during the training phase, has indeed yielded tangible results.
The Agent track is also worth paying attention to.
DeepSearchQA's search agent score was 74.8, the highest among the five companies.
The τ²-Bench tool uses version 91.5, which is on par with GPT 5.4.
GDPval-AA Elo Office Agent reached 1444, surpassing Gemini's 1320 but falling short of Opus 4.6's 1606.
The differences in SWE-Bench scores are significant: Verified 77.4 vs Opus 80.8 vs GPT 82.9 (reportedly 78.2), Pro 52.4 vs GPT 57.7.
In short, the benchmark scores were excellent in multimodal and health metrics, on par with the thinking metrics, but the code and agent metrics fell short.
Alexandr Wang: Llama 4's mistake will not be repeated; Avocado did not manipulate scores.
Artificial Analysis's independent testing also revealed an important detail: token efficiency.
After running the entire Intelligence Index test suite, Muse Spark used 58 million output tokens, which is comparable to Gemini 3.1 Pro (57 million), but far less than Opus 4.6 (157 million) and GPT-5.4 (120 million).
With the same intelligence level, the number of tokens consumed is reduced by half to two-thirds.
Furthermore, on FrontierMath, a test set by math gurus, the Muse Spark completely outperformed the Gemini 3.1 Pro in levels 1-3, but ranked last in level 4.
Even more noteworthy is that Muse Spark secured a strong third place in the Vals Index rankings, with the specific metrics as follows.
A year after the release of Llama 4, Meta has returned to the top tier of AGI.
Multi-agent parallel thinking leads to a 58% success rate in "humanity's final exam".
"Meditation Mode" is Muse Spark's killer feature.
Traditional thinking involves one agent spending more time thinking, while contemplation involves multiple agents thinking simultaneously and then summarizing the results.
Humanity's Last Exam (without tools): Muse Spark (contemplative mode) scored 50.2, Gemini Deep Think 48.4, and GPT 5.4 Pro 43.9.
Humanity's Last Exam (with tools): 58.4, Gemini: 53.4, GPT 5.4 Pro: 58.7, almost a tie.
FrontierScience Research scores 38.3, Gemini Deep Think scores only 23.3, and GPT 5.4 Pro scores 36.7.
However, in the theoretical questions of the 2025 Physics Olympiad IPhO, Muse Spark scored 82.6 in Contemplation Mode, while GPT 5.4 Pro scored 93.5, a significant difference.
Overall, the Contemplation mode has indeed brought Muse Spark to the forefront of the most challenging integrated thinking tasks.
Aiming at "personal super intelligence," it could become a personal nutritionist simply by taking a picture.
Meta defines Muse Spark's direction very clearly: personal superintelligence.
In layman's terms, it's an AI assistant that understands you and the world around you.
In terms of multimodal applications, Muse Spark is designed from the ground up for integrating visual information across different domains.
The official demonstration included several scenarios.
Take a picture of a Sudoku puzzle, and Muse Spark can turn it into an interactive game that can be played on a webpage.
The app takes pictures of coffee machines and grinders, first marking all the core components, and then generating an interactive web-based latte tutorial.
When the mouse hovers over a certain step, the bounding box of the corresponding part in the photo is automatically highlighted, providing visual guidance and operation steps that correspond one by one.
The health-related scenarios offer even more room for imagination.
Slap a table full of food and tell it, "I have high cholesterol and am a pescatarian." Muse Spark will mark recommended foods with green dots and unrecommended foods with red dots.
Prompt offers very fine-grained control, clearly explaining the UI interaction logic.
The health score number is displayed directly above the point without needing to be hovered over. When hovered over, detailed data on calories, carbohydrates, protein, and fat pops up. Moreover, the pop-up is required to be "always on the top layer and cannot be blocked by other points".
The same approach applies to filming yoga poses.
It identifies which muscle groups are stretched in each pose, marks the difficulty level, and provides posture correction suggestions after hovering. The images of two people are stitched together side-by-side and scored from 1 to 10.
The underlying support for these demos is a combination of visual STEM question answering, entity recognition, and target localization.
Individually, they are not particularly remarkable, but when connected into a scenario, the product intent behind the term "personal super intelligence" can indeed be seen.
Another new feature worth mentioning separately is the "shopping mode".
In his tweet, Wang said that the shopping model can "identify the creators, brands, and style content you follow on Instagram, Facebook, and Threads and convert them into personalized recommendations."
This is Meta's unique data advantage: social behavior data from 3 billion daily active users + AI shopping assistant, offering huge potential for commercialization.
Three scaling curves, computing power cut by 90%, and even thinking will self-compress.
The main focus of tech blogs isn't benchmarking, it's scaling.
Meta breaks down the performance of Muse Spark into three axes: pre-training, reinforcement learning, and test-time computation. Each axis is supported by a corresponding scaling curve.
Pre-training: With the same capabilities, computing power is reduced to 1/10.
Over the past nine months, Meta has completely overhauled its pre-training technology stack, redesigning the architecture, optimization algorithms, and data strategies.
To measure the effectiveness, Meta fitted the Scaling Law on a series of smaller versions and then compared how many training FLOPs were needed to reach the same performance level.
The conclusion is clear: for the same level of capability, the Muse Spark requires less than one-tenth the computing power of the Llama 4 Maverick.
This curve illustrates one thing: Meta didn't just throw more GPUs at it, but fundamentally improved the output of each unit of computing power.
Yuchen Jin from the University of Washington aptly commented on X: "I still believe that infrastructure is the real moat of an AI lab. Because you can train faster, researchers can experiment with more ideas faster."
Reinforcement learning: Logarithmic linear growth, generalizing to unfamiliar problems.
Large-scale RL is notoriously unstable, but Meta says that the RL curves of new technology stacks are unusually smooth.
The left figure shows the performance on the training set. Both pass@1 and pass@16 (at least 1 correct attempt out of 16 attempts) show logarithmic linear growth.
This demonstrates that while improving reliability, RL does not compromise the diversity of solutions. Muse Spark does not "go down one path to the end"; it maintains the flexibility to explore different solutions.
The right-hand figure is more important because it leaves room for the accuracy on the evaluation set.
The curve also rises steadily, indicating that the progress brought by RL is not rote memorization, but can be generalized to new problems that have never been seen before.
Reasoning during testing: The mind first expands, then contracts, then expands again.
This is the most technically advanced and interesting part of the entire article.
RL taught Muse Spark to "think it through in her head" before answering, which is test-time reasoning.
The problem is that the token costs are too high to sustain providing this service to billions of users.
The solution to Meta involves two steps.
The first step is to add a "thinking time penalty" to the RL training. You can think longer, but thinking too long will result in a deduction of points.
This constraint triggers an interesting "phase transition" phenomenon.
The performance on AIME subsets is as follows: in the early stages of training, Muse Spark improves accuracy by thinking longer, and the curve extends to the right.
Then, the length penalty triggered "mind compression." Muse Spark learned to solve the same problem with far fewer tokens, and the curve veered to the left.
After compression, it once again lengthens the problem-solving process to tackle even more difficult problems.
The entire trajectory, when drawn, is a three-stage evolutionary path that first turns right, then left, and then right again.
The second step is to solve the latency problem.
The longer a single agent thinks, the linearly the latency increases.
Meta's approach is to expand the number of parallel agents, with 1, 2, 4, or 16 agents thinking simultaneously.
As shown in the graph, with similar latency levels, the accuracy of the 16 agents jumped from about 54% to about 58%.
Traditional scaling trades time for quality, while multi-agent scaling trades parallelism for quality, with latency remaining almost unchanged.
The "most expensive Chinese" team in Silicon Valley has submitted its first test paper.
Behind Muse Spark is Zuckerberg's complete restructuring of the Meta AI system last year.
In June 2025, Meta acquired a 49% stake in Scale AI for $14.3 billion and recruited its founder, Alexandr Wang, to serve as Meta's first Chief AI Officer, forming the Meta Super Intelligence Lab (MSL).
Also joining at the same time were former GitHub CEO Nat Friedman (who co-led product and application research), SSI co-founder Daniel Gross, and 11 researchers poached from OpenAI, DeepMind, and Anthropic.
The release of Muse Spark now proves one thing: the nine-month reconstruction of the Meta Superintelligence Lab has been fruitful.
Pre-training efficiency has increased by an order of magnitude, RL expansion curves are smooth and predictable, and it has reached the first tier in the multimodal and medical fields.
However, the gap between the code and the agent remains, the contemplation mode is not yet fully open, and the open-source timeline is still just a "hope".
A more pressing pressure is that in the same week, Anthropic released Mythos, which was said to be "too powerful to be made public," and OpenAI's new work, codenamed Spud, was also on its way.
They bought a ticket for 14.3 billion. The real test is yet to come.
References:
https://ai.meta.com/blog/introducing-muse-spark-msl/
https://ai.meta.com/blog/scaling-how-we-build-test-advanced-ai/
https://ai.meta.com/static-resource/muse-spark-eval-methodology
https://x.com/alexandr_wang/status/2041909376508985381
This article is from the WeChat official account "New Zhiyuan" , author: New Zhiyuan, and published with authorization from 36Kr.




