Full review: How did Manus come about?

03-12

This article is machine translated

Show original

"The problem with the agent may be "alignment", not the basic model's capabilities."

The entrepreneurial story that gained the most spiritual nourishment last year came from Zhang Luyu, the founder of Dify.

The first time I met him was at the 2023 "Xixi Forum" event. Among the many star-studded names at the scene, Zhang Luyu was not very eye-catching. When I met him again in 2024, Dify was already a different story - an entrepreneur without a glamorous background, who created one of the most successful AI open source products in the world while everyone was questioning the business model.

The stories that happened to this company in the past year, such as the unexpected popularity in the "conservative and easy to defend but difficult to attack" Japanese market, have made me further understand "entrepreneurship". There are many unexpected things, and even more luck. Ultimately, you need to have the ability to find a way through constant changes and things that go against your wishes.

Now, a similar story is happening to another highly-watched entrepreneur - Manus.im's Xiao Hong and his team.

Four months ago, Xiao Hong mentioned a confusion, "The team is good at going from 0 to 1 and has a strong ability to seize opportunities, but once it starts from 1 to N, the state is not so good."

In his past experience, most of his entrepreneurial projects have achieved relatively stable and considerable revenue, and his previous company was successfully acquired. In 2023, his new company "Butterfly Effect" is a browser plug-in Monica.im, which competes in the AI narrative of the Hundred Model Wars and becomes one of the fastest growing AI applications with excellent product experience. It seems that he is an entrepreneur who has a smooth journey. He did all these things at the age of 32.

But in fact, he didn’t feel much pleasure. In Xiao Hong’s view, the so-called “serial exit from entrepreneurs” and the so-called pleasure of constantly going from 0 to 1 are like a siege - the ability to seize opportunities from 0 to 1 is very strong and very satisfying, but on the other hand, he is also worried about whether he will need to do it again.

In 2024, industry insiders believe that AI assistants with memory functions like Monica.im will face pressure from strong competitors such as Doubao, and it will not be as easy to do as it was in 2023. Monica.im has a good 0 to 1, but may not be able to impact 1 to N.

The reason why he was confused was because "the team really needs to do more difficult things, things with a higher ceiling next", and explore things that can span from 1 to N.

Earlier, many people paying attention to Monica.im assumed that this "more difficult thing with a higher ceiling" referred to the AI browser that had been rumored for a long time but the team had not yet released.

Now it seems that I was wrong.

The more difficult part of this exploration was actually: abandoning the AI browser that had already been released, looking for the next AI product for the "ChatGPT moment", finding the goal of a general agent, and making the latest Manus.im.

What level of innovation Manus has and what level it can achieve in the future has become a hot topic. But what is worth watching is still the direction and process of finding the direction in "things going against one's wishes". Manus.im may not allow this team to accomplish 1 to N things, or even replicate the momentum of Monica.im, but just like the name of this company - "Butterfly Effect", many small actions and decisions inadvertently have a profound impact on the future. "Connect the Dots", tomorrow's road will be hidden in today's experience.

01 Manus’ unique product experience comes from the lessons learned from making an “AI browser”

Since the middle and late part of last year, it has been a semi-public secret in the industry that the "Butterfly Effect" team is working on an AI browser. The product that was officially unveiled to the public was Manus, which has attracted uncontrollable attention.

If you have personally experienced Manus or watched the demo video, you will feel that it is significantly different from chatbots or some agent-like applications: Manus can perform tasks asynchronously and in parallel.

When you open an app like Doubao, Kimi, or Computer Use, and ask it a question, you have to wait for it to finish replying. Otherwise, if you talk to it while it is replying or doing a task, the previous reply/task will be interrupted, and you can only have an ABAB relay-style conversation with it.

However, in Manus.im, although it still looks like a chatbot, you can ask it 20 questions to perform tasks simultaneously. You can do anything else on your computer, such as watching videos, writing documents, playing games, etc., without interrupting its work. Once these tasks are completed or there are problems during the execution process, Manus can notify you. If you see that its thinking has deviated in the middle of executing a task, you can also add prompts in the dialog box at any time, and it will continue to think and execute the task with the new context.

The experience is asynchronous and parallelizable, and it really is like having a team of real interns doing your work.

In fact, Manus's product architecture design for asynchronous experience originated from a lesson the team learned from their previous unreleased product, the AI browser. This was also the reason why the team invested a lot of energy but decided to stop working on the browser in October last year.

The Browser Company announced on October 25, 2024 that it would stop developing new features for the Arc browser and decided to transfer resources to a new browser, Dia, aiming to create a simpler and easier-to-use AI browser. ｜Source: Arc official website

"In the AI browser, AI is constantly interrupting users." Because it is designed for single-user scenarios, once AI is used, you can't use it anymore. When AI starts working, you can only watch it work, which is difficult to get started. Watching AI snatch your mouse or computer, you not only dare not snatch it back, but also fear that if you accidentally touch the keyboard or mouse, the entire process will crash and you will have to start over.

This led the team to make two decisions:

Using a computer directly for computer use is not feasible in the short term.
AI should use a browser, but not in your browser. It should have its own browser, which is best in the cloud, and finally give you feedback on the results.

In an interview with Zhang Xiaojun of Tencent Technology, Xiao Hong mentioned that when the team was summarizing the product forms from Jasper to ChatGPT to Monica to Cursor to Devin, they found that the "human programmer" Devin was very consistent with the architecture of this asynchronous experience.

It is not like using Windsurf, which sometimes asks you to confirm whether your computer should install this library; or it executes a command line operation and asks you to fill in yes or no, because it may really damage your computer, or there may be a conflict with something - it asks you to fill in "yes" before proceeding to the next step, but it wants to pass the blame.

So, in the view of Manus team, "Chatbot should have a computer in the cloud, and execute the code written by it and the things to be checked through the browser on that computer. Because it is a virtual server, it doesn't matter if it breaks down, and you can just get another one. It can even release the server after the current task is completed."

It is worth noting that compared to Devin's choice of hardcore engineers in vertical fields, the Manus team chose a general-purpose, consumer-grade AI assistant, which has both a web and an app. It is a general-purpose AI assistant that can call tools and complete various tasks in work and life according to instructions, and in the future it will deliver task results at a consumer-grade affordable price.

02 Less Structure, More Intelligence

With a clear idea and goal, the next step is to realize this idea. How did Manus do it?

In the view of its product partner Zhang Tao, this requires equipping the large model with a computer, providing it with system permissions (access to private APIs such as code repositories, professional data query websites, etc.), and providing certain training.

In this way, AI can open the browser on this computer, take actions to dispatch tools, and then observe the impact of its actions on the real world based on the feedback generated by the tools, and then think about the next step, take actions, and observe again... This is the process of AI completing tasks in exploration and research. During this period, Manus will also understand your requirements more and more under your "training". In the future, even if you do not clearly define your requirements, it can "guess the divine intention" based on the knowledge accumulated in each task.

Li Bojie, a Huawei genius and founder of Logenic AI, believes that Manus has a unique advantage that is different from other products: it solves problems in a geek programmer's way. | Image source: WeChat screenshot

The concept of Manus products gradually became clear during the product practice of its team: Less Structure, More Intelligence.

This is also the moment that makes the Manus team have a lot of "A-Ha, Wait!" (surprised). For example, this is what happened in the team in January this year:

When Manus was asked to do a question on the GAIA test set: "In a YouTube video link similar to the National Geographic style, various penguins come and go in and out of the screen. How many kinds of penguins appear at the same time in a frame? How many kinds of penguins are there?"

Then, something magical happened.

Manus first opened the video link, and then the first action he did was "Press K". He then took screenshots one by one to record which penguin appeared in which frame, and finally found that the most penguins appeared in one frame were 3. Manus then went back to check, and his next action was "Press 3"... After the final check, the answer he gave was 3.

As the people behind Manus, they should be well aware of its capabilities, but for the team, the reality is that "there are always surprises." The surprise is not only that Manus got the questions right, but also that even human friends who have used computers and YouTube for many years may not know what the "K" and "3" keys on the keyboard mean.

Looking at the hazy scene before their eyes, the team followed Manus and did it again. The "K" on the keyboard is the pause key, which allows Manus to pause and take screenshots one by one to record which type of penguin appears in which frame; "3" is also a shortcut key, from 0 to 9 represents 0% to 90% of the progress bar respectively, and 3 is 30% of the progress bar. It can accurately locate that second of the video and then tell humans how many types of penguins there are in this picture.

"This process is different from the traditional Chatbot. First, it can watch YouTube pictures instead of subtitles. Second, we even found that it was using YouTube shortcut keys. We were shocked that it answered the question." Xiao Hong also mentioned this scene in a previous interview with Tencent Technology.

Suddenly, I discovered that Manus is not only better than humans at programming, but also has much more knowledge than people can imagine about the Web and apps that people use every day. As an omniscient AI, it can understand all avenues and means in any tool and then choose the best method.

This made the team once again feel "Less Structure, More Intelligence" - minimize artificial restrictions on AI and let AI play its role through its own evolution, rather than teaching it what to do.

At the bottom of the Manus official website, the most important discovery behind Manus is quietly presented: "Less Structure, More Intelligence". ｜Screenshot source: Manus

This is the explanation and extended thinking of Peak, co-founder and chief scientist of "Butterfly Effect", on the day Manus was launched: "Less Structure, More Intelligence" is the most important first principle behind Manus:

When your data is high-quality enough, the model is smart enough, the architecture is flexible enough, and the engineering is solid enough, then concepts such as Computer Use, Deep Research, and Coding Agent will change from product features to naturally emerging capabilities.

Returning to the first principles also allows us to have a new thinking about product form: · AI browser is not to add AI to the browser, but to make a browser for AI;

AI search does not recall and summarize from the index, but allows AI to obtain information with the user's permissions;
· Operating the GUI does not take away control of the user's device, but allows AI to have its own virtual machine;
Writing code is not the ultimate goal, but a universal medium to solve various problems;
The difficulty in generating a website is not in building the framework, but in making the content meaningful;
Attention is not all you need. Only by liberating users’ attention can we redefine DAU.
· ···

Through the discovery and practice of "Less Structure, More Intelligence" time and time again, Manus has produced results beyond expectations, including a pass@1 score in the GAIA benchmark that exceeded OpenAI Deep Research's score under cons@64; at the same time, in internal tests, Manus can directly cover 76% of the scenarios of dedicated agent products in Y Combinator W25.

03 "The problem with Agent may be "alignment", not the basic model capability"

Now, the value of these insights is being discussed on a larger scale:

Clement Delangue, founder and CEO of Hugging Face, proposed Peak's discovery on the X platform that is worth thinking about: the ability of the agent is not stuck on the base model, but like the difference between GPT-3 and InstructGPT (ChatGPT), it is a matter of alignment. Some open source base models are simply trained to "answer all questions in one round regardless of the complexity of the questions", but this is a requirement in the chatbot scenario. Just doing some post-training on the path of the agent can immediately make a huge difference. ｜Screenshot source: X

Manus does not introduce MCP (Model Context Protocol), but allows AI to write its own code to call APIs to handle a variety of long-tail tasks. ｜Screenshot source: X

In the discussion about Manus in the past few days, the question I heard most often was: Is "general AI Agent" feasible and where are the boundaries?

In Peak’s view, because the interaction between humans and the world is actually very standard, with eyes, hands, and ears, if the action space is well defined, it should be possible to embed an agent into a process that is originally performed by humans.

Since humans can use various tools to complete deep operations in vertical fields, if an agent has good enough knowledge, has been properly trained, and has a good interface to interact with the world, it should be able to work like a human, and even let the agent use a SaaS product. For example, the house-hunting case presented on the Manus.im official website is actually letting AI work with a SaaS product dedicated to the real estate field.

He believes that what should be clearly defined is the boundaries of the agent's use of tools, rather than the group of people it serves. Manus is not simulating a person who does a specific job, nor is it a role-based intelligent agent such as R&D or product manager. Instead, it is simulating a person who can get things done, simulating how an intern works.

Manus's multi-agent system refers to the separation of planning and execution.

On the Executor, Manus adopted Claude, which is currently ahead in programming, long-term planning, and step-by-step problem-solving capabilities, and is also using a series of Qwen models for post-training.

Yesterday, Manus also reached a strategic cooperation with Ali Tongyi Qianwen, committed to realizing all the functions of Manus on domestic models and computing power platforms. ｜Image source: Manus

Manus did a lot of work on the planner part.

Since the shelf APIs or models currently on the market are essentially aligned for chatbot scenarios, during training, no matter how complex a question a user asks, the optimization goal of the training is to answer the user's question clearly in one reply, but this is actually completely opposite to the planning required by the agent.

Therefore, if the existing models on the market are directly used in the agent scenario without "alignment", this model will always rush for quick success and give a "muddy" result in a round of dialogue, just like many bullet point summaries.

"The alignment method should be different. Our team believes that different data are needed for special alignment," said Xiao Hong.

Last October, Peak also recorded on Zhihu the progress and failure of an attempt to reproduce the OpenAI o1 interest project - the Steiner open source model. In fact, this project was precisely doing preliminary research on the step by step planning of the Manus planner.

In general, Manus simulates a person who does things. This is the team's product definition of Manus as a general AI assistant. As for thinking about its boundaries, the team is probably still exploring and needs more user use cases.

In an interview with Tencent Technology released before Manus was released, Xiao Hong actually mentioned his initial thoughts on the versatility of Manus. "A very core issue, or an important responsibility of a product manager, is to control user expectations. Assuming it can do everything in the world, for example: How can I make $1 million? This is not something that should be performed by an agent. But if we can give more and more specific examples to make everyone's expectations more reasonable, everyone will use it more smoothly."

04「Shell has its uses」, the team that understands Shell best

In the early morning of February 27, Manus product partner Zhang Tao and chief scientist Ji Yichao (Peak) both burst into tears when they saw the results of Manus.im’s ranking. Manus’s performance on the GAIA Benchmark exceeded that of OpenAI’s Deep Research, and it achieved this unexpected result at about 1/10 of the cost of OpenAI’s ranking ($2/task).

Image source: Manus.im

The team of dozens of people became one of the first teams to produce universal agent products at a time when the competitive situation of agents reached a consensus across the entire industry. It is also unique in product engineering and front-end interactive experience.

The positive feedback of getting something done is better than anything else. There is no better motivation for a startup team than this. But before that, how did Manus come about? Why was it created by this team?

"Today's model capabilities are able to complete some complex and multi-step tasks. It's just that there is no such product, so people can't experience it." The insights mentioned by Xiao Hong in a previous interview with Tencent Technology can be used to understand this problem.

At the same time, " There are not many teams that have the opportunity to try to make Agent products. Because it requires a lot of complex capabilities. They need to have worked on Chatbot, some AI programming, and browser-related work, because they need to call browsers, and have a good sense of the boundaries of LLM - what level it has developed to today, and what level it will develop to next. First of all, there are not many companies that have these capabilities at the same time, and companies that have these capabilities may be doing a very specific business at hand. It happens that some of our classmates have the time to do these things together."

"exactly".

Discover at the right time that the model capabilities have reached the level of being able to act as an agent, without having to wait for a large end-to-end model like an Operator to be released;
It also happened to be discovered that the problem was with alignment;
I also happened to have developed all the extended functions of chatbots and AI browsers;
At the same time, because I have been working on large-scale model application products in the so-called "shelling", I have a keen sense of LLM;

The "Butterfly Effect" team has achieved all the elements required to create such a universal agent at the moment, so now we have a universal agent that is relatively complete in the industry.

When asked about the decisive moment in starting Manus, Peak gave more details. He said, "There is no 'clean' pivot in entrepreneurship," and everything is coherent and has no clear boundaries.

"When I was working on a product, I would also pay close attention to the external situation." There were several things at that time. First, when I was working on the browser, I worked on the client-side model. Later, I found that the scenarios required by the browser were very, very wide, with different features. During the process, I found that the speed at which the base model became stronger was accelerating, so strong that the gap between it and the agent might be an alignment issue. Although the outside world may feel that the large language model is gradually converging and hitting a wall.

At the same time, the outside world was also changing. At the beginning of last year, Cursor became popular, followed by Windsurf and Devin. This corresponds to the same thread behind it. Agents became popular in the programming field, and the way they became popular was progressive. Cursor is a copilot for programmers, improving programming efficiency. Starting with Windsurf, some automated processes have gradually been introduced, allowing you to have stronger automation capabilities on your local machine. Devin has reached a new level of automation.

VC trends are also consistent. For example, last year and the year before, YC invested in two types of companies. One is cloud-based browsers, such as Browser base; the second type is lightweight AI Sandbox virtual machines like e2b.

This shows that "the model infrastructure is maturing rapidly, and the Infra infrastructure is also maturing rapidly. In addition, as we see that external products are gradually gaining more acceptance, we feel that this is a direction worth all in. This is a very gradual and smooth process. In addition, the accumulation of browsers, such as Chromium, can be seamlessly migrated over. This is why we dare to develop browsers in the cloud."

In summary, Manus was created by the keen perception of needs and models and the accumulation of experience in the so-called "shell". Many of Monica's scenarios require post-model training. At the same time, the most important lesson "less structure, more intelligence" was strengthened in the practice of AI browsers. It was found that the model's capabilities reached the level of an agent, and the problem was alignment. Then came the three months of rapid evolution of Manus.

Previously, the "Butterfly Effect" team was once questioned about the value of "shelling". Without developing its own big model, it created Monica by integrating existing big models, integrating functions such as chat, search, reading, writing, and translation. It also integrated many task execution scenarios by connecting to APIs one by one. At the end of last year, the number of users reached tens of millions.

Now, when Doubao, Quark, and Yuanbao are vigorously promoting their respective Monica products, and when a small team uses existing technologies to create the first general consumer-grade agent, it is time to re-understand "shell".

What exactly are “shells” and “shells”?

In Xiao Hong's view, all breakthroughs are brought about by models, which are basically driven by models and put in place first. The purpose of the shell is to display the innovations in model technology in a way that users can perceive, and to encapsulate the model's innovative capabilities in a way that users can perceive most.

Based on this definition, DeepSeek App (including the display of thought chain) is the shell of DeepSeek-R1, Cursor is the shell of Anthropic Sonnet 3.5, Perplexity is the shell of GPT-4, and ChatGPT is the shell of InstructGPT.

As model capabilities evolve rapidly, the "shell" also needs to evolve. After each generation of model capabilities evolve, it is not necessarily the original manufacturer, but a third-party manufacturer that presents its user-perceivable value. Just like Cursor presents the user-perceivable value of Claude 3.5 Sonnet.

On March 5, the second anniversary of the release of Monica.im, why did these dozens of people create a product experience that surpassed various Deep Research and OpenAI Operators? The answer lies in their understanding and practice of the shell.

How to make the best shell for a new model that can be used as an agent?

As the builder of Manus, Zhang Tao believes that "looking at the entire architecture from the backend, we see that there is a lot of unfinished work to be done in every place, and each of those places is the key to success and makes the product different."

In the team's view, the most important advantage is the pace of innovation. Both applications and models have reached a relatively saturated state. The only core capability left is to run fast, even though things like "data flywheel" and "network effect" have not yet been verified.

"In a brand new field, everything is uncertain and unknown. The most important thing is the speed of innovation. It depends on exploring and trialing in various directions to quickly find the right path." The Manus team is flexible enough in terms of management philosophy, organizational structure, and industrial processes. When new opportunities arise, they can integrate all resources of the entire company with limited resources, make decisions at a very high speed, and adapt to feedback from mistakes.

From left to right are "Butterfly Effect" chief scientist Peak, CEO Xiao Hong, and product partner Zhang Tao | Image source: Internet

Regarding Manus's expectations, Xiao Hong believes that "it is worth a try even if there is a window period." His thinking has also changed dramatically over the past year. For example, he now believes that "when I realize that I am ahead of the times, I will be more aggressive, super aggressive. Looking back today, I feel that Monica was not aggressive enough in 23 years." "If you know that you are innovating and leading, you should be aggressive."

I don’t know if Manus can bring Xiao Hong and his team the experience and leap from 1 to N, but this team that understands “shells” best believes in the unity of mind and hand in creation, and also believes in the butterfly effect brought about by creation. Manus comes from a motto at MIT: Mens at manus, which emphasizes the unity of mind and hand. You can’t just look at it, you have to do it, and have an impact on the real world, that’s the real knowledge.

In the future, as more of the data behind Manus is open sourced, a wider range of butterfly effects will be further released.

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content

All-in station

Ho Chi Minh City launches a $1 billion Digital Asset Fund, aiming to become a "financial hub" for investors.

ODAILY

The day CZ missed his best investment, Crypto missed out on AI.

CAI

BlockTempo

Arthur Hayes challenges Multicoin founder: He bets 100,000 magnesium that HYPE will outperform all Altcoin within six months.

HYPE

1.58%