Author: Wan Chen
The elegant writing of DeepSeek-R1, the Ghibli-style art of GPT-4o, OpenAI o3's geographical location inference from images...
These are the phenomenal AI products that have been trending in the past two months. You can clearly see that reinforcement learning can finally be generalized, and multimodal models are becoming increasingly usable. This also means that 2025 has truly entered the time for Agent application deployment and acceleration.
The previously viral AI Agent - Manus team once revealed that by the end of last year, Claude 3.5 Sonnet had reached the level required for Agent development in long-term planning and step-by-step problem-solving tasks, which was the premise for Manus's birth.
Now, with the further maturation of deep thinking models and multimodal models, there will certainly be more Agents capable of handling complex tasks.
Based on this judgment, on April 17th, Bytedance's cloud and AI service platform "Volcano Engine" released a stronger model for the enterprise market - Doubao 1.5 Deep Thinking Model, which is also the first appearance of the reasoning model behind Bytedance's AI application Doubao App. Along with it, they also launched Doubao Text-to-Image Model 3.0 and an upgraded visual understanding model.
Regarding this model release, Volcano Engine's President Tan Dai believes that "the deep thinking model is the foundation for building Agents. The model must have the ability to think, plan, and reflect well, and must support multimodality, just like humans have visual and auditory capabilities, so Agents can better handle complex tasks."
As AI evolves end-to-end autonomous decision-making and execution capabilities, moving towards core production processes, Volcano Engine has also prepared an architecture and tools for Agents to operate in digital and physical worlds - the OS Agent solution and AI cloud-native inference suite, helping enterprises build and deploy Agent applications faster and more economically.
In Tan Dai's view, developing an Agent is like developing a website or APP. Having model APIs alone cannot completely solve the problem; many cloud-based AI cloud-native components are needed. In the past, cloud-native had its core definitions like containers and elasticity; now, AI cloud-native will also have similar key elements. Through continuous thinking, exploration, and rapid action in AI cloud-native aspects - such as creating various middleware, evaluation, monitoring, observability, data processing, security guarantees, and related components like Sandbox around models, Volcano Engine is committed to becoming the optimal solution for infrastructure in the AI era.
01 Doubao Deep Thinking Model: Thinking, Searching, and Reasoning Like Humans
Since the release of DeepSeek-R1 at the beginning of the year, many ToC applications have integrated the R1 reasoning model, except for Doubao App. The "Deep Thinking" mode launched on Doubao App in early March is backed by Bytedance's self-developed Doubao Deep Thinking Model.
Now, this reasoning model - Doubao 1.5 · Deep Thinking Model - is officially released and can be experienced and called on the Volcano Ark platform.
By clicking the network connection mode, Doubao can think like humans solving problems - thinking, searching, and then thinking again... ultimately aimed at solving the problem.
Here's an example in a shopping scenario, where Doubao recommends a suitable camping equipment set under given budget and size constraints.
In this problem, Doubao first broke down the considerations, planned the required information, then judged missing information and conducted network searches. It searched 3 rounds, first searching for price and performance to ensure budget and requirement compliance; it also considered children's specific needs and finally considered weather, searching for detailed evaluations. Thinking and searching until it obtained all necessary context for decision-making, providing a reasoning answer.
Besides thinking and searching, the Doubao Deep Thinking Model also possesses visual reasoning capabilities, thinking not just based on text, but also based on visual images.
Take ordering food as an example. With the May Day Golden Week approaching, travelers abroad won't need to upload photos to translation software to translate menus - the Doubao Deep Thinking Model can directly help order from images.
In the following example, the Doubao Deep Thinking Model first performed currency conversion to control the budget, then considered the preferences of elderly and children, carefully avoiding dishes they might be allergic to, and directly provided a menu solution.
Networking, thinking, reasoning, and multimodality, the Doubao 1.5 Deep Thinking Model demonstrates comprehensive reasoning capabilities to solve more complex problems.
According to the technical report, the Doubao 1.5 Deep Thinking Model has high completion rates in professional domain reasoning tasks. For instance, it scored on par with OpenAI o3-mini-high in the AIME 2024 mathematics reasoning test, with performance close to o1 in programming competitions and scientific reasoning tests. In general tasks like creative writing and humanities knowledge Q&A, the model also demonstrates excellent generalization abilities, capable of handling broader usage scenarios.
The Doubao Deep Thinking Model also features low latency. Its technical report shows that the model uses an MoE architecture, with total parameters of 200B and activated parameters of only 20B, achieving effects comparable to top-tier models with relatively small parameters. Based on efficient algorithms and high-performance inference systems, the Doubao model API service ensures high concurrency with latency as low as 20 milliseconds.
It also has multimodal capabilities, applicable to various scenarios. For example, it can understand complex enterprise project management flowcharts, quickly locate key information, and with strong instruction-following abilities, strictly answer customer questions according to the flowchart; when analyzing aerial photos, it can combine terrain features to judge regional development feasibility.
In addition to the reasoning model, the Doubao large model family also brought two model updates. In text-to-image models, Doubao launched the latest 3.0 upgraded version, which can achieve better text layout performance, photo-realistic image generation, and 2K high-definition image generation.
The new version not only better solves the generation difficulties of small and long texts but also improves image layout. For instance, the two posters "Current Form" and "Harvest Plan" generated on the far left have finer details and more natural layouts, ready for immediate use.
The other upgrade is the Doubao 1.5 Visual Understanding Model. The new version has two key updates: more precise visual positioning and smarter video understanding.
In visual positioning, the Doubao 1.5 Visual Understanding Model supports multi-target, small target, and general target box positioning and point positioning, and supports positioning count, content description, and 3D positioning. The improved visual positioning capabilities can further expand application scenarios, such as offline store inspections, GUI agents, robot training, and autonomous driving training.
In video understanding capabilities, the model has also significantly improved, such as memory ability, summary understanding, speed perception, and long video understanding. Enterprises can create more interesting commercial applications based on video understanding, like in home scenarios, using video understanding and vector search to perform semantic searches on home surveillance videos.
For example, in the following case, cat owners can now directly search "What did the kitten do at home today?" to quickly return semantically related video clips for user viewing.
With reasoning models featuring visual understanding and substantial reasoning capability reserves, many previously impossible tasks can now be achieved, unlocking more scenarios. For instance, cameras with such functionality will definitely be more popular, and AI glasses, AI toys, smart cameras, and door locks will have new development spaces.
02 Cloud Enters the Agentic AI Era
These days, OpenAI researcher Yao Shunyu (core author of Deep Research and Operator) pointed out in "The Second Half of AI" that with reinforcement learning finally finding a generalizable path - not just effective in specific domains like defeating human chess players, but achieving near-human competition level in software engineering, creative writing, IMO-level mathematics, mouse and keyboard operations, etc. In this situation, comparing leaderboard scores and achieving higher scores on more complex leaderboards would be easier, but such evaluation methods are already outdated.
Now, the competition is about the ability to define problems. In other words, what problems will AI solve in real life?
In 2025, the answer is productivity Agent. Currently, AI application scenarios are rapidly entering the Agentic AI era, with AI gradually able to complete professional and time-consuming tasks. In this context, Volcano Engine has also built a series of infrastructure for enterprises to "define their own general Agent".
The most important of which is the model, capable of autonomous planning, reflection, end-to-end autonomous decision-making and execution, moving towards core production links. At the same time, it also needs multi-modal reasoning capabilities, allowing it to complete tasks together through ears, mouth, and eyes in the real world.
Beyond the model, the Infra technology stack also needs to continuously evolve. For example, as the MoE architecture shows more efficient advantages and gradually becomes the mainstream model architecture, scheduling and adapting to MoE models requires more complex and flexible cloud computing architecture and tools.
In the enterprise general Agent scenario, Volcano Engine has launched a better architecture and tool - OS Agent solution, supporting large models to operate in digital and physical worlds, such as Agents operating browsers, searching product pages to compare iPhone prices, and even Agents performing video editing and music composition on remote computers using Jianying.
Currently, Volcano Engine's OS Agent solution includes the Doubao UI-TARS model, as well as veFaaS function services, cloud servers, cloud phones, and other products, enabling operations on code, browsers, computers, phones, and other Agents. Among these, the Doubao UI-TARS model integrates screen visual understanding, logical reasoning, interface element positioning, and operation, breaking through the limitations of traditional automation tools dependent on preset rules, and providing a model foundation closer to human operation for intelligent interaction of Agents.
In the general Agent scenario, Volcano Engine enables enterprises, individuals, or specific domains to define and explore Agents according to their needs through this OS Agent solution.
In vertical Agents, Volcano Engine will explore based on its own advantageous domains, such as previously launching the "intelligent programming assistant Trae" and the data product "Data Agent", the latter maximizing data processing capabilities by building a data flywheel.
On the other hand, with the penetration of Agents, there will be larger model reasoning consumption. Facing large-scale reasoning needs, Volcano Engine has specially created the AI cloud-native ServingKit inference suite, making model deployment faster and inference costs lower, with GPU consumption reduced by 80% compared to traditional solutions.
In Tan Dai's view, to meet the needs of the AI era, Volcano Engine will continue to focus on three aspects: continuously optimizing models to maintain competitiveness; constantly reducing costs, including fees, latency, and improving throughput; and making products easier to implement, such as tools for developers like Kaozi and HiAgent, and cloud-native components like OS Agent. By maintaining product and technological leadership, market share will also lead. Previously, IDC's "China Public Cloud Large Model Service Market Structure Analysis, 1Q25" showed that Volcano Engine ranks first with a 46.4% market share.
In December last year, the daily average Token calls for the Doubao large model were 4 trillion. By the end of March this year, this number had exceeded 12.7 trillion, achieving over 106 times high-speed growth in less than a year compared to when the Doubao large model was first released. In the future, with the further maturity of deep thinking models, visual reasoning, and optimization of AI cloud infrastructure, Agents will drive even larger Token call volumes.