
The Step-Audio team introduced that Step-Audio can generate expressions of emotion, dialect, language, singing, and personalized styles according to different scenario needs, and can have natural and high-quality dialogues with users.
At the same time, the voice generated by it not only has the characteristics of realistic and natural, high emotional intelligence, but also can achieve high-quality timbre replication and role-playing.
In short, in scenarios such as film and television entertainment, social networking, and gaming, the application demand of Step-Audio will fully satisfy you.
The Step open-source ecosystem is snowballing
How to put it, just one word: rolling.
Step is really rolling, especially in the multi-modal model, which is its forte -
The multi-modal models in the Step series under its brand have been the top-ranked in various authoritative evaluation sets and competitions at home and abroad since their inception.
Just in the last 3 months, they have won the top spot several times.
On November 22 last year, the latest ranking of the large model competition arena, the multi-modal understanding large model Step-1V was listed, with the same total score as Gemini-1.5-Flash-8B-Exp-0827, ranking first among Chinese large models in the visual field.
In January this year, the real-time ranking of the multi-modal model evaluation of the domestic large model evaluation platform "Compass" (OpenCompass), the newly released Step-1o series model took the first place.
On the same day, the latest ranking of the large model competition arena, the multi-modal model Step-1o-vision won the first place among domestic large models in the visual field.
Secondly, Step's multi-modal models not only have good performance and quality, but also have a high frequency of research and development -
As of now, Step has released 11 multi-modal large models in succession.
Last month, 6 models were released in 6 days, covering the full track of language, speech, vision, and reasoning, further consolidating the title of the multi-modal KOL.
This month, 2 more multi-modal models have been open-sourced.
As long as this pace is maintained, it can continue to prove its status as a "full-stack multi-modal player".
Relying on its strong multi-modal strength, since 2024, the market and developers have recognized and widely accessed the Step API, forming a huge user base.
Consumer products, such as Cha Bai Dao, have allowed thousands of stores nationwide to access the multi-modal understanding large model Step-1V, exploring the application of large model technology in the tea beverage industry, for intelligent inspection and AIGC marketing.
Public data shows that an average of more than 1 million cups of Cha Bai Dao tea beverages are delivered to consumers under the protection of the large model intelligent inspection every day.
And Step-1V can save 75% of the self-inspection and verification time for Cha Bai Dao supervisors on average every day, providing consumers with more reassuring and high-quality services.
Independent developers, such as the KOL AI application "Stomach Book" and the AI psychological healing application "Forest Chat Room", have all chosen the Step multi-modal model API after AB testing most domestic models.
(Whisper: Because using it has the highest conversion rate.)
Specific data shows that in the second half of 2024, the call volume of Step's multi-modal large model API grew by more than 45 times.
Also, this open-source is Step's own multi-modal model, which it is most proficient in.
We notice that Step, which has accumulated market and developer reputation and volume, this open-source is considering deeper integration from the model side.
On the one hand, Step-Video-T2V adopts the most open and relaxed MIT open-source license, allowing for free editing and commercial application.
It can be said to be "completely open".
On the other hand, Step said "fully reducing the industry access threshold".
Take Step-Audio as an example, unlike the open-source solutions on the market that require re-deployment and re-development, Step-Audio is a complete real-time dialogue solution, just simple deployment can directly realize real-time dialogue.
Zero frame start can enjoy the end-to-end experience.
All in all, around Step and its multi-modal model trump card, a unique open-source technology ecosystem for Step has initially formed.
In this ecosystem, technology, creativity and commercial value are intertwined, jointly promoting the development of multi-modal technology.
And with the continued research and development of Step models, the rapid and continuous access of developers, the support and joint efforts of ecosystem partners, the "snowball effect" of Step's ecosystem has already occurred and is growing stronger.
China's open-source power is speaking with strength
At one time, when people thought of the leaders in the open-source field of large models, the names that came to mind were Meta's LLaMA and Albert Gu's Mamba.
Now, there is no doubt that the open-source power of China's large model community has shone globally, using its strength to rewrite the "stereotypes".
January 20, on the eve of the Spring Festival of the Year of the Snake, was a day when AI gods fought in China and abroad.
The most eye-catching was that DeepSeek-R1 was born on that day, with reasoning performance on par with OpenAI's o1, but at only 1/3 the cost.
The impact was so great that it caused Nvidia to evaporate $589 billion (about 4.24 trillion yuan) in a single day, setting a record for the largest single-day drop in the US stock market.
More importantly and more dazzling is that the reason why R1 has risen to the level of excitement for millions of people is not only its excellent reasoning and affordable price, but more importantly, its open-source attribute.
A stone that stirred up a thousand waves, even the long-ridiculed "no longer open" OpenAI had its CEO Altman repeatedly come out to speak publicly.
Altman said: "On the issue of open-source weight AI models, (in my opinion) we have been on the wrong side of history."
He also said: "The world really needs open-source models, they can provide a lot of value to people. I'm glad there are some excellent open-source models in the world."
Now, Step is also starting to open-source its new trump cards.
And the open-source is the original intention.
The official said that the purpose of open-sourcing Step-Video-T2V and Step-Audio is to promote the sharing and innovation of large model technology, and to drive the inclusive development of artificial intelligence.
As soon as it was open-sourced, it showed off its strength in multiple evaluation sets.
Now on the table of open-source large models, DeepSeek pushes strong reasoning, Step's Step focuses on multi-modal, and there are various other players continuing to grow...
Their strength is not only at the forefront in the open-source circle, but also very impressive in the entire large model circle.
——China's open-source power, after emerging, is moving forward further.
Taking Step's open-source this time as an example, the breakthrough is in the technology of the multi-modal field, and the change is in the choice logic of global developers.
Many active technical KOLs in open-source communities like Eleuther AI have actively come forward to test Step's models, "thanking China's open-source".
Wang Tiezhen, the head of Hugging Face China, directly stated that Step will be the next "DeepSeek".
From "technological breakthrough" to "ecological openness", the path of China's large models is becoming more and more stable.
Speaking of which, Step's open-sourcing of the two models this time may just be a footnote to the AI competition in 2025.
More fundamentally, it demonstrates China's open-source power's technological self-confidence, and conveys a signal:
In the future AI large model world, China's power will not be absent, nor will it fall behind.