Doubao App's latest grayscale voice mode realizes singing that GPT-4o cannot

01-21

This article is machine translated

Show original

Written by Zhou Xinyu

Edited by Su Jianxun

In 2025, Douban's first update focused on the voice call function.

On January 20, 2025, Douban released its latest "end-to-end" voice large model and updated the real-time voice call function of the Douban app based on this model.

Previously, Douban's voice call function used a cascaded solution of ASR (automatic speech recognition) + LLM (large language model) + TTS (text-to-speech). The updated end-to-end voice large model now integrates speech recognition, understanding, and generation into a single model.

According to the "Intelligent Emergence" test, the highlight of Douban's updated voice model is that it replicates human-like expression forms and emotional output during voice interaction. At the same time, the fluency and intelligence of the new version's dialogue have also been greatly improved.

For example, Douban's new "Soul Singer" and "Versatile Celebrity" voice call modes have taken a step ahead of GPT-4o, realizing singing and role-playing.

△ Douban's updated voice call modes.

Douban has learned to sing and role-play

One major change in Douban is that it has expanded its voice role-playing ability to celebrities, book characters, and film and television characters. This function is also reflected in Douban's voice call "Versatile Celebrity" mode.

For example, when the author asked Douban to "imitate Yushu Xin's voice and say a New Year's greeting," Douban responded with "Hmph, I don't want to imitate her! I'm just me, a different firework," capturing the "little author" vibe.

Demonstration video: https://pan.baidu.com/s/1i9DvF3o2wjq_jyGMuF_lgQ?pwd=yrn8

Moreover, Douban's contextual memory capability is also quite impressive. When I tried different roles such as Song Dandan, Lin Daiyu, and Zhen Huan in the same conversation and then asked Douban to imitate Yushu Xin again, it immediately became aggrieved: "Why are you asking me to imitate her again?"

Demonstration video: https://pan.baidu.com/s/1gmHHEkqcrwAfiY01uy8-Uw?pwd=3b7a

Currently, most voice models on the market still require users to input relatively professional text prompts for song creation, or need to first generate music based on user-input text audio, and cannot achieve "spontaneous singing" in natural voice interaction.

Douban's newly launched "Soul Singer" mode can now allow Douban to sing a song spontaneously during a conversation.

For example, when asked to sing an upbeat song, Douban immediately sang a rendition of Taylor Swift's "Love Story," though it mistakenly called the song "Lose Control" and the pitch was a bit "off."

Demonstration video: https://pan.baidu.com/s/1vN4GpKdVtGEn4bYiV3uOkQ?pwd=kj8j

In addition, Douban has also gained song creation capabilities. For example, when told "Sing me a song with the lyrics 'more year-end bonuses,'" it immediately performed a song. Although the lyrics were a bit crude, the response speed was excellent.

Demonstration video: https://pan.baidu.com/s/1VZAL7F6h0cH6x8pDDB1muw?pwd=3seb

The capabilities of role-playing and singing demonstrate that Douban's anthropomorphic ability, interaction naturalness, and emotional expression level have reached the next level.

For example, when asked to tell a ghost story, Douban can switch tones according to the plot, creating a very atmospheric experience.

Demonstration video: https://pan.baidu.com/s/13g20MBVW1ydmtuL-dd3qSw?pwd=g3kb

This time, Douban has also launched two new personality modes: "Bullied Little Douban" and "Praise Master."

The so-called "Bullied Little Douban" is officially described as being able to present a pitiful state. But in our conversation, "Bullied Little Douban" is more accurately described as a "green tea little Douban."

Demonstration video: https://pan.baidu.com/s/1cixSfFb89KVC1wBKogGOyg?pwd=vcxr

However, the rare thing is that no matter what instructions are given, the "Bullied Little Douban" can maintain the "aggrieved" persona. For example, when asked to be sarcastic, the most sarcastic version still exudes a tea aroma:

"Oh, I wouldn't dare, you're the master, and I'm just a pitiful little thing for you to command, how could I have any other thoughts?"

Demonstration video: https://pan.baidu.com/s/1y4JBcUIjOMQKozUeufvXCg?pwd=b746

Compared to the voice call function released in August, Douban's emotional perception ability has also improved. Through a simple "aha," it can perceive the user's cheerful mood.

Demonstration video: https://pan.baidu.com/s/1UKAra3EOhL0l_1OPFoRdAg?pwd=m1rb

Of course, Douban's emotional expression has also become more human-like. Using "guessing gender" to tease Douban gives a sense of joking with a real online user.

Demonstration video: https://pan.baidu.com/s/1eTlUjDLENsnWGE2mEzSLEg?pwd=rusa

Mastering voice interaction, the entry ticket to the anthropomorphic track

Since the release of OpenAI's GPT-4o in May 2024, most AI voice call functions on the market have been using the cascaded solution of ASR (automatic speech recognition) + LLM (large language model) + TTS (text-to-speech).

For example, the initial Douban's voice call function integrated the speech recognition model Seed-ASR, the speech synthesis model Seed-TTS, and the RTC (real-time audio and video) technology to achieve real-time interaction in conversational scenarios.

However, the disadvantage of the cascaded solution of multiple models is that the AI's interaction is still not natural enough. There is inevitably information loss in the "speech-to-text-to-speech" process.

This has also led to the limited application scenarios of traditional voice interaction modes. The industry's deployment of AI voice interaction has been limited to education, customer service, and other high-professional and low-anthropomorphic scenarios.

However, end-to-end solutions are gradually becoming mainstream. For example, Zhipu released GLM-4-Voice in October 2024, and Mianbi Intelligence released the "edge-side GPT-4o" MiniCPM-o 2.6 in January 2025, both using end-to-end model solutions to simultaneously complete visual understanding and speech understanding and generation in a single model.

According to "Intelligent Emergence," the update of Douban's voice call function this time is mainly due to the change of the underlying model technology from the original multi-modal model collaboration cascaded solution to the direct "speech understanding to speech generation" end-to-end solution. This has led to significant improvements in latency reduction, naturalness, emotional expression, and even song output.

The improvement in voice capabilities will also expand the deployment space of AI from professional fields like education and customer service to broader scenarios like emotional companionship, psychological counseling, and dubbing.

Especially in the field of AI emotional companionship and role-playing, there has already been a strong ability to attract investment. For example, the recently launched AI idol role-playing app "Lovey Dovey" quickly topped the iOS rating in the Korean market and is popular among idol fans. Talkie, a role-playing app under MiniMax's "Six Little Tigers," had 29.77 million monthly active users as of December 2024 according to the AI product rankings.

Lovey Dovey Dialogue 1

Lovey Dovey Dialogue 2

The enhancement of role-playing, emotional perception and expression capabilities at the vocal level is a key step in enriching the forms of AI-human interaction and improving the sense of immersion. The market space that emotional interaction can open up also forces technology to move closer to "anthropomorphism".

Welcome to the exchange!

Welcome to follow!

Source

Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.

Add to Favorites

Comments

Relevant content