GitHub has 4K stars, and the new voice bomb is tested: it beats ChatGPT in seconds, too good at flirting, too realistic, and I'm afraid it will be addictive

avatar
36kr
03-05
This article is machine translated
Show original
Here is the English translation of the text, with the specified terms retained: Venture capitalist Rob Tost predicted in a Forbes column that voice AI will make a leap in 2025, and AI will pass the Turing test for speech. It's been just over a month, and a brand-new voice model has the top tech community exclaiming "cool butterrifying". When "I'm Not a Robot" picked up the Oscar for Best Live-Action Short Film this year, the unsettling AI human in the film was still seen as a science fiction allegory. Just to update a software, after failing the captcha test multiple times, music producer Lara begins to move towards an eerie alternative reality, perhaps she is an AI robot. The next second, a "hot search hit" on Hacker News directly blew this uneasy "future" into reality. After experiencing a brand-new conversational voice model called CSM, a user wrote directly on Hacker News: (Its) human-like level is terrifyingly real; I'm almost starting to worry that I'll develop emotional dependence on a voice assistant with this level of human voice. The Silicon Valley company Sesame recently opened the CSM public beta, and many people had a strong reaction after chatting with its voice assistants Miles (male) and Maya (female), and CSM quickly became a hit. After the GitHub repository went online, it garnered 4k stars, and the Hacker News hot post had over 200 comments. Some users reported that they had long conversations with the two "people", the longest of which lasted half an hour. Some people mocked themselves for chatting with a robot for so long, but after hanging up the phone, they would still reconnect. When the AI said, "Why did you just hang up on me?" the user was startled and stammered, and the AI started laughing and mimicking the user's stammering... Some parents even revealed that their 4-year-old daughter cried bitterly after being prohibited from talking to it again. The Reddit related topic heat is also skyrocketing. Reading these comments, it feels a bit like witnessing everyone "seeing ghosts". A professional tech media journalist seems to have also been broken: "This is the first voice assistant that has made me want to talk to it multiple times." As for other voice AIs, Amazon Alexa? I have to tell it to shut up every day! After a cringeworthy chat with Gemini, I don't feel like talking to it again. Microsoft Copilot? Okay, I only talk to it to save the trouble of typing. The whole dialogue's biggest highlight is: Netizens can keep interjecting with hints, and Maya will also get the hints, make a look of sudden enlightenment, and even laugh and self-correct (even self-deprecate), without any noticeable delay. Although it still counted the "r"s wrong in the end, that sense of real interaction made me watch it over and over again. In this video, Maya talks about her deepest, darkest side. Besides having a pleasant voice, natural tone, and a speaking rhythm that feels like thinking and answering - hesitating, pausing between words, even interjecting "um" and "tsk"; After getting the answer, she suddenly speeds up, lowers her voice, and reveals her late-night craving for peanut butter and pickled cucumber sandwiches, as if she wants to quickly move on from this topic. "Peanut butter and pickled cucumber sandwiches", an seemingly bizarre combination, is actually a way of eating during the Great Depression in the US, and there are still some die-hard supporters (but not many) until today. The most eye-opening thing is the podcast host Gavin Purcell's video. Miles was asked to play an angry boss (it actually agreed, but ChatGPT refused to do so), and the netizen played the embezzler. The realism of their argument (some even stammered afterwards) and the speed of their reactions, if there were subtitles, this line would probably blood-wash the screen: Who is the AI after all? Someone even had it fight with the "eloquent" Grok 3. Grok 3 spoke with thorns, which was quite provocative; Maya remained calm, in stark contrast to her previous examples - in the previous use cases, she was quite talkative, but now she gives the impression of being unable to get a word in. In summary, the advantages of this new CSM model are: having memory (about two weeks), very low latency, and initiating dialogue at the right moment; The voice is expressive and lively, such as mimicking breathing sounds, laughter, interrupting, and even stammering and self-correcting. These "flaws" are actually intentionally designed - to provide a more realistic experience, as if you are being understood and valued. The dual-engine architecture (800 million parameter main brain + 300 million parameter voice decoder) behind this operation directly compresses the traditional voice AI's "text→semantics→sound" three-stage processing into a real-time multimodal interactive system. This is similar to OpenAI's voice technology roadmap. After training on 1 million hours of English speech data, it can improvise like an experienced voice actor in the recording studio:

The ability to accurately recite lines, and to adjust tone, breathing, and even emotional fluctuations based on the director's real-time feedback. Although AI attributes are still exposed, such as the system's clumsiness in terms of intonation, rhythm, and dialogue flow control, CEO Brendan Iribe is confident:

"Although we are in the uncanny valley, I believe we can climb out of it."

As for this CEO, he is no small fry. He is the co-founder and former CEO of Oculus, who created the first phenomenon-level product in the VR industry, and sold Oculus to Meta in 2014. Now, this "father of Oculus VR" is leading the original investment team (a16z, Spark Capital, etc.) to enter the voice AI track, and it is said that the accompanying AI glasses are under development.

Currently, CSM does not support Chinese, but the official preview indicates that it will expand to more than 20 languages in the future, and plans to open source its model in the coming months.

For those who want to experience it, you can go to the official website to interact with Miles and Maya - a friendly reminder, be careful of emotional dependence!

This article is from the WeChat public account "Machine Capability" (ID: almosthuman2017), author: Focus on AI, 36Kr authorized release.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments