Google launches Gemini 3.1 Flash TTS: audio tags make AI voiceovers more vivid, supports 70+ languages, and is available for free trial through Google AI Studio.

This article is machine translated
Show original

Google is now focusing its efforts on the voice field, with the Gemini 3.1 Flash TTS officially launched on the 15th. With the new feature of "audio tags," it aims to allow developers to precisely schedule every detail of AI voice with text commands, just like a film director.

According to Google's official announcement , Gemini 3.1 Flash TTS will be launched simultaneously on three lines starting today: developers can experience it first through the Gemini API and Google AI Studio; enterprise users can access it through Vertex AI; and Google Workspace and personal account users can use it directly in Google Vids. 16 new languages ​​will be added to the launch.

Elo 1,211 high score rating

In terms of quality, Google directly cited third-party data to support its claim: on the Artificial Analysis TTS leaderboard (which collected thousands of blind test human preferences), 3.1 Flash TTS achieved an Elo score of 1,211 and was placed in the "Most Attractive Quadrant," indicating that it simultaneously possesses the advantages of high-quality speech generation and low cost. It supports more than 70 languages ​​and natively supports multi-speaker dialogue scenarios.

Audio tag: Handing the director's seat over to the developers

The most crucial technological update is "Audio Tags," which allows developers to embed natural language commands directly into text input, enabling fine-grained control over AI voice and moving beyond simply relying on models to guess tone. Google breaks down the entire experience into three layers:

Scenario-oriented : Developers define the environment and provide specific dialogue instructions, allowing different characters to remain "immersed" in multiple rounds of dialogue, with natural transitions in tone.

Speaker-level precision : Character voices are shaped through unique Audio Profiles, and rhythm, tone, and accent are dynamically switched using Director's Notes; Inline Tags allow speakers to temporarily change their expression midway through a sentence.

Seamless export : After confirming the performance parameters, they can be directly exported as Gemini API code, ensuring consistent sound recognition across projects and platforms.

Early adopters such as StyleUAI, HeyGen, Invideo AI, and Sierra have given positive feedback, noting that the technology can transform ordinary text into emotionally resonant audio performances.

SynthID watermarking comprehensively tags AI-generated audio.

Meanwhile, all audio generated by Gemini 3.1 Flash TTS has a built-in SynthID watermark. This is a subtle, invisible marker interwoven directly into the audio waveform, which can be reliably detected by the system, helping to identify AI-generated content and prevent the spread of misinformation. This is also part of Google's ongoing efforts to advance its AI content traceability mechanism.

Overall, the positioning of 3.1 Flash TTS is clear: it completes the Gemini ecosystem's voice-related puzzle with the triangle of "high quality, low cost, and strong controllability". The introduction of audio tags makes the director-style voice control that previously only existed in professional recording studios accessible to global developers in the form of APIs.

Source
Disclaimer: The content above is only the author's opinion which does not represent any position of Followin, and is not intended as, and shall not be understood or construed as, investment advice from Followin.
Like
Add to Favorites
Comments