According to Zhidongxi on August 29, early this morning, OpenAI released GPT-RealTime, a speech-to-speech model built for developers, and simultaneously updated API functions including remote MCP server support, image input, and SIP (Session Initiation Protocol) phone call support.
OpenAI claims this is its most advanced speech synthesis model to date, with GPT-RealTime improving in following complex instructions, accurately calling tools, and generating more natural and expressive speech. The model can naturally recite repeated letters and numbers, seamlessly switch languages, and even capture non-verbal signals such as laughter.
Today OpenAI also released two new voices , Cedar and Marin , which will be available exclusively in the Realtime API.
In terms of pricing, the general version of the Realtime API and the new GPT-RealTime model are open to all developers starting today. The price of GPT-RealTime is US$32 (approximately RMB 228) per million tokens of audio input, US$0.4 (approximately RMB 2.85) per million tokens of cached input, and US$64 (approximately RMB 456) per million tokens of audio output. The price of GPT-RealTime is 20% lower than that of gpt-4o-realtime-preview.
OpenAI has added fine-grained control over conversation context, allowing developers to set smart token limits and truncate multiple turns at a time, significantly reducing the cost of long conversations.
Last October, OpenAI released a public beta version of the Realtime API, and thousands of developers have used the API and made suggestions since then.
But judging from OpenAI's comments on the social platform X, some users are full of expectations for the new model, saying that voice applications will become more interesting, but some developers also reflected that the model's voice still sounds very robot-like, and the old voice character sounds only slightly more expressive.
In terms of speech models, progress is accelerating both domestically and internationally. Earlier this month, MiniMax, one of the six leading Chinese model developers, released Speech 2.5, a speech generation model covering over 40 languages. Earlier this year, the Doubao app also updated its real-time voice call feature, now available to users for free. This feature can mimic different voices and detect emotions. On the same day as OpenAI, Microsoft launched MAI-Voice-1, the first highly expressive and natural speech generation model, capable of generating audio with varying interpretations based on the same prompt.
01. Buying a house, buying tickets, making a doctor's appointment, you can talk like a friend
OpenAI released an example of collaborating with five companies to build a voice assistant on its blog.
The first is Zillow, an American real estate information service platform. OpenAI's new model can talk to natural users to help them screen properties based on lifestyle needs or analyze purchase prices, etc.
Secondly, as T-Mobile's mobile assistant, the AI assistant can quickly switch conversations, and will not be affected even if the user interrupts in the middle of a sentence to start a new topic.
The third is the ticket buying and selling platform StubHub. OpenAI's new model can help users make payments and guide problems encountered during the payment process.
The fourth is to help users make appointments with doctors by phone. In Oscar Health's platform, this new model can help users confirm available appointment times, appointment precautions, and appointment addresses.
Finally, there is the insurance technology company Lemonade. When users encounter insurance issues when buying a car, the AI assistant can provide purchase assistance to users, obtain the user's demands during the conversation, and then perform the purchase operation based on the user's personal and bank card information stored internally.
02. Capture laughter, seamlessly switch languages and adjust tone
OpenAI has made improvements to GPT-RealTime's audio quality, understanding user instructions, and following instructions.
For voice agents to enable continuous conversations, models must have human-like intonation, emotion, and rhythm to create a pleasant conversational experience. The blog post mentioned that GPT-RealTime can produce more natural, high-quality speech and can follow fine-grained instructions, such as "speak quickly and professionally" or "speak sympathetically with a French accent."
In terms of understanding user commands, GPT-RealTime can capture non-verbal cues such as laughter, switch languages in sentences, and adjust the tone . According to OpenAI's internal evaluation, the model is also more accurate in detecting alphanumeric sequences such as phone numbers in languages such as Spanish, Chinese, Japanese, and French.
In the Big Bench Audio evaluation, GPT-RealTime achieved an accuracy of 82.8% , surpassing OpenAI's old model released in December 2024. The Big Bench Audio benchmark is an evaluation dataset for evaluating the reasoning capabilities of language models that support audio input.
When building a speech-to-speech application, developers provide the model with a series of behavioral instructions, including how to speak, what to say in specific situations, what to do or not do. OpenAI focuses on improving how well the model follows these instructions, so that even small instructions can convey more information to the model.
On the MultiChallenge audio benchmark, which measures command-following accuracy, GPT-RealTime achieved a score of 30.5% , a significant improvement over the 20.6% achieved by the previous model. MultiChallenge assesses how well large models handle multi-turn conversations with humans. OpenAI selected a subset of test questions suitable for audio presentation, converted them into speech using text-to-speech (TTS), and produced the audio version of this assessment.
To build a robust voice agent with a speech-to-speech model, the model must be able to call the right tools at the right time . OpenAI has improved function calls along three dimensions: calling related functions, calling functions at the right time, and calling functions with appropriate parameters. In the ComplexFuncBench audio evaluation, which measures function call performance, GPT-RealTime scored 66.5% , surpassing the previous model. The model we released in December 2024 scored 49.7%.
Additionally, OpenAI has improved asynchronous function calls. Long-running function calls no longer interrupt the conversational flow, allowing the model to continue the conversation smoothly while waiting for results. This feature is natively supported in GPT-RealTime, so developers don't need to update their code.
03. Preserve voice nuances and add four new RealTime API features
Unlike traditional multi-model chain processes for converting speech to text and text to speech, the Realtime API directly processes and generates audio through a single model and API, which reduces latency, preserves nuances in speech, and makes its responses more natural and expressive.
New features of the RealTime API include:
Developers can enable MCP support in a session by passing the URL of a remote MCP server in the session configuration. Once connected, the API automatically handles tool calls, eliminating the need for developers to manually set up the integration.
This setup allows developers to simply point their session to a different MCP server and it will work immediately.
In terms of image input , developers can add images, photos, and screenshots to Realtime API sessions to use with audio or text. Now the model can build a conversation based on what the user actually sees, enabling users to ask questions such as "What do you see?" or "Read the text in this screenshot."
Rather than treating images like a live video stream, the system acts more like adding pictures to a conversation. Developers’ apps can decide which images to share with the model and when, giving them control over what the model sees and when it responds.
OpenAI has also added features to make the Realtime API easier to integrate, including Session Initiation Protocol (SIP) support and reusable prompts .
SIP support connects developers' applications directly to the public telephone network, PBX systems, office phones and other SIP endpoints through the Realtime API.
Reusable prompts allow developers to save and reuse prompts, including developer messages, tools, variables, and sample user/assistant messages. They can be used across Realtime API sessions, consistent with the usage logic of the Responses API.
04. Conclusion: Establishing multi-layered protection guidelines to prevent model abuse
To prevent real-time voice conversations from being abused, the Realtime API includes multiple layers of security and mitigation measures. OpenAI uses active classifiers for Realtime API conversations, which means that if certain conversations are detected to violate harmful content guidelines, they can be terminated. Developers can also use the Agents SDK to add their own additional security measures.
At present, ultra-realistic real-time voice conversations have demonstrated a wide range of application scenarios. Doubao real-time voice conversations and Baidu's newly launched digital employees all use voice as the main form of interaction with users. In addition, the new speech-to-speech model released by OpenAI also demonstrates stronger reasoning capabilities and more natural voice expression, enabling it to handle complex multi-step requests and build AI agents in different fields.
This article comes from the WeChat public account "Zhidongxi" (ID: zhidxcom) , author: Cheng Qian, editor: Li Shuiqing, and is authorized to be published by 36Kr.