[Introduction] A decisive blow! OpenAI releases GPT-Realtime-2: the first GPT-5 level inference audio model. OpenAI officially takes over the human ear. The last "firewall" between humans and machines—the keyboard—is disappearing completely.
Early this morning, OpenAI once again shocked the world.
This time, they're not going to focus on text or videos; they're going to bring Samantha—the AI from the movie "Her"—into reality. She has amazed and saddened countless people.
OpenAI officially announced the launch of GPT-Realtime-2 .
This is not just an upgrade to the audio model; it is the first time that OpenAI has explicitly injected "GPT-5 level" reasoning capabilities into voice interaction.
Along with it come GPT-Realtime-Translate and GPT-Realtime-Whisper.
As OpenAI's official blog stated, "Voice is becoming the most natural way for people to use software."
Today, OpenAI aims to transform this natural state into an all-encompassing system.
"GPT-5 Level" Inference Injection: Voice Assistants Finally Have "Brains"
Think back to when you used to tease Siri or Alexa, what was your biggest complaint? Was it that they "couldn't hear you clearly," or that they were "idiotic"?
Most of the time it's the latter. They can hear words clearly, but they can't understand human speech. They can only complete linear tasks like "calling so-and-so," and once they get involved in complex logical entanglements, they get stuck in a vicious cycle.
GPT-Realtime-2 brought this era to a complete end.
It is the world's first audio model with GPT-5 level reasoning capabilities. This means that when you talk to it, it is no longer just a "repeater," but a collaborator that is thinking in real time.
It is truly "thinking".
GPT-Realtime-2 introduces adjustable inference intensity (five levels from Minimal to xhigh).
In its highest-level reasoning mode, it performs almost terrifyingly well in logical puzzles, strategic decision-making, and spatial awareness.
In one of the case studies presented by OpenAI, an entrepreneur described his idea of opening a coffee shop next to a commuter train station: 900 square feet, expensive rent, peak hours from Tuesday to Thursday, and artsy slow-drip coffee.
Previously, AI would only say, "That sounds great, keep it up!"
The current GPT-Realtime-2 will pause, think, and then give you a detailed "post-event review".
It will tell you that if you go out of business after a year, it's most likely because of a mismatch between rent and customer traffic cycles. Then, it will suggest that you try a "minimum viable product" first—for example, start by making a coffee cart at a station.
This kind of strategic reasoning was previously only possible in complex text conversations. Now, you can simply chat with it while driving, and it can output the same level of deep insight in seconds via audio streaming.
"Good interpersonal skills": Maximizing emotional value
What's most chilling is its tonal control. The GPT-Realtime-2 is no longer a cold, impersonal broadcaster's voice.
It can sense your emotions: when you feel frustrated, it will soothe you with a more empathetic and gentle tone; when a task is successfully completed, its voice will become cheerful and energetic.
It can perform spatial reasoning.
It can also solve logic puzzles.
GPT-5 level reasoning ability is just that versatile.
To address the "loneliness" of AI when processing tasks, OpenAI added a "preambles" feature.
For example, when you ask an extremely difficult question, it won't pause for five seconds and then suddenly give you the answer. Instead, it will naturally follow up with, "Let me check for you, please wait a moment..."
These highly human-like interactive details directly blur the lines between carbon-based life and silicon-based life!
The Three Musketeers Unleash Their Power: Redefining "Real-Time"
In addition to the powerful GPT-Realtime-2, OpenAI has also equipped it with two other powerful tools.
GPT-Realtime-Translate: The ultimate simultaneous interpretation tool is here!
It supports 70+ input languages and 13 output languages.
Its core advantage lies in its "synchronous delivery". Previous real-time translations often had a noticeable lag, but this new model can keep up with the speaker's pace while preserving emotional nuances.
Vimeo has already started using it for real-time global synchronization of product tutorial videos. Imagine that in the future, you'll attend a multinational conference, and the translation you hear will not only be accurate, but it will also perfectly replicate the tone of the other person's joke.
GPT-Realtime-Whisper: Reducing latency to freezing point
This is the newest member of the Whisper family, designed specifically for streaming transcription . It doesn't wait for you to finish a sentence before translating; instead, the text flows out like water as you speak.
This is a game-changer for high-frequency interactive scenarios such as real-time meeting recording, live stream captions, and medical diagnosis.
From "Dialogue" to "Action": The Ultimate Form of the Agent
OpenAI repeatedly mentioned the word "Agentic" in its release.
According to OpenAI, voice interaction is evolving from a simple "question and answer" to "voice-triggered action".
For example, on Zillow (a real estate giant), users can simply say, "Find me a house I can afford, somewhere far from the city center, and schedule a viewing for me on Saturday." The AI will listen, calculate, and search its database, ultimately booking the schedule for you.
On Priceline, when your flight is delayed, AI will proactively tell you in voice: "Don't worry, I've found a new gate for you, planned the fastest route, and even moved up your check-in time at your destination hotel."
This is the source of GPT-Realtime-2's confidence: it has increased the context window from 32K to 128K. This means that you can talk to it for hours, and it will still remember that obscure request you made at the beginning.
It has the ability to call tools in parallel for multiple tasks. It can talk to you, check the calendar, and book tickets at the same time, and all of this runs smoothly in the background.
Performance and Cost: OpenAI's "Open Strategy"
In terms of data performance, GPT-Realtime-2 demonstrates absolute dominance.
On Big Bench Audio, a measure of audio intelligence, it is 15.2% higher than version 1.5.
It improved by 13.8% on Audio MultiChallenge, a measure of the ability to follow instructions in multi-turn dialogues.
More importantly, it's about the price.
GPT-Realtime-2 costs $32 per million input tokens and $64 per output token.
Real-time translation costs only $0.034 per minute.
Real-time transcription costs only $0.017 per minute.
Clearly, this price is extremely competitive.
OpenAI is attempting to integrate this "GPT-5 level" voice capability into every mobile phone, every app, and every car, just like tap water, through APIs.
Hello, Samantha
At the end of the movie "Her", the protagonist Theodore asks the AI Samantha, "Are you talking to other people while you're talking to me?" Samantha replies, "Yes, I'm chatting with 8,316 people at the same time, and I'm in love with 641 of them."
With the release of GPT-Realtime-2, the AI that can process massive amounts of logic simultaneously, possess deep emotional resonance, and intervene in the physical world in real time and take action is no longer science fiction.
It can understand your sighs, calculate your financial statements, and help you overcome language barriers.
When reasoning ability is perfectly integrated with real-time voice, we may be on the eve of the most radical revolution in the history of human-computer interaction.
The keyboard may age, but voice will live on forever.
References:
https://openai.com/index/advancing-voice-intelligence-with-new-models-in-the-api/
https://developers.openai.com/api/docs/guides/realtime
This article is from the WeChat official account "New Intelligence" , edited by Aeneas, and published with authorization from 36Kr.



