Voice AI That Reasons in Real-Time Just Became an API Call
Three new voice models from OpenAI on May 7. Three different jobs at three different prices, spread across an order of magnitude.
The headline figure most reporting is quoting — RM0.15 a minute — is the cheapest of the three. It also cannot run a customer service call centre on its own.
Most Malaysian businesses haven't noticed the distinction yet.
OpenAI released GPT-Realtime-2, GPT-Realtime-Translate, and GPT-Realtime-Whisper. They sound like variations of the same thing. They are not.
GPT-Realtime-2 is the reasoning agent. GPT-5-class intelligence inside a live voice loop — holds context across a 128K window, handles interruptions, calls tools, manages a full conversation. Priced at USD32 per million audio input tokens and USD64 per million output tokens, which works out to roughly RM1 to RM2 for a four-minute call once you account for system prompts and conversation context.
GPT-Realtime-Translate is a continuous live translator. Speech in language A out as speech in language B, 70+ input languages into 13 output languages. It does not reason, does not answer questions, does not manage a conversation. OpenAI's own documentation is explicit: use Translate when you want to translate what a human says, use Realtime-2 when you want an assistant. Priced at USD0.034 per minute (roughly RM0.15).
GPT-Realtime-Whisper is streaming speech-to-text. Transcription only. USD0.017 per minute.
Three jobs. Three prices. Mixing them up changes the business case by an order of magnitude.
Take a Penang-based support centre serving an E&E manufacturer's Malaysian and Taiwanese clients. A 100-seat centre at RM3,800 per agent per month costs RM380,000 a month before overtime, management, or training. Two honest paths exist, not one.
Path A — keep the agents, add Translate. English-speaking agents serve Mandarin and Tamil customers through a real-time translation layer at RM0.15/min. You still pay the agents, but you stop hiring premium-priced multilingual ones. Savings come from the hiring mix, not headcount reduction. For a centre that needs 30% multilingual hires at a 25% salary premium, the freed-up budget is meaningful but not transformative — call it RM30,000 to RM40,000 a month.
Path B — replace Tier-1 calls with Realtime-2. At RM1 to RM2 per four-minute call across 60 calls per agent per day and 22 working days, the API bill for the full 100-seat workload lands somewhere between RM130,000 and RM260,000 a month, against RM380,000 in wages. The maths also assumes Realtime-2 closes every call without escalation, which it won't. The realistic deployment routes the predictable slice to Realtime-2 and keeps a smaller human team for the rest.
The two paths solve different problems. Path A removes a hiring constraint while keeping human judgment. Path B removes the human from the predictable slice of calls. Most call centres will end up running both — and quietly running Whisper across every call for QA and analytics on top.
The capability that changed on May 7 isn't translation. That existed. It's reasoning inside the audio loop. GPT-Realtime-2 doesn't follow a decision tree. It holds context across a full conversation, handles mid-sentence topic changes, and carries a 128K token context window. The gap between "automated voice response" and "the system actually understood what I was asking" just closed — but at Realtime-2 prices, not Translate prices.
Who this actually matters to:
→ Malaysian GBS operators and BPO companies — your multilingual premium just got a competitor at RM0.15/min for the translation layer alone; the question is whether your buyers know yet, and how to position the human-judgment value you still uniquely offer
→ Malaysian banks with phone banking lines (Maybank, CIMB, RHB, AmBank) — Realtime-2's reasoning handles the mid-complexity escalation tier where most of your call centre cost is concentrated; that's the right model for autonomous resolution, not Translate, and the per-call economics work above a certain call volume
→ Malaysian software developers and product teams — adding a voice interface to an existing product dropped from a six-figure integration project to a documented API; pick the right model for the job, because the price gap between Realtime-2 and Translate is roughly tenfold
→ Malaysian F&B operators, clinics, and retailers with booking or inquiry lines — a kopitiam can now afford a Bahasa Malaysia and Mandarin reservation agent on Realtime-2 for the cost of a few teh tarik a day; the barrier is awareness and integration, not budget
MULTIPLE PERSPECTIVES
The structural shift is the bundling, not the headline price. Before May 7, building a voice agent meant stitching Whisper for transcription, GPT-4 or Claude for reasoning, and ElevenLabs for speech synthesis into a fragile pipeline with three vendors and three failure modes. OpenAI now does all of it inside one model, with reasoning happening between transcription and synthesis rather than between vendor APIs. That collapses integration cost, latency, and engineering surface area simultaneously. The price compression matters; the pipeline compression matters more.
The specific Malaysian advantage isn't translation pricing — translation is global and the price will be matched. It's the language combination. No major consumer market mixes Bahasa Malaysia, English, Mandarin, and Tamil at scale the way Malaysia does. A Malaysian developer building voice products for Malaysian consumers is solving a harder problem than developers in most markets. That complexity, when handled well, becomes a product edge for 12 to 18 months before global voice AI catches up with local nuance and code-switching.
The counterintuitive risk: the companies most exposed aren't the large call centres. They have budget and awareness to test, fail, and iterate. The vulnerable operations are mid-sized businesses — a regional logistics operator handling 300 daily calls with 8 staff, a clinic chain with appointment lines — that assumed voice AI wasn't priced for them. As of May 7, it is. Whether their competitors find out first is the variable.
If your business handles significant customer interaction by phone — which slice of those calls is predictable enough that Realtime-2 could resolve it end-to-end, and which slice needs a human whose judgment you'd rather augment with Translate than replace?
If the predictable slice is above 60%, the pilot economics on Realtime-2 are straightforward. Test against your 20 most common call types. Compare the API cost per call against your current per-call staff cost. Account for handoff to humans on the hard cases.
If it's below 60%, you don't have a voice AI problem. You have a workflow problem. Solve that first. In the meantime, layer Translate over your existing human agents to remove the multilingual hiring constraint while you wait. The API will still be there.
Voice AI just split into three different jobs at three different price points. The advantage goes to whoever picks the right one for the right use case before their competitor figures out the difference.

— Tony
Sharing what I learn building real things with AI.