November 13, 2025 • Technology
For years, we treated AI as a batch-first problem. You'd send a request, wait three to five seconds, and get a response. That was fine for chatbots answering questions or image generators creating assets. But in November 2025, something shifted. ElevenLabs released Scribe v2 Realtime with sub-150-millisecond latency across 90 languages. At the same time, researchers at Google DeepMind published SIMA 2, an AI agent that can think and act in real-time inside 3D environments. These aren't isolated releases. They're symptoms of a fundamental change in how AI systems are being built and deployed.
I spent the last month testing these tools in production scenarios, and the difference is stark. When you can get a transcription, an agent decision, or a voice response in 150 milliseconds instead of 500 or 1000, the entire user experience transforms. The AI stops feeling like a service you call and starts feeling like something you collaborate with.
Latency isn't just about speed. It's about interaction model. When latency exceeds about 300 milliseconds, humans perceive a noticeable delay. It breaks the sense of responsiveness. Below that, conversational AI starts to feel real-time. Below 150 milliseconds, it feels instant.
Consider a voice-powered customer support agent. At 2-second latency (typical last year), the conversation feels stilted. The human speaks, waits, the agent responds, pause again. The rhythm is off. At 150 milliseconds, the agent can listen and respond within a single breath. That's the difference between a tool and a conversation partner.
The same applies to embodied AI. SIMA 2 can now navigate a 3D environment and respond to instructions in real-time because it can process visual input, reason about the scene, and execute actions all within a tight loop. At 500-millisecond latency, the agent would feel sluggish, making decisions based on stale information. At 150ms, it can actually feel present.
ElevenLabs' Scribe v2 Realtime hits 150ms latency through a combination of architectural decisions. First, predictive transcription. Instead of waiting for the audio stream to fully arrive, the model begins predicting the next word based on partial audio. It's taking intelligent guesses about what comes next, much like autocomplete in a text editor.
Second, streaming support. The API accepts audio in chunks and returns transcriptions as they become available. There's no batching delay, no waiting for a full sentence or paragraph. You stream, it transcribes, continuously.
Third, voice activity detection built-in. The system knows when the human has stopped speaking, so it can finalize a transcript without waiting for a timeout. This alone eliminates hundreds of milliseconds of wasted time.
The model itself runs on edge hardware in many cases. For enterprise customers, you can deploy Scribe v2 on your own infrastructure. There's no round trip to a distant data center. The compute happens close to the audio source.
What surprised me testing this was the accuracy trade-off. Usually, sub-100ms systems sacrifice accuracy for speed. Scribe v2 maintains 93.5% accuracy across 30 languages while hitting 150ms. That's not a compromise. That's a genuine breakthrough.
SIMA 2 from Google DeepMind shows what real-time latency enables for embodied AI. The agent integrates Gemini's reasoning with the ability to act in a virtual environment. It can navigate, follow instructions, and adapt in real-time.
In one test, researchers showed SIMA 2 a newly generated 3D world it had never seen. The agent oriented itself, understood what the human was asking for, and executed a multi-step plan to accomplish a goal. All of this happened with low-latency decision-making. If each decision took a second, the agent would fail. It would lose track of context. The task would timeout. At 150ms per decision cycle, it works.
This has immediate applications in robotics. A robot in a warehouse needs to pick an item, navigate around obstacles, and hand it to a human, all while responding to spoken corrections. That requires latency measured in tens to hundreds of milliseconds, not seconds.
The same principle applies to voice assistants. Siri is being rebuilt with a 1.2-trillion-parameter Gemini model. When you ask it to do something multi-step ("Check my calendar, find a gap on Thursday, and book a call with Sarah"), the assistant needs to reason about each step and give you feedback or ask for clarification in real-time. Latency below 200ms makes that feel natural.
One pattern I'm seeing across November's releases is that voice is becoming the primary interface for AI. It's not that text is disappearing. It's that voice handles realtime, conversational interactions, while text handles batch work or detailed composition.
ElevenLabs is powering voice agents across industries. A recruiter can have a voice conversation with an AI that screens candidates in real-time. A therapist can use an AI that listens and responds with therapeutic prompts. A software engineer can pair-program with an AI that discusses design decisions aloud.
The key enabler is latency. At 2-second transcription latency, none of these feel natural. At 150ms, they do.
I tested this with a voice-powered note-taking workflow. I speak freely, and the AI transcribes, summarizes, and asks clarifying questions in real-time. At 500ms latency, it's jarring. I forget what I was saying. At 150ms, it's additive. The AI feels like a note-taking partner, not a bottleneck.
Here's where the November infrastructure announcements become relevant. Real-time AI at scale demands compute power and distribution. You can't run Scribe v2 Realtime on a single server in Ohio and serve the world. Latency would be too high.
This is why OpenAI committed $38 billion to AWS for compute infrastructure. It's why Google is investing $40 billion in Texas data centers. It's why Meta is spending $600 billion on AI infrastructure by 2028. Real-time systems need low-latency edge deployment or distributed inference.
Google's announcement of Gemini 2.5 Computer Use model is relevant here. This model can interact with web browsers and mobile UIs in real-time. It can click, type, and scroll based on what it sees. That requires latency measured in hundreds of milliseconds, not seconds, or the interaction falls apart.
From a developer perspective, this means the infrastructure game is shifting. It's not enough to have a powerful model. You need to deploy it close to users, optimize for inference speed, and handle streaming requests. The major cloud and AI providers are racing to control this infrastructure because whoever can deliver the lowest-latency experience wins.
Real-time systems also consume more energy. An AI system that processes requests in batches can afford to wait and optimize. A real-time system must process immediately. It can't wait for a GPU to fill up with work. It has to start processing the moment a request arrives.
This means more inference servers running, even when lightly loaded. It means more redundancy to guarantee latency SLAs. It means higher power consumption per request.
That's one reason companies are investing so heavily in energy infrastructure alongside compute. Google's Texas investment includes 6,200 megawatts of new energy capacity. That's not just for training models. A significant portion goes to running inference at latencies that matter for real-time applications.
If you're building AI-powered applications today, latency should be a first-class requirement. Here's how I think about it:
For batch work (data analysis, report generation, content creation), aim for sub-5-second latency. That's fast enough to feel synchronous. For streaming or conversational work, target sub-500ms. For real-time interactions (voice, robotics, live UI updates), you want sub-200ms if possible.
The tools available now support these targets. Using Scribe v2 for speech-to-text gives you 150ms latency out of the box. Using Gemini 2.5 Computer Use for UI interactions gives you sub-500ms latency for most web tasks. Using SIMA 2 for embodied reasoning gives you real-time decision-making.
The trade-off is cost and complexity. Achieving 150ms latency across the world requires infrastructure spending. But if your use case demands real-time interaction, it's worth it. The improvement in user experience is dramatic.
November 2025 marks a transition. The AI industry spent the last two years focused on model capability. GPT-4, Claude 3, Gemini 2 proved that scale and training can unlock reasoning, coding, and analysis capabilities. That race isn't over, but it's matured.
The new race is responsiveness. Can you take that capable model and make it fast enough to interact with humans and environments in real-time? Can you deploy it across enough infrastructure to serve billions of requests at 150ms latency? Can you do it cost-effectively?
This shift changes what matters for developers. Before, you picked the most capable model, regardless of latency. Now, you're trading capability against latency, and starting to pick systems optimized for responsiveness. Smaller models trained for specific tasks are becoming competitive with larger ones because they hit latency targets that larger models can't.
This is also why Anthropic and DeepSeek are getting attention. Claude 4 and DeepSeek R1 are capable, but they're also being optimized for inference speed. The models are designed to be deployed at latency and cost constraints that matter in production.
The most visible impact of low-latency AI in the next year will be voice agents. Every company is building them. Every developer I know is experimenting with them.
This is viable now because of Scribe v2 and similar real-time transcription systems. You can build a voice interface that actually feels responsive. The agent listens, understands in 150ms, reasons about the request, and responds. From the user's perspective, it's instant.
Early voice assistants felt slow and clunky. They were transcribing at 2-3 second latency, reasoning at another 2-3 seconds, then generating speech. By the time you heard a response, it felt disconnected from the conversation.
Voice agents built on Scribe v2 and real-time reasoning models feel present. That changes user adoption and trust.
150ms is the current frontier for general-purpose systems. But some applications need faster. Robotics applications in dynamic environments often need sub-100ms latency. Video understanding that requires real-time adjustments (like a robot catching an object or a surgical assistant guiding a procedure) needs sub-50ms latencies.
We're not there yet for general-purpose AI. But the research is accelerating. Microsoft's work on MMCTAgent (multi-modal critical thinking) for video reasoning is pushing toward sub-200ms on complex video understanding tasks. That's the next frontier after Scribe v2.
For most developers today, 150ms is sufficient. It unblocks real-time voice, responsive agents, and interactive systems. The next frontier is interesting but not urgent unless you're building robots or real-time visual systems.
If you're evaluating AI systems for a new project, check the latency spec, not just the capability benchmarks. A model that's 2% more capable but twice as slow might be the wrong choice for your use case.
If you're building voice-first interfaces, real-time transcription changes everything. The experience is fundamentally better than batch transcription at any latency above 500ms.
If you're interested in AI agents, understand that responsiveness is now a differentiator. SIMA 2 works because it can respond in real-time. An agent that pauses for 3 seconds between decisions feels broken, even if the decisions are good.
The AI industry in November 2025 is pivoting from a "more capable" game to a "more responsive" game. That's where the competitive advantage lies now.