ElevenLabs' New v3 Model Adds Emotional Intelligence to AI Voices: Here's What Changes
ElevenLabs has released Eleven v3 (alpha), a text-to-speech model designed to generate voices with unprecedented emotional range and expressiveness across 70 languages, supporting features like audio tags for tone control and multi-speaker dialogue generation. The model represents a significant shift from purely readable speech toward what the company calls "text-to-performance," where AI voices can now whisper, sigh, laugh, and interrupt naturally .
What Makes Eleven v3 Different From Previous Models?
The core innovation lies in how creators can now direct emotional delivery. Instead of hoping the AI interprets emotion from text alone, users can embed inline audio tags like [whispers], [sighs], [shouts], and [laughs] directly into scripts. For example, a creator could write: "[whispers] Something's coming... [sighs] I can feel it." The model then generates speech that matches those emotional cues with cinematic precision .
The model also introduces a new Text-to-Dialogue API endpoint that handles multi-speaker conversations automatically. Developers provide a structured array of speaker turns, and the system generates a single cohesive audio file with natural pacing, emotional shifts, and even interruptions that sound like real human conversation . This eliminates the need to stitch together separate voice recordings.
Compared to earlier versions, Eleven v3 delivers a 68% reduction in errors when processing complex text like chemical formulas and phone numbers, making it more reliable for professional narration work . However, the model requires more prompt engineering than previous versions, and it currently has higher latency, making it better suited for pre-recorded content like videos, audiobooks, and media tools rather than real-time conversational agents .
How to Use Eleven v3 for Your Projects
- Audio Tags for Emotional Control: Embed bracketed commands like [excited], [whispers], and [sighs] directly into your script to direct the AI's emotional delivery without manual editing or multiple takes.
- Multi-Speaker Dialogue Generation: Use the Text-to-Dialogue API to provide speaker turns as JSON objects, and the model automatically generates overlapping conversations with natural interruptions and emotional consistency.
- Global Language Support: Access the model across 70+ languages including Mandarin Chinese, Japanese, Spanish, Arabic, Hindi, and many others, enabling creators to produce content for international audiences without hiring voice actors in each region.
- Professional Voice Selection: Choose from purpose-built voice collections organized by use case, such as Announcers for trailers, Radio Hosts for podcasts, or Support voices for customer service applications.
For users who need real-time responsiveness, ElevenLabs recommends staying with v2.5 Turbo or Flash models, which deliver 75-millisecond latency suitable for conversational agents . A real-time version of v3 is in development .
What Does This Mean for Content Creators and Businesses?
The release signals a broader shift in how ElevenLabs positions itself. The company has evolved from a text-to-speech startup into what it describes as "the audio layer" of the internet, now offering agents, music generation, transcription, and video synchronization alongside voice synthesis . Eleven v3 is the flagship model for high-stakes narration work, while the company maintains specialized models for speed-focused applications.
"Since launching Multilingual v2, we've seen voice AI adopted in professional film, game development, education, and accessibility. But the consistent limitation wasn't sound quality, it was expressiveness. More exaggerated emotions, conversational interruptions, and believable back-and-forth were difficult to achieve," stated Piotr Dabkowski, Co-Founder of Research at ElevenLabs.
Piotr Dabkowski, Co-Founder, Research at ElevenLabs
The pricing reflects ElevenLabs' push to make the technology accessible. During the launch period through June 2026, the model is available at 80% off standard rates for self-serve users accessing it through the UI, making it roughly five times cheaper than normal pricing . Enterprise users also receive 80% off business plan pricing during the promotional period.
One important caveat: Professional Voice Clones (PVCs), which allow users to clone specific voices, are not yet fully optimized for v3 and may produce lower quality output compared to earlier models. ElevenLabs recommends using Instant Voice Clones (IVCs) or designed voices from the library during this research preview phase, with PVC optimization coming in the near future .
Where Is This Technology Headed?
The broader context shows ElevenLabs building toward a comprehensive AI audio infrastructure. The company now offers ElevenAgents for conversational AI that can take real actions mid-conversation, Eleven Music for generating studio-grade instrumental and vocal tracks, and Scribe v2 for real-time speech-to-text transcription . These tools work together as an integrated stack, allowing creators and businesses to handle voice, music, transcription, and video synchronization from a single platform.
Companies like Meta, Chess.com, Twilio, and Klarna are already using ElevenLabs' infrastructure for customer support, content creation, and interactive experiences . The shift toward emotionally intelligent voices with v3 suggests the technology is moving beyond functional automation toward experiences that feel genuinely responsive and human-like.
Access to Eleven v3 is available today through the ElevenLabs website and API. The company notes that public API access for v3 is coming soon, with early access available by contacting sales .