A New Transcription Startup Just Undercut ElevenLabs by 90%: Here's Why It Matters
A Boston-based startup called Modulate just launched a transcription service that costs roughly one-tenth as much as industry leaders like ElevenLabs, while delivering better accuracy on real-world conversations with multiple speakers, accents, and background noise. The service, called Velma Transcribe, represents a fundamental shift in how affordable speech-to-text technology could become for developers and enterprises building voice-enabled applications .
What Makes Velma Transcribe So Much Cheaper?
Modulate's breakthrough comes from its Ensemble Listening Model (ELM) architecture, which orchestrates multiple specialized transcription models working together rather than relying on a single large model. This approach dramatically improves accuracy while reducing computational overhead and cost. The result is pricing that undercuts every major competitor on the market .
To put the cost difference in perspective, Velma Transcribe charges approximately $0.03 per hour of audio transcribed. Compare that to ElevenLabs Scribe v2 at $0.40 per hour, Deepgram Nova-3 at $0.31 per hour, Deepgram Nova-2 at $0.26 per hour, and AssemblyAI Universal-3 Pro at $0.21 per hour. For organizations processing large volumes of voice data, this pricing gap could translate to savings in the hundreds of thousands of dollars annually .
How Does Velma Transcribe Handle Messy, Real-World Conversations?
Traditional transcription systems often struggle with the kinds of audio that actually exist in the real world: multiple people talking over each other, regional accents, background noise, and interruptions. Modulate engineered Velma specifically for these challenging scenarios. On the AMI Meeting Corpus, a widely used benchmark for complex multi-speaker conversational audio, Velma avoided over 40% of the errors made by ElevenLabs and over 70% of the errors made by OpenAI's GPT-4o-transcribe .
"We've tuned Velma for conversational audio, including emotion and accent detection, leading to materially lower error rates on meeting and call data while delivering dramatic cost savings versus incumbent providers. That combination makes high-quality transcription practical at scale," said Carter Huffman, CTO and Cofounder of Modulate.
Carter Huffman, CTO and Cofounder of Modulate
What Features Does Velma Transcribe Include?
- Multilingual Support: Handles transcription in 70 of the world's most commonly spoken languages, making it viable for global enterprises and applications.
- Emotion Detection: Identifies and labels over 20 distinct emotions within speech, providing insights beyond just the words being said.
- Accent Detection: Recognizes and tags 20+ different accents, improving accuracy on diverse speaker populations.
- PII Redaction: Automatically detects and removes personally identifiable information like names, phone numbers, and social security numbers for privacy-safe workflows.
- Speaker Diarization: Distinguishes between different speakers in a conversation, labeling who said what.
- Real-Time Streaming: Delivers sub-second latency with partial transcripts for live applications and AI agent pipelines.
- Zero Data Retention: Ensures privacy by not storing audio or transcripts after processing.
These capabilities address pain points that enterprises face when trying to extract value from voice data at scale. Call centers, social platforms, and voice-enabled AI agents can now afford to transcribe and analyze every conversation, not just a sample .
How to Get Started With Velma Transcribe
- Check Pricing: Visit modulate.ai/pricing to see usage-based pricing optimized for high-volume workloads and calculate potential savings for your use case.
- Test on Your Audio: Start with batch transcription to evaluate accuracy on your specific types of conversations before committing to production deployment.
- Explore Streaming Capabilities: If you need real-time transcription for live applications, test the sub-second streaming latency with partial transcript support.
- Evaluate Enterprise Features: Assess whether emotion detection, accent identification, and PII redaction add value to your specific application or compliance requirements.
Modulate is positioning Velma Transcribe as the first step in a broader developer API strategy. The company plans to release additional capabilities for synthetic voice detection, emotion analysis, and deeper conversational intelligence. Together, these tools could enable applications like fraud detection, customer sentiment analysis, compliance monitoring, and real-time decision support .
"The industry has spent years teaching AI how to generate and respond. The next frontier is teaching it how to listen," said Mike Pappas, CEO and Cofounder of Modulate. "Most systems today rely on transcription, reducing rich conversations to flat text and losing the signals humans naturally understand. Velma is the listening layer for AI, giving developers and enterprises the 'ears' needed to build voice-native applications that can capture the nuance and intent within spoken dialogue."
Mike Pappas, CEO and Cofounder of Modulate
Why This Matters for the Voice AI Industry
The dramatic cost reduction could democratize voice intelligence in the same way that cheaper cloud computing democratized machine learning. Organizations that previously couldn't afford to transcribe and analyze all their voice data can now do so economically. This shift has implications for customer service, compliance, healthcare, and any industry where understanding conversations is valuable .
Velma Transcribe is available today with both batch and streaming transcription endpoints. The service is backed by Modulate's ISO 27001 security certification, making it suitable for enterprise deployments that require strict data protection standards .