Building a Voice Model for Code: How a 5-Person Team Hit 450K Sessions/Day
Lena Vollmer
A technical breakdown of Aqua Voice's Avalon ASR model, the data pipeline behind it, and the engineering decisions that got a tiny team to #1 commercial model on the OpenASR Leaderboard.
If you dictate a command to Claude Code right now using standard speech recognition, it will likely transcribe "kubectl" as "cube control" and "PyTorch" as "pie torch." Most speech AI is trained on audiobooks, news broadcasts, and parliamentary proceedings. Nobody is dictating audiobooks into Cursor.
That domain gap is why wrapping an API around Whisper wasn't going to work for Aqua Voice. Instead, our five-person team built and trained Avalon, a custom ASR model designed specifically for technical speech. Today, it's the #1 commercial model on the OpenASR Leaderboard. This is a technical breakdown of how we built our data pipeline, the architecture decisions that got us to 450K sessions a day, and the expensive mistakes we made along the way.
The Domain Gap in ASR Training Data
The bulk of publicly available transcription training data comes from audiobooks, parliamentary proceedings, news broadcasts, and meeting recordings. Decades of human-made transcripts exist for these domains. Models trained on this data benchmark beautifully on academic splits.
But the training data distribution of most ASR models is misaligned with how people actually talk to computers: dictating prompts, referencing specific library names, mixing natural language with code identifiers. We decided to close this gap. Not by fine-tuning Whisper, but by training our own model from scratch.
Avalon benchmark results: testing speech recognition for code
OpenASR Leaderboard (industry standard benchmark suite, 7 datasets):
Avalon: 6.24 WER (public release)
Deepgram: 6.91
Mistral: 6.88
OpenAI Whisper Large v3: 7.44
Avalon is the #1 commercial model and #6 overall. These benchmarks are run by a third party (Hugging Face) on standardized test sets. No post-processing, no screen context, no LLM cleanup. Raw model output vs. human labels.
AISpeak Benchmark (coding and AI-specific terms):
We built this because existing benchmarks don't test what matters for our users. AISpeak consists of clips sourced from Twitch streams and YouTube where speakers use terms like "Claude Code," "MCP," "git checkout dev." For each clip, we check whether the model gets the key technical term right.
Model | Accuracy on Key Terms (AISpeak-10) |
|---|---|
Avalon | 97.4% |
ElevenLabs Scribe | 78.8% |
Whisper Large v3 | 65.1% |
NVIDIA Canary 1B | 51.5% |
We know custom benchmarks invite skepticism. The Avalon model card (PDF) has full methodology, and we plan to release the AISpeak dataset publicly so others can reproduce results and test their own models.
Latency (end-of-speech to text, clips under 30s, M3 Max MacBook Pro / 64GB RAM):
Aqua Voice: 965ms
Wispr Flow: 1,399ms
SuperWhisper: 2,407ms
Training the voice model: our custom data pipeline
The counterintuitive finding: optimizing for developer/technical speech didn't just improve technical accuracy. It improved everything.
From our Avalon launch post:
"It turned out that the data pipelines we stood up for both publicly available audio and for pre-existing audio datasets not only boosted the performance on this important domain, but also made the model better across the board."
We did not use customer audio for training unless users explicitly opted in. Instead, we built data collection pipelines focused on how people actually talk when working at a computer. The key insight was that this distribution of speech is underrepresented in existing datasets, and filling that gap benefited general transcription quality too.
Avalon uses a custom encoder-decoder architecture. While it shares DNA with the Whisper family, it is not a fine-tune. Avalon operates with a significantly lower parameter count, which is part of how we hit our strict sub-1000ms latency budget.
But architecture isn't our moat; our data pipeline is. To fix the domain gap, we built a synthetic pipeline for developer speech. We mined open-source coding tutorials, Twitch programming streams, and technical conference talks. The hard part wasn't sourcing the audio; it was the alignment. Standard forced-alignment tools fail spectacularly on code syntax. We had to build a custom pseudo-labeling pipeline that could reliably map spoken English (e.g., "dunder init") to written code (__init__) before feeding it into the training loop. By heavily weighting this technical distribution, the model learned the semantic difference between natural language and terminal commands.
Beyond the Model: Screen Context
Avalon is one piece of a system. The product also uses screen context to improve real-world accuracy, and this is worth explaining because the implementation is non-obvious.
When you dictate near a code editor, Aqua reads what's visible on screen (via macOS accessibility APIs and screen capture with user permission). If you say something that sounds like a variable name, the system can match it against identifiers actually present in your code. You say "canonical title on the context response model" near a file containing a ContextResponse class, and the output is canonical_title on ContextResponse with correct casing.
This is a separate system from Avalon itself. The raw ASR output goes through a context-matching step, then through LLM post-processing that handles formatting, stumble correction, and user-defined instructions.
Important distinction: The OpenASR and AISpeak benchmarks reflect raw Avalon output only. Screen context and LLM post-processing are product features that improve real-world experience but are not part of the benchmark numbers.
Scaling an ASR architecture with a 5-person team
Operating as a five-person team forced our architectural hand. We didn't have the headcount to manage a sprawling microservices cluster or throw infinite compute at our training runs. Brett owns the Avalon model end-to-end, while Jack and Mark handle product engineering. Because our CEO, Finn, is dyslexic and relies on dictation daily, our feedback loop was brutally short: if a latency regression shipped, the CEO felt it within five minutes. This constraint kept us laser-focused on vertical integration rather than chasing research fads.
The usage numbers, as of February 2026:
13,000+ daily active users
450,000+ sessions per day
43.6 million all-time sessions
Average user replaces 29% of all typing (growing 4.5%/week)
Median active span: 10 hours/day
39.4% of DAU use it morning AND evening
The engagement numbers are what convinced us the hypothesis was right. Voice input for computer work is a distinct, underserved use case. When you solve it well, people use it all day, not as a novelty but as a primary input method.
The Japan Story
Something we didn't plan for: Aqua went viral in Japan without any marketing.
The structural reason: voice input is dramatically faster than typing in non-Latin scripts. Japanese users type romanized approximations, select from candidate characters, and confirm. Voice skips all of that. The speedup is roughly 3x.
A clarification that's important: Avalon (our proprietary model) currently ships for English only. Japanese users use our Whisper-based pipeline, which is optimized for latency and integrated with the same screen context and post-processing systems. The multilingual version of Avalon (with dedicated Japanese optimization) is in training and rolling out soon.
Japan is now our largest market by user count, entirely through word of mouth. This tells us the opportunity is bigger than English-speaking developers. There are roughly 2 billion people who type in non-Latin scripts. For them, voice input isn't a productivity hack. It's a fundamentally faster input method.
What We Got Wrong
Some honest mistakes along the way:
Quality at the expense of latency. Early versions prioritized transcription accuracy over speed. Turns out users would rather have slightly less perfect text instantly than perfect text after a noticeable delay. We learned this the hard way watching engagement metrics.
Underestimating the LLM layer. We initially thought raw ASR quality was everything. It's not. The post-processing layer that corrects stumbles, handles formatting, and applies user instructions accounts for a huge portion of perceived quality. Two systems working together beat either one alone.
Context hallucination. When we first shipped the screen-reading context feature, we were too aggressive with the matching heuristics. If a user mumbled near an open code editor, the LLM post-processing would try to aggressively "correct" the noise into whatever variable names were on screen, resulting in random code snippets being injected into chat windows. We had to completely rebuild the confidence thresholding between the raw ASR output and the context-matching layer to stop the system from being overly helpful.
Why Not Just Use Whisper?
Three reasons:
Accuracy on technical terms. 65.1% vs. 97.4% on coding vocabulary is the difference between a useful tool and an annoying one. Developers won't adopt voice input that mangles their code.
Latency. 965ms end-to-end vs. multi-second delays with local Whisper. For a press-and-talk interface, every millisecond matters.
Vertical integration. Avalon + screen context + LLM post-processing + per-app custom instructions is a system. You can't replicate it by wrapping Whisper in an API. Most competitors took the wrapper approach, and the technical ceiling is lower.
What's Next
Multilingual Avalon (Japanese, Spanish, German, French, Russian with dedicated optimization, 43 others via Whisper-level support) is in training
iOS launches March 1, 2026
Avalon API is available for developers building their own voice interfaces
Aqua Voice is available for Mac and Windows. The Avalon model card has full technical details.