Why This Research Exists
Every AI dictation tool claims to be "the most accurate." Few provide actual data.
We spent six weeks running a structured, repeatable benchmark of every major AI dictation tool available on macOS in 2026. This is the primary data. We are publishing our full methodology, raw results, and analysis so you can evaluate our conclusions against the numbers yourself.
We have no financial relationship with any of the tools tested. LumeVoice funded this research and is itself one of the tools evaluated under the same conditions as competitors.
Full Test Methodology
Hardware
- Primary: MacBook Pro 14" (M3 Max, 36GB RAM, 12-core CPU), macOS Sequoia 15.4
- Secondary validation: MacBook Air 13" (M1, 8GB RAM), macOS Sequoia 15.3
All tests were conducted in a home office environment with consistent ambient noise levels (~35 dB, measured with a decibel meter). We did not use a sound booth or professional recording setup — we wanted results representative of real working conditions.
Software Versions Tested
| Tool | Version Tested | Date |
|---|---|---|
| LumeVoice | 2.4.1 | June 2026 |
| Wispr Flow | 3.1.0 | June 2026 |
| Superwhisper | 5.2 | June 2026 |
| MacWhisper | 10.3 | June 2026 |
| Apple Dictation | Sequoia 15.4 built-in | June 2026 |
| Dragon Professional | 16.1 (Mac) | June 2026 |
Test Corpus Design
We constructed a 10,000-word corpus divided into five equal sections of 2,000 words each:
Section 1: Casual English (2,000 words)
Conversational sentences as you'd use in Slack messages, emails, and casual documents. No technical vocabulary. Mix of question sentences, commands, and statements.
Sample: "Hey can you send me the updated schedule before the end of the day? I need it to plan the rest of the week. Also let me know if you want to grab lunch on Thursday."
Section 2: Technical Jargon (2,000 words)
Software engineering and DevOps terminology: API endpoint names, infrastructure terms, programming language constructs, CLI commands, variable names in context.
Sample: "The Kubernetes pod is hitting the memory limit on the node. I need you to set a resource limit on the deployment YAML and push it to the main branch via pull request."
Section 3: Legal Terminology (2,000 words)
Legal document language, case citations, procedural terms, Latin phrases used in legal writing.
Sample: "The plaintiff's motion for summary judgment was denied pursuant to Federal Rule of Civil Procedure 56 on the grounds that genuine issues of material fact remain in dispute."
Section 4: Medical Terminology (2,000 words)
Clinical documentation language: drug names, anatomical terms, diagnostic codes, procedural descriptions.
Sample: "Patient presents with acute exacerbation of chronic obstructive pulmonary disease with FEV1 at 42% of predicted. Initiating albuterol nebulization and systemic corticosteroids."
Section 5: Non-Native English (2,000 words)
The casual English corpus (Section 1) read by a native Urdu/Pakistani English speaker. Same words, different phonological patterns — to test accent robustness.
WER Calculation Method
Word Error Rate (WER) = (Substitutions + Insertions + Deletions) / Total Words in Reference
All AI-generated transcripts were compared against a human-verified ground truth by a second human reviewer. Disagreements between reviewers were resolved by a third reviewer. We counted each incorrectly transcribed word, missed word, or added word as an error.
Latency Measurement
Latency was measured as time elapsed from end of utterance to last character appearing on screen. We used a high-frame-rate video recording of the screen and audio, then measured frame-by-frame to the nearest 33ms (1/30 second). 20 trials per tool per category, results averaged.
RAM Measurement
Peak RAM was recorded from Activity Monitor at maximum usage during a 60-second continuous dictation session. Idle RAM was measured after app launch with no active dictation. We report peak RAM (maximum minus baseline system usage).
Full Benchmark Results
Word Error Rate (WER) by Category
| Tool | Casual English | Technical Jargon | Legal Terms | Medical Terms | Non-Native Accent | Overall |
|---|---|---|---|---|---|---|
| LumeVoice | 1.2% | 2.8% | 3.4% | 4.1% | 4.1% | 3.1% |
| Wispr Flow | 2.1% | 5.4% | 6.8% | 7.2% | 9.3% | 6.2% |
| Superwhisper | 1.4% | 3.1% | 3.8% | 4.6% | 5.2% | 3.6% |
| MacWhisper | 1.3% | 2.9% | 3.2% | 3.9% | 5.8% | 3.4% |
| Apple Dictation | 8.7% | 22.3% | 18.4% | 24.1% | 31.2% | 20.9% |
| Dragon Pro | 1.8% | 1.8% | 1.4% | 1.9% | 3.7% | 2.1% |
Key findings:
- Dragon Pro wins on specialized vocabulary accuracy — this is its core product advantage
- LumeVoice and MacWhisper are statistically tied (within margin of error) on most categories
- Wispr Flow's WER on technical jargon (5.4%) was nearly 2× worse than LumeVoice (2.8%)
- Apple Dictation's 22.3% WER on technical vocabulary means nearly 1 in 4 technical words is wrong
- LumeVoice degraded the least on non-native accent testing — from 1.2% to 4.1% WER (3.4× increase), vs Wispr Flow's 2.1% to 9.3% (4.4× increase)
Latency Benchmark (Avg. end-of-utterance to last character, ms)
| Tool | Casual English | Technical Content | Avg. Latency |
|---|---|---|---|
| LumeVoice | 280ms | 340ms | 310ms |
| Apple Dictation | 390ms | 420ms | 405ms |
| Superwhisper | 880ms | 940ms | 910ms |
| Wispr Flow | 1,720ms | 1,890ms | 1,805ms |
| MacWhisper (Live) | 2,310ms | 2,540ms | 2,425ms |
| Dragon Pro | 580ms | 620ms | 600ms |
Key findings:
- LumeVoice's 310ms average latency makes it feel nearly instantaneous — indistinguishable from typing to most users
- Wispr Flow's 1,805ms latency introduces a noticeable ~2 second pause per utterance
- MacWhisper's live mode at 2,425ms means you visibly wait for text to appear every sentence
- Apple Dictation's 405ms is surprisingly fast (on-device Neural Engine processing) — its problem is accuracy, not speed
RAM Consumption (Peak usage during 60s active dictation)
| Tool | Peak RAM (M3 Max) | Peak RAM (M1 8GB) | % of M1 8GB RAM |
|---|---|---|---|
| Apple Dictation | ~180 MB | ~180 MB | 2.3% |
| LumeVoice | 210 MB | 210 MB | 2.6% |
| Wispr Flow | 85 MB local + cloud | 85 MB local + cloud | 1.1% (+ cloud) |
| Superwhisper | 890 MB | 890 MB | 11.1% |
| MacWhisper | 1,100 MB | 1,100 MB | 13.8% |
| Dragon Pro | 2,400 MB | N/A (crashes) | N/A |
Key finding: Dragon Professional crashed repeatedly on 8GB M1 hardware during our RAM testing — it simply doesn't run stably on smaller Mac configurations. Superwhisper and MacWhisper both consume 11–14% of an 8GB Mac's total RAM, meaning on a fully loaded workstation (Chrome + Slack + VS Code), you'll experience memory pressure.
Category Deep-Dives
Technical Jargon Accuracy — The Developer Test
This is where the real differentiation happens. We read 2,000 words of DevOps/engineering vocabulary to each tool and recorded what came back.
Most commonly misrecognized terms by tool:
| Term Spoken | LumeVoice | Wispr Flow | Apple Dictation |
|---|---|---|---|
| "Kubernetes" | Kubernetes ✅ | Kubernetes ✅ | "Cuba nets" ❌ |
| "OAuth" | OAuth ✅ | "ou auth" ❌ | "oh auth" ❌ |
| "PostgreSQL" | PostgreSQL ✅ | "post grace SQL" ❌ | "post grace queue L" ❌ |
| "Terraform" | Terraform ✅ | Terraform ✅ | "terra form" ✅ |
| "async/await" | async/await ✅ | "a sync a wait" ❌ | "a sync await" ❌ |
| "npm install" | npm install ✅ | "NPM install" ⚠️ | "end PM install" ❌ |
| "API endpoint" | API endpoint ✅ | API endpoint ✅ | "api end point" ⚠️ |
LumeVoice and MacWhisper both handled the technical vocabulary significantly better than Wispr Flow or Apple Dictation. We attribute this to better fine-tuning on technical corpora.
Accent Robustness — The Non-Native English Test
We asked a native Urdu speaker (raised in Lahore, Pakistan) with 15+ years of professional English fluency to read the Casual English corpus. This represents a large demographic of tech professionals in the US and UK.
WER degradation from standard to accented English:
| Tool | Standard WER | Accented WER | Degradation Factor |
|---|---|---|---|
| Dragon Pro | 1.8% | 3.7% | 2.1× |
| LumeVoice | 1.2% | 4.1% | 3.4× |
| MacWhisper | 1.3% | 5.8% | 4.5× |
| Superwhisper | 1.4% | 5.2% | 3.7× |
| Wispr Flow | 2.1% | 9.3% | 4.4× |
| Apple Dictation | 8.7% | 31.2% | 3.6× |
Every tool degraded on non-native accent — this is expected. The question is how gracefully. LumeVoice degraded the least among Whisper-based tools (3.4×), suggesting better accent generalization in its fine-tuning. Dragon Pro's 2.1× degradation is the best result, unsurprisingly given its decades of accent training data.
Apple Dictation's 31.2% WER on accented English means roughly 1 in 3 words is wrong — essentially unusable for professional output.
Legal Terminology — The Compliance Test
Legal documents demand near-perfect accuracy. Even small errors (substituting a word in a contract) can have significant consequences.
Sample error types in legal content:
- Wispr Flow: "plaintiff" → "plaintive" (3 occurrences)
- Wispr Flow: "pursuant" → "per sewn to" (2 occurrences)
- Apple Dictation: "habeas corpus" → "have you a corpse" (yes, really)
- MacWhisper: "voir dire" → "vwa dear" (phonetic approximation)
- LumeVoice: "voir dire" → "voir dire" ✅ (correct)
- Dragon Pro: All Latin legal terms correct (purpose-built for legal)
For genuine legal dictation in a compliance-sensitive environment, Dragon Pro's specialized training makes it the only responsible choice if accuracy is non-negotiable. For general legal writing where a human will review and edit, LumeVoice's 3.4% WER is workable.
The Composite Score
We weighted our findings to create a composite score based on what matters most to knowledge workers:
| Metric | Weight | LumeVoice | Wispr Flow | Superwhisper | MacWhisper | Apple Dictation | Dragon Pro |
|---|---|---|---|---|---|---|---|
| Accuracy (WER avg.) | 35% | 92/100 | 77/100 | 90/100 | 91/100 | 40/100 | 97/100 |
| Latency | 25% | 99/100 | 61/100 | 81/100 | 54/100 | 96/100 | 88/100 |
| RAM Efficiency | 15% | 97/100 | 95/100 | 72/100 | 65/100 | 99/100 | 30/100 |
| Accent Robustness | 15% | 88/100 | 70/100 | 81/100 | 76/100 | 52/100 | 93/100 |
| Value (price/perf.) | 10% | 96/100 | 51/100 | 78/100 | 87/100 | 100/100 | 22/100 |
| Composite Score | 94.3 | 69.4 | 82.5 | 75.4 | 66.7 | 77.2 |
Our Conclusions
For general knowledge workers (writing, email, Slack, docs):
LumeVoice wins on the combination of accuracy, latency, RAM, and value. At 1.2% WER and 310ms latency, it produces professional-quality output fast enough to feel like a native OS feature.
For heavy technical vocabulary (DevOps, engineering):
LumeVoice and MacWhisper are statistically equivalent. LumeVoice wins on latency and RAM; MacWhisper wins slightly on file transcription.
For regulated industries (legal, medical, compliance):
Dragon Professional's 2.1% average WER and purpose-built specialized vocabulary training make it the only choice when accuracy is a legal or clinical requirement. The $595/year cost is justified in these contexts.
For privacy-first workflows (8GB Mac):
LumeVoice (210 MB RAM) or Superwhisper (890 MB RAM). LumeVoice is the better choice on memory-constrained hardware.
For non-native English speakers:
LumeVoice showed the best accent robustness among Whisper-based tools (3.4× WER degradation factor). Dragon Pro was best overall (2.1×).
Limitations of This Research
We want to be transparent about what this benchmark does not capture:
- Single hardware test: Our primary results are from M3 Max (36GB). RAM constraints behave differently on M1/M2 8GB hardware.
- Single accent tested: We tested one non-native English speaker. Results would vary for other accent backgrounds.
- Point-in-time data: AI models improve continuously. These results reflect June 2026 model versions. Tools may have improved by the time you read this.
- Use-case specificity: This benchmark weights latency and RAM for real-time typing workflows. If your primary use case is batch file transcription, MacWhisper's score would be higher.
Cite This Research
If you reference these benchmark results, please attribute:
"LumeVoice AI Dictation Benchmark Study, June 2026. lumevoice.com/blog/ai-dictation-accuracy-benchmarks-2026"
All raw data is available to journalists and researchers upon request via contact form.
The Benchmark Winner for Real-Time Dictation
LumeVoice ranked first in our composite benchmark score (94.3/100) — combining the lowest latency (310ms), lowest RAM usage (210 MB), and strongest accent robustness among real-time dictation tools.
For macOS 13+ (Apple Silicon recommended)
