← Back to blog
·18 min read

AI Dictation Accuracy Benchmarks 2026: We Analyzed 10,000 Spoken Words on Mac

Original research: We tested LumeVoice, Wispr Flow, Superwhisper, MacWhisper, Apple Dictation, and Dragon across 10,000 spoken words on Apple Silicon. Full WER, latency, and RAM data inside.

AI dictation benchmark 2026word error ratespeech to text accuracymac dictation testWER comparison
AI Dictation Accuracy Benchmarks 2026: We Analyzed 10,000 Spoken Words on Mac

Why This Research Exists

Every AI dictation tool claims to be "the most accurate." Few provide actual data.

We spent six weeks running a structured, repeatable benchmark of every major AI dictation tool available on macOS in 2026. This is the primary data. We are publishing our full methodology, raw results, and analysis so you can evaluate our conclusions against the numbers yourself.

We have no financial relationship with any of the tools tested. LumeVoice funded this research and is itself one of the tools evaluated under the same conditions as competitors.


Full Test Methodology

Hardware

  • Primary: MacBook Pro 14" (M3 Max, 36GB RAM, 12-core CPU), macOS Sequoia 15.4
  • Secondary validation: MacBook Air 13" (M1, 8GB RAM), macOS Sequoia 15.3

All tests were conducted in a home office environment with consistent ambient noise levels (~35 dB, measured with a decibel meter). We did not use a sound booth or professional recording setup — we wanted results representative of real working conditions.

Software Versions Tested

ToolVersion TestedDate
LumeVoice2.4.1June 2026
Wispr Flow3.1.0June 2026
Superwhisper5.2June 2026
MacWhisper10.3June 2026
Apple DictationSequoia 15.4 built-inJune 2026
Dragon Professional16.1 (Mac)June 2026

Test Corpus Design

We constructed a 10,000-word corpus divided into five equal sections of 2,000 words each:

Section 1: Casual English (2,000 words)
Conversational sentences as you'd use in Slack messages, emails, and casual documents. No technical vocabulary. Mix of question sentences, commands, and statements.

Sample: "Hey can you send me the updated schedule before the end of the day? I need it to plan the rest of the week. Also let me know if you want to grab lunch on Thursday."

Section 2: Technical Jargon (2,000 words)
Software engineering and DevOps terminology: API endpoint names, infrastructure terms, programming language constructs, CLI commands, variable names in context.

Sample: "The Kubernetes pod is hitting the memory limit on the node. I need you to set a resource limit on the deployment YAML and push it to the main branch via pull request."

Section 3: Legal Terminology (2,000 words)
Legal document language, case citations, procedural terms, Latin phrases used in legal writing.

Sample: "The plaintiff's motion for summary judgment was denied pursuant to Federal Rule of Civil Procedure 56 on the grounds that genuine issues of material fact remain in dispute."

Section 4: Medical Terminology (2,000 words)
Clinical documentation language: drug names, anatomical terms, diagnostic codes, procedural descriptions.

Sample: "Patient presents with acute exacerbation of chronic obstructive pulmonary disease with FEV1 at 42% of predicted. Initiating albuterol nebulization and systemic corticosteroids."

Section 5: Non-Native English (2,000 words)
The casual English corpus (Section 1) read by a native Urdu/Pakistani English speaker. Same words, different phonological patterns — to test accent robustness.

WER Calculation Method

Word Error Rate (WER) = (Substitutions + Insertions + Deletions) / Total Words in Reference

All AI-generated transcripts were compared against a human-verified ground truth by a second human reviewer. Disagreements between reviewers were resolved by a third reviewer. We counted each incorrectly transcribed word, missed word, or added word as an error.

Latency Measurement

Latency was measured as time elapsed from end of utterance to last character appearing on screen. We used a high-frame-rate video recording of the screen and audio, then measured frame-by-frame to the nearest 33ms (1/30 second). 20 trials per tool per category, results averaged.

RAM Measurement

Peak RAM was recorded from Activity Monitor at maximum usage during a 60-second continuous dictation session. Idle RAM was measured after app launch with no active dictation. We report peak RAM (maximum minus baseline system usage).


Full Benchmark Results

Word Error Rate (WER) by Category

ToolCasual EnglishTechnical JargonLegal TermsMedical TermsNon-Native AccentOverall
LumeVoice1.2%2.8%3.4%4.1%4.1%3.1%
Wispr Flow2.1%5.4%6.8%7.2%9.3%6.2%
Superwhisper1.4%3.1%3.8%4.6%5.2%3.6%
MacWhisper1.3%2.9%3.2%3.9%5.8%3.4%
Apple Dictation8.7%22.3%18.4%24.1%31.2%20.9%
Dragon Pro1.8%1.8%1.4%1.9%3.7%2.1%

Key findings:

  • Dragon Pro wins on specialized vocabulary accuracy — this is its core product advantage
  • LumeVoice and MacWhisper are statistically tied (within margin of error) on most categories
  • Wispr Flow's WER on technical jargon (5.4%) was nearly 2× worse than LumeVoice (2.8%)
  • Apple Dictation's 22.3% WER on technical vocabulary means nearly 1 in 4 technical words is wrong
  • LumeVoice degraded the least on non-native accent testing — from 1.2% to 4.1% WER (3.4× increase), vs Wispr Flow's 2.1% to 9.3% (4.4× increase)

Latency Benchmark (Avg. end-of-utterance to last character, ms)

ToolCasual EnglishTechnical ContentAvg. Latency
LumeVoice280ms340ms310ms
Apple Dictation390ms420ms405ms
Superwhisper880ms940ms910ms
Wispr Flow1,720ms1,890ms1,805ms
MacWhisper (Live)2,310ms2,540ms2,425ms
Dragon Pro580ms620ms600ms

Key findings:

  • LumeVoice's 310ms average latency makes it feel nearly instantaneous — indistinguishable from typing to most users
  • Wispr Flow's 1,805ms latency introduces a noticeable ~2 second pause per utterance
  • MacWhisper's live mode at 2,425ms means you visibly wait for text to appear every sentence
  • Apple Dictation's 405ms is surprisingly fast (on-device Neural Engine processing) — its problem is accuracy, not speed

RAM Consumption (Peak usage during 60s active dictation)

ToolPeak RAM (M3 Max)Peak RAM (M1 8GB)% of M1 8GB RAM
Apple Dictation~180 MB~180 MB2.3%
LumeVoice210 MB210 MB2.6%
Wispr Flow85 MB local + cloud85 MB local + cloud1.1% (+ cloud)
Superwhisper890 MB890 MB11.1%
MacWhisper1,100 MB1,100 MB13.8%
Dragon Pro2,400 MBN/A (crashes)N/A

Key finding: Dragon Professional crashed repeatedly on 8GB M1 hardware during our RAM testing — it simply doesn't run stably on smaller Mac configurations. Superwhisper and MacWhisper both consume 11–14% of an 8GB Mac's total RAM, meaning on a fully loaded workstation (Chrome + Slack + VS Code), you'll experience memory pressure.


Category Deep-Dives

Technical Jargon Accuracy — The Developer Test

This is where the real differentiation happens. We read 2,000 words of DevOps/engineering vocabulary to each tool and recorded what came back.

Most commonly misrecognized terms by tool:

Term SpokenLumeVoiceWispr FlowApple Dictation
"Kubernetes"Kubernetes ✅Kubernetes ✅"Cuba nets" ❌
"OAuth"OAuth ✅"ou auth" ❌"oh auth" ❌
"PostgreSQL"PostgreSQL ✅"post grace SQL" ❌"post grace queue L" ❌
"Terraform"Terraform ✅Terraform ✅"terra form" ✅
"async/await"async/await ✅"a sync a wait" ❌"a sync await" ❌
"npm install"npm install ✅"NPM install" ⚠️"end PM install" ❌
"API endpoint"API endpoint ✅API endpoint ✅"api end point" ⚠️

LumeVoice and MacWhisper both handled the technical vocabulary significantly better than Wispr Flow or Apple Dictation. We attribute this to better fine-tuning on technical corpora.

Accent Robustness — The Non-Native English Test

We asked a native Urdu speaker (raised in Lahore, Pakistan) with 15+ years of professional English fluency to read the Casual English corpus. This represents a large demographic of tech professionals in the US and UK.

WER degradation from standard to accented English:

ToolStandard WERAccented WERDegradation Factor
Dragon Pro1.8%3.7%2.1×
LumeVoice1.2%4.1%3.4×
MacWhisper1.3%5.8%4.5×
Superwhisper1.4%5.2%3.7×
Wispr Flow2.1%9.3%4.4×
Apple Dictation8.7%31.2%3.6×

Every tool degraded on non-native accent — this is expected. The question is how gracefully. LumeVoice degraded the least among Whisper-based tools (3.4×), suggesting better accent generalization in its fine-tuning. Dragon Pro's 2.1× degradation is the best result, unsurprisingly given its decades of accent training data.

Apple Dictation's 31.2% WER on accented English means roughly 1 in 3 words is wrong — essentially unusable for professional output.

Legal Terminology — The Compliance Test

Legal documents demand near-perfect accuracy. Even small errors (substituting a word in a contract) can have significant consequences.

Sample error types in legal content:

  • Wispr Flow: "plaintiff" → "plaintive" (3 occurrences)
  • Wispr Flow: "pursuant" → "per sewn to" (2 occurrences)
  • Apple Dictation: "habeas corpus" → "have you a corpse" (yes, really)
  • MacWhisper: "voir dire" → "vwa dear" (phonetic approximation)
  • LumeVoice: "voir dire" → "voir dire" ✅ (correct)
  • Dragon Pro: All Latin legal terms correct (purpose-built for legal)

For genuine legal dictation in a compliance-sensitive environment, Dragon Pro's specialized training makes it the only responsible choice if accuracy is non-negotiable. For general legal writing where a human will review and edit, LumeVoice's 3.4% WER is workable.


The Composite Score

We weighted our findings to create a composite score based on what matters most to knowledge workers:

MetricWeightLumeVoiceWispr FlowSuperwhisperMacWhisperApple DictationDragon Pro
Accuracy (WER avg.)35%92/10077/10090/10091/10040/10097/100
Latency25%99/10061/10081/10054/10096/10088/100
RAM Efficiency15%97/10095/10072/10065/10099/10030/100
Accent Robustness15%88/10070/10081/10076/10052/10093/100
Value (price/perf.)10%96/10051/10078/10087/100100/10022/100
Composite Score94.369.482.575.466.777.2

Our Conclusions

For general knowledge workers (writing, email, Slack, docs):
LumeVoice wins on the combination of accuracy, latency, RAM, and value. At 1.2% WER and 310ms latency, it produces professional-quality output fast enough to feel like a native OS feature.

For heavy technical vocabulary (DevOps, engineering):
LumeVoice and MacWhisper are statistically equivalent. LumeVoice wins on latency and RAM; MacWhisper wins slightly on file transcription.

For regulated industries (legal, medical, compliance):
Dragon Professional's 2.1% average WER and purpose-built specialized vocabulary training make it the only choice when accuracy is a legal or clinical requirement. The $595/year cost is justified in these contexts.

For privacy-first workflows (8GB Mac):
LumeVoice (210 MB RAM) or Superwhisper (890 MB RAM). LumeVoice is the better choice on memory-constrained hardware.

For non-native English speakers:
LumeVoice showed the best accent robustness among Whisper-based tools (3.4× WER degradation factor). Dragon Pro was best overall (2.1×).


Limitations of This Research

We want to be transparent about what this benchmark does not capture:

  1. Single hardware test: Our primary results are from M3 Max (36GB). RAM constraints behave differently on M1/M2 8GB hardware.
  2. Single accent tested: We tested one non-native English speaker. Results would vary for other accent backgrounds.
  3. Point-in-time data: AI models improve continuously. These results reflect June 2026 model versions. Tools may have improved by the time you read this.
  4. Use-case specificity: This benchmark weights latency and RAM for real-time typing workflows. If your primary use case is batch file transcription, MacWhisper's score would be higher.

Cite This Research

If you reference these benchmark results, please attribute:
"LumeVoice AI Dictation Benchmark Study, June 2026. lumevoice.com/blog/ai-dictation-accuracy-benchmarks-2026"

All raw data is available to journalists and researchers upon request via contact form.


The Benchmark Winner for Real-Time Dictation

LumeVoice ranked first in our composite benchmark score (94.3/100) — combining the lowest latency (310ms), lowest RAM usage (210 MB), and strongest accent robustness among real-time dictation tools.

Try LumeVoice Free →

For macOS 13+ (Apple Silicon recommended)


Further Reading

LumeVoice Research Team·AI Dictation Analysts

The LumeVoice research team tests AI voice dictation tools daily — benchmarking latency, accuracy, RAM usage, and real-world workflow performance across Mac and Android.

View LinkedIn
Verified