RESEARCH

Benchmarking dictation honestly

January 8, 2026 · 8 min read

Word error rate is the number everyone quotes and the number that hides the errors that actually hurt. A tool can win on WER and still feel worse to use. So we measure something else.

Not all errors are equal

Mishearing “their” as “there” is one error and a shrug. Mangling a person’s name, a product, a CLI flag, or dropping a “not” — also one error, but it changes the meaning or breaks the code. WER weights them the same. Users do not.

What we track instead

We score keyword accuracy (the terms in your dictionary and the proper nouns that carry meaning), punctuation and casing on real prose, and intervention rate — how often you had to touch the keyboard to fix something. That last one tracks how good the tool actually feels far better than raw WER.

Measured on real speech

Benchmarks built from clean read-aloud audio flatter everyone. We test on messy, real dictation: filler words, self-corrections, accents, background noise, mid-sentence language switches. That is the input the product gets, so that is the input we grade on.

We publish the methodology, not just a percentage — a number you cannot interrogate is not a measurement, it is marketing.