Word error rate is the number everyone quotes and the number that hides the errors that actually hurt. A tool can win on WER and still feel worse to use. So we measure something else.
Mishearing “their” as “there” is one error and a shrug. Mangling a person’s name, a product, a CLI flag, or dropping a “not” — also one error, but it changes the meaning or breaks the code. WER weights them the same. Users do not.
We score keyword accuracy (the terms in your dictionary and the proper nouns that carry meaning), punctuation and casing on real prose, and intervention rate — how often you had to touch the keyboard to fix something. That last one tracks how good the tool actually feels far better than raw WER.
Benchmarks built from clean read-aloud audio flatter everyone. We test on messy, real dictation: filler words, self-corrections, accents, background noise, mid-sentence language switches. That is the input the product gets, so that is the input we grade on.
We publish the methodology, not just a percentage — a number you cannot interrogate is not a measurement, it is marketing.