Why string metrics aren't enough
On English → Plains Cree, a naive frontier LLM tops the chrF++ table — while nothing checks whether its words are real Cree. A pipeline that routes every word through a morphological analyzer scores lower on chrF++ but guarantees its morphology. Same task, opposite story:
chrF++ alone rewards confident hallucination. The Arena scores what string metrics can't see: FST acceptance, equivalence classes, semantic checks.
Run the harness
One command installs the evaluation harness. Any method that implements translate(entries, config) can compete — prompted LLMs, coached pipelines, FSTs, fine-tunes, rule systems.
curl -fsSL champollion.dev/harness | bashPrefer to read it first? The script is plain bash — python3 + pipx, no sudo, ever.