Accéder au contenu principal

MISSION CONTROL

The Arena

Open benchmarks for translating into the languages everyone else skips. Fingerprinted runs, versioned corpora, metrics that catch what fluency hides. Anyone can compete.

Why string metrics aren't enough

On English → Plains Cree, a naive frontier LLM tops the chrF++ table — while nothing checks whether its words are real Cree. A pipeline that routes every word through a morphological analyzer scores lower on chrF++ but guarantees its morphology. Same task, opposite story:

Naive frontier LLM
chrF++ 47.6
morphology unverified — fluent-looking output, no guarantee the words exist
FST-gated pipeline
chrF++ 43.2
91.5% morphologically valid words — every surface form checked by a finite-state transducer

chrF++ alone rewards confident hallucination. The Arena scores what string metrics can't see: FST acceptance, equivalence classes, semantic checks.

Run the harness

One command installs the evaluation harness. Any method that implements translate(entries, config) can compete — prompted LLMs, coached pipelines, FSTs, fine-tunes, rule systems.

curl -fsSL champollion.dev/harness | bash

Prefer to read it first? The script is plain bash — python3 + pipx, no sudo, ever.

Read the rules