The Arena | champollion

Why string metrics aren't enough

On English → Plains Cree, a naive frontier LLM tops the chrF++ table — while nothing checks whether its words are real Cree. A pipeline that routes every word through a morphological analyzer scores lower on chrF++ but guarantees its morphology. Same task, opposite story:

Naive frontier LLM

chrF++ 47.6ⓘ

morphology unverified — fluent-looking output, no guarantee the words exist

FST-gated pipeline

chrF++ 43.2ⓘ

91.5% morphologically valid words — every surface form checked by a finite-state transducer

chrF++ alone rewards confident hallucination. The Arena scores what string metrics can't see: FST acceptance, equivalence classes, semantic checks.

Run the harness

One command installs the evaluation harness. Any method that implements translate(entries, config) can compete — prompted LLMs, coached pipelines, FSTs, fine-tunes, rule systems.

curl -fsSL champollion.dev/harness | bash

Prefer to read it first? The script is plain bash — python3 + pipx, no sudo, ever.

Read the rules

How it works ↗

The full pipeline — datasets, runs, run cards, trust tiers.

Scoring specification ↗

Composite score, metric definitions, tier thresholds.

Benchmark specification ↗

What counts as a valid benchmark run and why.

Prize specification ↗

How contests and verification will work.