One command
Paste this into a terminal. It explains itself, asks before doing anything, installs the harness if you don’t have it, helps you add an OpenRouter key if you don’t have one, then runs the highest-value open benchmarks up to your budget — showing you the exact runs and estimated cost before a single token is spent. Change the budget and the command updates.
curl -fsSL champollion.dev/give | bash
Your API key stays on your machine — the harness talks straight to OpenRouter, we never see it, and you shouldn’t share it with anyone (including us). Running needs no account with us; publishing your results asks for one OAuth sign-in so the run card carries your name. Nothing system-level is touched (no sudo), and pipx uninstall mt-eval-harness removes it completely. The script is plain bash you can read first: champollion.dev/give. New to terminals entirely? The step-by-step walkthrough is coming to this page — every command above is also explained in the contributor guide.
The fast path: hand it to your agent
If you use Claude Code or another coding agent, this is a paste-one-prompt contribution. The agent installs the harness, picks a queue item, runs it with your key, and publishes the report (you’ll approve an OAuth sign-in for attribution).
Install the Champollion mt-eval harness (curl -fsSL champollion.dev/harness | bash). Fetch https://champollion.dev/queue.json and show me the top 3 open items. Using my OpenRouter key (OPENROUTER_API_KEY), execute the run_command of the item I pick, then run `mt-eval publish` on the generated report JSON and show me the published run card.
Prefer to drive it yourself? Two commands replace the agent: curl -fsSL champollion.dev/harness | bash to install, curl -fsSL champollion.dev/queue | bash to see the queue with ready-to-paste run commands. Both are plain bash you can read first (the installer, the queue viewer); the queue viewer only displays — it never spends your tokens.
The queue, right now
Prioritized open (corpus, model, condition) combinations — ranked by expected chain value: how much each run strengthens the whole language mesh per estimated dollar (the formula is public and every rank is re-derivable by hand). Two people running the same item is harmless: run-card fingerprints deduplicate identical runs, and independent replications are useful data, so there’s no sign-up and no claim-locking.
Loading the queue…
The contribution ladder
Install the harness, pick any open queue item, paste its command, and publish the report. That’s a real, fingerprinted data point on a language pair nobody has measured yet. No MT background needed.
Write a coaching file — grammar rules, a small glossary, style notes for the target language — and pass it with --coaching-file. The harness injects it as the system prompt and records the full text in the run card, so your prompt craft is reproducible. Beating the naive baseline on a low-resource pair is a genuine finding.
Implement translate(entries, config) and the harness will benchmark anything inside it: FST-gated generation, dictionary lookup, retrieval, chained models. Declared dependency classes (S/O/A1/A2) keep methods comparable and auditable.
Which API key do I need?
The harness makes its calls through OpenRouter — set OPENROUTER_API_KEY (environment variable or a local .env file) and one key reaches every model in the queue lineup: Claude, GPT, and Gemini alike. If your tokens live with Anthropic, OpenAI, or Google directly, an OpenRouter account is the bridge — the harness does not yet accept direct provider keys (the run-card schema reserves an api_provider field for when it does, but today every run is an OpenRouter run). Cost tracking, model validation, and pricing snapshots all come from the same OpenRouter metadata, so what the leaderboard reports as run cost is what your key was billed.
What your run counts as
Community submissions publish at the self-benchmarked tier — plainly labeled as “submitted by the person who ran it.” That’s not a caveat; it’s the design. Every run card carries the dataset hash, model, condition, full system prompt, and cost, so anyone can re-run your exact configuration and check the result. Elevated tiers (verification) are granted by review, not by self-assertion.
Your submitter name appears on the leaderboard row. That is the recognition on offer today — we won’t promise badges, bounties, or programs that don’t exist yet.
Each run card is fingerprinted (SHA-256 over dataset hash, model, condition, and system prompt). Identical re-runs deduplicate on publish; near-duplicates with different prompts are separate, comparable experiments.
Every queued corpus is marked do_not_train and carries its license (CC-BY family, Tatoeba-derived) in the run card. Non-commercially-licensed corpora are excluded from the open queue entirely.
Trust tiers, dataset rules, and scoring are specified on mtevalarena.org. See your result on the leaderboard after publishing.