Skip to main content

Method Leaderboard

Benchmarking translation methods for Indigenous and low‑resource languages with reproducible, fingerprinted evaluation.

⚠️ These are automated proxy scores, not validated quality judgments. Community review determines deployment readiness. See the scoring specification for methodology details.

Have a method to submit? Build a plugin and submit your scores →

Condition:

Loading leaderboard data...

Trust Levels
Self-benchmarkedActive
Self-benchmarkedComing soon
Self-benchmarkedComing soon
Corpus Size (n)
n<100below the significance floor — score gaps within ~5 chrF++ are noise
n<50below the development-set floor — orderings indicative only
Small corpora always stay on the board. Expand a flagged row for a per-pair “help build this corpus” link — our dev corpora rebuild from Tatoeba releases, so sentences added upstream flow into the next build.

⚠️ LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs.

How It Works

  1. 1Fingerprinted Pipelines — Each submission is tied to a specific Git commit and pipeline configuration, ensuring results can be traced back to the exact code that produced them.
  2. 2Versioned Datasets — Evaluation datasets are content-hashed and versioned. Scores are only comparable within the same dataset version, preventing silent data contamination.
  3. 3Standardised Harness — All metrics are computed by the shared champollion evaluation harness, eliminating implementation differences between submissions.
  4. 4Open Submission — Anyone can submit results by opening a pull request with their method's JSON entry and pipeline fingerprint. Verified and Community trust tiers will be available soon.