Method Leaderboard | champollion

From→ ToCompleted runs only

Condition:

Loading leaderboard data...

Trust Levels

Self-benchmarkedActive

Self-benchmarkedComing soon

Corpus Size (n)

n<100below the significance floor — score gaps within ~5 chrF++ are noise

n<50below the development-set floor — orderings indicative only

Small corpora always stay on the board. Expand a flagged row for a per-pair “help build this corpus” link — our dev corpora rebuild from Tatoeba releases, so sentences added upstream flow into the next build.

⚠️ LLM outputs are non-deterministic. Scores represent point-in-time measurements under specific model versions and API configurations. Model providers may update weights, decoding strategies, or safety filters at any time, which can cause score drift between runs.

How It Works

1Fingerprinted Pipelines — Each submission is tied to a specific Git commit and pipeline configuration, ensuring results can be traced back to the exact code that produced them.
2Versioned Datasets — Evaluation datasets are content-hashed and versioned. Scores are only comparable within the same dataset version, preventing silent data contamination.
3Standardised Harness — All metrics are computed by the shared champollion evaluation harness, eliminating implementation differences between submissions.
4Open Submission — Anyone can submit results by opening a pull request with their method's JSON entry and pipeline fingerprint. Verified and Community trust tiers will be available soon.