Research | champollion

Specifications

Benchmark specification ↗

What constitutes a valid run: datasets, fingerprints, run cards.

Scoring specification ↗

Composite score construction, metric definitions, quality tiers.

Significance testing ↗

Bootstrap confidence intervals and paired comparison methodology.

Corpus design ↗

How evaluation corpora are built, versioned, and contamination-checked.

Corpora

The dataset registry currently tracks 48 development corpora across low-resource language pairs, each with a license, provenance notes, and a do-not-train flag where the source requires it. Held-out test sets stay sealed; dev sets are open for iteration. Contamination findings are published, not buried — see the corpus design spec for the audit trail.

Citation & licensing

The language-card layer draws on 332 registered upstream sources — Glottolog, WALS, Grambank, PHOIBLE, Lexibank, and friends — each tracked with its license and attribution requirements. Cards record per-field provenance (_fieldSources), so any fact can be traced, challenged, and corrected.

Citation procedure

How facts enter a card and how sources are recorded.

Language card spec

The full schema for the 7,959-card dataset.

Get in touch

Collaboration, corpus partnerships, corrections, or skepticism — all welcome. Open an issue on GitHub or start with the corpus partnership spec.