Naar hoofdinhoud gaan

RESEARCHER

Reproduce everything

Every benchmark run is fingerprinted to a git commit. Every corpus is versioned and content-hashed. Every fact on every language card names its source. That's the whole methodology.

Specifications

Corpora

The dataset registry currently tracks 48 development corpora across low-resource language pairs, each with a license, provenance notes, and a do-not-train flag where the source requires it. Held-out test sets stay sealed; dev sets are open for iteration. Contamination findings are published, not buried — see the corpus design spec for the audit trail.

Citation & licensing

The language-card layer draws on 332 registered upstream sources — Glottolog, WALS, Grambank, PHOIBLE, Lexibank, and friends — each tracked with its license and attribution requirements. Cards record per-field provenance (_fieldSources), so any fact can be traced, challenged, and corrected.

Get in touch

Collaboration, corpus partnerships, corrections, or skepticism — all welcome. Open an issue on GitHub or start with the corpus partnership spec.