语言卡规范
单一信息源。 本文档定义了每张语言卡的规范形状。每张卡必须包含此处列出的每个顶级字段,即使值为
null或[]。缺少字段的卡不符合规范。这种统一性使自动化工具、linter、富化脚本和人工审查者能够信任卡的结构。
设计原则
-
统一的形状。 所有 8,000+ 张卡具有相同的顶级字段。未知值为
null,空数组为[],空对象为null(不是{})。这意味着代码永远不需要检查"这个字段存在吗?"——只需检查"它有值吗?" -
溯源一切。 每项事实声明都追溯到一个命名的、版本化的、主要来源。无源声明是无法验证的声明。
dataSources字段(以及子对象中的每字段source注释)使出处明确。 -
保留分歧。 当权威机构意见不一致时(Wikidata 说 50,000 使用者,Ethnologue 说 20,000),我们存储两者并标注来源。我们不平均、不解决、不选边。用户可以理解细微差别。
-
Null 表示未知,不是不适用。 如果字段为
null,意味着"我们还没有找到这方面的数据"。如果字段确实不适用(例如,grammatical gender对于手语),值应该解释这一点:{ "grammatical": false, "inclusiveGuidance": "不适用——美国手语没有语法性别。" } -
仅合并。 富化脚本添加数据,永不覆盖。人工策划的值优先于自动化数据。
三层架构
| 层 | 位置 | 目的 |
|---|---|---|
| 语言卡 | shared/language-cards/<code>.json | 每种语言的配置:身份、分类、资源、一切 |
| 属卡 | shared/language-cards/genera/<genus>.json | 相关语言的共享运行时属性(策划的,非自动生成) |
| 语言树 | shared/language-cards/language-tree.json | 完整的 Glottolog 层级——Lab UI 和语言发现的参考数据 |
继承模型
当卡设置 "extends": "family-dravidian" 时,运行时使用 _deepMerge()(在 lib/registers.js 中)将父卡合并到子卡中。这让属卡定义共享的寄存器、正式系统和性别指导,流向所有成员语言——无需在数百张单独的卡中重复数据。
合并语义
| 子值 | 行为 | 原因 |
|---|---|---|
null | 从父继承 | null 意味着"我不定义这个"——父的值流向下来 |
| 非 null | 覆盖父 | 子的数据更具体——优先 |
| 嵌套对象 | 递归合并 | 子字段覆盖,父字段保留 |
| 数组 | 完全替换 | 数组不逐项合并——子数组获胜 |
身份字段(永不继承)
某些字段属于卡本身,必须永不从父继承:
code, extends, _migration, aliases, iso639_1, iso639_3
即使父卡定义了 aliases: ["macro-code"],子卡也不会继承这些别名。这些字段始终是子卡自己的值(包括未设置时的 null)。
原因: 没有这条规则,每种 Cree 语言都会从宏语言父继承 aliases: ["cre"],使每个变体都成为宏的别名。
示例:Cree 卡如何解析
┌───────────────────────┐
│ family-algic.json │ formality: null, registers: null
│ (no registers) │
└──────────┬────────────┘
│ extends
┌──────────┴────────────┐
│ genus-cree.json │ formality: { system: "obviative-animate", ... }
│ (sourced registers) │ registers: { formal: {...}, informal: {...} }
└──────────┬────────────┘
│ extends
┌──────────┴────────────┐
│ crk.json │ code: "crk", extends: "genus-cree"
│ (Plains Cree) │ formality: null → inherits from genus-cree
│ │ registers: null → inherits from genus-cree
│ │ script: "Cans" → own value, no inheritance
│ │ code: "crk" → identity field, never inherited
└───────────────────────┘
在运行时,getLanguageCard("crk") 返回一个合并的对象,包含 genus-cree 的寄存器 + family-algic 的属性(如果有)+ crk 自己的身份和元数据。
属卡模板
属卡位于 shared/language-cards/genera/ 并为语言组定义共享属性。它们遵循与常规卡相同的模式,但约定不同:
{
// Identity — genus cards use a prefixed code, NOT an ISO 639-3 code
"code": "genus-cree", // "genus-", "family-", or "macrolanguage-" prefix
"name": "Cree Languages", // Human-readable group name
"extends": "family-algic", // Genus cards can extend family cards (chaining)
// Formality — shared across the group, sourced from typological databases
"formality": {
"system": "obviative-animate",
"description": "Cree languages use an obviative/proximate system...",
"default": "formal",
"source": "WALS 37A, 38A + Wolfart 1973"
},
// Registers — shared presets, if the group shares a formality system
"registers": {
"formal": {
"label": "Formal (Proximate)",
"description": "...",
"prompt": "...",
"isDefault": true
},
"informal": {
"label": "Informal",
"description": "...",
"prompt": "..."
}
},
// Gender — shared grammatical gender behavior
"gender": {
"grammatical": false, // Cree doesn't have grammatical gender
"inclusiveGuidance": null // so no inclusive guidance needed
},
// Everything else is null — individual cards provide their own
// classification, geography, resources, etc.
"classification": null,
"methodSupport": null,
// ...
}
关键规则: 属卡必须仅包含在整个组中真正共享且来自权威参考的数据。如果正式系统在成员之间变化,它应该在单个卡上,而不是属卡上。
规范模板
每张卡必须具有这个确切的顶级形状。子对象模式在下面的字段参考中记录。
{
// ═══════════════════════════════════════════════════════════════════════
// § 1. IDENTITY
// Who is this language? What codes identify it?
// Sources: ISO 639-3 registry, ISO 639-1, BCP 47/IANA.
// ═══════════════════════════════════════════════════════════════════════
"code": "xxx", // REQUIRED. ISO 639-3 code. This IS the card ID and filename.
"name": "English Name", // REQUIRED. English reference name from ISO 639-3 registry.
"nativeName": null, // Endonym (name in the language itself). Source: Wikidata P1705.
// Examples: "nêhiyawêwin / ᓀᐦᐃᔭᐍᐏᐣ", "日本語", "Esperanto".
"alternateNames": [], // Other names this language is known by. Source: Glottolog, Ethnologue.
// Not aliases (those are code-level). These are name-level variants.
// Example: ["Qafar af", "Afaraf", "'Afar Af"] for Afar (aar).
"iso639_3": "xxx", // REQUIRED. Three-letter ISO 639-3 code. Same as `code`.
"iso639_1": null, // Two-letter ISO 639-1 code (e.g., "en", "fr"). null if none.
"bcp47": null, // IETF BCP 47 tag. Often same as iso639_1. Can include subtags
// (e.g., "iu-Cans-CA"). null if unknown.
"aliases": [], // Alternative code-level identifiers that resolve to this card.
// Example: ["fil"] for tl (Tagalog), ["iu"] for iku (Inuktitut).
// Used by code resolution: user types "fil", system loads tl.json.
"isoScope": "I", // REQUIRED. ISO 639-3 scope:
// "I" = Individual language
// "M" = Macrolanguage (e.g., Chinese, Arabic, Cree)
// "S" = Special (e.g., mis, mul, zxx)
"isoType": "L", // REQUIRED. ISO 639-3 type:
// "L" = Living "E" = Extinct "A" = Ancient
// "H" = Historical "C" = Constructed
"macrolanguage": null, // If this language is part of a macrolanguage, the macrolanguage
// ISO 639-3 code (e.g., "cre" for Plains Cree, "ara" for Arabic
// varieties). Source: ISO 639-3 macrolanguages.tab.
"extends": null, // Genus card key if shared properties are inherited from a genus
// card (e.g., "genus-cree", "genus-eskimo-aleut").
// null for most languages.
// ═══════════════════════════════════════════════════════════════════════
// § 2. CLASSIFICATION
// Where does this language sit in the family tree?
// Source: Glottolog. NEVER hand-build classifications.
// ═══════════════════════════════════════════════════════════════════════
"glottocode": null, // Glottolog identifier (e.g., "plai1258", "stan1293").
// null if the language is not in Glottolog.
"classification": null, // Genealogical classification from Glottolog. When populated:
// {
// "family": "Algic", // Top-level family. null for isolates.
// "familyGlottocode": "algi1248", // Glottocode of the family.
// "genus": "Plains Creeic", // WALS-style genus.
// "genusGlottocode": "plai1264", // Glottocode of the genus.
// "ancestry": ["Algic", "Algonquian-Blackfoot", "Algonquian",
// "Cree-Montagnais-Naskapi", "Cree", "Plains Creeic"]
// }
// For isolates: family = language name, genus = language name,
// ancestry = [language name].
"isIsolate": false, // true if a language isolate (no known genetic relatives).
// Source: Glottolog CLDF.
// ═══════════════════════════════════════════════════════════════════════
// § 3. GEOGRAPHY
// Where is this language spoken?
// Sources: Glottolog (coordinates, countries), census data, Ethnologue.
// ═══════════════════════════════════════════════════════════════════════
"macroarea": null, // Glottolog macroarea. One of: "Africa", "Australia",
// "Eurasia", "North America", "Papunesia", "South America".
// null if unknown. Source: Glottolog CLDF.
"coordinates": null, // Representative geographic point. When populated:
// { "lat": 52.1, "lng": -106.6, "source": "glottolog-5.3" }
// This is a representative point, not a boundary.
"countries": [], // ISO 3166-1 alpha-2 country codes where this language is spoken.
// Example: ["CA", "US"]. Source: Glottolog.
"regions": [], // Detailed regional breakdown with admin codes & speaker estimates.
// Each entry:
// {
// "country": "Canada",
// "countryCode": "CA",
// "officialStatus": "recognized", // official, co-official,
// // recognized, none
// "region": "Saskatchewan, Alberta, Manitoba",
// "speakerEstimate": "~20,000",
// "coordinates": [-106.6, 52.1], // [lng, lat]
// "admin1Codes": ["CA-SK", "CA-AB", "CA-MB"]
// }
"arealContext": null, // Linguistic area / Sprachbund membership. DISTINCT from
// contactInfluences (which is language-specific contact history).
// This field captures zone-level typological convergence patterns
// — i.e., what linguistic area the language exists within and
// what features are common across that area.
// {
// "zone": "Mainland Southeast Asian Sprachbund",
// "arealFeatures": "Tonal convergence, classifier systems,
// topic-prominence, monosyllabicity trend.",
// "typicalContacts": ["Classical Chinese", "Sanskrit/Pali"],
// "source": "areal-linguistics (Enfield 2005)"
// }
// NOT the same as contactInfluences. A language can exist within
// a convergence area without having specific contact history with
// any particular language in that area.
// ═══════════════════════════════════════════════════════════════════════
// § 4. WRITING SYSTEMS
// How is this language written?
// Sources: Wikidata P282, ISO 15924, manual research.
// Note: Some languages have NO standardized orthography. Some have
// competing orthographies. Some use multiple scripts routinely (e.g.,
// Serbian: Cyrillic + Latin; Japanese: Kanji + Hiragana + Katakana).
// Sign languages may use notation systems (SignWriting, HamNoSys) or
// none at all.
// ═══════════════════════════════════════════════════════════════════════
"script": null, // Primary ISO 15924 script code (e.g., "Latn", "Cyrl", "Cans",
// "Jpan"). null if no written form or unknown.
"scriptUnicodeName": null, // Unicode script block name derived from the script field.
// e.g., "Latin", "Cyrillic", "Canadian_Aboriginal", "CJK".
// Used by code_switching metric plugin. Auto-populated by
// enrich-script-unicode-names.mjs. null if script is null.
"scripts": [], // All writing systems with detail. Array of:
// {
// "code": "Cans",
// "name": "Unified Canadian Aboriginal Syllabics",
// "primary": true
// }
// A language with multiple scripts has multiple entries.
// A language with no written form has [].
"dir": null, // Writing direction: "ltr" (left-to-right) or "rtl" (right-to-left).
// null if no written form or unknown.
"scriptConverter": null, // Script converter key if we have a converter for this language
// (e.g., "crk" for SRO↔Syllabics). null for most languages.
"orthographicStatus": null, // Writing system standardization status. When populated:
// {
// "status": "standardized",
// // "standardized" — official/agreed orthography exists
// // "competing" — multiple orthographies in active use
// // "emerging" — orthography under development
// // "none" — primarily oral, no standard writing
// "notes": "Uses SIL-developed Latin orthography since 1960s.",
// "source": "ethnologue" // or "manual-curation"
// }
// Crucial for LRLs where orthographic variation directly impacts
// MT training data quality and evaluation consistency.
// ═══════════════════════════════════════════════════════════════════════
// § 5. DEMOGRAPHICS & VITALITY
// How many people speak this language? Is it endangered?
// Sources: Census, Ethnologue, UNESCO Atlas, Wikidata, Glottolog AES.
//
// CRITICAL: Store ALL estimates separately with source attribution.
// Never average or "resolve" conflicting data. Speaker counts are
// politically contested for many languages. Present the evidence,
// let the reader assess.
// ═══════════════════════════════════════════════════════════════════════
"speakerEstimates": [], // Array of speaker count estimates from different authorities.
// Each entry:
// {
// "source": "wikidata", // or "ethnologue-28",
// // "census-ph-2020", etc.
// "count": 20000, // Point estimate. null if range-only.
// "date": "2026-06-07", // When this data was retrieved.
// "countRange": { "min": 15000, "max": 25000 }, // Optional range.
// "note": "Wikidata has 2 estimates: 15,000 and 25,000"
// }
// Empty array means we have not yet found speaker count data.
"vitality": null, // Endangerment / vitality assessment. When populated:
// {
// "unescoStatus": "severely-endangered",
// // Enum: "safe", "vulnerable", "definitely-endangered",
// // "severely-endangered", "critically-endangered",
// // "extinct"
// "aesStatus": "shifting",
// // Glottolog AES label (free text from AES data).
// "egids": "6b",
// // Ethnologue Expanded Graded Intergenerational Disruption
// // Scale. Levels: 0 (international) to 10 (extinct).
// "trend": "declining",
// // Qualitative trend: "stable", "growing", "declining",
// // "shifting", "moribund", "awakening"
// "source": "glottolog-aes-5.3",
// "notes": "Intergenerational transmission breaking down."
// }
// ═══════════════════════════════════════════════════════════════════════
// § 5.5. DOCUMENTATION & DIGITAL PRESENCE
// How well-documented is this language? What digital footprint does it
// have? These fields answer the practical question: "What can I
// actually DO with this language?"
// Sources: Glottolog (references), Wikipedia, Common Voice, Tatoeba.
// ═══════════════════════════════════════════════════════════════════════
"documentationDepth": null, // How well-documented is this language in the literature?
// {
// "referenceCount": 42,
// // Number of published references in Glottolog.
// "med": "grammar",
// // Most Extensive Description type. One of:
// // "long_grammar", "grammar", "grammar_sketch",
// // "dictionary", "phonology", "text", "wordlist",
// // "comparative", "minimal", "unknown"
// "source": "glottolog-5.3"
// }
"digitalPresence": null, // Digital footprint across web platforms. When populated:
// {
// "wikipedia": {
// "edition": true, // Has its own Wikipedia edition?
// "articleCount": 75000, // Number of articles.
// "editionCode": "crk", // Wikipedia subdomain code.
// "source": "wikimedia-api-2026"
// },
// "commonVoice": {
// "validatedHours": 12.5,
// "totalHours": 25.0,
// "speakers": 45,
// "sentences": 1200,
// "source": "common-voice-20.0"
// },
// "tatoeba": {
// "sentenceCount": 342,
// "source": "tatoeba-2026"
// }
// }
"dialectCount": null, // Number of recognized dialects in Glottolog.
// Derived from child_dialect_count in languoid.csv.
// Simple integer. null if 0 or unknown.
// Source: glottolog-5.3.
// ═══════════════════════════════════════════════════════════════════════
// § 6. FORMALITY, REGISTERS & GENDER
// How does politeness work in this language? What translation registers
// do we offer? How should gender be handled?
//
// This section drives Champollion's register-preset system — the
// mechanism by which users select formal/informal/professional tone.
// These fields require genuine linguistic research, not automation.
// ═══════════════════════════════════════════════════════════════════════
"formality": null, // Formality system description. When populated:
// {
// "system": "T-V",
// // One of: "T-V", "speech-levels", "keigo", "particles",
// // "register-levels", "register-and-code-switching",
// // "code-switching", "none"
// "description": "French uses a vous/tu distinction...",
// "default": "formal-vous" // Key into the `registers` object.
// }
"registers": null, // Translation register presets. When populated, keyed by preset ID:
// {
// "formal-vous": {
// "label": "Formal (vouvoiement)",
// "description": "One sentence: when to use this preset.",
// "prompt": "The actual LLM system prompt instruction that
// steers translation tone. Must name specific
// linguistic features (pronouns, verb forms, particles).",
// "deeplFormality": "prefer_more"
// // Only if methodSupport.deepl.formality is true.
// // One of: "prefer_more", "prefer_less", "default".
// }
// }
"gender": null, // Grammatical gender and inclusive guidance. When populated:
// {
// "grammatical": true, // Does the language have gram. gender?
// "inclusiveGuidance": "Use gender-neutral forms when possible.
// Prefer 'iel' (neologism) or rephrase to
// avoid gendered agreement."
// }
// For languages without grammatical gender (Turkish, Finnish):
// { "grammatical": false, "inclusiveGuidance": null }
"codeSwitching": null, // Code-switching behavior (for languages where mixing with another
// language is the norm, not an error). When populated:
// {
// "contactLanguage": "Spanish",
// "contactIso639_3": "spa",
// "mixedVarietyName": "Jopará", // null if no named mixed variety
// "prevalence": "dominant", // "rare", "common", "dominant"
// "morphologicalIntegration": true,
// "pipelineStrategy": "hybrid-fst",
// "notes": "Jopará IS the everyday language of most Paraguayans..."
// }
// ═══════════════════════════════════════════════════════════════════════
// § 7. LINGUISTIC PROFILE
// What makes this language what it is? What are the specific challenges
// for machine translation? What rules govern its typography?
// What languages have shaped it through contact?
//
// These fields require genuine linguistic expertise. For many languages
// (especially low-resource), this section will remain null until a
// qualified researcher or community member contributes.
// ═══════════════════════════════════════════════════════════════════════
"linguisticChallenges": null, // MT-relevant challenges, keyed by challenge ID.
// When populated:
// {
// "polysynthesis": "Cree is highly polysynthetic. A single verb
// can incorporate subject, object, tense...",
// "animacy": "Verb conjugation changes based on whether the
// subject/object is animate or inanimate...",
// "neologisms": "Avoid literal translations of modern software
// concepts. Maintain Cree metaphorical logic..."
// }
// Aim for 3–6 challenges per language when researched.
"contactInfluences": [], // How other languages have shaped this one. Array of:
// {
// "source": "English",
// "sourceIso639_3": "eng", // null if proto-language/unknown
// "type": "superstrate",
// // Enum: "superstrate", "substrate", "adstrate",
// // "learned_borrowing", "lexical_borrowing",
// // "relexification"
// "domains": ["education", "government", "technology"],
// "depth": "deep",
// // Enum: "light", "moderate", "heavy", "structural",
// // "defining"
// "period": "1870–present",
// "notes": "Residential school era and ongoing...",
// "citation_needed": false
// // true if no published academic source found.
// // See language-card-citation-procedure.md.
// }
"rules": null, // Typography, plural, and capitalization rules. When populated:
// {
// "typography": {
// "quoteStart": "\u201c",
// "quoteEnd": "\u201d",
// "usesSpaces": true, // false for CJK, Thai, Lao, Khmer
// "punctuationSpacing": {
// "doublePunctuation": "none" // "thin-nbsp" for French
// }
// },
// "plurals": {
// "categories": ["one", "other"]
// // From CLDR. Possible values:
// // "zero", "one", "two", "few", "many", "other"
// },
// "capitalization": {
// "hasCase": true
// // true for Latin, Cyrillic, Greek, Armenian scripts.
// // false for CJK, Arabic, Devanagari, etc.
// }
// }
// Source: CLDR + ISO 15924 derivation.
"typologicalProfile": null, // Grambank typological features. When populated:
// {
// "featuresDocumented": 195,
// "featuresCoverage": 1, // 0.0–1.0 fraction of features
// "wordOrderDominant": "SVO",
// "hasDefiniteArticle": true,
// "hasIndefiniteArticle": true,
// "hasGenderSystem": true,
// "hasCaseMorphology": true,
// "hasEvidentiality": false,
// "hasToneSystem": false,
// "source": "grambank-1.0.3"
// }
// Auto-populated by enrich-grambank-typology.mjs.
"phonologicalInventory": null, // PHOIBLE phoneme inventory. When populated:
// {
// "consonants": 24,
// "vowels": 16,
// "tones": 0,
// "totalPhonemes": 40,
// "isTonal": false,
// "inventorySize": "moderately-large",
// // Enum: "small", "moderately-small", "average",
// // "moderately-large", "large"
// "source": "phoible-2.0"
// }
// Auto-populated by enrich-phoible-phonemes.mjs.
// ═══════════════════════════════════════════════════════════════════════
// § 8. ENCYCLOPEDIC
// General knowledge about the language for human context. History,
// dialect situation, institutional resources, representative sayings.
// This section is for understanding, not computation.
// ═══════════════════════════════════════════════════════════════════════
"encyclopedic": null, // General knowledge. When populated:
// {
// "family": "Algic", // Redundant with classification
// // but useful for human readers.
// "dialects": {
// "split": true, // Is there significant variation?
// "classification": "Plains Cree (y-dialect)",
// "variants": ["crk", "cwd", "csw"] // ISO codes of variants
// },
// "demographics": {
// "speakers": "Approx. 20,000 active speakers",
// "regions": ["Saskatchewan", "Alberta", "Manitoba"]
// },
// "history": "Plains Cree is the most widely spoken Algonquian
// language in western Canada...",
// "resources": {
// "wikipedia": "https://en.wikipedia.org/wiki/Plains_Cree",
// "foundations": [{ "name": "ALTLab", "url": "https://..." }],
// "dictionaries": [{ "name": "itwêwina", "url": "https://..." }]
// }
// }
"culturalAphorism": null, // A representative saying, proverb, or teaching in the language.
// When populated:
// {
// "text": "ê-wîcêhtonaniwahk kâ-kî-isi-wâpahtamâhk ôma pimâtisiwin",
// "transliteration": null, // Romanized form if non-Latin script.
// "translation": "Through helping each other we come to understand
// this life",
// "literal": "By-helping-one-another we-have-come-to-see this life",
// "source": "Cree teaching, documented in nêhiyawêwin educational
// resources"
// }
// Choose sayings that reveal something about the language's
// worldview or structure. Must be sourced.
"varieties": [], // For macrolanguages or languages with significant dialectal
// variation, the individual varieties with their own tool coverage.
// Each entry:
// {
// "name": "Cusco Quechua",
// "iso639_3": "quz",
// "region": "Cusco, Peru",
// "fstCoverage": true,
// "corpusCoverage": true,
// "nllbCoverage": false,
// "mutualIntelligibility": "Primary variety for this card",
// "notes": "SQUOIA FST was built for this variety."
// }
// ═══════════════════════════════════════════════════════════════════════
// § 9. DIGITAL RESOURCES & TOOLING
// What NLP tools, corpora, models, and datasets exist for this language?
// What translation APIs support it? What eval benchmarks are available?
//
// This is Champollion's operational core — these fields determine what
// we can actually DO with this language.
// ═══════════════════════════════════════════════════════════════════════
"resources": null, // NLP resources available for this language. When populated:
// {
// "fsts": [{ // Finite-state transducers
// "name": "GiellaLT Plains Cree FST (lang-crk)",
// "url": "https://github.com/giellalt/lang-crk/releases",
// "type": "morphological-analyzer"
// }],
// "corpora": [{ // Text corpora
// "name": "EDTeKLA Cree Language Textbook Corpus",
// "type": "parallel", // "parallel", "monolingual"
// "pairs": ["en-crk"],
// "url": "https://...",
// "exposure": "open-web" // "open-web", "restricted",
// // "holdout"
// }],
// "models": [{ // Pre-trained models
// "name": "NLLB-200 (crk_Cans)",
// "url": "https://...",
// "type": "nmt"
// }],
// "tools": [], // Other NLP tools
// "wordlists": [{ // Standardized wordlists
// "name": "Lexibank",
// "conceptCount": 200,
// "source": "lexibank"
// }],
// "treebanks": [{ // Syntactic treebanks
// "name": "UD_Korean-GSD",
// "tokens": 80000,
// "source": "universal-dependencies-2.14"
// }]
// }
// IMPORTANT: Only actual NLP/digital resources belong here.
// "This language has a WALS entry" is NOT a resource — that
// goes in databaseCoverage.
"databaseCoverage": null, // Which typological/reference databases cover this language.
// Separated from resources to avoid conflating "has a database
// entry" with "has usable NLP tooling."
// {
// "wals": true,
// "grambank": true,
// "phoible": true,
// "cldr": true,
// "lexibank": true,
// "commonVoice": true,
// "source": "derived"
// }
"corpusAvailability": null, // What text/parallel corpora exist for NLP use?
// {
// "bibleTranslation": {
// "textAvailable": true,
// "audioAvailable": true,
// "source": "bible-brain-api"
// },
// "opusCorpora": ["wikimedia", "ubuntu", "gnome"],
// "source": "multi-source"
// }
"keyboardSupport": null, // Input method / keyboard availability. When populated:
// {
// "keymanKeyboards": 3,
// // Number of Keyman keyboards available.
// "cldrKeyboard": true,
// // CLDR has keyboard layout data.
// "source": "keyman-api + cldr"
// }
"methodSupport": { // REQUIRED. Which Champollion translation methods support this
// language. Each method is an object with at minimum
// { "supported": boolean }.
"googleTranslate": { "supported": false },
"deepl": { "supported": false },
"microsoftTranslator": { "supported": false },
"libreTranslate": { "supported": false },
"nllb": { "supported": false },
// When NLLB is supported, include the code:
// { "supported": true, "code": "crk_Cans" }
"llm": { "supported": true }
// LLM is always true (quality varies by language).
// Optional: "verifiedDate": "2026-06-07" for audit trail.
},
"metricModelSupport": null, // Which MT evaluation models produce reliable scores.
// When populated:
// {
// "xlmr": "high", // "high", "medium", or "low"
// // XLM-R training representation tier.
// "africomet": false // true if AfriCOMET covers this language.
// }
// Drives automatic COMET model selection in metrics_comet.py.
// Auto-populated by enrich-metric-model-support.mjs.
"metricPlugins": null, // Which per-language metric plugin packs are available.
// When populated:
// {
// "formalityMarkers": true // Formality marker resource file exists
// // at plugins/resources/formality/{code}.json
// }
// Each key corresponds to a resource pack in
// arena/mt_eval_harness/plugins/resources/{packName}/.
// To add a new metric pack for a language, create the resource
// file and set the flag here. No code changes required.
"evalPack": null, // Evaluation dependency pack for language-specific metrics.
// When populated, declares the Python dependencies and
// post-install steps required by this language's eval standards.
// The harness uses this for dependency gating: if deps are
// missing, the harness warns the user and skips LYSS metrics
// (rather than crashing).
// When populated:
// {
// "pythonDeps": {
// "pyhfst": "pyhfst>=1.4", // PyPI package specs
// "requests": "requests>=2.28",
// "spacy": "spacy>=3.7"
// },
// "postInstall": [ // Commands to run after pip
// {
// "command": "spacy download en_core_web_md",
// "label": "spaCy English model (for LYSS-sem)"
// }
// ],
// "requiresFst": true, // true if GiellaLT FST needed
// "description": "LYSS equivalence linter + FST validation"
// }
"evalMetrics": null, // Language-specific evaluation metrics (LYSS standards).
// When populated, the harness dynamically imports these
// MetricPlugin classes from eval_standards/<lang>/ and applies
// them to every run targeting this language — regardless of
// which method (contestant) is being evaluated.
// Keyed by metric ID:
// {
// "lyss-eq": {
// "module": "eval_standards.crk.metrics",
// "class": "CrkLinterMetric",
// "description": "LYSS deterministic variant-class linter"
// },
// "lyss-sem": {
// "module": "eval_standards.crk.metrics",
// "class": "CrkSemanticMetric",
// "description": "LYSS FST-based semantic validator",
// "dependencies": ["spacy>=3.7"],
// "spacy_models": ["en_core_web_md"]
// }
// }
// Architecture: eval standards are referees, not contestants.
// They live in the harness (eval_standards/), not in method
// plugins. This ensures all methods are scored equally.
// Discovery: plugin_discovery.py reads this field via
// language_cards.get_eval_metrics() and instantiates metrics
// using importlib. Dependencies are checked against evalPack.
"omt1600": null, // Meta's OMT-1600 (One Model for Translation) coverage assessment.
// When populated:
// {
// "covered": true,
// "tier": "R1", // Meta's resource tier
// "evalMetrics": ["chrF++", "BLASER-3"],
// "notes": "Plains Cree: no web-crawled bitext..."
// }
"evalDatasets": [], // Evaluation dataset IDs available for this language.
// Example: ["flores-plus-devtest", "edtekla-dev-v1"].
// Empty means no standardized eval set exists.
"pipelineReadiness": null, // Assessment of readiness for Champollion's translation pipeline.
// When populated:
// {
// "tier": "tier-2-feasible",
// // "watch-list" — cataloged but no path to translation
// // "tier-3-cataloged" — basic metadata present
// // "tier-2-feasible" — tools exist, pipeline possible
// // "tier-1-ready" — pipeline operational
// "hasFST": true,
// "hasParallelCorpus": true,
// "hasEvalBenchmark": true,
// "blockers": ["Syllabics post-processing validation"],
// "notes": "FST-gated pipeline operational. EDTeKLA corpus..."
// }
// ═══════════════════════════════════════════════════════════════════════
// § 10. PROVENANCE & METADATA
// Where does this data come from? Who reviewed it? When was it
// generated? What's its overall quality level?
//
// This section exists to make the card auditable. Every automated
// enrichment, every human review, every source consulted should
// leave a trace here.
// ═══════════════════════════════════════════════════════════════════════
"dataSources": [], // REQUIRED. Sources consulted for this card's data.
// Can be a flat array (backwards-compatible):
// ["iso639-3-2024", "glottolog-5.3", "wikidata"]
//
// Or a structured per-field object (preferred for new cards):
// {
// "classification": ["glottolog-5.3"],
// "vitality": ["glottolog-aes-5.3", "unesco-atlas-2024"],
// "speakerEstimates": ["wikidata", "census-ca-2021"],
// "rules": ["cldr-48"],
// "methodSupport": ["google-translate-2026-06"]
// }
"supportTier": "cataloged", // Auto-derived tier summarizing the card's depth:
// "cataloged" — identity + classification only
// "emerging" — + vitality + speakerEstimates
// "developing" — + resources + methodSupport
// "supported" — full research: registers, challenges, etc.
"humanReviewed": null, // null until a qualified human reviews the card. When populated:
// {
// "reviewer": "Prof. Kenneth Jamandre",
// "affiliation": "University of the Philippines Diliman",
// "date": "2026-06-08",
// "scope": "full", // "full", "partial", "vitality-only"
// "notes": "Verified speaker count, vitality assessment,
// and contact influences for Tagalog."
// }
"notes": null, // Free-text notes about this language or this card's data quality.
// Example: "Low-resource language under active development.
// Translation pipeline uses FST-gated approach."
"firstDocumented": null, // Year of first known documentation. Negative for BCE.
// Example: -1500 (Sanskrit, ~1500 BCE), 1787 (some languages).
// Source: Glottolog CLDF.
"lastDocumented": null, // Year of last known documentation (relevant for extinct languages).
// Source: Glottolog CLDF.
"_generated": null // Auto-populated by enrichment scripts. When populated:
// {
// "by": "generate-all-cards.mjs",
// "at": "2026-06-07T12:34:56Z",
// "sources": ["iso639-3", "glottolog-5.3", "wikidata"],
// "completeness": "partial",
// // "partial" — has identity + classification + coords
// // "substantial" — + vitality + speakerEstimates + script
// // "complete" — all automatable fields populated
// "lastEnriched": "2026-06-07"
// }
}
字段参考
§ 1. 身份字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
code | string | ✅ | ✅ | ISO 639-3 注册表 |
name | string | ✅ | ✅ | ISO 639-3 注册表 |
nativeName | string | null | — | ✅ | Wikidata P1705 |
alternateNames | string[] | — | ✅ | Glottolog、Ethnologue |
iso639_3 | string | ✅ | ✅ | ISO 639-3 注册表 |
iso639_1 | string | null | — | ✅ | ISO 639-1 |
bcp47 | string | null | — | 部分 | IANA 子标签注册表 |
aliases | string[] | — | ❌ | 手动策划 |
isoScope | string | ✅ | ✅ | ISO 639-3 注册表 |
isoType | string | ✅ | ✅ | ISO 639-3 注册表 |
macrolanguage | string | null | — | ✅ | ISO 639-3 macrolanguages.tab |
extends | string | null | — | ❌ | 手动策划 |
§ 2. 分类字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
glottocode | string | null | — | ✅ | Glottolog |
classification | object | null | — | ✅ | Glottolog |
isIsolate | boolean | — | ✅ | Glottolog CLDF |
§ 3. 地理字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
macroarea | string | null | — | ✅ | Glottolog CLDF |
coordinates | object | null | — | ✅ | Glottolog |
countries | string[] | — | ✅ | Glottolog |
regions | object[] | — | ❌ | 人口普查、Ethnologue、手动 |
arealContext | object | null | — | ✅ | 坐标 + 语言学区域区域 |
§ 4. 书写系统字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
script | string | null | — | ✅ | Wikidata P282 |
scriptUnicodeName | string | null | — | ✅ | 从 script 通过 ISO 15924 → Unicode 映射派生 |
scripts | object[] | — | 部分 | Wikidata、手动 |
dir | string | null | — | ✅ | 从脚本派生 |
scriptConverter | string | null | — | ❌ | 手动 |
orthographicStatus | object | null | — | 部分 | Ethnologue、手动 |
§ 5. 人口统计与活力字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
speakerEstimates | object[] | — | ✅ | Wikidata、Ethnologue、人口普查 |
vitality | object | null | — | ✅ | Glottolog AES、UNESCO |
§ 5.5 文档与数字存在字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
documentationDepth | object | null | — | ✅ | Glottolog 参考文献 |
digitalPresence | object | null | — | ✅ | Wikipedia、Common Voice、Tatoeba |
dialectCount | number | null | — | ✅ | Glottolog |
§ 6. 正式性、寄存器与性别字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
formality | object | null | — | ❌ | 语言学研究 |
registers | object | null | — | ❌ | 语言学研究 |
gender | object | null | — | ❌ | 语言学研究 |
codeSwitching | object | null | — | ❌ | 语言学研究 |
§ 7. 语言学档案字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
linguisticChallenges | object | null | — | ❌ | 语言学研究 |
contactInfluences | object[] | — | ❌ | 已发表的语言学 |
rules | object | null | — | ✅ | CLDR |
typologicalProfile | object | null | — | ✅ | Grambank 1.0.3 — 由 enrich-grambank-typology.mjs 自动填充 |
phonologicalInventory | object | null | — | ✅ | PHOIBLE 2.0 — 由 enrich-phoible-phonemes.mjs 自动填充 |
§ 8. 百科字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
encyclopedic | object | null | — | ❌ | 手动研究 |
culturalAphorism | object | null | — | ❌ | 社区贡献 |
varieties | object[] | — | ❌ | 手动研究 |
§ 9. 数字资源字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
resources | object | null | — | 部分 | 手动 + 自动化 |
databaseCoverage | object | null | — | ✅ | 从富化派生 |
corpusAvailability | object | null | — | ✅ | Bible Brain、OPUS、Lexibank |
keyboardSupport | object | null | — | ✅ | Keyman API、CLDR |
methodSupport | object | ✅ | 部分 | API 验证 |
metricModelSupport | object | null | — | ✅ | XLM-R 论文、AfriCOMET 论文 |
metricPlugins | object | null | — | ✅ | 卡富化——声明哪些指标插件包适用(例如 { formalityMarkers: true }) |
omt1600 | object | null | — | ✅ | 元评估 |
evalDatasets | string[] | — | ✅ | 数据集注册表 |
pipelineReadiness | object | null | — | 部分 | 派生 + 手动 |
resources.fsts[].install:resources对象中的 FST 条目可以包含一个install子对象,其字段为:repo、releaseTag、assetPattern、format、maturity,以及可选的bundlePattern。这替代了以前的GIELLALT_FST_REGISTRY硬编码字典。参见get_fst_install_info()在language_cards.py中。
§ 10. 出处字段
| 字段 | 类型 | 必需 | 可自动化 | 来源 |
|---|---|---|---|---|
dataSources | array | object | ✅ | ✅ | 自动 + 手动 |
supportTier | string | — | ✅ | 从卡完整性派生 |
humanReviewed | object | null | — | ❌ | 人工审查者 |
notes | string | null | — | ❌ | 手动 |
firstDocumented | number | null | — | ✅ | Glottolog CLDF |
lastDocumented | number | null | — | ✅ | Glottolog CLDF |
_generated | object | null | — | ✅ | 富化脚本 |
语言代码政策
Champollion 使用 ISO 639-3 作为规范标识符。其他标准代码注册为别名,在运行时解析为 ISO 639-3 代码。
| 优先级 | 标准 | 示例 | 字段 | 用途 |
|---|---|---|---|---|
| 1(规范) | ISO 639-3 | crk | code | 卡文件名、配置键、API 参数 |
| 2(别名) | ISO 639-1 | iu | aliases[] | 在 CLI 中接受,解析为 ISO 639-3 |
| 3(别名) | BCP 47 | fil | aliases[] | 在 CLI 中接受,解析为 ISO 639-3 |
| 参考 | Glottocode | plai1258 | glottocode | 仅分类,不用于运行时 |
解析顺序: 当用户提供代码时:
card.code上的直接匹配 → 找到card.aliases[]上的匹配 → 找到,返回规范卡card.iso639_1上的匹配 → 找到(备选)- 未找到 → 错误
迁移历史:ISO 639-1 → ISO 639-3
在 v8 之前,卡文件名在可用时使用 ISO 639-1 代码(fr.json、de.json、ja.json)。在 639-3 迁移中,所有卡都重命名为其 ISO 639-3 等价物:
| 之前 | 之后 | 原因 |
|---|---|---|
fr.json | fra.json | 639-3 是规范 |
de.json | deu.json | 639-3 是规范 |
zh.json | cmn.json | 宏语言 → 默认个体 |
ar.json | arb.json | 宏语言 → 现代标准阿拉伯语 |
ms.json | zsm.json | 宏语言 → 标准马来语 |
旧代码发生了什么?
- 旧的 639-1 代码在
card.iso639_1中 - 旧的 639-1 代码在
card.aliases[]中 resolveCode("fr")在运行时返回"fra"— 向后兼容- 用户仍然可以在配置中写
"fr"— 它透明地解析
架构上改变了什么:
_deepMerge()现在跳过null值(从父继承)_deepMerge()现在设置了身份字段(代码、扩展、别名永不继承)formality.default现在从寄存器isDefault: true标志派生- 205 个 Grambank 派生的卡获得了结构
formality.default修复 - 38 个属/族/宏语言卡提供继承目标
边界情况
手语
手语(例如 ASE——美国手语)是具有 ISO 639-3 代码的合法语言。它们有地理和使用者数量,但:
script通常为null(无标准书面形式)scripts可能包括"Sgnw"(SignWriting)如果使用了符号系统dir为nulllinguisticChallenges应该处理空间语法、分类器等gender.grammatical通常为false
古代与历史语言
拉丁语(lat,isoType H)和梵语(san,isoType H)等语言仍在特定背景下使用(礼仪、学术),但没有本地使用者:
vitality可能注明"无本地使用者",带"trend": "stable"(不衰退——使用它的社区稳定,只是很小)speakerEstimates应该注明这些是 L2 使用者,不是 L1firstDocumented/lastDocumented在时间上定位它们
构造语言
世界语(epo,isoType C)、逻辑语等:
classification可能指向"构造"族或 nullcontactInfluences反映源材料(例如,世界语借鉴罗曼、日耳曼、斯拉夫语)vitality不寻常——使用者社区增长但无本地家园
宏语言
阿拉伯语(ara)、汉语(zho)、Cree(cre)、Quechua(que)是包含多种个体语言的宏语言:
isoScope: "M"varieties应该列出个体语言及其 ISO 代码methodSupport应该反映宏语言卡支持的内容(通常是标准化变体)- 个体变体也应该有自己的卡
无标准化正字法的语言
许多语言(特别是口头传统语言)没有标准化的书写系统,或有竞争的正字法:
script为nullscripts为[]dir为nullnotes应该解释正字法情况linguisticChallenges应该注明这如何影响 MT(例如,无训练数据)
双言现象
阿拉伯语(MSA 对方言)或 Guaraní(Jopará 对纯 Guaraní)等语言:
codeSwitching捕捉混合变体情况registers可以为不同级别提供预设varieties可以列出双言对
接触影响类型
| 类型 | 含义 | 示例 |
|---|---|---|
superstrate | 强加给社区的主导语言 | 法语 → 英语(1066 年后) |
substrate | 本地语言影响强加的语言 | 凯尔特语 → 英语 |
adstrate | 相邻语言有相互影响 | 诺斯语 → 英语 |
learned_borrowing | 通过教育/学术借用 | 拉丁语 → 英语 |
lexical_borrowing | 通过接触直接词汇借用 | 西班牙语 → 菲律宾语 |
relexification | 大规模词汇替换 | 葡萄牙语 → 帕皮亚门图语 |
接触影响深度
| 深度 | 含义 |
|---|---|
light | 少数借词,最小结构影响 |
moderate | 特定领域的重要词汇 |
heavy | 普遍的词汇和一些结构特征 |
structural | 语法、句法和音韵受影响 |
defining | 核心身份由接触塑造(克里奥尔语、混合语言) |
编写好的寄存器预设
好的预设提示:
- 明确命名正式性特征(例如"해요체"、"vous-form"、"siz-form")
- 解释要使用的特定代词或动词形式
- 给出这个寄存器何时适当的背景
- 如果适用,提及脚本考虑
不要在预设提示中放置性别包容性指导。性别指导属于 card.gender.inclusiveGuidance ——它单独注入。
❌ Bad: "Standard Thai. Professional register."
✔ Good: "Professional Thai. Use คุณ (khun) for second person, เรา (rao)
for first person when needed. Clear, concise phrasing
appropriate for digital interfaces."
预设命名约定
预设键应该是描述性的、小写连字符分隔的:
- T-V 语言:
formal-vous、informal-tu、formal-Sie、casual-du - 言语级别:
polite-haeyo、formal-hapsyo、casual-hae - 中立:
professional、neutral-professional - 代码转换:
taglish-professional、pure-filipino
富化程序
每卡处理顺序
富化卡时,按此顺序查阅来源。记录每个查阅的来源,即使它没有返回数据。
- ISO 639-3 注册表 →
code、name、isoScope、isoType - ISO 639-3 macrolanguages.tab →
macrolanguage - Glottolog languoid.csv →
glottocode、classification、coordinates、countries - Glottolog CLDF →
macroarea、isIsolate、firstDocumented、lastDocumented - Glottolog AES →
vitality(濒危状态) - Wikidata SPARQL →
nativeName、speakerEstimates、script、scripts、dir - CLDR →
rules(排版、复数、大小写) - NLLB-200 / FLORES+ →
methodSupport.nllb、evalDatasets - API 验证 → 剩余
methodSupport条目 - ML 模型论文 →
metricModelSupport(XLM-R 训练数据、AfriCOMET 覆盖) 脚本:node scripts/enrich-metric-model-support.mjs
冲突处理
当来源不一致时:
- 存储两者并标注来源
- 不平均或选边
- 注明分歧在相关
note字段中 - 仅当需要单一值进行计算时,优先最近的主要来源
验证
在任何富化或手动编辑后运行 linter:
node scripts/lint-language-cards.mjs # all cards
node scripts/lint-language-cards.mjs --lang crk # single card
PR 检查清单
提交新的或修改的语言卡时:
- 文件命名为
<code>.json在shared/language-cards/中 - 规范模板中的所有顶级字段都存在
-
classification从 Glottolog 填充(不是手工构建) -
dataSources列出所有查阅的来源 -
methodSupport条目针对实际 API 语言列表验证 -
contactInfluences条目有已发表的来源或citation_needed: true -
linguisticChallenges有 3–6 个 MT 相关挑战(如果研究过) -
rules从 CLDR 填充(如果存在区域设置数据) - Linter 通过无错误
专业参考
| 标准 | 维护者 | 我们的用途 |
|---|---|---|
| ISO 639-3 | SIL International | 规范语言代码、宏语言关系 |
| Glottolog | Max Planck Institute | 分类、坐标、AES 濒危 |
| WALS | Max Planck Institute | 属定义、类型特征 |
| ISO 15924 | Unicode/ISO | 脚本代码 |
| CLDR | Unicode Consortium | 区域设置数据、复数规则、排版 |
| Wikidata | Wikimedia Foundation | 使用者数量、内族名、脚本数据 |
| Ethnologue | SIL International | EGIDS、使用者估计、DLS |
| UNESCO Atlas | UNESCO | 濒危分类 |
| Katig Collective | UP Diliman | 菲律宾语言胶囊 |
另见:语言卡引用程序以获取详细的逐来源指导。