跳转到主要内容

语言卡规范

单一信息源。 本文档定义了每张语言卡的规范形状。每张卡必须包含此处列出的每个顶级字段,即使值为 null[]。缺少字段的卡不符合规范。这种统一性使自动化工具、linter、富化脚本和人工审查者能够信任卡的结构。

设计原则

  1. 统一的形状。 所有 8,000+ 张卡具有相同的顶级字段。未知值为 null,空数组为 [],空对象为 null(不是 {})。这意味着代码永远不需要检查"这个字段存在吗?"——只需检查"它有值吗?"

  2. 溯源一切。 每项事实声明都追溯到一个命名的、版本化的、主要来源。无源声明是无法验证的声明。dataSources 字段(以及子对象中的每字段 source 注释)使出处明确。

  3. 保留分歧。 当权威机构意见不一致时(Wikidata 说 50,000 使用者,Ethnologue 说 20,000),我们存储两者并标注来源。我们不平均、不解决、不选边。用户可以理解细微差别。

  4. Null 表示未知,不是不适用。 如果字段为 null,意味着"我们还没有找到这方面的数据"。如果字段确实不适用(例如,grammatical gender 对于手语),值应该解释这一点:{ "grammatical": false, "inclusiveGuidance": "不适用——美国手语没有语法性别。" }

  5. 仅合并。 富化脚本添加数据,永不覆盖。人工策划的值优先于自动化数据。


三层架构

位置目的
语言卡shared/language-cards/<code>.json每种语言的配置:身份、分类、资源、一切
属卡shared/language-cards/genera/<genus>.json相关语言的共享运行时属性(策划的,非自动生成)
语言树shared/language-cards/language-tree.json完整的 Glottolog 层级——Lab UI 和语言发现的参考数据

继承模型

当卡设置 "extends": "family-dravidian" 时,运行时使用 _deepMerge()(在 lib/registers.js 中)将父卡合并到子卡中。这让属卡定义共享的寄存器、正式系统和性别指导,流向所有成员语言——无需在数百张单独的卡中重复数据。

合并语义

子值行为原因
null从父继承null 意味着"我不定义这个"——父的值流向下来
非 null覆盖父子的数据更具体——优先
嵌套对象递归合并子字段覆盖,父字段保留
数组完全替换数组不逐项合并——子数组获胜

身份字段(永不继承)

某些字段属于卡本身,必须永不从父继承:

code, extends, _migration, aliases, iso639_1, iso639_3

即使父卡定义了 aliases: ["macro-code"],子卡也不会继承这些别名。这些字段始终是子卡自己的值(包括未设置时的 null)。

原因: 没有这条规则,每种 Cree 语言都会从宏语言父继承 aliases: ["cre"],使每个变体都成为宏的别名。

示例:Cree 卡如何解析

┌───────────────────────┐
│ family-algic.json │ formality: null, registers: null
│ (no registers) │
└──────────┬────────────┘
│ extends
┌──────────┴────────────┐
│ genus-cree.json │ formality: { system: "obviative-animate", ... }
│ (sourced registers) │ registers: { formal: {...}, informal: {...} }
└──────────┬────────────┘
│ extends
┌──────────┴────────────┐
│ crk.json │ code: "crk", extends: "genus-cree"
│ (Plains Cree) │ formality: null → inherits from genus-cree
│ │ registers: null → inherits from genus-cree
│ │ script: "Cans" → own value, no inheritance
│ │ code: "crk" → identity field, never inherited
└───────────────────────┘

在运行时,getLanguageCard("crk") 返回一个合并的对象,包含 genus-cree 的寄存器 + family-algic 的属性(如果有)+ crk 自己的身份和元数据。

属卡模板

属卡位于 shared/language-cards/genera/ 并为语言组定义共享属性。它们遵循与常规卡相同的模式,但约定不同:

{
// Identity — genus cards use a prefixed code, NOT an ISO 639-3 code
"code": "genus-cree", // "genus-", "family-", or "macrolanguage-" prefix
"name": "Cree Languages", // Human-readable group name
"extends": "family-algic", // Genus cards can extend family cards (chaining)

// Formality — shared across the group, sourced from typological databases
"formality": {
"system": "obviative-animate",
"description": "Cree languages use an obviative/proximate system...",
"default": "formal",
"source": "WALS 37A, 38A + Wolfart 1973"
},

// Registers — shared presets, if the group shares a formality system
"registers": {
"formal": {
"label": "Formal (Proximate)",
"description": "...",
"prompt": "...",
"isDefault": true
},
"informal": {
"label": "Informal",
"description": "...",
"prompt": "..."
}
},

// Gender — shared grammatical gender behavior
"gender": {
"grammatical": false, // Cree doesn't have grammatical gender
"inclusiveGuidance": null // so no inclusive guidance needed
},

// Everything else is null — individual cards provide their own
// classification, geography, resources, etc.
"classification": null,
"methodSupport": null,
// ...
}

关键规则: 属卡必须仅包含在整个组中真正共享且来自权威参考的数据。如果正式系统在成员之间变化,它应该在单个卡上,而不是属卡上。

规范模板

每张卡必须具有这个确切的顶级形状。子对象模式在下面的字段参考中记录。

{
// ═══════════════════════════════════════════════════════════════════════
// § 1. IDENTITY
// Who is this language? What codes identify it?
// Sources: ISO 639-3 registry, ISO 639-1, BCP 47/IANA.
// ═══════════════════════════════════════════════════════════════════════

"code": "xxx", // REQUIRED. ISO 639-3 code. This IS the card ID and filename.
"name": "English Name", // REQUIRED. English reference name from ISO 639-3 registry.
"nativeName": null, // Endonym (name in the language itself). Source: Wikidata P1705.
// Examples: "nêhiyawêwin / ᓀᐦᐃᔭᐍᐏᐣ", "日本語", "Esperanto".
"alternateNames": [], // Other names this language is known by. Source: Glottolog, Ethnologue.
// Not aliases (those are code-level). These are name-level variants.
// Example: ["Qafar af", "Afaraf", "'Afar Af"] for Afar (aar).
"iso639_3": "xxx", // REQUIRED. Three-letter ISO 639-3 code. Same as `code`.
"iso639_1": null, // Two-letter ISO 639-1 code (e.g., "en", "fr"). null if none.
"bcp47": null, // IETF BCP 47 tag. Often same as iso639_1. Can include subtags
// (e.g., "iu-Cans-CA"). null if unknown.
"aliases": [], // Alternative code-level identifiers that resolve to this card.
// Example: ["fil"] for tl (Tagalog), ["iu"] for iku (Inuktitut).
// Used by code resolution: user types "fil", system loads tl.json.
"isoScope": "I", // REQUIRED. ISO 639-3 scope:
// "I" = Individual language
// "M" = Macrolanguage (e.g., Chinese, Arabic, Cree)
// "S" = Special (e.g., mis, mul, zxx)
"isoType": "L", // REQUIRED. ISO 639-3 type:
// "L" = Living "E" = Extinct "A" = Ancient
// "H" = Historical "C" = Constructed
"macrolanguage": null, // If this language is part of a macrolanguage, the macrolanguage
// ISO 639-3 code (e.g., "cre" for Plains Cree, "ara" for Arabic
// varieties). Source: ISO 639-3 macrolanguages.tab.
"extends": null, // Genus card key if shared properties are inherited from a genus
// card (e.g., "genus-cree", "genus-eskimo-aleut").
// null for most languages.

// ═══════════════════════════════════════════════════════════════════════
// § 2. CLASSIFICATION
// Where does this language sit in the family tree?
// Source: Glottolog. NEVER hand-build classifications.
// ═══════════════════════════════════════════════════════════════════════

"glottocode": null, // Glottolog identifier (e.g., "plai1258", "stan1293").
// null if the language is not in Glottolog.
"classification": null, // Genealogical classification from Glottolog. When populated:
// {
// "family": "Algic", // Top-level family. null for isolates.
// "familyGlottocode": "algi1248", // Glottocode of the family.
// "genus": "Plains Creeic", // WALS-style genus.
// "genusGlottocode": "plai1264", // Glottocode of the genus.
// "ancestry": ["Algic", "Algonquian-Blackfoot", "Algonquian",
// "Cree-Montagnais-Naskapi", "Cree", "Plains Creeic"]
// }
// For isolates: family = language name, genus = language name,
// ancestry = [language name].
"isIsolate": false, // true if a language isolate (no known genetic relatives).
// Source: Glottolog CLDF.

// ═══════════════════════════════════════════════════════════════════════
// § 3. GEOGRAPHY
// Where is this language spoken?
// Sources: Glottolog (coordinates, countries), census data, Ethnologue.
// ═══════════════════════════════════════════════════════════════════════

"macroarea": null, // Glottolog macroarea. One of: "Africa", "Australia",
// "Eurasia", "North America", "Papunesia", "South America".
// null if unknown. Source: Glottolog CLDF.
"coordinates": null, // Representative geographic point. When populated:
// { "lat": 52.1, "lng": -106.6, "source": "glottolog-5.3" }
// This is a representative point, not a boundary.
"countries": [], // ISO 3166-1 alpha-2 country codes where this language is spoken.
// Example: ["CA", "US"]. Source: Glottolog.
"regions": [], // Detailed regional breakdown with admin codes & speaker estimates.
// Each entry:
// {
// "country": "Canada",
// "countryCode": "CA",
// "officialStatus": "recognized", // official, co-official,
// // recognized, none
// "region": "Saskatchewan, Alberta, Manitoba",
// "speakerEstimate": "~20,000",
// "coordinates": [-106.6, 52.1], // [lng, lat]
// "admin1Codes": ["CA-SK", "CA-AB", "CA-MB"]
// }

"arealContext": null, // Linguistic area / Sprachbund membership. DISTINCT from
// contactInfluences (which is language-specific contact history).
// This field captures zone-level typological convergence patterns
// — i.e., what linguistic area the language exists within and
// what features are common across that area.
// {
// "zone": "Mainland Southeast Asian Sprachbund",
// "arealFeatures": "Tonal convergence, classifier systems,
// topic-prominence, monosyllabicity trend.",
// "typicalContacts": ["Classical Chinese", "Sanskrit/Pali"],
// "source": "areal-linguistics (Enfield 2005)"
// }
// NOT the same as contactInfluences. A language can exist within
// a convergence area without having specific contact history with
// any particular language in that area.

// ═══════════════════════════════════════════════════════════════════════
// § 4. WRITING SYSTEMS
// How is this language written?
// Sources: Wikidata P282, ISO 15924, manual research.
// Note: Some languages have NO standardized orthography. Some have
// competing orthographies. Some use multiple scripts routinely (e.g.,
// Serbian: Cyrillic + Latin; Japanese: Kanji + Hiragana + Katakana).
// Sign languages may use notation systems (SignWriting, HamNoSys) or
// none at all.
// ═══════════════════════════════════════════════════════════════════════

"script": null, // Primary ISO 15924 script code (e.g., "Latn", "Cyrl", "Cans",
// "Jpan"). null if no written form or unknown.
"scriptUnicodeName": null, // Unicode script block name derived from the script field.
// e.g., "Latin", "Cyrillic", "Canadian_Aboriginal", "CJK".
// Used by code_switching metric plugin. Auto-populated by
// enrich-script-unicode-names.mjs. null if script is null.
"scripts": [], // All writing systems with detail. Array of:
// {
// "code": "Cans",
// "name": "Unified Canadian Aboriginal Syllabics",
// "primary": true
// }
// A language with multiple scripts has multiple entries.
// A language with no written form has [].
"dir": null, // Writing direction: "ltr" (left-to-right) or "rtl" (right-to-left).
// null if no written form or unknown.
"scriptConverter": null, // Script converter key if we have a converter for this language
// (e.g., "crk" for SRO↔Syllabics). null for most languages.
"orthographicStatus": null, // Writing system standardization status. When populated:
// {
// "status": "standardized",
// // "standardized" — official/agreed orthography exists
// // "competing" — multiple orthographies in active use
// // "emerging" — orthography under development
// // "none" — primarily oral, no standard writing
// "notes": "Uses SIL-developed Latin orthography since 1960s.",
// "source": "ethnologue" // or "manual-curation"
// }
// Crucial for LRLs where orthographic variation directly impacts
// MT training data quality and evaluation consistency.

// ═══════════════════════════════════════════════════════════════════════
// § 5. DEMOGRAPHICS & VITALITY
// How many people speak this language? Is it endangered?
// Sources: Census, Ethnologue, UNESCO Atlas, Wikidata, Glottolog AES.
//
// CRITICAL: Store ALL estimates separately with source attribution.
// Never average or "resolve" conflicting data. Speaker counts are
// politically contested for many languages. Present the evidence,
// let the reader assess.
// ═══════════════════════════════════════════════════════════════════════

"speakerEstimates": [], // Array of speaker count estimates from different authorities.
// Each entry:
// {
// "source": "wikidata", // or "ethnologue-28",
// // "census-ph-2020", etc.
// "count": 20000, // Point estimate. null if range-only.
// "date": "2026-06-07", // When this data was retrieved.
// "countRange": { "min": 15000, "max": 25000 }, // Optional range.
// "note": "Wikidata has 2 estimates: 15,000 and 25,000"
// }
// Empty array means we have not yet found speaker count data.

"vitality": null, // Endangerment / vitality assessment. When populated:
// {
// "unescoStatus": "severely-endangered",
// // Enum: "safe", "vulnerable", "definitely-endangered",
// // "severely-endangered", "critically-endangered",
// // "extinct"
// "aesStatus": "shifting",
// // Glottolog AES label (free text from AES data).
// "egids": "6b",
// // Ethnologue Expanded Graded Intergenerational Disruption
// // Scale. Levels: 0 (international) to 10 (extinct).
// "trend": "declining",
// // Qualitative trend: "stable", "growing", "declining",
// // "shifting", "moribund", "awakening"
// "source": "glottolog-aes-5.3",
// "notes": "Intergenerational transmission breaking down."
// }

// ═══════════════════════════════════════════════════════════════════════
// § 5.5. DOCUMENTATION & DIGITAL PRESENCE
// How well-documented is this language? What digital footprint does it
// have? These fields answer the practical question: "What can I
// actually DO with this language?"
// Sources: Glottolog (references), Wikipedia, Common Voice, Tatoeba.
// ═══════════════════════════════════════════════════════════════════════

"documentationDepth": null, // How well-documented is this language in the literature?
// {
// "referenceCount": 42,
// // Number of published references in Glottolog.
// "med": "grammar",
// // Most Extensive Description type. One of:
// // "long_grammar", "grammar", "grammar_sketch",
// // "dictionary", "phonology", "text", "wordlist",
// // "comparative", "minimal", "unknown"
// "source": "glottolog-5.3"
// }

"digitalPresence": null, // Digital footprint across web platforms. When populated:
// {
// "wikipedia": {
// "edition": true, // Has its own Wikipedia edition?
// "articleCount": 75000, // Number of articles.
// "editionCode": "crk", // Wikipedia subdomain code.
// "source": "wikimedia-api-2026"
// },
// "commonVoice": {
// "validatedHours": 12.5,
// "totalHours": 25.0,
// "speakers": 45,
// "sentences": 1200,
// "source": "common-voice-20.0"
// },
// "tatoeba": {
// "sentenceCount": 342,
// "source": "tatoeba-2026"
// }
// }

"dialectCount": null, // Number of recognized dialects in Glottolog.
// Derived from child_dialect_count in languoid.csv.
// Simple integer. null if 0 or unknown.
// Source: glottolog-5.3.

// ═══════════════════════════════════════════════════════════════════════
// § 6. FORMALITY, REGISTERS & GENDER
// How does politeness work in this language? What translation registers
// do we offer? How should gender be handled?
//
// This section drives Champollion's register-preset system — the
// mechanism by which users select formal/informal/professional tone.
// These fields require genuine linguistic research, not automation.
// ═══════════════════════════════════════════════════════════════════════

"formality": null, // Formality system description. When populated:
// {
// "system": "T-V",
// // One of: "T-V", "speech-levels", "keigo", "particles",
// // "register-levels", "register-and-code-switching",
// // "code-switching", "none"
// "description": "French uses a vous/tu distinction...",
// "default": "formal-vous" // Key into the `registers` object.
// }

"registers": null, // Translation register presets. When populated, keyed by preset ID:
// {
// "formal-vous": {
// "label": "Formal (vouvoiement)",
// "description": "One sentence: when to use this preset.",
// "prompt": "The actual LLM system prompt instruction that
// steers translation tone. Must name specific
// linguistic features (pronouns, verb forms, particles).",
// "deeplFormality": "prefer_more"
// // Only if methodSupport.deepl.formality is true.
// // One of: "prefer_more", "prefer_less", "default".
// }
// }

"gender": null, // Grammatical gender and inclusive guidance. When populated:
// {
// "grammatical": true, // Does the language have gram. gender?
// "inclusiveGuidance": "Use gender-neutral forms when possible.
// Prefer 'iel' (neologism) or rephrase to
// avoid gendered agreement."
// }
// For languages without grammatical gender (Turkish, Finnish):
// { "grammatical": false, "inclusiveGuidance": null }

"codeSwitching": null, // Code-switching behavior (for languages where mixing with another
// language is the norm, not an error). When populated:
// {
// "contactLanguage": "Spanish",
// "contactIso639_3": "spa",
// "mixedVarietyName": "Jopará", // null if no named mixed variety
// "prevalence": "dominant", // "rare", "common", "dominant"
// "morphologicalIntegration": true,
// "pipelineStrategy": "hybrid-fst",
// "notes": "Jopará IS the everyday language of most Paraguayans..."
// }

// ═══════════════════════════════════════════════════════════════════════
// § 7. LINGUISTIC PROFILE
// What makes this language what it is? What are the specific challenges
// for machine translation? What rules govern its typography?
// What languages have shaped it through contact?
//
// These fields require genuine linguistic expertise. For many languages
// (especially low-resource), this section will remain null until a
// qualified researcher or community member contributes.
// ═══════════════════════════════════════════════════════════════════════

"linguisticChallenges": null, // MT-relevant challenges, keyed by challenge ID.
// When populated:
// {
// "polysynthesis": "Cree is highly polysynthetic. A single verb
// can incorporate subject, object, tense...",
// "animacy": "Verb conjugation changes based on whether the
// subject/object is animate or inanimate...",
// "neologisms": "Avoid literal translations of modern software
// concepts. Maintain Cree metaphorical logic..."
// }
// Aim for 3–6 challenges per language when researched.

"contactInfluences": [], // How other languages have shaped this one. Array of:
// {
// "source": "English",
// "sourceIso639_3": "eng", // null if proto-language/unknown
// "type": "superstrate",
// // Enum: "superstrate", "substrate", "adstrate",
// // "learned_borrowing", "lexical_borrowing",
// // "relexification"
// "domains": ["education", "government", "technology"],
// "depth": "deep",
// // Enum: "light", "moderate", "heavy", "structural",
// // "defining"
// "period": "1870–present",
// "notes": "Residential school era and ongoing...",
// "citation_needed": false
// // true if no published academic source found.
// // See language-card-citation-procedure.md.
// }

"rules": null, // Typography, plural, and capitalization rules. When populated:
// {
// "typography": {
// "quoteStart": "\u201c",
// "quoteEnd": "\u201d",
// "usesSpaces": true, // false for CJK, Thai, Lao, Khmer
// "punctuationSpacing": {
// "doublePunctuation": "none" // "thin-nbsp" for French
// }
// },
// "plurals": {
// "categories": ["one", "other"]
// // From CLDR. Possible values:
// // "zero", "one", "two", "few", "many", "other"
// },
// "capitalization": {
// "hasCase": true
// // true for Latin, Cyrillic, Greek, Armenian scripts.
// // false for CJK, Arabic, Devanagari, etc.
// }
// }
// Source: CLDR + ISO 15924 derivation.

"typologicalProfile": null, // Grambank typological features. When populated:
// {
// "featuresDocumented": 195,
// "featuresCoverage": 1, // 0.0–1.0 fraction of features
// "wordOrderDominant": "SVO",
// "hasDefiniteArticle": true,
// "hasIndefiniteArticle": true,
// "hasGenderSystem": true,
// "hasCaseMorphology": true,
// "hasEvidentiality": false,
// "hasToneSystem": false,
// "source": "grambank-1.0.3"
// }
// Auto-populated by enrich-grambank-typology.mjs.

"phonologicalInventory": null, // PHOIBLE phoneme inventory. When populated:
// {
// "consonants": 24,
// "vowels": 16,
// "tones": 0,
// "totalPhonemes": 40,
// "isTonal": false,
// "inventorySize": "moderately-large",
// // Enum: "small", "moderately-small", "average",
// // "moderately-large", "large"
// "source": "phoible-2.0"
// }
// Auto-populated by enrich-phoible-phonemes.mjs.

// ═══════════════════════════════════════════════════════════════════════
// § 8. ENCYCLOPEDIC
// General knowledge about the language for human context. History,
// dialect situation, institutional resources, representative sayings.
// This section is for understanding, not computation.
// ═══════════════════════════════════════════════════════════════════════

"encyclopedic": null, // General knowledge. When populated:
// {
// "family": "Algic", // Redundant with classification
// // but useful for human readers.
// "dialects": {
// "split": true, // Is there significant variation?
// "classification": "Plains Cree (y-dialect)",
// "variants": ["crk", "cwd", "csw"] // ISO codes of variants
// },
// "demographics": {
// "speakers": "Approx. 20,000 active speakers",
// "regions": ["Saskatchewan", "Alberta", "Manitoba"]
// },
// "history": "Plains Cree is the most widely spoken Algonquian
// language in western Canada...",
// "resources": {
// "wikipedia": "https://en.wikipedia.org/wiki/Plains_Cree",
// "foundations": [{ "name": "ALTLab", "url": "https://..." }],
// "dictionaries": [{ "name": "itwêwina", "url": "https://..." }]
// }
// }

"culturalAphorism": null, // A representative saying, proverb, or teaching in the language.
// When populated:
// {
// "text": "ê-wîcêhtonaniwahk kâ-kî-isi-wâpahtamâhk ôma pimâtisiwin",
// "transliteration": null, // Romanized form if non-Latin script.
// "translation": "Through helping each other we come to understand
// this life",
// "literal": "By-helping-one-another we-have-come-to-see this life",
// "source": "Cree teaching, documented in nêhiyawêwin educational
// resources"
// }
// Choose sayings that reveal something about the language's
// worldview or structure. Must be sourced.

"varieties": [], // For macrolanguages or languages with significant dialectal
// variation, the individual varieties with their own tool coverage.
// Each entry:
// {
// "name": "Cusco Quechua",
// "iso639_3": "quz",
// "region": "Cusco, Peru",
// "fstCoverage": true,
// "corpusCoverage": true,
// "nllbCoverage": false,
// "mutualIntelligibility": "Primary variety for this card",
// "notes": "SQUOIA FST was built for this variety."
// }

// ═══════════════════════════════════════════════════════════════════════
// § 9. DIGITAL RESOURCES & TOOLING
// What NLP tools, corpora, models, and datasets exist for this language?
// What translation APIs support it? What eval benchmarks are available?
//
// This is Champollion's operational core — these fields determine what
// we can actually DO with this language.
// ═══════════════════════════════════════════════════════════════════════

"resources": null, // NLP resources available for this language. When populated:
// {
// "fsts": [{ // Finite-state transducers
// "name": "GiellaLT Plains Cree FST (lang-crk)",
// "url": "https://github.com/giellalt/lang-crk/releases",
// "type": "morphological-analyzer"
// }],
// "corpora": [{ // Text corpora
// "name": "EDTeKLA Cree Language Textbook Corpus",
// "type": "parallel", // "parallel", "monolingual"
// "pairs": ["en-crk"],
// "url": "https://...",
// "exposure": "open-web" // "open-web", "restricted",
// // "holdout"
// }],
// "models": [{ // Pre-trained models
// "name": "NLLB-200 (crk_Cans)",
// "url": "https://...",
// "type": "nmt"
// }],
// "tools": [], // Other NLP tools
// "wordlists": [{ // Standardized wordlists
// "name": "Lexibank",
// "conceptCount": 200,
// "source": "lexibank"
// }],
// "treebanks": [{ // Syntactic treebanks
// "name": "UD_Korean-GSD",
// "tokens": 80000,
// "source": "universal-dependencies-2.14"
// }]
// }
// IMPORTANT: Only actual NLP/digital resources belong here.
// "This language has a WALS entry" is NOT a resource — that
// goes in databaseCoverage.

"databaseCoverage": null, // Which typological/reference databases cover this language.
// Separated from resources to avoid conflating "has a database
// entry" with "has usable NLP tooling."
// {
// "wals": true,
// "grambank": true,
// "phoible": true,
// "cldr": true,
// "lexibank": true,
// "commonVoice": true,
// "source": "derived"
// }

"corpusAvailability": null, // What text/parallel corpora exist for NLP use?
// {
// "bibleTranslation": {
// "textAvailable": true,
// "audioAvailable": true,
// "source": "bible-brain-api"
// },
// "opusCorpora": ["wikimedia", "ubuntu", "gnome"],
// "source": "multi-source"
// }

"keyboardSupport": null, // Input method / keyboard availability. When populated:
// {
// "keymanKeyboards": 3,
// // Number of Keyman keyboards available.
// "cldrKeyboard": true,
// // CLDR has keyboard layout data.
// "source": "keyman-api + cldr"
// }

"methodSupport": { // REQUIRED. Which Champollion translation methods support this
// language. Each method is an object with at minimum
// { "supported": boolean }.
"googleTranslate": { "supported": false },
"deepl": { "supported": false },
"microsoftTranslator": { "supported": false },
"libreTranslate": { "supported": false },
"nllb": { "supported": false },
// When NLLB is supported, include the code:
// { "supported": true, "code": "crk_Cans" }
"llm": { "supported": true }
// LLM is always true (quality varies by language).
// Optional: "verifiedDate": "2026-06-07" for audit trail.
},

"metricModelSupport": null, // Which MT evaluation models produce reliable scores.
// When populated:
// {
// "xlmr": "high", // "high", "medium", or "low"
// // XLM-R training representation tier.
// "africomet": false // true if AfriCOMET covers this language.
// }
// Drives automatic COMET model selection in metrics_comet.py.
// Auto-populated by enrich-metric-model-support.mjs.

"metricPlugins": null, // Which per-language metric plugin packs are available.
// When populated:
// {
// "formalityMarkers": true // Formality marker resource file exists
// // at plugins/resources/formality/{code}.json
// }
// Each key corresponds to a resource pack in
// arena/mt_eval_harness/plugins/resources/{packName}/.
// To add a new metric pack for a language, create the resource
// file and set the flag here. No code changes required.

"evalPack": null, // Evaluation dependency pack for language-specific metrics.
// When populated, declares the Python dependencies and
// post-install steps required by this language's eval standards.
// The harness uses this for dependency gating: if deps are
// missing, the harness warns the user and skips LYSS metrics
// (rather than crashing).
// When populated:
// {
// "pythonDeps": {
// "pyhfst": "pyhfst>=1.4", // PyPI package specs
// "requests": "requests>=2.28",
// "spacy": "spacy>=3.7"
// },
// "postInstall": [ // Commands to run after pip
// {
// "command": "spacy download en_core_web_md",
// "label": "spaCy English model (for LYSS-sem)"
// }
// ],
// "requiresFst": true, // true if GiellaLT FST needed
// "description": "LYSS equivalence linter + FST validation"
// }

"evalMetrics": null, // Language-specific evaluation metrics (LYSS standards).
// When populated, the harness dynamically imports these
// MetricPlugin classes from eval_standards/<lang>/ and applies
// them to every run targeting this language — regardless of
// which method (contestant) is being evaluated.
// Keyed by metric ID:
// {
// "lyss-eq": {
// "module": "eval_standards.crk.metrics",
// "class": "CrkLinterMetric",
// "description": "LYSS deterministic variant-class linter"
// },
// "lyss-sem": {
// "module": "eval_standards.crk.metrics",
// "class": "CrkSemanticMetric",
// "description": "LYSS FST-based semantic validator",
// "dependencies": ["spacy>=3.7"],
// "spacy_models": ["en_core_web_md"]
// }
// }
// Architecture: eval standards are referees, not contestants.
// They live in the harness (eval_standards/), not in method
// plugins. This ensures all methods are scored equally.
// Discovery: plugin_discovery.py reads this field via
// language_cards.get_eval_metrics() and instantiates metrics
// using importlib. Dependencies are checked against evalPack.

"omt1600": null, // Meta's OMT-1600 (One Model for Translation) coverage assessment.
// When populated:
// {
// "covered": true,
// "tier": "R1", // Meta's resource tier
// "evalMetrics": ["chrF++", "BLASER-3"],
// "notes": "Plains Cree: no web-crawled bitext..."
// }

"evalDatasets": [], // Evaluation dataset IDs available for this language.
// Example: ["flores-plus-devtest", "edtekla-dev-v1"].
// Empty means no standardized eval set exists.

"pipelineReadiness": null, // Assessment of readiness for Champollion's translation pipeline.
// When populated:
// {
// "tier": "tier-2-feasible",
// // "watch-list" — cataloged but no path to translation
// // "tier-3-cataloged" — basic metadata present
// // "tier-2-feasible" — tools exist, pipeline possible
// // "tier-1-ready" — pipeline operational
// "hasFST": true,
// "hasParallelCorpus": true,
// "hasEvalBenchmark": true,
// "blockers": ["Syllabics post-processing validation"],
// "notes": "FST-gated pipeline operational. EDTeKLA corpus..."
// }

// ═══════════════════════════════════════════════════════════════════════
// § 10. PROVENANCE & METADATA
// Where does this data come from? Who reviewed it? When was it
// generated? What's its overall quality level?
//
// This section exists to make the card auditable. Every automated
// enrichment, every human review, every source consulted should
// leave a trace here.
// ═══════════════════════════════════════════════════════════════════════

"dataSources": [], // REQUIRED. Sources consulted for this card's data.
// Can be a flat array (backwards-compatible):
// ["iso639-3-2024", "glottolog-5.3", "wikidata"]
//
// Or a structured per-field object (preferred for new cards):
// {
// "classification": ["glottolog-5.3"],
// "vitality": ["glottolog-aes-5.3", "unesco-atlas-2024"],
// "speakerEstimates": ["wikidata", "census-ca-2021"],
// "rules": ["cldr-48"],
// "methodSupport": ["google-translate-2026-06"]
// }

"supportTier": "cataloged", // Auto-derived tier summarizing the card's depth:
// "cataloged" — identity + classification only
// "emerging" — + vitality + speakerEstimates
// "developing" — + resources + methodSupport
// "supported" — full research: registers, challenges, etc.

"humanReviewed": null, // null until a qualified human reviews the card. When populated:
// {
// "reviewer": "Prof. Kenneth Jamandre",
// "affiliation": "University of the Philippines Diliman",
// "date": "2026-06-08",
// "scope": "full", // "full", "partial", "vitality-only"
// "notes": "Verified speaker count, vitality assessment,
// and contact influences for Tagalog."
// }

"notes": null, // Free-text notes about this language or this card's data quality.
// Example: "Low-resource language under active development.
// Translation pipeline uses FST-gated approach."

"firstDocumented": null, // Year of first known documentation. Negative for BCE.
// Example: -1500 (Sanskrit, ~1500 BCE), 1787 (some languages).
// Source: Glottolog CLDF.

"lastDocumented": null, // Year of last known documentation (relevant for extinct languages).
// Source: Glottolog CLDF.

"_generated": null // Auto-populated by enrichment scripts. When populated:
// {
// "by": "generate-all-cards.mjs",
// "at": "2026-06-07T12:34:56Z",
// "sources": ["iso639-3", "glottolog-5.3", "wikidata"],
// "completeness": "partial",
// // "partial" — has identity + classification + coords
// // "substantial" — + vitality + speakerEstimates + script
// // "complete" — all automatable fields populated
// "lastEnriched": "2026-06-07"
// }
}

字段参考

§ 1. 身份字段

字段类型必需可自动化来源
codestringISO 639-3 注册表
namestringISO 639-3 注册表
nativeNamestring | nullWikidata P1705
alternateNamesstring[]Glottolog、Ethnologue
iso639_3stringISO 639-3 注册表
iso639_1string | nullISO 639-1
bcp47string | null部分IANA 子标签注册表
aliasesstring[]手动策划
isoScopestringISO 639-3 注册表
isoTypestringISO 639-3 注册表
macrolanguagestring | nullISO 639-3 macrolanguages.tab
extendsstring | null手动策划

§ 2. 分类字段

字段类型必需可自动化来源
glottocodestring | nullGlottolog
classificationobject | nullGlottolog
isIsolatebooleanGlottolog CLDF

§ 3. 地理字段

字段类型必需可自动化来源
macroareastring | nullGlottolog CLDF
coordinatesobject | nullGlottolog
countriesstring[]Glottolog
regionsobject[]人口普查、Ethnologue、手动
arealContextobject | null坐标 + 语言学区域区域

§ 4. 书写系统字段

字段类型必需可自动化来源
scriptstring | nullWikidata P282
scriptUnicodeNamestring | nullscript 通过 ISO 15924 → Unicode 映射派生
scriptsobject[]部分Wikidata、手动
dirstring | null从脚本派生
scriptConverterstring | null手动
orthographicStatusobject | null部分Ethnologue、手动

§ 5. 人口统计与活力字段

字段类型必需可自动化来源
speakerEstimatesobject[]Wikidata、Ethnologue、人口普查
vitalityobject | nullGlottolog AES、UNESCO

§ 5.5 文档与数字存在字段

字段类型必需可自动化来源
documentationDepthobject | nullGlottolog 参考文献
digitalPresenceobject | nullWikipedia、Common Voice、Tatoeba
dialectCountnumber | nullGlottolog

§ 6. 正式性、寄存器与性别字段

字段类型必需可自动化来源
formalityobject | null语言学研究
registersobject | null语言学研究
genderobject | null语言学研究
codeSwitchingobject | null语言学研究

§ 7. 语言学档案字段

字段类型必需可自动化来源
linguisticChallengesobject | null语言学研究
contactInfluencesobject[]已发表的语言学
rulesobject | nullCLDR
typologicalProfileobject | nullGrambank 1.0.3 — 由 enrich-grambank-typology.mjs 自动填充
phonologicalInventoryobject | nullPHOIBLE 2.0 — 由 enrich-phoible-phonemes.mjs 自动填充

§ 8. 百科字段

字段类型必需可自动化来源
encyclopedicobject | null手动研究
culturalAphorismobject | null社区贡献
varietiesobject[]手动研究

§ 9. 数字资源字段

字段类型必需可自动化来源
resourcesobject | null部分手动 + 自动化
databaseCoverageobject | null从富化派生
corpusAvailabilityobject | nullBible Brain、OPUS、Lexibank
keyboardSupportobject | nullKeyman API、CLDR
methodSupportobject部分API 验证
metricModelSupportobject | nullXLM-R 论文、AfriCOMET 论文
metricPluginsobject | null卡富化——声明哪些指标插件包适用(例如 { formalityMarkers: true }
omt1600object | null元评估
evalDatasetsstring[]数据集注册表
pipelineReadinessobject | null部分派生 + 手动

resources.fsts[].installresources 对象中的 FST 条目可以包含一个 install 子对象,其字段为:reporeleaseTagassetPatternformatmaturity,以及可选的 bundlePattern。这替代了以前的 GIELLALT_FST_REGISTRY 硬编码字典。参见 get_fst_install_info()language_cards.py 中。

§ 10. 出处字段

字段类型必需可自动化来源
dataSourcesarray | object自动 + 手动
supportTierstring从卡完整性派生
humanReviewedobject | null人工审查者
notesstring | null手动
firstDocumentednumber | nullGlottolog CLDF
lastDocumentednumber | nullGlottolog CLDF
_generatedobject | null富化脚本

语言代码政策

Champollion 使用 ISO 639-3 作为规范标识符。其他标准代码注册为别名,在运行时解析为 ISO 639-3 代码。

优先级标准示例字段用途
1(规范)ISO 639-3crkcode卡文件名、配置键、API 参数
2(别名)ISO 639-1iualiases[]在 CLI 中接受,解析为 ISO 639-3
3(别名)BCP 47filaliases[]在 CLI 中接受,解析为 ISO 639-3
参考Glottocodeplai1258glottocode仅分类,不用于运行时

解析顺序: 当用户提供代码时:

  1. card.code 上的直接匹配 → 找到
  2. card.aliases[] 上的匹配 → 找到,返回规范卡
  3. card.iso639_1 上的匹配 → 找到(备选)
  4. 未找到 → 错误

迁移历史:ISO 639-1 → ISO 639-3

在 v8 之前,卡文件名在可用时使用 ISO 639-1 代码(fr.jsonde.jsonja.json)。在 639-3 迁移中,所有卡都重命名为其 ISO 639-3 等价物:

之前之后原因
fr.jsonfra.json639-3 是规范
de.jsondeu.json639-3 是规范
zh.jsoncmn.json宏语言 → 默认个体
ar.jsonarb.json宏语言 → 现代标准阿拉伯语
ms.jsonzsm.json宏语言 → 标准马来语

旧代码发生了什么?

  • 旧的 639-1 代码在 card.iso639_1
  • 旧的 639-1 代码在 card.aliases[]
  • resolveCode("fr") 在运行时返回 "fra" — 向后兼容
  • 用户仍然可以在配置中写 "fr" — 它透明地解析

架构上改变了什么:

  • _deepMerge() 现在跳过 null 值(从父继承)
  • _deepMerge() 现在设置了身份字段(代码、扩展、别名永不继承)
  • formality.default 现在从寄存器 isDefault: true 标志派生
  • 205 个 Grambank 派生的卡获得了结构 formality.default 修复
  • 38 个属/族/宏语言卡提供继承目标

边界情况

手语

手语(例如 ASE——美国手语)是具有 ISO 639-3 代码的合法语言。它们有地理和使用者数量,但:

  • script 通常为 null(无标准书面形式)
  • scripts 可能包括 "Sgnw"(SignWriting)如果使用了符号系统
  • dirnull
  • linguisticChallenges 应该处理空间语法、分类器等
  • gender.grammatical 通常为 false

古代与历史语言

拉丁语(lat,isoType H)和梵语(san,isoType H)等语言仍在特定背景下使用(礼仪、学术),但没有本地使用者:

  • vitality 可能注明"无本地使用者",带 "trend": "stable"(不衰退——使用它的社区稳定,只是很小)
  • speakerEstimates 应该注明这些是 L2 使用者,不是 L1
  • firstDocumented / lastDocumented 在时间上定位它们

构造语言

世界语(epo,isoType C)、逻辑语等:

  • classification 可能指向"构造"族或 null
  • contactInfluences 反映源材料(例如,世界语借鉴罗曼、日耳曼、斯拉夫语)
  • vitality 不寻常——使用者社区增长但无本地家园

宏语言

阿拉伯语(ara)、汉语(zho)、Cree(cre)、Quechua(que)是包含多种个体语言的宏语言:

  • isoScope: "M"
  • varieties 应该列出个体语言及其 ISO 代码
  • methodSupport 应该反映宏语言卡支持的内容(通常是标准化变体)
  • 个体变体也应该有自己的卡

无标准化正字法的语言

许多语言(特别是口头传统语言)没有标准化的书写系统,或有竞争的正字法:

  • scriptnull
  • scripts[]
  • dirnull
  • notes 应该解释正字法情况
  • linguisticChallenges 应该注明这如何影响 MT(例如,无训练数据)

双言现象

阿拉伯语(MSA 对方言)或 Guaraní(Jopará 对纯 Guaraní)等语言:

  • codeSwitching 捕捉混合变体情况
  • registers 可以为不同级别提供预设
  • varieties 可以列出双言对

接触影响类型

类型含义示例
superstrate强加给社区的主导语言法语 → 英语(1066 年后)
substrate本地语言影响强加的语言凯尔特语 → 英语
adstrate相邻语言有相互影响诺斯语 → 英语
learned_borrowing通过教育/学术借用拉丁语 → 英语
lexical_borrowing通过接触直接词汇借用西班牙语 → 菲律宾语
relexification大规模词汇替换葡萄牙语 → 帕皮亚门图语

接触影响深度

深度含义
light少数借词,最小结构影响
moderate特定领域的重要词汇
heavy普遍的词汇和一些结构特征
structural语法、句法和音韵受影响
defining核心身份由接触塑造(克里奥尔语、混合语言)

编写好的寄存器预设

好的预设提示:

  • 明确命名正式性特征(例如"해요체"、"vous-form"、"siz-form")
  • 解释要使用的特定代词或动词形式
  • 给出这个寄存器何时适当的背景
  • 如果适用,提及脚本考虑

不要在预设提示中放置性别包容性指导。性别指导属于 card.gender.inclusiveGuidance ——它单独注入。

❌ Bad: "Standard Thai. Professional register."
✔ Good: "Professional Thai. Use คุณ (khun) for second person, เรา (rao)
for first person when needed. Clear, concise phrasing
appropriate for digital interfaces."

预设命名约定

预设键应该是描述性的、小写连字符分隔的:

  • T-V 语言:formal-vousinformal-tuformal-Siecasual-du
  • 言语级别:polite-haeyoformal-hapsyocasual-hae
  • 中立:professionalneutral-professional
  • 代码转换:taglish-professionalpure-filipino

富化程序

每卡处理顺序

富化卡时,按此顺序查阅来源。记录每个查阅的来源,即使它没有返回数据。

  1. ISO 639-3 注册表codenameisoScopeisoType
  2. ISO 639-3 macrolanguages.tabmacrolanguage
  3. Glottolog languoid.csvglottocodeclassificationcoordinatescountries
  4. Glottolog CLDFmacroareaisIsolatefirstDocumentedlastDocumented
  5. Glottolog AESvitality(濒危状态)
  6. Wikidata SPARQLnativeNamespeakerEstimatesscriptscriptsdir
  7. CLDRrules(排版、复数、大小写)
  8. NLLB-200 / FLORES+methodSupport.nllbevalDatasets
  9. API 验证 → 剩余 methodSupport 条目
  10. ML 模型论文metricModelSupport(XLM-R 训练数据、AfriCOMET 覆盖) 脚本:node scripts/enrich-metric-model-support.mjs

冲突处理

当来源不一致时:

  1. 存储两者并标注来源
  2. 不平均或选边
  3. 注明分歧在相关 note 字段中
  4. 仅当需要单一值进行计算时,优先最近的主要来源

验证

在任何富化或手动编辑后运行 linter:

node scripts/lint-language-cards.mjs # all cards
node scripts/lint-language-cards.mjs --lang crk # single card

PR 检查清单

提交新的或修改的语言卡时:

  • 文件命名为 <code>.jsonshared/language-cards/
  • 规范模板中的所有顶级字段都存在
  • classification 从 Glottolog 填充(不是手工构建)
  • dataSources 列出所有查阅的来源
  • methodSupport 条目针对实际 API 语言列表验证
  • contactInfluences 条目有已发表的来源或 citation_needed: true
  • linguisticChallenges 有 3–6 个 MT 相关挑战(如果研究过)
  • rules 从 CLDR 填充(如果存在区域设置数据)
  • Linter 通过无错误

专业参考

标准维护者我们的用途
ISO 639-3SIL International规范语言代码、宏语言关系
GlottologMax Planck Institute分类、坐标、AES 濒危
WALSMax Planck Institute属定义、类型特征
ISO 15924Unicode/ISO脚本代码
CLDRUnicode Consortium区域设置数据、复数规则、排版
WikidataWikimedia Foundation使用者数量、内族名、脚本数据
EthnologueSIL InternationalEGIDS、使用者估计、DLS
UNESCO AtlasUNESCO濒危分类
Katig CollectiveUP Diliman菲律宾语言胶囊

另见:语言卡引用程序以获取详细的逐来源指导。