언어 카드 명세

단일 진실 공급원(Single source of truth). 이 문서는 모든 언어 카드의 표준 형태를 정의해요. 모든 카드는 여기에 나열된 최상위 필드를 모두 포함해야 하며, 그 값이 null이나 []일 때도 마찬가지예요. 필드가 누락된 카드는 규격에 맞지 않아요. 이 균일성 덕분에 자동화 도구, 린터, 보강 스크립트, 그리고 사람 검토자가 카드 구조를 신뢰할 수 있어요.

설계 원칙

균일한 형태. 8,000개가 넘는 모든 카드는 동일한 최상위 필드를 가져요. 알 수 없는 값은 null이고, 빈 배열은 [], 빈 객체는 null이에요({}이 아니에요). 이는 코드가 "이 필드가 존재하는가?"를 확인할 필요 없이 "값이 채워져 있는가?"만 확인하면 된다는 의미예요.
모든 것에 출처를. 모든 사실 주장은 이름이 있고 버전이 명시된 1차 출처로 추적돼요. 출처가 없는 주장은 검증할 수 없는 주장이에요. dataSources 필드(그리고 하위 객체의 필드별 source 주석)는 출처를 명시적으로 만들어줘요.
불일치를 보존하기. 권위 있는 출처들이 서로 다를 때(Wikidata는 화자 50,000명, Ethnologue는 20,000명이라고 할 때), 우리는 출처 정보와 함께 둘 다 저장해요. 평균을 내거나 해결하거나 한쪽을 고르지 않아요. 사용자는 그 미묘한 차이를 탐색할 수 있어요.
null은 알 수 없음을 의미하지, 적용 불가가 아니에요. 어떤 필드가 null이라면 "아직 이에 대한 데이터를 찾지 못했다"는 의미예요. 어떤 필드가 정말로 적용되지 않는다면 (예: 수어에 대한 grammatical gender), 값으로 그것을 설명해야 해요: { "grammatical": false, "inclusiveGuidance": "Not applicable — ASL does not have grammatical gender." }
병합만 하기. 보강 스크립트는 데이터를 추가하지, 절대 덮어쓰지 않아요. 사람이 큐레이션한 값이 자동화된 데이터보다 우선해요.

3계층 아키텍처

계층	위치	목적
언어 카드	`shared/language-cards/<code>.json`	언어별 구성: 정체성, 분류, 리소스 등 모든 것
속(Genus) 카드	`shared/language-cards/genera/<genus>.json`	관련 언어들의 공유 런타임 속성(자동 생성이 아닌 큐레이션)
언어 트리	`shared/language-cards/language-tree.json`	전체 Glottolog 계층 구조 — Lab UI와 언어 탐색을 위한 참조 데이터

상속 모델

카드가 "extends": "family-dravidian"를 설정하면, 런타임이 부모 카드를 자식에 _deepMerge()을 사용해 병합해요(lib/registers.js에서). 이를 통해 속 카드가 공유 레지스터, 격식 체계, 성별 가이드를 정의하면 이것이 모든 소속 언어로 흘러내려가요 — 수백 개의 개별 카드에 데이터를 중복하지 않고도요.

병합 의미론

자식 값	동작	이유
`null`	부모로부터 상속	`null`은 "이것을 정의하지 않음"을 의미 — 부모의 값이 흘러 들어옴
Non-null	부모 재정의	자식의 데이터가 더 구체적임 — 우선함
중첩 객체	재귀적 병합	자식 필드가 재정의, 부모 필드는 보존
배열	전체 교체	배열은 항목별로 병합하지 않음 — 자식 배열이 우선

정체성 필드(절대 상속되지 않음)

어떤 필드는 카드 자체에 속하며 부모로부터 절대 상속되어서는 안 돼요:

code, extends, _migration, aliases, iso639_1, iso639_3

부모 카드가 aliases: ["macro-code"]를 정의하더라도, 자식 카드는 그 별칭을 상속하지 않아요. 이 필드들은 항상 자식 자신의 값이에요(설정되지 않았다면 null 포함).

이유: 이 규칙이 없으면 모든 Cree 언어가 매크로언어 부모로부터 aliases: ["cre"]를 상속받아, 모든 변종이 매크로의 별칭이 되어버려요.

예시: Cree 카드가 해석되는 방식

┌───────────────────────┐
│  family-algic.json    │  formality: null, registers: null
│  (no registers)       │
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  genus-cree.json      │  formality: { system: "obviative-animate", ... }
│  (sourced registers)  │  registers: { formal: {...}, informal: {...} }
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  crk.json             │  code: "crk", extends: "genus-cree"
│  (Plains Cree)        │  formality: null → inherits from genus-cree
│                       │  registers: null → inherits from genus-cree
│                       │  script: "Cans"  → own value, no inheritance
│                       │  code: "crk"     → identity field, never inherited
└───────────────────────┘

런타임에서 getLanguageCard("crk")은 genus-cree의 레지스터 + family-algic의 속성(있는 경우) + crk 자신의 정체성과 메타데이터가 병합된 객체를 반환해요.

속 카드 템플릿

속 카드는 shared/language-cards/genera/에 위치하며 언어 그룹의 공유 속성을 정의해요. 일반 카드와 동일한 스키마를 따르지만 규칙이 달라요:

{
  // Identity — genus cards use a prefixed code, NOT an ISO 639-3 code
  "code": "genus-cree",           // "genus-", "family-", or "macrolanguage-" prefix
  "name": "Cree Languages",      // Human-readable group name
  "extends": "family-algic",     // Genus cards can extend family cards (chaining)

  // Formality — shared across the group, sourced from typological databases
  "formality": {
    "system": "obviative-animate",
    "description": "Cree languages use an obviative/proximate system...",
    "default": "formal",
    "source": "WALS 37A, 38A + Wolfart 1973"
  },

  // Registers — shared presets, if the group shares a formality system
  "registers": {
    "formal": {
      "label": "Formal (Proximate)",
      "description": "...",
      "prompt": "...",
      "isDefault": true
    },
    "informal": {
      "label": "Informal",
      "description": "...",
      "prompt": "..."
    }
  },

  // Gender — shared grammatical gender behavior
  "gender": {
    "grammatical": false,       // Cree doesn't have grammatical gender
    "inclusiveGuidance": null   //   so no inclusive guidance needed
  },

  // Everything else is null — individual cards provide their own
  // classification, geography, resources, etc.
  "classification": null,
  "methodSupport": null,
  // ...
}

핵심 규칙: 속 카드는 전체 그룹에 걸쳐 진정으로 공유되고 권위 있는 참조에서 출처를 둔 데이터만 포함해야 해요. 격식 체계가 구성원마다 다르다면, 그것은 속이 아니라 개별 카드에 속해요.

표준 템플릿

모든 카드는 정확히 이 최상위 형태를 가져야 해요. 하위 객체 스키마는 아래 필드 레퍼런스에 문서화되어 있어요.

{
  // ═══════════════════════════════════════════════════════════════════════
  //  § 1. IDENTITY
  //  Who is this language? What codes identify it?
  //  Sources: ISO 639-3 registry, ISO 639-1, BCP 47/IANA.
  // ═══════════════════════════════════════════════════════════════════════

  "code":          "xxx",       // REQUIRED. ISO 639-3 code. This IS the card ID and filename.
  "name":          "English Name",  // REQUIRED. English reference name from ISO 639-3 registry.
  "nativeName":    null,        // Endonym (name in the language itself). Source: Wikidata P1705.
                                // Examples: "nêhiyawêwin / ᓀᐦᐃᔭᐍᐏᐣ", "日本語", "Esperanto".
  "alternateNames": [],         // Other names this language is known by. Source: Glottolog, Ethnologue.
                                // Not aliases (those are code-level). These are name-level variants.
                                // Example: ["Qafar af", "Afaraf", "'Afar Af"] for Afar (aar).
  "iso639_3":      "xxx",      // REQUIRED. Three-letter ISO 639-3 code. Same as `code`.
  "iso639_1":      null,        // Two-letter ISO 639-1 code (e.g., "en", "fr"). null if none.
  "bcp47":         null,        // IETF BCP 47 tag. Often same as iso639_1. Can include subtags
                                // (e.g., "iu-Cans-CA"). null if unknown.
  "aliases":       [],          // Alternative code-level identifiers that resolve to this card.
                                // Example: ["fil"] for tl (Tagalog), ["iu"] for iku (Inuktitut).
                                // Used by code resolution: user types "fil", system loads tl.json.
  "isoScope":      "I",        // REQUIRED. ISO 639-3 scope:
                                //   "I" = Individual language
                                //   "M" = Macrolanguage (e.g., Chinese, Arabic, Cree)
                                //   "S" = Special (e.g., mis, mul, zxx)
  "isoType":       "L",        // REQUIRED. ISO 639-3 type:
                                //   "L" = Living    "E" = Extinct    "A" = Ancient
                                //   "H" = Historical    "C" = Constructed
  "macrolanguage": null,        // If this language is part of a macrolanguage, the macrolanguage
                                // ISO 639-3 code (e.g., "cre" for Plains Cree, "ara" for Arabic
                                // varieties). Source: ISO 639-3 macrolanguages.tab.
  "extends":       null,        // Genus card key if shared properties are inherited from a genus
                                // card (e.g., "genus-cree", "genus-eskimo-aleut").
                                // null for most languages.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 2. CLASSIFICATION
  //  Where does this language sit in the family tree?
  //  Source: Glottolog. NEVER hand-build classifications.
  // ═══════════════════════════════════════════════════════════════════════

  "glottocode":      null,      // Glottolog identifier (e.g., "plai1258", "stan1293").
                                // null if the language is not in Glottolog.
  "classification":  null,      // Genealogical classification from Glottolog. When populated:
                                // {
                                //   "family": "Algic",              // Top-level family. null for isolates.
                                //   "familyGlottocode": "algi1248", // Glottocode of the family.
                                //   "genus": "Plains Creeic",       // WALS-style genus.
                                //   "genusGlottocode": "plai1264",  // Glottocode of the genus.
                                //   "ancestry": ["Algic", "Algonquian-Blackfoot", "Algonquian",
                                //                "Cree-Montagnais-Naskapi", "Cree", "Plains Creeic"]
                                // }
                                // For isolates: family = language name, genus = language name,
                                // ancestry = [language name].
  "isIsolate":       false,     // true if a language isolate (no known genetic relatives).
                                // Source: Glottolog CLDF.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 3. GEOGRAPHY
  //  Where is this language spoken?
  //  Sources: Glottolog (coordinates, countries), census data, Ethnologue.
  // ═══════════════════════════════════════════════════════════════════════

  "macroarea":     null,        // Glottolog macroarea. One of: "Africa", "Australia",
                                // "Eurasia", "North America", "Papunesia", "South America".
                                // null if unknown. Source: Glottolog CLDF.
  "coordinates":   null,        // Representative geographic point. When populated:
                                // { "lat": 52.1, "lng": -106.6, "source": "glottolog-5.3" }
                                // This is a representative point, not a boundary.
  "countries":     [],          // ISO 3166-1 alpha-2 country codes where this language is spoken.
                                // Example: ["CA", "US"]. Source: Glottolog.
  "regions":       [],          // Detailed regional breakdown with admin codes & speaker estimates.
                                // Each entry:
                                // {
                                //   "country": "Canada",
                                //   "countryCode": "CA",
                                //   "officialStatus": "recognized",  // official, co-official,
                                //                                    // recognized, none
                                //   "region": "Saskatchewan, Alberta, Manitoba",
                                //   "speakerEstimate": "~20,000",
                                //   "coordinates": [-106.6, 52.1],   // [lng, lat]
                                //   "admin1Codes": ["CA-SK", "CA-AB", "CA-MB"]
                                // }

  "arealContext":  null,         // Linguistic area / Sprachbund membership. DISTINCT from
                                // contactInfluences (which is language-specific contact history).
                                // This field captures zone-level typological convergence patterns
                                // — i.e., what linguistic area the language exists within and
                                // what features are common across that area.
                                // {
                                //   "zone": "Mainland Southeast Asian Sprachbund",
                                //   "arealFeatures": "Tonal convergence, classifier systems,
                                //     topic-prominence, monosyllabicity trend.",
                                //   "typicalContacts": ["Classical Chinese", "Sanskrit/Pali"],
                                //   "source": "areal-linguistics (Enfield 2005)"
                                // }
                                // NOT the same as contactInfluences. A language can exist within
                                // a convergence area without having specific contact history with
                                // any particular language in that area.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 4. WRITING SYSTEMS
  //  How is this language written?
  //  Sources: Wikidata P282, ISO 15924, manual research.
  //  Note: Some languages have NO standardized orthography. Some have
  //  competing orthographies. Some use multiple scripts routinely (e.g.,
  //  Serbian: Cyrillic + Latin; Japanese: Kanji + Hiragana + Katakana).
  //  Sign languages may use notation systems (SignWriting, HamNoSys) or
  //  none at all.
  // ═══════════════════════════════════════════════════════════════════════

  "script":        null,        // Primary ISO 15924 script code (e.g., "Latn", "Cyrl", "Cans",
                                // "Jpan"). null if no written form or unknown.
  "scriptUnicodeName": null,    // Unicode script block name derived from the script field.
                                // e.g., "Latin", "Cyrillic", "Canadian_Aboriginal", "CJK".
                                // Used by code_switching metric plugin. Auto-populated by
                                // enrich-script-unicode-names.mjs. null if script is null.
  "scripts":       [],          // All writing systems with detail. Array of:
                                // {
                                //   "code": "Cans",
                                //   "name": "Unified Canadian Aboriginal Syllabics",
                                //   "primary": true
                                // }
                                // A language with multiple scripts has multiple entries.
                                // A language with no written form has [].
  "dir":           null,        // Writing direction: "ltr" (left-to-right) or "rtl" (right-to-left).
                                // null if no written form or unknown.
  "scriptConverter": null,      // Script converter key if we have a converter for this language
                                // (e.g., "crk" for SRO↔Syllabics). null for most languages.
  "orthographicStatus": null,   // Writing system standardization status. When populated:
                                // {
                                //   "status": "standardized",
                                //       // "standardized" — official/agreed orthography exists
                                //       // "competing"    — multiple orthographies in active use
                                //       // "emerging"     — orthography under development
                                //       // "none"         — primarily oral, no standard writing
                                //   "notes": "Uses SIL-developed Latin orthography since 1960s.",
                                //   "source": "ethnologue" // or "manual-curation"
                                // }
                                // Crucial for LRLs where orthographic variation directly impacts
                                // MT training data quality and evaluation consistency.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5. DEMOGRAPHICS & VITALITY
  //  How many people speak this language? Is it endangered?
  //  Sources: Census, Ethnologue, UNESCO Atlas, Wikidata, Glottolog AES.
  //
  //  CRITICAL: Store ALL estimates separately with source attribution.
  //  Never average or "resolve" conflicting data. Speaker counts are
  //  politically contested for many languages. Present the evidence,
  //  let the reader assess.
  // ═══════════════════════════════════════════════════════════════════════

  "speakerEstimates": [],       // Array of speaker count estimates from different authorities.
                                // Each entry:
                                // {
                                //   "source": "wikidata",              // or "ethnologue-28",
                                //                                      // "census-ph-2020", etc.
                                //   "count": 20000,                    // Point estimate. null if range-only.
                                //   "date": "2026-06-07",              // When this data was retrieved.
                                //   "countRange": { "min": 15000, "max": 25000 },  // Optional range.
                                //   "note": "Wikidata has 2 estimates: 15,000 and 25,000"
                                // }
                                // Empty array means we have not yet found speaker count data.

  "vitality":      null,        // Endangerment / vitality assessment. When populated:
                                // {
                                //   "unescoStatus": "severely-endangered",
                                //       // Enum: "safe", "vulnerable", "definitely-endangered",
                                //       //       "severely-endangered", "critically-endangered",
                                //       //       "extinct"
                                //   "aesStatus": "shifting",
                                //       // Glottolog AES label (free text from AES data).
                                //   "egids": "6b",
                                //       // Ethnologue Expanded Graded Intergenerational Disruption
                                //       // Scale. Levels: 0 (international) to 10 (extinct).
                                //   "trend": "declining",
                                //       // Qualitative trend: "stable", "growing", "declining",
                                //       //                     "shifting", "moribund", "awakening"
                                //   "source": "glottolog-aes-5.3",
                                //   "notes": "Intergenerational transmission breaking down."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5.5. DOCUMENTATION & DIGITAL PRESENCE
  //  How well-documented is this language? What digital footprint does it
  //  have? These fields answer the practical question: "What can I
  //  actually DO with this language?"
  //  Sources: Glottolog (references), Wikipedia, Common Voice, Tatoeba.
  // ═══════════════════════════════════════════════════════════════════════

  "documentationDepth": null,    // How well-documented is this language in the literature?
                                 // {
                                 //   "referenceCount": 42,
                                 //       // Number of published references in Glottolog.
                                 //   "med": "grammar",
                                 //       // Most Extensive Description type. One of:
                                 //       // "long_grammar", "grammar", "grammar_sketch",
                                 //       // "dictionary", "phonology", "text", "wordlist",
                                 //       // "comparative", "minimal", "unknown"
                                 //   "source": "glottolog-5.3"
                                 // }

  "digitalPresence":  null,      // Digital footprint across web platforms. When populated:
                                 // {
                                 //   "wikipedia": {
                                 //     "edition": true,      // Has its own Wikipedia edition?
                                 //     "articleCount": 75000, // Number of articles.
                                 //     "editionCode": "crk",  // Wikipedia subdomain code.
                                 //     "source": "wikimedia-api-2026"
                                 //   },
                                 //   "commonVoice": {
                                 //     "validatedHours": 12.5,
                                 //     "totalHours": 25.0,
                                 //     "speakers": 45,
                                 //     "sentences": 1200,
                                 //     "source": "common-voice-20.0"
                                 //   },
                                 //   "tatoeba": {
                                 //     "sentenceCount": 342,
                                 //     "source": "tatoeba-2026"
                                 //   }
                                 // }

  "dialectCount":     null,      // Number of recognized dialects in Glottolog.
                                 // Derived from child_dialect_count in languoid.csv.
                                 // Simple integer. null if 0 or unknown.
                                 // Source: glottolog-5.3.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 6. FORMALITY, REGISTERS & GENDER
  //  How does politeness work in this language? What translation registers
  //  do we offer? How should gender be handled?
  //
  //  This section drives Champollion's register-preset system — the
  //  mechanism by which users select formal/informal/professional tone.
  //  These fields require genuine linguistic research, not automation.
  // ═══════════════════════════════════════════════════════════════════════

  "formality":     null,        // Formality system description. When populated:
                                // {
                                //   "system": "T-V",
                                //       // One of: "T-V", "speech-levels", "keigo", "particles",
                                //       //         "register-levels", "register-and-code-switching",
                                //       //         "code-switching", "none"
                                //   "description": "French uses a vous/tu distinction...",
                                //   "default": "formal-vous"   // Key into the `registers` object.
                                // }

  "registers":     null,        // Translation register presets. When populated, keyed by preset ID:
                                // {
                                //   "formal-vous": {
                                //     "label": "Formal (vouvoiement)",
                                //     "description": "One sentence: when to use this preset.",
                                //     "prompt": "The actual LLM system prompt instruction that
                                //               steers translation tone. Must name specific
                                //               linguistic features (pronouns, verb forms, particles).",
                                //     "deeplFormality": "prefer_more"
                                //       // Only if methodSupport.deepl.formality is true.
                                //       // One of: "prefer_more", "prefer_less", "default".
                                //   }
                                // }

  "gender":        null,        // Grammatical gender and inclusive guidance. When populated:
                                // {
                                //   "grammatical": true,         // Does the language have gram. gender?
                                //   "inclusiveGuidance": "Use gender-neutral forms when possible.
                                //                        Prefer 'iel' (neologism) or rephrase to
                                //                        avoid gendered agreement."
                                // }
                                // For languages without grammatical gender (Turkish, Finnish):
                                // { "grammatical": false, "inclusiveGuidance": null }

  "codeSwitching":  null,       // Code-switching behavior (for languages where mixing with another
                                // language is the norm, not an error). When populated:
                                // {
                                //   "contactLanguage": "Spanish",
                                //   "contactIso639_3": "spa",
                                //   "mixedVarietyName": "Jopará",   // null if no named mixed variety
                                //   "prevalence": "dominant",       // "rare", "common", "dominant"
                                //   "morphologicalIntegration": true,
                                //   "pipelineStrategy": "hybrid-fst",
                                //   "notes": "Jopará IS the everyday language of most Paraguayans..."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 7. LINGUISTIC PROFILE
  //  What makes this language what it is? What are the specific challenges
  //  for machine translation? What rules govern its typography?
  //  What languages have shaped it through contact?
  //
  //  These fields require genuine linguistic expertise. For many languages
  //  (especially low-resource), this section will remain null until a
  //  qualified researcher or community member contributes.
  // ═══════════════════════════════════════════════════════════════════════

  "linguisticChallenges": null,  // MT-relevant challenges, keyed by challenge ID.
                                 // When populated:
                                 // {
                                 //   "polysynthesis": "Cree is highly polysynthetic. A single verb
                                 //                    can incorporate subject, object, tense...",
                                 //   "animacy": "Verb conjugation changes based on whether the
                                 //              subject/object is animate or inanimate...",
                                 //   "neologisms": "Avoid literal translations of modern software
                                 //                 concepts. Maintain Cree metaphorical logic..."
                                 // }
                                 // Aim for 3–6 challenges per language when researched.

  "contactInfluences": [],       // How other languages have shaped this one. Array of:
                                 // {
                                 //   "source": "English",
                                 //   "sourceIso639_3": "eng",       // null if proto-language/unknown
                                 //   "type": "superstrate",
                                 //       // Enum: "superstrate", "substrate", "adstrate",
                                 //       //       "learned_borrowing", "lexical_borrowing",
                                 //       //       "relexification"
                                 //   "domains": ["education", "government", "technology"],
                                 //   "depth": "deep",
                                 //       // Enum: "light", "moderate", "heavy", "structural",
                                 //       //       "defining"
                                 //   "period": "1870–present",
                                 //   "notes": "Residential school era and ongoing...",
                                 //   "citation_needed": false
                                 //       // true if no published academic source found.
                                 //       // See language-card-citation-procedure.md.
                                 // }

  "rules":          null,        // Typography, plural, and capitalization rules. When populated:
                                 // {
                                 //   "typography": {
                                 //     "quoteStart": "\u201c",
                                 //     "quoteEnd": "\u201d",
                                 //     "usesSpaces": true,        // false for CJK, Thai, Lao, Khmer
                                 //     "punctuationSpacing": {
                                 //       "doublePunctuation": "none"  // "thin-nbsp" for French
                                 //     }
                                 //   },
                                 //   "plurals": {
                                 //     "categories": ["one", "other"]
                                 //       // From CLDR. Possible values:
                                 //       // "zero", "one", "two", "few", "many", "other"
                                 //   },
                                 //   "capitalization": {
                                 //     "hasCase": true
                                 //       // true for Latin, Cyrillic, Greek, Armenian scripts.
                                 //       // false for CJK, Arabic, Devanagari, etc.
                                 //   }
                                 // }
                                 // Source: CLDR + ISO 15924 derivation.

  "typologicalProfile": null,   // Grambank typological features. When populated:
                                // {
                                //   "featuresDocumented": 195,
                                //   "featuresCoverage": 1,     // 0.0–1.0 fraction of features
                                //   "wordOrderDominant": "SVO",
                                //   "hasDefiniteArticle": true,
                                //   "hasIndefiniteArticle": true,
                                //   "hasGenderSystem": true,
                                //   "hasCaseMorphology": true,
                                //   "hasEvidentiality": false,
                                //   "hasToneSystem": false,
                                //   "source": "grambank-1.0.3"
                                // }
                                // Auto-populated by enrich-grambank-typology.mjs.

  "phonologicalInventory": null, // PHOIBLE phoneme inventory. When populated:
                                // {
                                //   "consonants": 24,
                                //   "vowels": 16,
                                //   "tones": 0,
                                //   "totalPhonemes": 40,
                                //   "isTonal": false,
                                //   "inventorySize": "moderately-large",
                                //       // Enum: "small", "moderately-small", "average",
                                //       //       "moderately-large", "large"
                                //   "source": "phoible-2.0"
                                // }
                                // Auto-populated by enrich-phoible-phonemes.mjs.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 8. ENCYCLOPEDIC
  //  General knowledge about the language for human context. History,
  //  dialect situation, institutional resources, representative sayings.
  //  This section is for understanding, not computation.
  // ═══════════════════════════════════════════════════════════════════════

  "encyclopedic":    null,       // General knowledge. When populated:
                                 // {
                                 //   "family": "Algic",             // Redundant with classification
                                 //                                  // but useful for human readers.
                                 //   "dialects": {
                                 //     "split": true,               // Is there significant variation?
                                 //     "classification": "Plains Cree (y-dialect)",
                                 //     "variants": ["crk", "cwd", "csw"]  // ISO codes of variants
                                 //   },
                                 //   "demographics": {
                                 //     "speakers": "Approx. 20,000 active speakers",
                                 //     "regions": ["Saskatchewan", "Alberta", "Manitoba"]
                                 //   },
                                 //   "history": "Plains Cree is the most widely spoken Algonquian
                                 //              language in western Canada...",
                                 //   "resources": {
                                 //     "wikipedia": "https://en.wikipedia.org/wiki/Plains_Cree",
                                 //     "foundations": [{ "name": "ALTLab", "url": "https://..." }],
                                 //     "dictionaries": [{ "name": "itwêwina", "url": "https://..." }]
                                 //   }
                                 // }

  "culturalAphorism": null,      // A representative saying, proverb, or teaching in the language.
                                 // When populated:
                                 // {
                                 //   "text": "ê-wîcêhtonaniwahk kâ-kî-isi-wâpahtamâhk ôma pimâtisiwin",
                                 //   "transliteration": null,       // Romanized form if non-Latin script.
                                 //   "translation": "Through helping each other we come to understand
                                 //                   this life",
                                 //   "literal": "By-helping-one-another we-have-come-to-see this life",
                                 //   "source": "Cree teaching, documented in nêhiyawêwin educational
                                 //              resources"
                                 // }
                                 // Choose sayings that reveal something about the language's
                                 // worldview or structure. Must be sourced.

  "varieties":      [],          // For macrolanguages or languages with significant dialectal
                                 // variation, the individual varieties with their own tool coverage.
                                 // Each entry:
                                 // {
                                 //   "name": "Cusco Quechua",
                                 //   "iso639_3": "quz",
                                 //   "region": "Cusco, Peru",
                                 //   "fstCoverage": true,
                                 //   "corpusCoverage": true,
                                 //   "nllbCoverage": false,
                                 //   "mutualIntelligibility": "Primary variety for this card",
                                 //   "notes": "SQUOIA FST was built for this variety."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 9. DIGITAL RESOURCES & TOOLING
  //  What NLP tools, corpora, models, and datasets exist for this language?
  //  What translation APIs support it? What eval benchmarks are available?
  //
  //  This is Champollion's operational core — these fields determine what
  //  we can actually DO with this language.
  // ═══════════════════════════════════════════════════════════════════════

  "resources":      null,        // NLP resources available for this language. When populated:
                                 // {
                                 //   "fsts": [{                     // Finite-state transducers
                                 //     "name": "GiellaLT Plains Cree FST (lang-crk)",
                                 //     "url": "https://github.com/giellalt/lang-crk/releases",
                                 //     "type": "morphological-analyzer"
                                 //   }],
                                 //   "corpora": [{                  // Text corpora
                                 //     "name": "EDTeKLA Cree Language Textbook Corpus",
                                 //     "type": "parallel",          // "parallel", "monolingual"
                                 //     "pairs": ["en-crk"],
                                 //     "url": "https://...",
                                 //     "exposure": "open-web"       // "open-web", "restricted",
                                 //                                  // "holdout"
                                 //   }],
                                 //   "models": [{                   // Pre-trained models
                                 //     "name": "NLLB-200 (crk_Cans)",
                                 //     "url": "https://...",
                                 //     "type": "nmt"
                                 //   }],
                                 //   "tools": [],                   // Other NLP tools
                                 //   "wordlists": [{                // Standardized wordlists
                                 //     "name": "Lexibank",
                                 //     "conceptCount": 200,
                                 //     "source": "lexibank"
                                 //   }],
                                 //   "treebanks": [{                // Syntactic treebanks
                                 //     "name": "UD_Korean-GSD",
                                 //     "tokens": 80000,
                                 //     "source": "universal-dependencies-2.14"
                                 //   }]
                                 // }
                                 // IMPORTANT: Only actual NLP/digital resources belong here.
                                 // "This language has a WALS entry" is NOT a resource — that
                                 // goes in databaseCoverage.

  "databaseCoverage": null,      // Which typological/reference databases cover this language.
                                 // Separated from resources to avoid conflating "has a database
                                 // entry" with "has usable NLP tooling."
                                 // {
                                 //   "wals": true,
                                 //   "grambank": true,
                                 //   "phoible": true,
                                 //   "cldr": true,
                                 //   "lexibank": true,
                                 //   "commonVoice": true,
                                 //   "source": "derived"
                                 // }

  "corpusAvailability": null,    // What text/parallel corpora exist for NLP use?
                                 // {
                                 //   "bibleTranslation": {
                                 //     "textAvailable": true,
                                 //     "audioAvailable": true,
                                 //     "source": "bible-brain-api"
                                 //   },
                                 //   "opusCorpora": ["wikimedia", "ubuntu", "gnome"],
                                 //   "source": "multi-source"
                                 // }

  "keyboardSupport":  null,      // Input method / keyboard availability. When populated:
                                 // {
                                 //   "keymanKeyboards": 3,
                                 //       // Number of Keyman keyboards available.
                                 //   "cldrKeyboard": true,
                                 //       // CLDR has keyboard layout data.
                                 //   "source": "keyman-api + cldr"
                                 // }

  "methodSupport":  {            // REQUIRED. Which Champollion translation methods support this
                                 // language. Each method is an object with at minimum
                                 // { "supported": boolean }.
    "googleTranslate":     { "supported": false },
    "deepl":               { "supported": false },
    "microsoftTranslator": { "supported": false },
    "libreTranslate":      { "supported": false },
    "nllb":                { "supported": false },
                                 // When NLLB is supported, include the code:
                                 // { "supported": true, "code": "crk_Cans" }
    "llm":                 { "supported": true }
                                 // LLM is always true (quality varies by language).
                                 // Optional: "verifiedDate": "2026-06-07" for audit trail.
  },

  "metricModelSupport": null,   // Which MT evaluation models produce reliable scores.
                                // When populated:
                                // {
                                //   "xlmr": "high",          // "high", "medium", or "low"
                                //                            // XLM-R training representation tier.
                                //   "africomet": false        // true if AfriCOMET covers this language.
                                // }
                                // Drives automatic COMET model selection in metrics_comet.py.
                                // Auto-populated by enrich-metric-model-support.mjs.

  "metricPlugins":   null,      // Which per-language metric plugin packs are available.
                                // When populated:
                                // {
                                //   "formalityMarkers": true  // Formality marker resource file exists
                                //                             // at plugins/resources/formality/{code}.json
                                // }
                                // Each key corresponds to a resource pack in
                                // arena/mt_eval_harness/plugins/resources/{packName}/.
                                // To add a new metric pack for a language, create the resource
                                // file and set the flag here. No code changes required.

  "evalPack":       null,        // Evaluation dependency pack for language-specific metrics.
                                 // When populated, declares the Python dependencies and
                                 // post-install steps required by this language's eval standards.
                                 // The harness uses this for dependency gating: if deps are
                                 // missing, the harness warns the user and skips LYSS metrics
                                 // (rather than crashing).
                                 // When populated:
                                 // {
                                 //   "pythonDeps": {
                                 //     "pyhfst": "pyhfst>=1.4",    // PyPI package specs
                                 //     "requests": "requests>=2.28",
                                 //     "spacy": "spacy>=3.7"
                                 //   },
                                 //   "postInstall": [               // Commands to run after pip
                                 //     {
                                 //       "command": "spacy download en_core_web_md",
                                 //       "label": "spaCy English model (for LYSS-sem)"
                                 //     }
                                 //   ],
                                 //   "requiresFst": true,           // true if GiellaLT FST needed
                                 //   "description": "LYSS equivalence linter + FST validation"
                                 // }

  "evalMetrics":    null,        // Language-specific evaluation metrics (LYSS standards).
                                 // When populated, the harness dynamically imports these
                                 // MetricPlugin classes from eval_standards/<lang>/ and applies
                                 // them to every run targeting this language — regardless of
                                 // which method (contestant) is being evaluated.
                                 // Keyed by metric ID:
                                 // {
                                 //   "lyss-eq": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkLinterMetric",
                                 //     "description": "LYSS deterministic variant-class linter"
                                 //   },
                                 //   "lyss-sem": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkSemanticMetric",
                                 //     "description": "LYSS FST-based semantic validator",
                                 //     "dependencies": ["spacy>=3.7"],
                                 //     "spacy_models": ["en_core_web_md"]
                                 //   }
                                 // }
                                 // Architecture: eval standards are referees, not contestants.
                                 // They live in the harness (eval_standards/), not in method
                                 // plugins. This ensures all methods are scored equally.
                                 // Discovery: plugin_discovery.py reads this field via
                                 // language_cards.get_eval_metrics() and instantiates metrics
                                 // using importlib. Dependencies are checked against evalPack.

  "omt1600":        null,        // Meta's OMT-1600 (One Model for Translation) coverage assessment.
                                 // When populated:
                                 // {
                                 //   "covered": true,
                                 //   "tier": "R1",                  // Meta's resource tier
                                 //   "evalMetrics": ["chrF++", "BLASER-3"],
                                 //   "notes": "Plains Cree: no web-crawled bitext..."
                                 // }

  "evalDatasets":   [],          // Evaluation dataset IDs available for this language.
                                 // Example: ["flores-plus-devtest", "edtekla-dev-v1"].
                                 // Empty means no standardized eval set exists.

  "pipelineReadiness": null,     // Assessment of readiness for Champollion's translation pipeline.
                                 // When populated:
                                 // {
                                 //   "tier": "tier-2-feasible",
                                 //       // "watch-list"       — cataloged but no path to translation
                                 //       // "tier-3-cataloged" — basic metadata present
                                 //       // "tier-2-feasible"  — tools exist, pipeline possible
                                 //       // "tier-1-ready"     — pipeline operational
                                 //   "hasFST": true,
                                 //   "hasParallelCorpus": true,
                                 //   "hasEvalBenchmark": true,
                                 //   "blockers": ["Syllabics post-processing validation"],
                                 //   "notes": "FST-gated pipeline operational. EDTeKLA corpus..."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 10. PROVENANCE & METADATA
  //  Where does this data come from? Who reviewed it? When was it
  //  generated? What's its overall quality level?
  //
  //  This section exists to make the card auditable. Every automated
  //  enrichment, every human review, every source consulted should
  //  leave a trace here.
  // ═══════════════════════════════════════════════════════════════════════

  "dataSources":   [],           // REQUIRED. Sources consulted for this card's data.
                                 // Can be a flat array (backwards-compatible):
                                 //   ["iso639-3-2024", "glottolog-5.3", "wikidata"]
                                 //
                                 // Or a structured per-field object (preferred for new cards):
                                 //   {
                                 //     "classification": ["glottolog-5.3"],
                                 //     "vitality": ["glottolog-aes-5.3", "unesco-atlas-2024"],
                                 //     "speakerEstimates": ["wikidata", "census-ca-2021"],
                                 //     "rules": ["cldr-48"],
                                 //     "methodSupport": ["google-translate-2026-06"]
                                 //   }

  "supportTier":   "cataloged",  // Auto-derived tier summarizing the card's depth:
                                 //   "cataloged"   — identity + classification only
                                 //   "emerging"    — + vitality + speakerEstimates
                                 //   "developing"  — + resources + methodSupport
                                 //   "supported"   — full research: registers, challenges, etc.

  "humanReviewed": null,         // null until a qualified human reviews the card. When populated:
                                 // {
                                 //   "reviewer": "Prof. Kenneth Jamandre",
                                 //   "affiliation": "University of the Philippines Diliman",
                                 //   "date": "2026-06-08",
                                 //   "scope": "full",             // "full", "partial", "vitality-only"
                                 //   "notes": "Verified speaker count, vitality assessment,
                                 //             and contact influences for Tagalog."
                                 // }

  "notes":         null,         // Free-text notes about this language or this card's data quality.
                                 // Example: "Low-resource language under active development.
                                 //           Translation pipeline uses FST-gated approach."

  "firstDocumented": null,       // Year of first known documentation. Negative for BCE.
                                 // Example: -1500 (Sanskrit, ~1500 BCE), 1787 (some languages).
                                 // Source: Glottolog CLDF.

  "lastDocumented":  null,       // Year of last known documentation (relevant for extinct languages).
                                 // Source: Glottolog CLDF.

  "_generated":    null          // Auto-populated by enrichment scripts. When populated:
                                 // {
                                 //   "by": "generate-all-cards.mjs",
                                 //   "at": "2026-06-07T12:34:56Z",
                                 //   "sources": ["iso639-3", "glottolog-5.3", "wikidata"],
                                 //   "completeness": "partial",
                                 //       // "partial"     — has identity + classification + coords
                                 //       // "substantial" — + vitality + speakerEstimates + script
                                 //       // "complete"    — all automatable fields populated
                                 //   "lastEnriched": "2026-06-07"
                                 // }
}

필드 레퍼런스

§ 1. 정체성 필드

필드	타입	필수	자동화 가능	출처
`code`	`string`	✅	✅	ISO 639-3 registry
`name`	`string`	✅	✅	ISO 639-3 registry
`nativeName`	`string \| null`	—	✅	Wikidata P1705
`alternateNames`	`string[]`	—	✅	Glottolog, Ethnologue
`iso639_3`	`string`	✅	✅	ISO 639-3 registry
`iso639_1`	`string \| null`	—	✅	ISO 639-1
`bcp47`	`string \| null`	—	Partial	IANA subtag registry
`aliases`	`string[]`	—	❌	Manual curation
`isoScope`	`string`	✅	✅	ISO 639-3 registry
`isoType`	`string`	✅	✅	ISO 639-3 registry
`macrolanguage`	`string \| null`	—	✅	ISO 639-3 macrolanguages.tab
`extends`	`string \| null`	—	❌	Manual curation

§ 2. 분류 필드

필드	타입	필수	자동화 가능	출처
`glottocode`	`string \| null`	—	✅	Glottolog
`classification`	`object \| null`	—	✅	Glottolog
`isIsolate`	`boolean`	—	✅	Glottolog CLDF

§ 3. 지리 필드

필드	타입	필수	자동화 가능	출처
`macroarea`	`string \| null`	—	✅	Glottolog CLDF
`coordinates`	`object \| null`	—	✅	Glottolog
`countries`	`string[]`	—	✅	Glottolog
`regions`	`object[]`	—	❌	Census, Ethnologue, manual
`arealContext`	`object \| null`	—	✅	Coordinates + linguistic area zones

§ 4. 문자 체계 필드

필드	타입	필수	자동화 가능	출처
`script`	`string \| null`	—	✅	Wikidata P282
`scriptUnicodeName`	`string \| null`	—	✅	Derived from `script` via ISO 15924 → Unicode mapping
`scripts`	`object[]`	—	Partial	Wikidata, manual
`dir`	`string \| null`	—	✅	Derivable from script
`scriptConverter`	`string \| null`	—	❌	Manual
`orthographicStatus`	`object \| null`	—	Partial	Ethnologue, manual

§ 5. 인구 통계 및 활력도 필드

필드	타입	필수	자동화 가능	출처
`speakerEstimates`	`object[]`	—	✅	Wikidata, Ethnologue, census
`vitality`	`object \| null`	—	✅	Glottolog AES, UNESCO

§ 5.5 문서화 및 디지털 존재 필드

필드	타입	필수	자동화 가능	출처
`documentationDepth`	`object \| null`	—	✅	Glottolog references
`digitalPresence`	`object \| null`	—	✅	Wikipedia, Common Voice, Tatoeba
`dialectCount`	`number \| null`	—	✅	Glottolog

§ 6. 격식, 레지스터 및 성별 필드

필드	타입	필수	자동화 가능	출처
`formality`	`object \| null`	—	❌	Linguistic research
`registers`	`object \| null`	—	❌	Linguistic research
`gender`	`object \| null`	—	❌	Linguistic research
`codeSwitching`	`object \| null`	—	❌	Linguistic research

§ 7. 언어 프로파일 필드

필드	타입	필수	자동화 가능	출처
`linguisticChallenges`	`object \| null`	—	❌	Linguistic research
`contactInfluences`	`object[]`	—	❌	Published linguistics
`rules`	`object \| null`	—	✅	CLDR
`typologicalProfile`	`object \| null`	—	✅	Grambank 1.0.3 — `enrich-grambank-typology.mjs`에 의해 자동 채워짐
`phonologicalInventory`	`object \| null`	—	✅	PHOIBLE 2.0 — `enrich-phoible-phonemes.mjs`에 의해 자동 채워짐

§ 8. 백과사전 필드

필드	타입	필수	자동화 가능	출처
`encyclopedic`	`object \| null`	—	❌	Manual research
`culturalAphorism`	`object \| null`	—	❌	Community contribution
`varieties`	`object[]`	—	❌	Manual research

§ 9. 디지털 리소스 필드

필드	타입	필수	자동화 가능	출처
`resources`	`object \| null`	—	Partial	Manual + automated
`databaseCoverage`	`object \| null`	—	✅	Derived from enrichment
`corpusAvailability`	`object \| null`	—	✅	Bible Brain, OPUS, Lexibank
`keyboardSupport`	`object \| null`	—	✅	Keyman API, CLDR
`methodSupport`	`object`	✅	Partial	API verification
`metricModelSupport`	`object \| null`	—	✅	XLM-R paper, AfriCOMET paper
`metricPlugins`	`object \| null`	—	✅	Card enrichment — 어떤 메트릭 플러그인 팩이 적용되는지 선언(예: `{ formalityMarkers: true }`)
`omt1600`	`object \| null`	—	✅	Meta assessment
`evalDatasets`	`string[]`	—	✅	Dataset registry
`pipelineReadiness`	`object \| null`	—	Partial	Derived + manual

resources.fsts[].install: resources 객체의 FST 항목은 다음 필드를 가진 install 하위 객체를 포함할 수 있어요: repo, releaseTag, assetPattern, format, maturity, 그리고 선택적으로 bundlePattern. 이는 이전의 GIELLALT_FST_REGISTRY 하드코딩된 dict를 대체해요. language_cards.py의 get_fst_install_info()를 참고하세요.

§ 10. 출처 필드

필드	타입	필수	자동화 가능	출처
`dataSources`	`array \| object`	✅	✅	Auto + manual
`supportTier`	`string`	—	✅	Derived from card completeness
`humanReviewed`	`object \| null`	—	❌	Human reviewer
`notes`	`string \| null`	—	❌	Manual
`firstDocumented`	`number \| null`	—	✅	Glottolog CLDF
`lastDocumented`	`number \| null`	—	✅	Glottolog CLDF
`_generated`	`object \| null`	—	✅	Enrichment scripts

언어 코드 정책

Champollion은 표준 식별자로 ISO 639-3을 사용해요. 다른 표준 코드들은 별칭으로 등록되며 런타임에 ISO 639-3 코드로 해석돼요.

우선순위	표준	예시	필드	용도
1 (표준)	ISO 639-3	`crk`	`code`	카드 파일명, 구성 키, API 매개변수
2 (별칭)	ISO 639-1	`iu`	`aliases[]`	CLI에서 허용, ISO 639-3으로 해석됨
3 (별칭)	BCP 47	`fil`	`aliases[]`	CLI에서 허용, ISO 639-3으로 해석됨
참조	Glottocode	`plai1258`	`glottocode`	분류 전용, 런타임용 아님

해석 순서: 사용자가 코드를 제공할 때:

card.code에 직접 일치 → 발견
card.aliases[]에 일치 → 발견, 표준 카드 반환
card.iso639_1에 일치 → 발견(대체)
발견되지 않음 → 오류

마이그레이션 이력: ISO 639-1 → ISO 639-3

v8 이전에는 카드 파일명에 가능한 경우 ISO 639-1 코드를 사용했어요(fr.json, de.json, ja.json). 639-3 마이그레이션에서 모든 카드는 해당 ISO 639-3 등가 코드로 이름이 변경되었어요:

이전	이후	이유
`fr.json`	`fra.json`	639-3이 표준
`de.json`	`deu.json`	639-3이 표준
`zh.json`	`cmn.json`	매크로언어 → 기본 개별 언어
`ar.json`	`arb.json`	매크로언어 → 현대 표준 아랍어
`ms.json`	`zsm.json`	매크로언어 → 표준 말레이어

기존 코드는 어떻게 되었나요?

기존 639-1 코드는 card.iso639_1에 있어요
기존 639-1 코드는 card.aliases[]에 있어요
resolveCode("fr")은 런타임에 "fra"을 반환해요 — 하위 호환됨
사용자는 여전히 구성에 "fr"을 쓸 수 있어요 — 투명하게 해석돼요

아키텍처적으로 무엇이 바뀌었나요:

_deepMerge()은 이제 null 값을 건너뛰어요(부모로부터 상속)
_deepMerge()은 이제 정체성 필드가 설정돼요(code, extends, aliases는 절대 상속되지 않음)
formality.default은 이제 레지스터 isDefault: true 플래그로부터 도출돼요
205개의 Grambank 유래 카드가 구조적 formality.default 수정을 받았어요
38개의 속/패밀리/매크로언어 카드가 상속 대상을 제공해요

엣지 케이스

수어

수어(예: ASE — American Sign Language)는 ISO 639-3 코드를 가진 정당한 언어예요. 지리와 화자 수를 가지지만:

script은 일반적으로 null이에요(표준 문자 형태 없음)
scripts은 표기 체계가 사용되는 경우 "Sgnw"(SignWriting)을 포함할 수 있어요
dir은 null이에요
linguisticChallenges은 공간 문법, 분류사 등을 다뤄야 해요
gender.grammatical은 일반적으로 false이에요

고대어 및 역사적 언어

라틴어(lat, isoType H)와 산스크리트어(san, isoType H) 같은 언어는 특정 맥락(전례, 학술)에서 여전히 사용되지만 원어민 화자가 없어요:

vitality은 "trend": "stable"와 함께 "원어민 화자 없음"을 표기할 수 있어요(쇠퇴 중 아님 — 이를 사용하는 공동체는 작지만 안정적임)
speakerEstimates은 이들이 L1이 아닌 L2 화자임을 표기해야 해요
firstDocumented / lastDocumented은 시간상으로 그들을 위치시켜요

인공어

에스페란토(epo, isoType C), 로지반 등:

classification은 "인공" 패밀리 또는 null을 가리킬 수 있어요
contactInfluences은 출처 자료를 반영해요(예: 에스페란토는 로망스어, 게르만어, 슬라브어를 차용)
vitality은 특이해요 — 화자 공동체는 늘어나지만 원어민 본거지는 없음

매크로언어

아랍어(ara), 중국어(zho), Cree(cre), 케추아어(que)는 여러 개별 언어를 포괄하는 매크로언어예요:

isoScope: "M"
varieties은 ISO 코드와 함께 개별 언어들을 나열해야 해요
methodSupport은 매크로언어 카드가 지원하는 것을 반영해야 해요(보통 표준화된 변종)
개별 변종도 자체 카드를 가져야 해요

표준화된 정서법이 없는 언어

많은 언어(특히 구전 전통 언어)는 표준화된 문자 체계가 없거나, 경쟁하는 정서법들을 가져요:

script은 null이에요
scripts은 []이에요
dir은 null이에요
notes은 정서법 상황을 설명해야 해요
linguisticChallenges은 이것이 MT에 미치는 영향을 표기해야 해요(예: 학습 데이터 없음)

다이글로시아

아랍어(MSA 대 방언)나 과라니어(Jopará 대 순수 과라니어) 같은 언어:

codeSwitching은 혼합 변종 상황을 포착해요
registers은 서로 다른 수준에 대한 프리셋을 제공할 수 있어요
varieties은 다이글로시아 쌍을 나열할 수 있어요

접촉 영향 유형

유형	의미	예시
`superstrate`	지배 언어가 공동체에 부과됨	French → English (1066년 이후)
`substrate`	모국어가 부과된 언어에 영향을 줌	Celtic → English
`adstrate`	상호 영향을 주는 인접 언어	Norse → English
`learned_borrowing`	교육/학문을 통한 차용	Latin → English
`lexical_borrowing`	접촉을 통한 직접 어휘 차용	Spanish → Filipino
`relexification`	대규모 어휘 교체	Portuguese → Papiamentu

접촉 영향 깊이

깊이	의미
`light`	몇 개의 차용어, 최소한의 구조적 영향
`moderate`	특정 영역에서 상당한 어휘
`heavy`	광범위한 어휘와 일부 구조적 특징
`structural`	문법, 통사론, 음운론이 영향받음
`defining`	접촉에 의해 핵심 정체성이 형성됨(크리올어, 혼합어)

좋은 레지스터 프리셋 작성하기

좋은 프리셋 프롬프트:

격식 특징을 명시적으로 명명하기(예: "해요체", "vous-form", "siz-form")
사용할 구체적인 대명사나 동사 형태를 설명하기
이 레지스터가 적절한 상황에 대한 맥락 제공하기
해당되는 경우 문자 고려사항 언급하기

하지 말 것: 성별 포용 가이드를 프리셋 프롬프트에 넣지 마세요. 성별 가이드는 card.gender.inclusiveGuidance에 속해요 — 별도로 주입돼요.

❌ Bad:  "Standard Thai. Professional register."
✔ Good: "Professional Thai. Use คุณ (khun) for second person, เรา (rao)
         for first person when needed. Clear, concise phrasing
         appropriate for digital interfaces."

프리셋 명명 규칙

프리셋 키는 설명적이고 소문자-하이픈으로 연결해야 해요:

T-V 언어: formal-vous, informal-tu, formal-Sie, casual-du
화계(speech levels): polite-haeyo, formal-hapsyo, casual-hae
중립: professional, neutral-professional
코드 스위칭: taglish-professional, pure-filipino

보강 절차

카드별 처리 순서

카드를 보강할 때는 다음 순서로 출처를 참조하세요. 데이터를 반환하지 않더라도 참조한 모든 출처를 문서화하세요.

ISO 639-3 registry → code, name, isoScope, isoType
ISO 639-3 macrolanguages.tab → macrolanguage
Glottolog languoid.csv → glottocode, classification, coordinates, countries
Glottolog CLDF → macroarea, isIsolate, firstDocumented, lastDocumented
Glottolog AES → vitality (위험 상태)
Wikidata SPARQL → nativeName, speakerEstimates, script, scripts, dir
CLDR → rules (타이포그래피, 복수형, 대문자 처리)
NLLB-200 / FLORES+ → methodSupport.nllb, evalDatasets
API verification → 남은 methodSupport 항목
ML model papers → metricModelSupport (XLM-R 학습 데이터, AfriCOMET 커버리지) 스크립트: node scripts/enrich-metric-model-support.mjs

충돌 처리

출처들이 서로 다를 때:

출처 정보와 함께 둘 다 저장
평균을 내거나 한쪽을 고르지 않기
관련 note 필드에 불일치 표기
계산을 위해 단일 값이 필요한 경우에만 가장 최근의 1차 출처를 선호

검증

보강이나 수동 편집 후에는 린터를 실행하세요:

node scripts/lint-language-cards.mjs              # all cards
node scripts/lint-language-cards.mjs --lang crk    # single card

PR 체크리스트

새로운 언어 카드나 수정된 언어 카드를 제출할 때:

shared/language-cards/에 <code>.json로 명명된 파일
표준 템플릿의 모든 최상위 필드가 존재함
classification이 Glottolog에서 채워짐(수작업 구축 아님)
dataSources이 참조한 모든 출처를 나열함
methodSupport 항목이 실제 API 언어 목록과 대조하여 검증됨
contactInfluences 항목이 출판된 출처 또는 citation_needed: true를 가짐
3–6개의 MT 관련 과제를 가진 linguisticChallenges (조사된 경우)
rules이 CLDR에서 채워짐(로케일 데이터가 존재하는 경우)
린터가 오류 없이 통과함

전문 참조 자료

표준	관리 주체	우리의 용도
ISO 639-3	SIL International	표준 언어 코드, 매크로언어 관계
Glottolog	Max Planck Institute	분류, 좌표, AES 위험도
WALS	Max Planck Institute	속 정의, 유형론적 특징
ISO 15924	Unicode/ISO	문자 코드
CLDR	Unicode Consortium	로케일 데이터, 복수형 규칙, 타이포그래피
Wikidata	Wikimedia Foundation	화자 수, 자칭(endonym), 문자 데이터
Ethnologue	SIL International	EGIDS, 화자 추정치, DLS
UNESCO Atlas	UNESCO	위험도 분류
Katig Collective	UP Diliman	필리핀 언어 캡슐

참고: 출처별 상세 가이드는 언어 카드 인용 절차를 확인하세요.

설계 원칙​

3계층 아키텍처​

상속 모델​

병합 의미론​

정체성 필드(절대 상속되지 않음)​

예시: Cree 카드가 해석되는 방식​

속 카드 템플릿​

표준 템플릿​

필드 레퍼런스​

§ 1. 정체성 필드​

§ 2. 분류 필드​

§ 3. 지리 필드​

§ 4. 문자 체계 필드​

§ 5. 인구 통계 및 활력도 필드​

§ 5.5 문서화 및 디지털 존재 필드​

§ 6. 격식, 레지스터 및 성별 필드​

§ 7. 언어 프로파일 필드​

§ 8. 백과사전 필드​

§ 9. 디지털 리소스 필드​

§ 10. 출처 필드​

언어 코드 정책​

마이그레이션 이력: ISO 639-1 → ISO 639-3​

엣지 케이스​

수어​

고대어 및 역사적 언어​

인공어​

매크로언어​

표준화된 정서법이 없는 언어​

다이글로시아​

접촉 영향 유형​

접촉 영향 깊이​

좋은 레지스터 프리셋 작성하기​

프리셋 명명 규칙​

보강 절차​

카드별 처리 순서​

충돌 처리​

검증​

PR 체크리스트​

전문 참조 자료​