Especificação de Cartão de Idioma

Fonte única de verdade. Este documento define a forma canônica de cada cartão de idioma. Todo cartão DEVE conter todos os campos de nível superior listados aqui, mesmo quando o valor é null ou []. Um cartão com um campo ausente não está em conformidade. Essa uniformidade é o que permite que ferramentas automatizadas, linters, scripts de enriquecimento e revisores humanos confiem na estrutura do cartão.

Princípios de Design

Forma uniforme. Todos os 8.000+ cartões têm os mesmos campos de nível superior. Valores desconhecidos são null, arrays vazios são [], objetos vazios são null (não {}). Isso significa que o código nunca precisa verificar "este campo existe?" — apenas "ele está preenchido?"
Rastreie tudo. Toda afirmação factual é rastreável até uma fonte nomeada, versionada e primária. Afirmações sem fonte são afirmações não verificáveis. O campo dataSources (e anotações source por campo em sub-objetos) tornam a proveniência explícita.
Preserve desacordos. Quando autoridades discordam (Wikidata diz 50.000 falantes, Ethnologue diz 20.000), armazenamos ambas com atribuição de fonte. Não fazemos média, resolvemos ou escolhemos lados. Os usuários podem navegar pela nuance.
Null significa desconhecido, não inaplicável. Se um campo é null, significa "ainda não encontramos dados para isso." Se um campo genuinamente não se aplica (por exemplo, grammatical gender para uma língua de sinais), o valor deve explicar isso: { "grammatical": false, "inclusiveGuidance": "Não aplicável — ASL não tem gênero gramatical." }
Apenas mescle. Scripts de enriquecimento adicionam dados, nunca sobrescrevem. Valores curados manualmente têm prioridade sobre dados automatizados.

Arquitetura de Três Camadas

Camada	Localização	Propósito
Cartões de idioma	`shared/language-cards/<code>.json`	Configuração por idioma: identidade, classificação, recursos, tudo
Cartões de gênero	`shared/language-cards/genera/<genus>.json`	Propriedades de tempo de execução compartilhadas para idiomas relacionados (curadas, não geradas automaticamente)
Árvore de idiomas	`shared/language-cards/language-tree.json`	Hierarquia completa do Glottolog — dados de referência para UI do Lab e descoberta de idiomas

Modelo de Herança

Quando um cartão define "extends": "family-dravidian", o tempo de execução mescla o cartão pai no filho usando _deepMerge() (em lib/registers.js). Isso permite que cartões de gênero definam registros compartilhados, sistemas de formalidade e orientação de gênero que fluem para todos os idiomas membros — sem duplicar dados em centenas de cartões individuais.

Semântica de Mesclagem

Valor do filho	Comportamento	Por quê
`null`	Herdar do pai	`null` significa "não defino isso" — o valor do pai flui
Não nulo	Sobrescrever pai	Os dados do filho são mais específicos — têm prioridade
Objeto aninhado	Mesclagem recursiva	Campos do filho sobrescrevem, campos do pai preservados
Array	Substituir completamente	Arrays não mesclam item por item — o array do filho vence

Campos de Identidade (Nunca Herdados)

Alguns campos pertencem ao cartão em si e NUNCA devem ser herdados de um pai:

code, extends, _migration, aliases, iso639_1, iso639_3

Mesmo que um cartão pai defina aliases: ["macro-code"], um cartão filho NÃO herdará esses aliases. Esses campos são sempre os valores do próprio filho (incluindo null se não definido).

Por quê: Sem essa regra, todo idioma Cree herdaria aliases: ["cre"] do pai da macrolíngua, tornando cada variedade um alias da macro.

Exemplo: Como um Cartão Cree é Resolvido

┌───────────────────────┐
│  family-algic.json    │  formality: null, registers: null
│  (no registers)       │
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  genus-cree.json      │  formality: { system: "obviative-animate", ... }
│  (sourced registers)  │  registers: { formal: {...}, informal: {...} }
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  crk.json             │  code: "crk", extends: "genus-cree"
│  (Plains Cree)        │  formality: null → inherits from genus-cree
│                       │  registers: null → inherits from genus-cree
│                       │  script: "Cans"  → own value, no inheritance
│                       │  code: "crk"     → identity field, never inherited
└───────────────────────┘

Em tempo de execução, getLanguageCard("crk") retorna um objeto mesclado com registros de genus-cree + propriedades de family-algic (se houver) + identidade e metadados próprios de crk.

Modelo de Cartão de Gênero

Cartões de gênero vivem em shared/language-cards/genera/ e definem propriedades compartilhadas para um grupo de idiomas. Eles seguem o mesmo esquema que cartões regulares, mas com convenções diferentes:

{
  // Identity — genus cards use a prefixed code, NOT an ISO 639-3 code
  "code": "genus-cree",           // "genus-", "family-", or "macrolanguage-" prefix
  "name": "Cree Languages",      // Human-readable group name
  "extends": "family-algic",     // Genus cards can extend family cards (chaining)

  // Formality — shared across the group, sourced from typological databases
  "formality": {
    "system": "obviative-animate",
    "description": "Cree languages use an obviative/proximate system...",
    "default": "formal",
    "source": "WALS 37A, 38A + Wolfart 1973"
  },

  // Registers — shared presets, if the group shares a formality system
  "registers": {
    "formal": {
      "label": "Formal (Proximate)",
      "description": "...",
      "prompt": "...",
      "isDefault": true
    },
    "informal": {
      "label": "Informal",
      "description": "...",
      "prompt": "..."
    }
  },

  // Gender — shared grammatical gender behavior
  "gender": {
    "grammatical": false,       // Cree doesn't have grammatical gender
    "inclusiveGuidance": null   //   so no inclusive guidance needed
  },

  // Everything else is null — individual cards provide their own
  // classification, geography, resources, etc.
  "classification": null,
  "methodSupport": null,
  // ...
}

Regra-chave: Cartões de gênero devem APENAS conter dados genuinamente compartilhados em todo o grupo e originários de referências autoritárias. Se um sistema de formalidade varia entre membros, pertence aos cartões individuais, não ao gênero.

Modelo Canônico

Todo cartão DEVE ter exatamente esta forma de nível superior. Esquemas de sub-objetos são documentados na Referência de Campos abaixo.

{
  // ═══════════════════════════════════════════════════════════════════════
  //  § 1. IDENTITY
  //  Who is this language? What codes identify it?
  //  Sources: ISO 639-3 registry, ISO 639-1, BCP 47/IANA.
  // ═══════════════════════════════════════════════════════════════════════

  "code":          "xxx",       // REQUIRED. ISO 639-3 code. This IS the card ID and filename.
  "name":          "English Name",  // REQUIRED. English reference name from ISO 639-3 registry.
  "nativeName":    null,        // Endonym (name in the language itself). Source: Wikidata P1705.
                                // Examples: "nêhiyawêwin / ᓀᐦᐃᔭᐍᐏᐣ", "日本語", "Esperanto".
  "alternateNames": [],         // Other names this language is known by. Source: Glottolog, Ethnologue.
                                // Not aliases (those are code-level). These are name-level variants.
                                // Example: ["Qafar af", "Afaraf", "'Afar Af"] for Afar (aar).
  "iso639_3":      "xxx",      // REQUIRED. Three-letter ISO 639-3 code. Same as `code`.
  "iso639_1":      null,        // Two-letter ISO 639-1 code (e.g., "en", "fr"). null if none.
  "bcp47":         null,        // IETF BCP 47 tag. Often same as iso639_1. Can include subtags
                                // (e.g., "iu-Cans-CA"). null if unknown.
  "aliases":       [],          // Alternative code-level identifiers that resolve to this card.
                                // Example: ["fil"] for tl (Tagalog), ["iu"] for iku (Inuktitut).
                                // Used by code resolution: user types "fil", system loads tl.json.
  "isoScope":      "I",        // REQUIRED. ISO 639-3 scope:
                                //   "I" = Individual language
                                //   "M" = Macrolanguage (e.g., Chinese, Arabic, Cree)
                                //   "S" = Special (e.g., mis, mul, zxx)
  "isoType":       "L",        // REQUIRED. ISO 639-3 type:
                                //   "L" = Living    "E" = Extinct    "A" = Ancient
                                //   "H" = Historical    "C" = Constructed
  "macrolanguage": null,        // If this language is part of a macrolanguage, the macrolanguage
                                // ISO 639-3 code (e.g., "cre" for Plains Cree, "ara" for Arabic
                                // varieties). Source: ISO 639-3 macrolanguages.tab.
  "extends":       null,        // Genus card key if shared properties are inherited from a genus
                                // card (e.g., "genus-cree", "genus-eskimo-aleut").
                                // null for most languages.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 2. CLASSIFICATION
  //  Where does this language sit in the family tree?
  //  Source: Glottolog. NEVER hand-build classifications.
  // ═══════════════════════════════════════════════════════════════════════

  "glottocode":      null,      // Glottolog identifier (e.g., "plai1258", "stan1293").
                                // null if the language is not in Glottolog.
  "classification":  null,      // Genealogical classification from Glottolog. When populated:
                                // {
                                //   "family": "Algic",              // Top-level family. null for isolates.
                                //   "familyGlottocode": "algi1248", // Glottocode of the family.
                                //   "genus": "Plains Creeic",       // WALS-style genus.
                                //   "genusGlottocode": "plai1264",  // Glottocode of the genus.
                                //   "ancestry": ["Algic", "Algonquian-Blackfoot", "Algonquian",
                                //                "Cree-Montagnais-Naskapi", "Cree", "Plains Creeic"]
                                // }
                                // For isolates: family = language name, genus = language name,
                                // ancestry = [language name].
  "isIsolate":       false,     // true if a language isolate (no known genetic relatives).
                                // Source: Glottolog CLDF.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 3. GEOGRAPHY
  //  Where is this language spoken?
  //  Sources: Glottolog (coordinates, countries), census data, Ethnologue.
  // ═══════════════════════════════════════════════════════════════════════

  "macroarea":     null,        // Glottolog macroarea. One of: "Africa", "Australia",
                                // "Eurasia", "North America", "Papunesia", "South America".
                                // null if unknown. Source: Glottolog CLDF.
  "coordinates":   null,        // Representative geographic point. When populated:
                                // { "lat": 52.1, "lng": -106.6, "source": "glottolog-5.3" }
                                // This is a representative point, not a boundary.
  "countries":     [],          // ISO 3166-1 alpha-2 country codes where this language is spoken.
                                // Example: ["CA", "US"]. Source: Glottolog.
  "regions":       [],          // Detailed regional breakdown with admin codes & speaker estimates.
                                // Each entry:
                                // {
                                //   "country": "Canada",
                                //   "countryCode": "CA",
                                //   "officialStatus": "recognized",  // official, co-official,
                                //                                    // recognized, none
                                //   "region": "Saskatchewan, Alberta, Manitoba",
                                //   "speakerEstimate": "~20,000",
                                //   "coordinates": [-106.6, 52.1],   // [lng, lat]
                                //   "admin1Codes": ["CA-SK", "CA-AB", "CA-MB"]
                                // }

  "arealContext":  null,         // Linguistic area / Sprachbund membership. DISTINCT from
                                // contactInfluences (which is language-specific contact history).
                                // This field captures zone-level typological convergence patterns
                                // — i.e., what linguistic area the language exists within and
                                // what features are common across that area.
                                // {
                                //   "zone": "Mainland Southeast Asian Sprachbund",
                                //   "arealFeatures": "Tonal convergence, classifier systems,
                                //     topic-prominence, monosyllabicity trend.",
                                //   "typicalContacts": ["Classical Chinese", "Sanskrit/Pali"],
                                //   "source": "areal-linguistics (Enfield 2005)"
                                // }
                                // NOT the same as contactInfluences. A language can exist within
                                // a convergence area without having specific contact history with
                                // any particular language in that area.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 4. WRITING SYSTEMS
  //  How is this language written?
  //  Sources: Wikidata P282, ISO 15924, manual research.
  //  Note: Some languages have NO standardized orthography. Some have
  //  competing orthographies. Some use multiple scripts routinely (e.g.,
  //  Serbian: Cyrillic + Latin; Japanese: Kanji + Hiragana + Katakana).
  //  Sign languages may use notation systems (SignWriting, HamNoSys) or
  //  none at all.
  // ═══════════════════════════════════════════════════════════════════════

  "script":        null,        // Primary ISO 15924 script code (e.g., "Latn", "Cyrl", "Cans",
                                // "Jpan"). null if no written form or unknown.
  "scriptUnicodeName": null,    // Unicode script block name derived from the script field.
                                // e.g., "Latin", "Cyrillic", "Canadian_Aboriginal", "CJK".
                                // Used by code_switching metric plugin. Auto-populated by
                                // enrich-script-unicode-names.mjs. null if script is null.
  "scripts":       [],          // All writing systems with detail. Array of:
                                // {
                                //   "code": "Cans",
                                //   "name": "Unified Canadian Aboriginal Syllabics",
                                //   "primary": true
                                // }
                                // A language with multiple scripts has multiple entries.
                                // A language with no written form has [].
  "dir":           null,        // Writing direction: "ltr" (left-to-right) or "rtl" (right-to-left).
                                // null if no written form or unknown.
  "scriptConverter": null,      // Script converter key if we have a converter for this language
                                // (e.g., "crk" for SRO↔Syllabics). null for most languages.
  "orthographicStatus": null,   // Writing system standardization status. When populated:
                                // {
                                //   "status": "standardized",
                                //       // "standardized" — official/agreed orthography exists
                                //       // "competing"    — multiple orthographies in active use
                                //       // "emerging"     — orthography under development
                                //       // "none"         — primarily oral, no standard writing
                                //   "notes": "Uses SIL-developed Latin orthography since 1960s.",
                                //   "source": "ethnologue" // or "manual-curation"
                                // }
                                // Crucial for LRLs where orthographic variation directly impacts
                                // MT training data quality and evaluation consistency.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5. DEMOGRAPHICS & VITALITY
  //  How many people speak this language? Is it endangered?
  //  Sources: Census, Ethnologue, UNESCO Atlas, Wikidata, Glottolog AES.
  //
  //  CRITICAL: Store ALL estimates separately with source attribution.
  //  Never average or "resolve" conflicting data. Speaker counts are
  //  politically contested for many languages. Present the evidence,
  //  let the reader assess.
  // ═══════════════════════════════════════════════════════════════════════

  "speakerEstimates": [],       // Array of speaker count estimates from different authorities.
                                // Each entry:
                                // {
                                //   "source": "wikidata",              // or "ethnologue-28",
                                //                                      // "census-ph-2020", etc.
                                //   "count": 20000,                    // Point estimate. null if range-only.
                                //   "date": "2026-06-07",              // When this data was retrieved.
                                //   "countRange": { "min": 15000, "max": 25000 },  // Optional range.
                                //   "note": "Wikidata has 2 estimates: 15,000 and 25,000"
                                // }
                                // Empty array means we have not yet found speaker count data.

  "vitality":      null,        // Endangerment / vitality assessment. When populated:
                                // {
                                //   "unescoStatus": "severely-endangered",
                                //       // Enum: "safe", "vulnerable", "definitely-endangered",
                                //       //       "severely-endangered", "critically-endangered",
                                //       //       "extinct"
                                //   "aesStatus": "shifting",
                                //       // Glottolog AES label (free text from AES data).
                                //   "egids": "6b",
                                //       // Ethnologue Expanded Graded Intergenerational Disruption
                                //       // Scale. Levels: 0 (international) to 10 (extinct).
                                //   "trend": "declining",
                                //       // Qualitative trend: "stable", "growing", "declining",
                                //       //                     "shifting", "moribund", "awakening"
                                //   "source": "glottolog-aes-5.3",
                                //   "notes": "Intergenerational transmission breaking down."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5.5. DOCUMENTATION & DIGITAL PRESENCE
  //  How well-documented is this language? What digital footprint does it
  //  have? These fields answer the practical question: "What can I
  //  actually DO with this language?"
  //  Sources: Glottolog (references), Wikipedia, Common Voice, Tatoeba.
  // ═══════════════════════════════════════════════════════════════════════

  "documentationDepth": null,    // How well-documented is this language in the literature?
                                 // {
                                 //   "referenceCount": 42,
                                 //       // Number of published references in Glottolog.
                                 //   "med": "grammar",
                                 //       // Most Extensive Description type. One of:
                                 //       // "long_grammar", "grammar", "grammar_sketch",
                                 //       // "dictionary", "phonology", "text", "wordlist",
                                 //       // "comparative", "minimal", "unknown"
                                 //   "source": "glottolog-5.3"
                                 // }

  "digitalPresence":  null,      // Digital footprint across web platforms. When populated:
                                 // {
                                 //   "wikipedia": {
                                 //     "edition": true,      // Has its own Wikipedia edition?
                                 //     "articleCount": 75000, // Number of articles.
                                 //     "editionCode": "crk",  // Wikipedia subdomain code.
                                 //     "source": "wikimedia-api-2026"
                                 //   },
                                 //   "commonVoice": {
                                 //     "validatedHours": 12.5,
                                 //     "totalHours": 25.0,
                                 //     "speakers": 45,
                                 //     "sentences": 1200,
                                 //     "source": "common-voice-20.0"
                                 //   },
                                 //   "tatoeba": {
                                 //     "sentenceCount": 342,
                                 //     "source": "tatoeba-2026"
                                 //   }
                                 // }

  "dialectCount":     null,      // Number of recognized dialects in Glottolog.
                                 // Derived from child_dialect_count in languoid.csv.
                                 // Simple integer. null if 0 or unknown.
                                 // Source: glottolog-5.3.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 6. FORMALITY, REGISTERS & GENDER
  //  How does politeness work in this language? What translation registers
  //  do we offer? How should gender be handled?
  //
  //  This section drives Champollion's register-preset system — the
  //  mechanism by which users select formal/informal/professional tone.
  //  These fields require genuine linguistic research, not automation.
  // ═══════════════════════════════════════════════════════════════════════

  "formality":     null,        // Formality system description. When populated:
                                // {
                                //   "system": "T-V",
                                //       // One of: "T-V", "speech-levels", "keigo", "particles",
                                //       //         "register-levels", "register-and-code-switching",
                                //       //         "code-switching", "none"
                                //   "description": "French uses a vous/tu distinction...",
                                //   "default": "formal-vous"   // Key into the `registers` object.
                                // }

  "registers":     null,        // Translation register presets. When populated, keyed by preset ID:
                                // {
                                //   "formal-vous": {
                                //     "label": "Formal (vouvoiement)",
                                //     "description": "One sentence: when to use this preset.",
                                //     "prompt": "The actual LLM system prompt instruction that
                                //               steers translation tone. Must name specific
                                //               linguistic features (pronouns, verb forms, particles).",
                                //     "deeplFormality": "prefer_more"
                                //       // Only if methodSupport.deepl.formality is true.
                                //       // One of: "prefer_more", "prefer_less", "default".
                                //   }
                                // }

  "gender":        null,        // Grammatical gender and inclusive guidance. When populated:
                                // {
                                //   "grammatical": true,         // Does the language have gram. gender?
                                //   "inclusiveGuidance": "Use gender-neutral forms when possible.
                                //                        Prefer 'iel' (neologism) or rephrase to
                                //                        avoid gendered agreement."
                                // }
                                // For languages without grammatical gender (Turkish, Finnish):
                                // { "grammatical": false, "inclusiveGuidance": null }

  "codeSwitching":  null,       // Code-switching behavior (for languages where mixing with another
                                // language is the norm, not an error). When populated:
                                // {
                                //   "contactLanguage": "Spanish",
                                //   "contactIso639_3": "spa",
                                //   "mixedVarietyName": "Jopará",   // null if no named mixed variety
                                //   "prevalence": "dominant",       // "rare", "common", "dominant"
                                //   "morphologicalIntegration": true,
                                //   "pipelineStrategy": "hybrid-fst",
                                //   "notes": "Jopará IS the everyday language of most Paraguayans..."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 7. LINGUISTIC PROFILE
  //  What makes this language what it is? What are the specific challenges
  //  for machine translation? What rules govern its typography?
  //  What languages have shaped it through contact?
  //
  //  These fields require genuine linguistic expertise. For many languages
  //  (especially low-resource), this section will remain null until a
  //  qualified researcher or community member contributes.
  // ═══════════════════════════════════════════════════════════════════════

  "linguisticChallenges": null,  // MT-relevant challenges, keyed by challenge ID.
                                 // When populated:
                                 // {
                                 //   "polysynthesis": "Cree is highly polysynthetic. A single verb
                                 //                    can incorporate subject, object, tense...",
                                 //   "animacy": "Verb conjugation changes based on whether the
                                 //              subject/object is animate or inanimate...",
                                 //   "neologisms": "Avoid literal translations of modern software
                                 //                 concepts. Maintain Cree metaphorical logic..."
                                 // }
                                 // Aim for 3–6 challenges per language when researched.

  "contactInfluences": [],       // How other languages have shaped this one. Array of:
                                 // {
                                 //   "source": "English",
                                 //   "sourceIso639_3": "eng",       // null if proto-language/unknown
                                 //   "type": "superstrate",
                                 //       // Enum: "superstrate", "substrate", "adstrate",
                                 //       //       "learned_borrowing", "lexical_borrowing",
                                 //       //       "relexification"
                                 //   "domains": ["education", "government", "technology"],
                                 //   "depth": "deep",
                                 //       // Enum: "light", "moderate", "heavy", "structural",
                                 //       //       "defining"
                                 //   "period": "1870–present",
                                 //   "notes": "Residential school era and ongoing...",
                                 //   "citation_needed": false
                                 //       // true if no published academic source found.
                                 //       // See language-card-citation-procedure.md.
                                 // }

  "rules":          null,        // Typography, plural, and capitalization rules. When populated:
                                 // {
                                 //   "typography": {
                                 //     "quoteStart": "\u201c",
                                 //     "quoteEnd": "\u201d",
                                 //     "usesSpaces": true,        // false for CJK, Thai, Lao, Khmer
                                 //     "punctuationSpacing": {
                                 //       "doublePunctuation": "none"  // "thin-nbsp" for French
                                 //     }
                                 //   },
                                 //   "plurals": {
                                 //     "categories": ["one", "other"]
                                 //       // From CLDR. Possible values:
                                 //       // "zero", "one", "two", "few", "many", "other"
                                 //   },
                                 //   "capitalization": {
                                 //     "hasCase": true
                                 //       // true for Latin, Cyrillic, Greek, Armenian scripts.
                                 //       // false for CJK, Arabic, Devanagari, etc.
                                 //   }
                                 // }
                                 // Source: CLDR + ISO 15924 derivation.

  "typologicalProfile": null,   // Grambank typological features. When populated:
                                // {
                                //   "featuresDocumented": 195,
                                //   "featuresCoverage": 1,     // 0.0–1.0 fraction of features
                                //   "wordOrderDominant": "SVO",
                                //   "hasDefiniteArticle": true,
                                //   "hasIndefiniteArticle": true,
                                //   "hasGenderSystem": true,
                                //   "hasCaseMorphology": true,
                                //   "hasEvidentiality": false,
                                //   "hasToneSystem": false,
                                //   "source": "grambank-1.0.3"
                                // }
                                // Auto-populated by enrich-grambank-typology.mjs.

  "phonologicalInventory": null, // PHOIBLE phoneme inventory. When populated:
                                // {
                                //   "consonants": 24,
                                //   "vowels": 16,
                                //   "tones": 0,
                                //   "totalPhonemes": 40,
                                //   "isTonal": false,
                                //   "inventorySize": "moderately-large",
                                //       // Enum: "small", "moderately-small", "average",
                                //       //       "moderately-large", "large"
                                //   "source": "phoible-2.0"
                                // }
                                // Auto-populated by enrich-phoible-phonemes.mjs.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 8. ENCYCLOPEDIC
  //  General knowledge about the language for human context. History,
  //  dialect situation, institutional resources, representative sayings.
  //  This section is for understanding, not computation.
  // ═══════════════════════════════════════════════════════════════════════

  "encyclopedic":    null,       // General knowledge. When populated:
                                 // {
                                 //   "family": "Algic",             // Redundant with classification
                                 //                                  // but useful for human readers.
                                 //   "dialects": {
                                 //     "split": true,               // Is there significant variation?
                                 //     "classification": "Plains Cree (y-dialect)",
                                 //     "variants": ["crk", "cwd", "csw"]  // ISO codes of variants
                                 //   },
                                 //   "demographics": {
                                 //     "speakers": "Approx. 20,000 active speakers",
                                 //     "regions": ["Saskatchewan", "Alberta", "Manitoba"]
                                 //   },
                                 //   "history": "Plains Cree is the most widely spoken Algonquian
                                 //              language in western Canada...",
                                 //   "resources": {
                                 //     "wikipedia": "https://en.wikipedia.org/wiki/Plains_Cree",
                                 //     "foundations": [{ "name": "ALTLab", "url": "https://..." }],
                                 //     "dictionaries": [{ "name": "itwêwina", "url": "https://..." }]
                                 //   }
                                 // }

  "culturalAphorism": null,      // A representative saying, proverb, or teaching in the language.
                                 // When populated:
                                 // {
                                 //   "text": "ê-wîcêhtonaniwahk kâ-kî-isi-wâpahtamâhk ôma pimâtisiwin",
                                 //   "transliteration": null,       // Romanized form if non-Latin script.
                                 //   "translation": "Through helping each other we come to understand
                                 //                   this life",
                                 //   "literal": "By-helping-one-another we-have-come-to-see this life",
                                 //   "source": "Cree teaching, documented in nêhiyawêwin educational
                                 //              resources"
                                 // }
                                 // Choose sayings that reveal something about the language's
                                 // worldview or structure. Must be sourced.

  "varieties":      [],          // For macrolanguages or languages with significant dialectal
                                 // variation, the individual varieties with their own tool coverage.
                                 // Each entry:
                                 // {
                                 //   "name": "Cusco Quechua",
                                 //   "iso639_3": "quz",
                                 //   "region": "Cusco, Peru",
                                 //   "fstCoverage": true,
                                 //   "corpusCoverage": true,
                                 //   "nllbCoverage": false,
                                 //   "mutualIntelligibility": "Primary variety for this card",
                                 //   "notes": "SQUOIA FST was built for this variety."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 9. DIGITAL RESOURCES & TOOLING
  //  What NLP tools, corpora, models, and datasets exist for this language?
  //  What translation APIs support it? What eval benchmarks are available?
  //
  //  This is Champollion's operational core — these fields determine what
  //  we can actually DO with this language.
  // ═══════════════════════════════════════════════════════════════════════

  "resources":      null,        // NLP resources available for this language. When populated:
                                 // {
                                 //   "fsts": [{                     // Finite-state transducers
                                 //     "name": "GiellaLT Plains Cree FST (lang-crk)",
                                 //     "url": "https://github.com/giellalt/lang-crk/releases",
                                 //     "type": "morphological-analyzer"
                                 //   }],
                                 //   "corpora": [{                  // Text corpora
                                 //     "name": "EDTeKLA Cree Language Textbook Corpus",
                                 //     "type": "parallel",          // "parallel", "monolingual"
                                 //     "pairs": ["en-crk"],
                                 //     "url": "https://...",
                                 //     "exposure": "open-web"       // "open-web", "restricted",
                                 //                                  // "holdout"
                                 //   }],
                                 //   "models": [{                   // Pre-trained models
                                 //     "name": "NLLB-200 (crk_Cans)",
                                 //     "url": "https://...",
                                 //     "type": "nmt"
                                 //   }],
                                 //   "tools": [],                   // Other NLP tools
                                 //   "wordlists": [{                // Standardized wordlists
                                 //     "name": "Lexibank",
                                 //     "conceptCount": 200,
                                 //     "source": "lexibank"
                                 //   }],
                                 //   "treebanks": [{                // Syntactic treebanks
                                 //     "name": "UD_Korean-GSD",
                                 //     "tokens": 80000,
                                 //     "source": "universal-dependencies-2.14"
                                 //   }]
                                 // }
                                 // IMPORTANT: Only actual NLP/digital resources belong here.
                                 // "This language has a WALS entry" is NOT a resource — that
                                 // goes in databaseCoverage.

  "databaseCoverage": null,      // Which typological/reference databases cover this language.
                                 // Separated from resources to avoid conflating "has a database
                                 // entry" with "has usable NLP tooling."
                                 // {
                                 //   "wals": true,
                                 //   "grambank": true,
                                 //   "phoible": true,
                                 //   "cldr": true,
                                 //   "lexibank": true,
                                 //   "commonVoice": true,
                                 //   "source": "derived"
                                 // }

  "corpusAvailability": null,    // What text/parallel corpora exist for NLP use?
                                 // {
                                 //   "bibleTranslation": {
                                 //     "textAvailable": true,
                                 //     "audioAvailable": true,
                                 //     "source": "bible-brain-api"
                                 //   },
                                 //   "opusCorpora": ["wikimedia", "ubuntu", "gnome"],
                                 //   "source": "multi-source"
                                 // }

  "keyboardSupport":  null,      // Input method / keyboard availability. When populated:
                                 // {
                                 //   "keymanKeyboards": 3,
                                 //       // Number of Keyman keyboards available.
                                 //   "cldrKeyboard": true,
                                 //       // CLDR has keyboard layout data.
                                 //   "source": "keyman-api + cldr"
                                 // }

  "methodSupport":  {            // REQUIRED. Which Champollion translation methods support this
                                 // language. Each method is an object with at minimum
                                 // { "supported": boolean }.
    "googleTranslate":     { "supported": false },
    "deepl":               { "supported": false },
    "microsoftTranslator": { "supported": false },
    "libreTranslate":      { "supported": false },
    "nllb":                { "supported": false },
                                 // When NLLB is supported, include the code:
                                 // { "supported": true, "code": "crk_Cans" }
    "llm":                 { "supported": true }
                                 // LLM is always true (quality varies by language).
                                 // Optional: "verifiedDate": "2026-06-07" for audit trail.
  },

  "metricModelSupport": null,   // Which MT evaluation models produce reliable scores.
                                // When populated:
                                // {
                                //   "xlmr": "high",          // "high", "medium", or "low"
                                //                            // XLM-R training representation tier.
                                //   "africomet": false        // true if AfriCOMET covers this language.
                                // }
                                // Drives automatic COMET model selection in metrics_comet.py.
                                // Auto-populated by enrich-metric-model-support.mjs.

  "metricPlugins":   null,      // Which per-language metric plugin packs are available.
                                // When populated:
                                // {
                                //   "formalityMarkers": true  // Formality marker resource file exists
                                //                             // at plugins/resources/formality/{code}.json
                                // }
                                // Each key corresponds to a resource pack in
                                // arena/mt_eval_harness/plugins/resources/{packName}/.
                                // To add a new metric pack for a language, create the resource
                                // file and set the flag here. No code changes required.

  "evalPack":       null,        // Evaluation dependency pack for language-specific metrics.
                                 // When populated, declares the Python dependencies and
                                 // post-install steps required by this language's eval standards.
                                 // The harness uses this for dependency gating: if deps are
                                 // missing, the harness warns the user and skips LYSS metrics
                                 // (rather than crashing).
                                 // When populated:
                                 // {
                                 //   "pythonDeps": {
                                 //     "pyhfst": "pyhfst>=1.4",    // PyPI package specs
                                 //     "requests": "requests>=2.28",
                                 //     "spacy": "spacy>=3.7"
                                 //   },
                                 //   "postInstall": [               // Commands to run after pip
                                 //     {
                                 //       "command": "spacy download en_core_web_md",
                                 //       "label": "spaCy English model (for LYSS-sem)"
                                 //     }
                                 //   ],
                                 //   "requiresFst": true,           // true if GiellaLT FST needed
                                 //   "description": "LYSS equivalence linter + FST validation"
                                 // }

  "evalMetrics":    null,        // Language-specific evaluation metrics (LYSS standards).
                                 // When populated, the harness dynamically imports these
                                 // MetricPlugin classes from eval_standards/<lang>/ and applies
                                 // them to every run targeting this language — regardless of
                                 // which method (contestant) is being evaluated.
                                 // Keyed by metric ID:
                                 // {
                                 //   "lyss-eq": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkLinterMetric",
                                 //     "description": "LYSS deterministic variant-class linter"
                                 //   },
                                 //   "lyss-sem": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkSemanticMetric",
                                 //     "description": "LYSS FST-based semantic validator",
                                 //     "dependencies": ["spacy>=3.7"],
                                 //     "spacy_models": ["en_core_web_md"]
                                 //   }
                                 // }
                                 // Architecture: eval standards are referees, not contestants.
                                 // They live in the harness (eval_standards/), not in method
                                 // plugins. This ensures all methods are scored equally.
                                 // Discovery: plugin_discovery.py reads this field via
                                 // language_cards.get_eval_metrics() and instantiates metrics
                                 // using importlib. Dependencies are checked against evalPack.

  "omt1600":        null,        // Meta's OMT-1600 (One Model for Translation) coverage assessment.
                                 // When populated:
                                 // {
                                 //   "covered": true,
                                 //   "tier": "R1",                  // Meta's resource tier
                                 //   "evalMetrics": ["chrF++", "BLASER-3"],
                                 //   "notes": "Plains Cree: no web-crawled bitext..."
                                 // }

  "evalDatasets":   [],          // Evaluation dataset IDs available for this language.
                                 // Example: ["flores-plus-devtest", "edtekla-dev-v1"].
                                 // Empty means no standardized eval set exists.

  "pipelineReadiness": null,     // Assessment of readiness for Champollion's translation pipeline.
                                 // When populated:
                                 // {
                                 //   "tier": "tier-2-feasible",
                                 //       // "watch-list"       — cataloged but no path to translation
                                 //       // "tier-3-cataloged" — basic metadata present
                                 //       // "tier-2-feasible"  — tools exist, pipeline possible
                                 //       // "tier-1-ready"     — pipeline operational
                                 //   "hasFST": true,
                                 //   "hasParallelCorpus": true,
                                 //   "hasEvalBenchmark": true,
                                 //   "blockers": ["Syllabics post-processing validation"],
                                 //   "notes": "FST-gated pipeline operational. EDTeKLA corpus..."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 10. PROVENANCE & METADATA
  //  Where does this data come from? Who reviewed it? When was it
  //  generated? What's its overall quality level?
  //
  //  This section exists to make the card auditable. Every automated
  //  enrichment, every human review, every source consulted should
  //  leave a trace here.
  // ═══════════════════════════════════════════════════════════════════════

  "dataSources":   [],           // REQUIRED. Sources consulted for this card's data.
                                 // Can be a flat array (backwards-compatible):
                                 //   ["iso639-3-2024", "glottolog-5.3", "wikidata"]
                                 //
                                 // Or a structured per-field object (preferred for new cards):
                                 //   {
                                 //     "classification": ["glottolog-5.3"],
                                 //     "vitality": ["glottolog-aes-5.3", "unesco-atlas-2024"],
                                 //     "speakerEstimates": ["wikidata", "census-ca-2021"],
                                 //     "rules": ["cldr-48"],
                                 //     "methodSupport": ["google-translate-2026-06"]
                                 //   }

  "supportTier":   "cataloged",  // Auto-derived tier summarizing the card's depth:
                                 //   "cataloged"   — identity + classification only
                                 //   "emerging"    — + vitality + speakerEstimates
                                 //   "developing"  — + resources + methodSupport
                                 //   "supported"   — full research: registers, challenges, etc.

  "humanReviewed": null,         // null until a qualified human reviews the card. When populated:
                                 // {
                                 //   "reviewer": "Prof. Kenneth Jamandre",
                                 //   "affiliation": "University of the Philippines Diliman",
                                 //   "date": "2026-06-08",
                                 //   "scope": "full",             // "full", "partial", "vitality-only"
                                 //   "notes": "Verified speaker count, vitality assessment,
                                 //             and contact influences for Tagalog."
                                 // }

  "notes":         null,         // Free-text notes about this language or this card's data quality.
                                 // Example: "Low-resource language under active development.
                                 //           Translation pipeline uses FST-gated approach."

  "firstDocumented": null,       // Year of first known documentation. Negative for BCE.
                                 // Example: -1500 (Sanskrit, ~1500 BCE), 1787 (some languages).
                                 // Source: Glottolog CLDF.

  "lastDocumented":  null,       // Year of last known documentation (relevant for extinct languages).
                                 // Source: Glottolog CLDF.

  "_generated":    null          // Auto-populated by enrichment scripts. When populated:
                                 // {
                                 //   "by": "generate-all-cards.mjs",
                                 //   "at": "2026-06-07T12:34:56Z",
                                 //   "sources": ["iso639-3", "glottolog-5.3", "wikidata"],
                                 //   "completeness": "partial",
                                 //       // "partial"     — has identity + classification + coords
                                 //       // "substantial" — + vitality + speakerEstimates + script
                                 //       // "complete"    — all automatable fields populated
                                 //   "lastEnriched": "2026-06-07"
                                 // }
}

Referência de Campos

§ 1. Campos de Identidade

Campo	Tipo	Obrigatório	Automatizável	Fonte
`code`	`string`	✅	✅	Registro ISO 639-3
`name`	`string`	✅	✅	Registro ISO 639-3
`nativeName`	`string \| null`	—	✅	Wikidata P1705
`alternateNames`	`string[]`	—	✅	Glottolog, Ethnologue
`iso639_3`	`string`	✅	✅	Registro ISO 639-3
`iso639_1`	`string \| null`	—	✅	ISO 639-1
`bcp47`	`string \| null`	—	Parcial	Registro de subtag IANA
`aliases`	`string[]`	—	❌	Curação manual
`isoScope`	`string`	✅	✅	Registro ISO 639-3
`isoType`	`string`	✅	✅	Registro ISO 639-3
`macrolanguage`	`string \| null`	—	✅	ISO 639-3 macrolanguages.tab
`extends`	`string \| null`	—	❌	Curação manual

§ 2. Campos de Classificação

Campo	Tipo	Obrigatório	Automatizável	Fonte
`glottocode`	`string \| null`	—	✅	Glottolog
`classification`	`object \| null`	—	✅	Glottolog
`isIsolate`	`boolean`	—	✅	Glottolog CLDF

§ 3. Campos de Geografia

Campo	Tipo	Obrigatório	Automatizável	Fonte
`macroarea`	`string \| null`	—	✅	Glottolog CLDF
`coordinates`	`object \| null`	—	✅	Glottolog
`countries`	`string[]`	—	✅	Glottolog
`regions`	`object[]`	—	❌	Censo, Ethnologue, manual
`arealContext`	`object \| null`	—	✅	Coordenadas + zonas de área linguística

§ 4. Campos de Sistema de Escrita

Campo	Tipo	Obrigatório	Automatizável	Fonte
`script`	`string \| null`	—	✅	Wikidata P282
`scriptUnicodeName`	`string \| null`	—	✅	Derivado de `script` via mapeamento ISO 15924 → Unicode
`scripts`	`object[]`	—	Parcial	Wikidata, manual
`dir`	`string \| null`	—	✅	Derivável do script
`scriptConverter`	`string \| null`	—	❌	Manual
`orthographicStatus`	`object \| null`	—	Parcial	Ethnologue, manual

§ 5. Campos de Demografia e Vitalidade

Campo	Tipo	Obrigatório	Automatizável	Fonte
`speakerEstimates`	`object[]`	—	✅	Wikidata, Ethnologue, censo
`vitality`	`object \| null`	—	✅	Glottolog AES, UNESCO

§ 5.5 Campos de Documentação e Presença Digital

Campo	Tipo	Obrigatório	Automatizável	Fonte
`documentationDepth`	`object \| null`	—	✅	Referências do Glottolog
`digitalPresence`	`object \| null`	—	✅	Wikipedia, Common Voice, Tatoeba
`dialectCount`	`number \| null`	—	✅	Glottolog

§ 6. Campos de Formalidade, Registro e Gênero

Campo	Tipo	Obrigatório	Automatizável	Fonte
`formality`	`object \| null`	—	❌	Pesquisa linguística
`registers`	`object \| null`	—	❌	Pesquisa linguística
`gender`	`object \| null`	—	❌	Pesquisa linguística
`codeSwitching`	`object \| null`	—	❌	Pesquisa linguística

§ 7. Campos de Perfil Linguístico

Campo	Tipo	Obrigatório	Automatizável	Fonte
`linguisticChallenges`	`object \| null`	—	❌	Pesquisa linguística
`contactInfluences`	`object[]`	—	❌	Linguística publicada
`rules`	`object \| null`	—	✅	CLDR
`typologicalProfile`	`object \| null`	—	✅	Grambank 1.0.3 — preenchido automaticamente por `enrich-grambank-typology.mjs`
`phonologicalInventory`	`object \| null`	—	✅	PHOIBLE 2.0 — preenchido automaticamente por `enrich-phoible-phonemes.mjs`

§ 8. Campos Enciclopédicos

Campo	Tipo	Obrigatório	Automatizável	Fonte
`encyclopedic`	`object \| null`	—	❌	Pesquisa manual
`culturalAphorism`	`object \| null`	—	❌	Contribuição da comunidade
`varieties`	`object[]`	—	❌	Pesquisa manual

§ 9. Campos de Recurso Digital

Campo	Tipo	Obrigatório	Automatizável	Fonte
`resources`	`object \| null`	—	Parcial	Manual + automatizado
`databaseCoverage`	`object \| null`	—	✅	Derivado de enriquecimento
`corpusAvailability`	`object \| null`	—	✅	Bible Brain, OPUS, Lexibank
`keyboardSupport`	`object \| null`	—	✅	API Keyman, CLDR
`methodSupport`	`object`	✅	Parcial	Verificação de API
`metricModelSupport`	`object \| null`	—	✅	Artigo XLM-R, artigo AfriCOMET
`metricPlugins`	`object \| null`	—	✅	Enriquecimento de cartão — declara quais pacotes de plugin de métrica se aplicam (por exemplo, `{ formalityMarkers: true }`)
`omt1600`	`object \| null`	—	✅	Avaliação meta
`evalDatasets`	`string[]`	—	✅	Registro de conjunto de dados
`pipelineReadiness`	`object \| null`	—	Parcial	Derivado + manual

resources.fsts[].install: Entradas FST no objeto resources podem incluir um sub-objeto install com campos: repo, releaseTag, assetPattern, format, maturity, e opcionalmente bundlePattern. Isso substitui o antigo dict codificado GIELLALT_FST_REGISTRY. Veja get_fst_install_info() em language_cards.py.

§ 10. Campos de Proveniência

Campo	Tipo	Obrigatório	Automatizável	Fonte
`dataSources`	`array \| object`	✅	✅	Auto + manual
`supportTier`	`string`	—	✅	Derivado da completude do cartão
`humanReviewed`	`object \| null`	—	❌	Revisor humano
`notes`	`string \| null`	—	❌	Manual
`firstDocumented`	`number \| null`	—	✅	Glottolog CLDF
`lastDocumented`	`number \| null`	—	✅	Glottolog CLDF
`_generated`	`object \| null`	—	✅	Scripts de enriquecimento

Política de Código de Idioma

Champollion usa ISO 639-3 como identificador canônico. Outros códigos padrão são registrados como aliases e resolvem para o código ISO 639-3 em tempo de execução.

Prioridade	Padrão	Exemplo	Campo	Uso
1 (canônico)	ISO 639-3	`crk`	`code`	Nome de arquivo do cartão, chaves de config, parâmetros de API
2 (alias)	ISO 639-1	`iu`	`aliases[]`	Aceito em CLI, resolvido para ISO 639-3
3 (alias)	BCP 47	`fil`	`aliases[]`	Aceito em CLI, resolvido para ISO 639-3
Referência	Glottocode	`plai1258`	`glottocode`	Apenas classificação, não para tempo de execução

Ordem de resolução: Quando um usuário fornece um código:

Correspondência direta em card.code → encontrado
Correspondência em card.aliases[] → encontrado, retorna o cartão canônico
Correspondência em card.iso639_1 → encontrado (fallback)
Não encontrado → erro

Histórico de Migração: ISO 639-1 → ISO 639-3

Antes da v8, nomes de arquivo de cartão usavam códigos ISO 639-1 quando disponíveis (fr.json, de.json, ja.json). Na migração 639-3, todos os cartões foram renomeados para seus equivalentes ISO 639-3:

Antes	Depois	Por quê
`fr.json`	`fra.json`	639-3 é canônico
`de.json`	`deu.json`	639-3 é canônico
`zh.json`	`cmn.json`	Macrolíngua → individual padrão
`ar.json`	`arb.json`	Macrolíngua → Árabe Padrão Moderno
`ms.json`	`zsm.json`	Macrolíngua → Malaio Padrão

O que aconteceu com os códigos antigos?

O código 639-1 antigo está em card.iso639_1
O código 639-1 antigo está em card.aliases[]
resolveCode("fr") retorna "fra" em tempo de execução — compatível com versões anteriores
Os usuários ainda podem escrever "fr" em sua config — resolve transparentemente

O que mudou arquitetonicamente:

_deepMerge() agora pula valores null (herda do pai)
_deepMerge() agora tem um campo de identidade definido (código, estende, aliases nunca herdados)
formality.default agora é derivado de flags de registro isDefault: true
205 cartões derivados de Grambank receberam correção estrutural formality.default
38 cartões de gênero/família/macrolíngua fornecem destinos de herança

Casos Extremos

Línguas de Sinais

Línguas de sinais (por exemplo, ASE — American Sign Language) são idiomas legítimos com códigos ISO 639-3. Eles têm geografia e contagens de falantes, mas:

script é tipicamente null (sem forma escrita padrão)
scripts pode incluir "Sgnw" (SignWriting) se um sistema de notação for usado
dir é null
linguisticChallenges deve abordar gramática espacial, classificadores, etc.
gender.grammatical é tipicamente false

Línguas Antigas e Históricas

Idiomas como Latim (lat, isoType H) e Sânscrito (san, isoType H) ainda são usados em contextos específicos (litúrgico, acadêmico), mas não têm falantes nativos:

vitality pode notar "sem falantes nativos" com "trend": "stable" (não em declínio — a comunidade que a usa é estável, apenas pequena)
speakerEstimates deve notar que estes são falantes L2, não L1
firstDocumented / lastDocumented as localizam no tempo

Línguas Construídas

Esperanto (epo, isoType C), Lojban, etc.:

classification pode apontar para uma família "construída" ou nulo
contactInfluences reflete o material de origem (por exemplo, Esperanto se baseia em Romance, Germânico, Eslavo)
vitality é incomum — comunidade de falantes em crescimento, mas sem pátria nativa

Macrolínguas

Árabe (ara), Chinês (zho), Cree (cre), Quíchua (que) são macrolínguas que abrangem múltiplos idiomas individuais:

isoScope: "M"
varieties deve listar os idiomas individuais com seus códigos ISO
methodSupport deve refletir o que o cartão de macrolíngua suporta (geralmente a variedade padronizada)
Variedades individuais também devem ter seus próprios cartões

Idiomas Sem Ortografia Padronizada

Muitos idiomas (especialmente idiomas de tradição oral) não têm um sistema de escrita padronizado, ou têm ortografias concorrentes:

script é null
scripts é []
dir é null
notes deve explicar a situação ortográfica
linguisticChallenges deve notar como isso afeta MT (por exemplo, sem dados de treinamento)

Diglossia

Idiomas como Árabe (MSA vs. dialetos) ou Guarani (Jopará vs. Guarani puro):

codeSwitching captura a situação de variedade mista
registers pode oferecer predefinições para diferentes níveis
varieties pode listar o par diglóssico

Tipos de Influência de Contato

Tipo	Significado	Exemplo
`superstrate`	Idioma dominante imposto a uma comunidade	Francês → Inglês (pós-1066)
`substrate`	Idioma nativo influenciando um idioma imposto	Céltico → Inglês
`adstrate`	Idioma vizinho com influência mútua	Nórdico → Inglês
`learned_borrowing`	Empréstimos através de educação/erudição	Latim → Inglês
`lexical_borrowing`	Empréstimos de vocabulário direto através de contato	Espanhol → Filipino
`relexification`	Substituição de vocabulário em massa	Português → Papiamentu

Profundidades de Influência de Contato

Profundidade	Significado
`light`	Algumas palavras emprestadas, impacto estrutural mínimo
`moderate`	Vocabulário significativo em domínios específicos
`heavy`	Vocabulário pervasivo e algumas características estruturais
`structural`	Gramática, sintaxe e fonologia afetadas
`defining`	Identidade central moldada pelo contato (crioulos, línguas mistas)

Escrevendo Boas Predefinições de Registro

Boas predefinições de prompt:

Nomeie explicitamente o recurso de formalidade (por exemplo, "해요체", "vous-form", "siz-form")
Explique o pronome ou forma verbal específica a usar
Dê contexto para quando este registro é apropriado
Mencione considerações de script se aplicável

Não coloque orientação de gênero inclusivo no prompt de predefinição. A orientação de gênero pertence a card.gender.inclusiveGuidance — é injetada separadamente.

❌ Bad:  "Standard Thai. Professional register."
✔ Good: "Professional Thai. Use คุณ (khun) for second person, เรา (rao)
         for first person when needed. Clear, concise phrasing
         appropriate for digital interfaces."

Convenção de Nomenclatura de Predefinição

Chaves de predefinição devem ser descritivas e em minúsculas com hífen:

Idiomas T-V: formal-vous, informal-tu, formal-Sie, casual-du
Níveis de fala: polite-haeyo, formal-hapsyo, casual-hae
Neutro: professional, neutral-professional
Code-switching: taglish-professional, pure-filipino

Procedimento de Enriquecimento

Ordem de Processamento Por Cartão

Ao enriquecer um cartão, consulte as fontes nesta ordem. Documente cada fonte consultada, mesmo que tenha retornado sem dados.

Registro ISO 639-3 → code, name, isoScope, isoType
ISO 639-3 macrolanguages.tab → macrolanguage
Glottolog languoid.csv → glottocode, classification, coordinates, countries
Glottolog CLDF → macroarea, isIsolate, firstDocumented, lastDocumented
Glottolog AES → vitality (status de ameaça)
Wikidata SPARQL → nativeName, speakerEstimates, script, scripts, dir
CLDR → rules (tipografia, plurais, capitalização)
NLLB-200 / FLORES+ → methodSupport.nllb, evalDatasets
Verificação de API → entradas methodSupport restantes
Artigos de modelo ML → metricModelSupport (dados de treinamento XLM-R, cobertura AfriCOMET) Script: node scripts/enrich-metric-model-support.mjs

Tratamento de Conflitos

Quando as fontes discordam:

Armazene ambas com atribuição de fonte
NÃO faça média ou escolha lados
Anote a discrepância no campo note relevante
Prefira a fonte primária mais recente apenas quando um único valor é necessário para computação

Validação

Execute o linter após qualquer enriquecimento ou edição manual:

node scripts/lint-language-cards.mjs              # all cards
node scripts/lint-language-cards.mjs --lang crk    # single card

Lista de Verificação de PR

Ao enviar um cartão de idioma novo ou modificado:

Arquivo nomeado <code>.json em shared/language-cards/
Todos os campos de nível superior do modelo canônico estão presentes
classification preenchido do Glottolog (não construído manualmente)
dataSources lista todas as fontes consultadas
Entradas methodSupport verificadas contra listas de idiomas de API reais
Entradas contactInfluences têm fontes publicadas ou citation_needed: true
linguisticChallenges com 3–6 desafios relevantes para MT (se pesquisado)
rules preenchido do CLDR (se dados de locale existem)
Linter passa sem erros

Referências Profissionais

Padrão	Mantido Por	Nosso Uso
ISO 639-3	SIL International	Códigos de idioma canônicos, relacionamentos de macrolíngua
Glottolog	Max Planck Institute	Classificação, coordenadas, AES de ameaça
WALS	Max Planck Institute	Definições de gênero, características tipológicas
ISO 15924	Unicode/ISO	Códigos de script
CLDR	Unicode Consortium	Dados de locale, regras de plural, tipografia
Wikidata	Wikimedia Foundation	Contagens de falantes, endônimos, dados de script
Ethnologue	SIL International	EGIDS, estimativas de falantes, DLS
UNESCO Atlas	UNESCO	Classificação de ameaça
Katig Collective	UP Diliman	Cápsulas de idioma das Filipinas

Veja também: Procedimento de Citação de Cartão de Idioma para orientação detalhada fonte por fonte.

Princípios de Design​

Arquitetura de Três Camadas​

Modelo de Herança​

Semântica de Mesclagem​

Campos de Identidade (Nunca Herdados)​

Exemplo: Como um Cartão Cree é Resolvido​

Modelo de Cartão de Gênero​

Modelo Canônico​

Referência de Campos​

§ 1. Campos de Identidade​

§ 2. Campos de Classificação​

§ 3. Campos de Geografia​

§ 4. Campos de Sistema de Escrita​

§ 5. Campos de Demografia e Vitalidade​

§ 5.5 Campos de Documentação e Presença Digital​

§ 6. Campos de Formalidade, Registro e Gênero​

§ 7. Campos de Perfil Linguístico​

§ 8. Campos Enciclopédicos​

§ 9. Campos de Recurso Digital​

§ 10. Campos de Proveniência​

Política de Código de Idioma​

Histórico de Migração: ISO 639-1 → ISO 639-3​

Casos Extremos​

Línguas de Sinais​

Línguas Antigas e Históricas​

Línguas Construídas​

Macrolínguas​

Idiomas Sem Ortografia Padronizada​

Diglossia​

Tipos de Influência de Contato​

Profundidades de Influência de Contato​

Escrevendo Boas Predefinições de Registro​

Convenção de Nomenclatura de Predefinição​

Procedimento de Enriquecimento​

Ordem de Processamento Por Cartão​

Tratamento de Conflitos​

Validação​

Lista de Verificação de PR​

Referências Profissionais​