Spezifikation der Sprachkarte

Single Source of Truth. Dieses Dokument definiert die kanonische Struktur jeder Sprachkarte. Jede Karte MUSS jedes hier aufgeführte Top-Level-Feld enthalten, selbst wenn der Wert null oder [] ist. Eine Karte mit einem fehlenden Feld ist nicht konform. Diese Einheitlichkeit ist es, die es automatisierten Tools, Lintern, Anreicherungsskripten und menschlichen Prüfern ermöglicht, der Kartenstruktur zu vertrauen.

Designprinzipien

Einheitliche Struktur. Alle über 8.000 Karten haben dieselben Top-Level-Felder. Unbekannte Werte sind null, leere Arrays sind [], leere Objekte sind null (nicht {}). Das bedeutet, dass der Code niemals prüfen muss, „existiert dieses Feld?" — sondern nur „ist es befüllt?"
Alles mit Quellen belegen. Jede sachliche Aussage lässt sich auf eine benannte, versionierte, primäre Quelle zurückführen. Aussagen ohne Quelle sind nicht überprüfbare Aussagen. Das Feld dataSources (sowie feldbezogene source-Annotationen in Unterobjekten) machen die Herkunft explizit.
Uneinigkeit bewahren. Wenn Autoritäten uneinig sind (Wikidata gibt 50.000 Sprecher an, Ethnologue 20.000), speichern wir beide mit Quellenangabe. Wir bilden keinen Durchschnitt, lösen nichts auf und beziehen keine Position. Anwender können die Nuancen selbst erkunden.
Null bedeutet unbekannt, nicht unanwendbar. Wenn ein Feld null ist, bedeutet das „wir haben hierfür noch keine Daten gefunden." Wenn ein Feld tatsächlich nicht zutrifft (z. B. grammatical gender bei einer Gebärdensprache), sollte der Wert dies erklären: { "grammatical": false, "inclusiveGuidance": "Not applicable — ASL does not have grammatical gender." }
Nur zusammenführen. Anreicherungsskripte fügen Daten hinzu, überschreiben sie aber niemals. Manuell kuratierte Werte haben Vorrang vor automatisierten Daten.

Drei-Schichten-Architektur

Schicht	Speicherort	Zweck
Sprachkarten	`shared/language-cards/<code>.json`	Konfiguration pro Sprache: Identität, Klassifikation, Ressourcen, alles
Genus-Karten	`shared/language-cards/genera/<genus>.json`	Gemeinsame Laufzeiteigenschaften für verwandte Sprachen (kuratiert, nicht automatisch generiert)
Sprachbaum	`shared/language-cards/language-tree.json`	Vollständige Glottolog-Hierarchie — Referenzdaten für die Lab-UI und Sprachentdeckung

Vererbungsmodell

Wenn eine Karte "extends": "family-dravidian" setzt, führt die Laufzeitumgebung die übergeordnete Karte mittels _deepMerge() (in lib/registers.js) in die untergeordnete Karte zusammen. Dies ermöglicht es Genus-Karten, gemeinsame Register, Formalitätssysteme und Genus-Leitlinien zu definieren, die an alle zugehörigen Sprachen weitergegeben werden — ohne Daten über Hunderte einzelner Karten hinweg zu duplizieren.

Zusammenführungssemantik

Wert der untergeordneten Karte	Verhalten	Grund
`null`	Von übergeordneter Karte erben	`null` bedeutet „ich definiere dies nicht" — der Wert der übergeordneten Karte wird weitergegeben
Nicht-null	Übergeordnete Karte überschreiben	Die Daten der untergeordneten Karte sind spezifischer — sie haben Vorrang
Verschachteltes Objekt	Rekursive Zusammenführung	Felder der untergeordneten Karte überschreiben, Felder der übergeordneten Karte bleiben erhalten
Array	Vollständig ersetzen	Arrays werden nicht Element für Element zusammengeführt — das Array der untergeordneten Karte gewinnt

Identitätsfelder (Niemals vererbt)

Einige Felder gehören zur Karte selbst und dürfen NIEMALS von einer übergeordneten Karte geerbt werden:

code, extends, _migration, aliases, iso639_1, iso639_3

Selbst wenn eine übergeordnete Karte aliases: ["macro-code"] definiert, wird eine untergeordnete Karte diese Aliase NICHT erben. Diese Felder sind stets die eigenen Werte der untergeordneten Karte (einschließlich null, falls nicht gesetzt).

Grund: Ohne diese Regel würde jede Cree-Sprache aliases: ["cre"] von der übergeordneten Makrosprache erben, wodurch jede Varietät zu einem Alias der Makrosprache würde.

Beispiel: Wie eine Cree-Karte aufgelöst wird

┌───────────────────────┐
│  family-algic.json    │  formality: null, registers: null
│  (no registers)       │
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  genus-cree.json      │  formality: { system: "obviative-animate", ... }
│  (sourced registers)  │  registers: { formal: {...}, informal: {...} }
└──────────┬────────────┘
           │ extends
┌──────────┴────────────┐
│  crk.json             │  code: "crk", extends: "genus-cree"
│  (Plains Cree)        │  formality: null → inherits from genus-cree
│                       │  registers: null → inherits from genus-cree
│                       │  script: "Cans"  → own value, no inheritance
│                       │  code: "crk"     → identity field, never inherited
└───────────────────────┘

Zur Laufzeit gibt getLanguageCard("crk") ein zusammengeführtes Objekt zurück, das die Register von genus-cree + die Eigenschaften von family-algic (falls vorhanden) + die eigene Identität und Metadaten von crk enthält.

Genus-Kartenvorlage

Genus-Karten befinden sich in shared/language-cards/genera/ und definieren gemeinsame Eigenschaften für eine Sprachgruppe. Sie folgen demselben Schema wie reguläre Karten, jedoch mit abweichenden Konventionen:

{
  // Identity — genus cards use a prefixed code, NOT an ISO 639-3 code
  "code": "genus-cree",           // "genus-", "family-", or "macrolanguage-" prefix
  "name": "Cree Languages",      // Human-readable group name
  "extends": "family-algic",     // Genus cards can extend family cards (chaining)

  // Formality — shared across the group, sourced from typological databases
  "formality": {
    "system": "obviative-animate",
    "description": "Cree languages use an obviative/proximate system...",
    "default": "formal",
    "source": "WALS 37A, 38A + Wolfart 1973"
  },

  // Registers — shared presets, if the group shares a formality system
  "registers": {
    "formal": {
      "label": "Formal (Proximate)",
      "description": "...",
      "prompt": "...",
      "isDefault": true
    },
    "informal": {
      "label": "Informal",
      "description": "...",
      "prompt": "..."
    }
  },

  // Gender — shared grammatical gender behavior
  "gender": {
    "grammatical": false,       // Cree doesn't have grammatical gender
    "inclusiveGuidance": null   //   so no inclusive guidance needed
  },

  // Everything else is null — individual cards provide their own
  // classification, geography, resources, etc.
  "classification": null,
  "methodSupport": null,
  // ...
}

Schlüsselregel: Genus-Karten dürfen NUR Daten enthalten, die tatsächlich über die gesamte Gruppe hinweg gemeinsam sind und aus maßgeblichen Referenzen stammen. Wenn ein Formalitätssystem zwischen Mitgliedern variiert, gehört es auf die einzelnen Karten, nicht auf das Genus.

Kanonische Vorlage

Jede Karte MUSS genau diese Top-Level-Struktur aufweisen. Die Schemata der Unterobjekte sind in der Feldreferenz unten dokumentiert.

{
  // ═══════════════════════════════════════════════════════════════════════
  //  § 1. IDENTITY
  //  Who is this language? What codes identify it?
  //  Sources: ISO 639-3 registry, ISO 639-1, BCP 47/IANA.
  // ═══════════════════════════════════════════════════════════════════════

  "code":          "xxx",       // REQUIRED. ISO 639-3 code. This IS the card ID and filename.
  "name":          "English Name",  // REQUIRED. English reference name from ISO 639-3 registry.
  "nativeName":    null,        // Endonym (name in the language itself). Source: Wikidata P1705.
                                // Examples: "nêhiyawêwin / ᓀᐦᐃᔭᐍᐏᐣ", "日本語", "Esperanto".
  "alternateNames": [],         // Other names this language is known by. Source: Glottolog, Ethnologue.
                                // Not aliases (those are code-level). These are name-level variants.
                                // Example: ["Qafar af", "Afaraf", "'Afar Af"] for Afar (aar).
  "iso639_3":      "xxx",      // REQUIRED. Three-letter ISO 639-3 code. Same as `code`.
  "iso639_1":      null,        // Two-letter ISO 639-1 code (e.g., "en", "fr"). null if none.
  "bcp47":         null,        // IETF BCP 47 tag. Often same as iso639_1. Can include subtags
                                // (e.g., "iu-Cans-CA"). null if unknown.
  "aliases":       [],          // Alternative code-level identifiers that resolve to this card.
                                // Example: ["fil"] for tl (Tagalog), ["iu"] for iku (Inuktitut).
                                // Used by code resolution: user types "fil", system loads tl.json.
  "isoScope":      "I",        // REQUIRED. ISO 639-3 scope:
                                //   "I" = Individual language
                                //   "M" = Macrolanguage (e.g., Chinese, Arabic, Cree)
                                //   "S" = Special (e.g., mis, mul, zxx)
  "isoType":       "L",        // REQUIRED. ISO 639-3 type:
                                //   "L" = Living    "E" = Extinct    "A" = Ancient
                                //   "H" = Historical    "C" = Constructed
  "macrolanguage": null,        // If this language is part of a macrolanguage, the macrolanguage
                                // ISO 639-3 code (e.g., "cre" for Plains Cree, "ara" for Arabic
                                // varieties). Source: ISO 639-3 macrolanguages.tab.
  "extends":       null,        // Genus card key if shared properties are inherited from a genus
                                // card (e.g., "genus-cree", "genus-eskimo-aleut").
                                // null for most languages.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 2. CLASSIFICATION
  //  Where does this language sit in the family tree?
  //  Source: Glottolog. NEVER hand-build classifications.
  // ═══════════════════════════════════════════════════════════════════════

  "glottocode":      null,      // Glottolog identifier (e.g., "plai1258", "stan1293").
                                // null if the language is not in Glottolog.
  "classification":  null,      // Genealogical classification from Glottolog. When populated:
                                // {
                                //   "family": "Algic",              // Top-level family. null for isolates.
                                //   "familyGlottocode": "algi1248", // Glottocode of the family.
                                //   "genus": "Plains Creeic",       // WALS-style genus.
                                //   "genusGlottocode": "plai1264",  // Glottocode of the genus.
                                //   "ancestry": ["Algic", "Algonquian-Blackfoot", "Algonquian",
                                //                "Cree-Montagnais-Naskapi", "Cree", "Plains Creeic"]
                                // }
                                // For isolates: family = language name, genus = language name,
                                // ancestry = [language name].
  "isIsolate":       false,     // true if a language isolate (no known genetic relatives).
                                // Source: Glottolog CLDF.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 3. GEOGRAPHY
  //  Where is this language spoken?
  //  Sources: Glottolog (coordinates, countries), census data, Ethnologue.
  // ═══════════════════════════════════════════════════════════════════════

  "macroarea":     null,        // Glottolog macroarea. One of: "Africa", "Australia",
                                // "Eurasia", "North America", "Papunesia", "South America".
                                // null if unknown. Source: Glottolog CLDF.
  "coordinates":   null,        // Representative geographic point. When populated:
                                // { "lat": 52.1, "lng": -106.6, "source": "glottolog-5.3" }
                                // This is a representative point, not a boundary.
  "countries":     [],          // ISO 3166-1 alpha-2 country codes where this language is spoken.
                                // Example: ["CA", "US"]. Source: Glottolog.
  "regions":       [],          // Detailed regional breakdown with admin codes & speaker estimates.
                                // Each entry:
                                // {
                                //   "country": "Canada",
                                //   "countryCode": "CA",
                                //   "officialStatus": "recognized",  // official, co-official,
                                //                                    // recognized, none
                                //   "region": "Saskatchewan, Alberta, Manitoba",
                                //   "speakerEstimate": "~20,000",
                                //   "coordinates": [-106.6, 52.1],   // [lng, lat]
                                //   "admin1Codes": ["CA-SK", "CA-AB", "CA-MB"]
                                // }

  "arealContext":  null,         // Linguistic area / Sprachbund membership. DISTINCT from
                                // contactInfluences (which is language-specific contact history).
                                // This field captures zone-level typological convergence patterns
                                // — i.e., what linguistic area the language exists within and
                                // what features are common across that area.
                                // {
                                //   "zone": "Mainland Southeast Asian Sprachbund",
                                //   "arealFeatures": "Tonal convergence, classifier systems,
                                //     topic-prominence, monosyllabicity trend.",
                                //   "typicalContacts": ["Classical Chinese", "Sanskrit/Pali"],
                                //   "source": "areal-linguistics (Enfield 2005)"
                                // }
                                // NOT the same as contactInfluences. A language can exist within
                                // a convergence area without having specific contact history with
                                // any particular language in that area.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 4. WRITING SYSTEMS
  //  How is this language written?
  //  Sources: Wikidata P282, ISO 15924, manual research.
  //  Note: Some languages have NO standardized orthography. Some have
  //  competing orthographies. Some use multiple scripts routinely (e.g.,
  //  Serbian: Cyrillic + Latin; Japanese: Kanji + Hiragana + Katakana).
  //  Sign languages may use notation systems (SignWriting, HamNoSys) or
  //  none at all.
  // ═══════════════════════════════════════════════════════════════════════

  "script":        null,        // Primary ISO 15924 script code (e.g., "Latn", "Cyrl", "Cans",
                                // "Jpan"). null if no written form or unknown.
  "scriptUnicodeName": null,    // Unicode script block name derived from the script field.
                                // e.g., "Latin", "Cyrillic", "Canadian_Aboriginal", "CJK".
                                // Used by code_switching metric plugin. Auto-populated by
                                // enrich-script-unicode-names.mjs. null if script is null.
  "scripts":       [],          // All writing systems with detail. Array of:
                                // {
                                //   "code": "Cans",
                                //   "name": "Unified Canadian Aboriginal Syllabics",
                                //   "primary": true
                                // }
                                // A language with multiple scripts has multiple entries.
                                // A language with no written form has [].
  "dir":           null,        // Writing direction: "ltr" (left-to-right) or "rtl" (right-to-left).
                                // null if no written form or unknown.
  "scriptConverter": null,      // Script converter key if we have a converter for this language
                                // (e.g., "crk" for SRO↔Syllabics). null for most languages.
  "orthographicStatus": null,   // Writing system standardization status. When populated:
                                // {
                                //   "status": "standardized",
                                //       // "standardized" — official/agreed orthography exists
                                //       // "competing"    — multiple orthographies in active use
                                //       // "emerging"     — orthography under development
                                //       // "none"         — primarily oral, no standard writing
                                //   "notes": "Uses SIL-developed Latin orthography since 1960s.",
                                //   "source": "ethnologue" // or "manual-curation"
                                // }
                                // Crucial for LRLs where orthographic variation directly impacts
                                // MT training data quality and evaluation consistency.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5. DEMOGRAPHICS & VITALITY
  //  How many people speak this language? Is it endangered?
  //  Sources: Census, Ethnologue, UNESCO Atlas, Wikidata, Glottolog AES.
  //
  //  CRITICAL: Store ALL estimates separately with source attribution.
  //  Never average or "resolve" conflicting data. Speaker counts are
  //  politically contested for many languages. Present the evidence,
  //  let the reader assess.
  // ═══════════════════════════════════════════════════════════════════════

  "speakerEstimates": [],       // Array of speaker count estimates from different authorities.
                                // Each entry:
                                // {
                                //   "source": "wikidata",              // or "ethnologue-28",
                                //                                      // "census-ph-2020", etc.
                                //   "count": 20000,                    // Point estimate. null if range-only.
                                //   "date": "2026-06-07",              // When this data was retrieved.
                                //   "countRange": { "min": 15000, "max": 25000 },  // Optional range.
                                //   "note": "Wikidata has 2 estimates: 15,000 and 25,000"
                                // }
                                // Empty array means we have not yet found speaker count data.

  "vitality":      null,        // Endangerment / vitality assessment. When populated:
                                // {
                                //   "unescoStatus": "severely-endangered",
                                //       // Enum: "safe", "vulnerable", "definitely-endangered",
                                //       //       "severely-endangered", "critically-endangered",
                                //       //       "extinct"
                                //   "aesStatus": "shifting",
                                //       // Glottolog AES label (free text from AES data).
                                //   "egids": "6b",
                                //       // Ethnologue Expanded Graded Intergenerational Disruption
                                //       // Scale. Levels: 0 (international) to 10 (extinct).
                                //   "trend": "declining",
                                //       // Qualitative trend: "stable", "growing", "declining",
                                //       //                     "shifting", "moribund", "awakening"
                                //   "source": "glottolog-aes-5.3",
                                //   "notes": "Intergenerational transmission breaking down."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 5.5. DOCUMENTATION & DIGITAL PRESENCE
  //  How well-documented is this language? What digital footprint does it
  //  have? These fields answer the practical question: "What can I
  //  actually DO with this language?"
  //  Sources: Glottolog (references), Wikipedia, Common Voice, Tatoeba.
  // ═══════════════════════════════════════════════════════════════════════

  "documentationDepth": null,    // How well-documented is this language in the literature?
                                 // {
                                 //   "referenceCount": 42,
                                 //       // Number of published references in Glottolog.
                                 //   "med": "grammar",
                                 //       // Most Extensive Description type. One of:
                                 //       // "long_grammar", "grammar", "grammar_sketch",
                                 //       // "dictionary", "phonology", "text", "wordlist",
                                 //       // "comparative", "minimal", "unknown"
                                 //   "source": "glottolog-5.3"
                                 // }

  "digitalPresence":  null,      // Digital footprint across web platforms. When populated:
                                 // {
                                 //   "wikipedia": {
                                 //     "edition": true,      // Has its own Wikipedia edition?
                                 //     "articleCount": 75000, // Number of articles.
                                 //     "editionCode": "crk",  // Wikipedia subdomain code.
                                 //     "source": "wikimedia-api-2026"
                                 //   },
                                 //   "commonVoice": {
                                 //     "validatedHours": 12.5,
                                 //     "totalHours": 25.0,
                                 //     "speakers": 45,
                                 //     "sentences": 1200,
                                 //     "source": "common-voice-20.0"
                                 //   },
                                 //   "tatoeba": {
                                 //     "sentenceCount": 342,
                                 //     "source": "tatoeba-2026"
                                 //   }
                                 // }

  "dialectCount":     null,      // Number of recognized dialects in Glottolog.
                                 // Derived from child_dialect_count in languoid.csv.
                                 // Simple integer. null if 0 or unknown.
                                 // Source: glottolog-5.3.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 6. FORMALITY, REGISTERS & GENDER
  //  How does politeness work in this language? What translation registers
  //  do we offer? How should gender be handled?
  //
  //  This section drives Champollion's register-preset system — the
  //  mechanism by which users select formal/informal/professional tone.
  //  These fields require genuine linguistic research, not automation.
  // ═══════════════════════════════════════════════════════════════════════

  "formality":     null,        // Formality system description. When populated:
                                // {
                                //   "system": "T-V",
                                //       // One of: "T-V", "speech-levels", "keigo", "particles",
                                //       //         "register-levels", "register-and-code-switching",
                                //       //         "code-switching", "none"
                                //   "description": "French uses a vous/tu distinction...",
                                //   "default": "formal-vous"   // Key into the `registers` object.
                                // }

  "registers":     null,        // Translation register presets. When populated, keyed by preset ID:
                                // {
                                //   "formal-vous": {
                                //     "label": "Formal (vouvoiement)",
                                //     "description": "One sentence: when to use this preset.",
                                //     "prompt": "The actual LLM system prompt instruction that
                                //               steers translation tone. Must name specific
                                //               linguistic features (pronouns, verb forms, particles).",
                                //     "deeplFormality": "prefer_more"
                                //       // Only if methodSupport.deepl.formality is true.
                                //       // One of: "prefer_more", "prefer_less", "default".
                                //   }
                                // }

  "gender":        null,        // Grammatical gender and inclusive guidance. When populated:
                                // {
                                //   "grammatical": true,         // Does the language have gram. gender?
                                //   "inclusiveGuidance": "Use gender-neutral forms when possible.
                                //                        Prefer 'iel' (neologism) or rephrase to
                                //                        avoid gendered agreement."
                                // }
                                // For languages without grammatical gender (Turkish, Finnish):
                                // { "grammatical": false, "inclusiveGuidance": null }

  "codeSwitching":  null,       // Code-switching behavior (for languages where mixing with another
                                // language is the norm, not an error). When populated:
                                // {
                                //   "contactLanguage": "Spanish",
                                //   "contactIso639_3": "spa",
                                //   "mixedVarietyName": "Jopará",   // null if no named mixed variety
                                //   "prevalence": "dominant",       // "rare", "common", "dominant"
                                //   "morphologicalIntegration": true,
                                //   "pipelineStrategy": "hybrid-fst",
                                //   "notes": "Jopará IS the everyday language of most Paraguayans..."
                                // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 7. LINGUISTIC PROFILE
  //  What makes this language what it is? What are the specific challenges
  //  for machine translation? What rules govern its typography?
  //  What languages have shaped it through contact?
  //
  //  These fields require genuine linguistic expertise. For many languages
  //  (especially low-resource), this section will remain null until a
  //  qualified researcher or community member contributes.
  // ═══════════════════════════════════════════════════════════════════════

  "linguisticChallenges": null,  // MT-relevant challenges, keyed by challenge ID.
                                 // When populated:
                                 // {
                                 //   "polysynthesis": "Cree is highly polysynthetic. A single verb
                                 //                    can incorporate subject, object, tense...",
                                 //   "animacy": "Verb conjugation changes based on whether the
                                 //              subject/object is animate or inanimate...",
                                 //   "neologisms": "Avoid literal translations of modern software
                                 //                 concepts. Maintain Cree metaphorical logic..."
                                 // }
                                 // Aim for 3–6 challenges per language when researched.

  "contactInfluences": [],       // How other languages have shaped this one. Array of:
                                 // {
                                 //   "source": "English",
                                 //   "sourceIso639_3": "eng",       // null if proto-language/unknown
                                 //   "type": "superstrate",
                                 //       // Enum: "superstrate", "substrate", "adstrate",
                                 //       //       "learned_borrowing", "lexical_borrowing",
                                 //       //       "relexification"
                                 //   "domains": ["education", "government", "technology"],
                                 //   "depth": "deep",
                                 //       // Enum: "light", "moderate", "heavy", "structural",
                                 //       //       "defining"
                                 //   "period": "1870–present",
                                 //   "notes": "Residential school era and ongoing...",
                                 //   "citation_needed": false
                                 //       // true if no published academic source found.
                                 //       // See language-card-citation-procedure.md.
                                 // }

  "rules":          null,        // Typography, plural, and capitalization rules. When populated:
                                 // {
                                 //   "typography": {
                                 //     "quoteStart": "\u201c",
                                 //     "quoteEnd": "\u201d",
                                 //     "usesSpaces": true,        // false for CJK, Thai, Lao, Khmer
                                 //     "punctuationSpacing": {
                                 //       "doublePunctuation": "none"  // "thin-nbsp" for French
                                 //     }
                                 //   },
                                 //   "plurals": {
                                 //     "categories": ["one", "other"]
                                 //       // From CLDR. Possible values:
                                 //       // "zero", "one", "two", "few", "many", "other"
                                 //   },
                                 //   "capitalization": {
                                 //     "hasCase": true
                                 //       // true for Latin, Cyrillic, Greek, Armenian scripts.
                                 //       // false for CJK, Arabic, Devanagari, etc.
                                 //   }
                                 // }
                                 // Source: CLDR + ISO 15924 derivation.

  "typologicalProfile": null,   // Grambank typological features. When populated:
                                // {
                                //   "featuresDocumented": 195,
                                //   "featuresCoverage": 1,     // 0.0–1.0 fraction of features
                                //   "wordOrderDominant": "SVO",
                                //   "hasDefiniteArticle": true,
                                //   "hasIndefiniteArticle": true,
                                //   "hasGenderSystem": true,
                                //   "hasCaseMorphology": true,
                                //   "hasEvidentiality": false,
                                //   "hasToneSystem": false,
                                //   "source": "grambank-1.0.3"
                                // }
                                // Auto-populated by enrich-grambank-typology.mjs.

  "phonologicalInventory": null, // PHOIBLE phoneme inventory. When populated:
                                // {
                                //   "consonants": 24,
                                //   "vowels": 16,
                                //   "tones": 0,
                                //   "totalPhonemes": 40,
                                //   "isTonal": false,
                                //   "inventorySize": "moderately-large",
                                //       // Enum: "small", "moderately-small", "average",
                                //       //       "moderately-large", "large"
                                //   "source": "phoible-2.0"
                                // }
                                // Auto-populated by enrich-phoible-phonemes.mjs.

  // ═══════════════════════════════════════════════════════════════════════
  //  § 8. ENCYCLOPEDIC
  //  General knowledge about the language for human context. History,
  //  dialect situation, institutional resources, representative sayings.
  //  This section is for understanding, not computation.
  // ═══════════════════════════════════════════════════════════════════════

  "encyclopedic":    null,       // General knowledge. When populated:
                                 // {
                                 //   "family": "Algic",             // Redundant with classification
                                 //                                  // but useful for human readers.
                                 //   "dialects": {
                                 //     "split": true,               // Is there significant variation?
                                 //     "classification": "Plains Cree (y-dialect)",
                                 //     "variants": ["crk", "cwd", "csw"]  // ISO codes of variants
                                 //   },
                                 //   "demographics": {
                                 //     "speakers": "Approx. 20,000 active speakers",
                                 //     "regions": ["Saskatchewan", "Alberta", "Manitoba"]
                                 //   },
                                 //   "history": "Plains Cree is the most widely spoken Algonquian
                                 //              language in western Canada...",
                                 //   "resources": {
                                 //     "wikipedia": "https://en.wikipedia.org/wiki/Plains_Cree",
                                 //     "foundations": [{ "name": "ALTLab", "url": "https://..." }],
                                 //     "dictionaries": [{ "name": "itwêwina", "url": "https://..." }]
                                 //   }
                                 // }

  "culturalAphorism": null,      // A representative saying, proverb, or teaching in the language.
                                 // When populated:
                                 // {
                                 //   "text": "ê-wîcêhtonaniwahk kâ-kî-isi-wâpahtamâhk ôma pimâtisiwin",
                                 //   "transliteration": null,       // Romanized form if non-Latin script.
                                 //   "translation": "Through helping each other we come to understand
                                 //                   this life",
                                 //   "literal": "By-helping-one-another we-have-come-to-see this life",
                                 //   "source": "Cree teaching, documented in nêhiyawêwin educational
                                 //              resources"
                                 // }
                                 // Choose sayings that reveal something about the language's
                                 // worldview or structure. Must be sourced.

  "varieties":      [],          // For macrolanguages or languages with significant dialectal
                                 // variation, the individual varieties with their own tool coverage.
                                 // Each entry:
                                 // {
                                 //   "name": "Cusco Quechua",
                                 //   "iso639_3": "quz",
                                 //   "region": "Cusco, Peru",
                                 //   "fstCoverage": true,
                                 //   "corpusCoverage": true,
                                 //   "nllbCoverage": false,
                                 //   "mutualIntelligibility": "Primary variety for this card",
                                 //   "notes": "SQUOIA FST was built for this variety."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 9. DIGITAL RESOURCES & TOOLING
  //  What NLP tools, corpora, models, and datasets exist for this language?
  //  What translation APIs support it? What eval benchmarks are available?
  //
  //  This is Champollion's operational core — these fields determine what
  //  we can actually DO with this language.
  // ═══════════════════════════════════════════════════════════════════════

  "resources":      null,        // NLP resources available for this language. When populated:
                                 // {
                                 //   "fsts": [{                     // Finite-state transducers
                                 //     "name": "GiellaLT Plains Cree FST (lang-crk)",
                                 //     "url": "https://github.com/giellalt/lang-crk/releases",
                                 //     "type": "morphological-analyzer"
                                 //   }],
                                 //   "corpora": [{                  // Text corpora
                                 //     "name": "EDTeKLA Cree Language Textbook Corpus",
                                 //     "type": "parallel",          // "parallel", "monolingual"
                                 //     "pairs": ["en-crk"],
                                 //     "url": "https://...",
                                 //     "exposure": "open-web"       // "open-web", "restricted",
                                 //                                  // "holdout"
                                 //   }],
                                 //   "models": [{                   // Pre-trained models
                                 //     "name": "NLLB-200 (crk_Cans)",
                                 //     "url": "https://...",
                                 //     "type": "nmt"
                                 //   }],
                                 //   "tools": [],                   // Other NLP tools
                                 //   "wordlists": [{                // Standardized wordlists
                                 //     "name": "Lexibank",
                                 //     "conceptCount": 200,
                                 //     "source": "lexibank"
                                 //   }],
                                 //   "treebanks": [{                // Syntactic treebanks
                                 //     "name": "UD_Korean-GSD",
                                 //     "tokens": 80000,
                                 //     "source": "universal-dependencies-2.14"
                                 //   }]
                                 // }
                                 // IMPORTANT: Only actual NLP/digital resources belong here.
                                 // "This language has a WALS entry" is NOT a resource — that
                                 // goes in databaseCoverage.

  "databaseCoverage": null,      // Which typological/reference databases cover this language.
                                 // Separated from resources to avoid conflating "has a database
                                 // entry" with "has usable NLP tooling."
                                 // {
                                 //   "wals": true,
                                 //   "grambank": true,
                                 //   "phoible": true,
                                 //   "cldr": true,
                                 //   "lexibank": true,
                                 //   "commonVoice": true,
                                 //   "source": "derived"
                                 // }

  "corpusAvailability": null,    // What text/parallel corpora exist for NLP use?
                                 // {
                                 //   "bibleTranslation": {
                                 //     "textAvailable": true,
                                 //     "audioAvailable": true,
                                 //     "source": "bible-brain-api"
                                 //   },
                                 //   "opusCorpora": ["wikimedia", "ubuntu", "gnome"],
                                 //   "source": "multi-source"
                                 // }

  "keyboardSupport":  null,      // Input method / keyboard availability. When populated:
                                 // {
                                 //   "keymanKeyboards": 3,
                                 //       // Number of Keyman keyboards available.
                                 //   "cldrKeyboard": true,
                                 //       // CLDR has keyboard layout data.
                                 //   "source": "keyman-api + cldr"
                                 // }

  "methodSupport":  {            // REQUIRED. Which Champollion translation methods support this
                                 // language. Each method is an object with at minimum
                                 // { "supported": boolean }.
    "googleTranslate":     { "supported": false },
    "deepl":               { "supported": false },
    "microsoftTranslator": { "supported": false },
    "libreTranslate":      { "supported": false },
    "nllb":                { "supported": false },
                                 // When NLLB is supported, include the code:
                                 // { "supported": true, "code": "crk_Cans" }
    "llm":                 { "supported": true }
                                 // LLM is always true (quality varies by language).
                                 // Optional: "verifiedDate": "2026-06-07" for audit trail.
  },

  "metricModelSupport": null,   // Which MT evaluation models produce reliable scores.
                                // When populated:
                                // {
                                //   "xlmr": "high",          // "high", "medium", or "low"
                                //                            // XLM-R training representation tier.
                                //   "africomet": false        // true if AfriCOMET covers this language.
                                // }
                                // Drives automatic COMET model selection in metrics_comet.py.
                                // Auto-populated by enrich-metric-model-support.mjs.

  "metricPlugins":   null,      // Which per-language metric plugin packs are available.
                                // When populated:
                                // {
                                //   "formalityMarkers": true  // Formality marker resource file exists
                                //                             // at plugins/resources/formality/{code}.json
                                // }
                                // Each key corresponds to a resource pack in
                                // arena/mt_eval_harness/plugins/resources/{packName}/.
                                // To add a new metric pack for a language, create the resource
                                // file and set the flag here. No code changes required.

  "evalPack":       null,        // Evaluation dependency pack for language-specific metrics.
                                 // When populated, declares the Python dependencies and
                                 // post-install steps required by this language's eval standards.
                                 // The harness uses this for dependency gating: if deps are
                                 // missing, the harness warns the user and skips LYSS metrics
                                 // (rather than crashing).
                                 // When populated:
                                 // {
                                 //   "pythonDeps": {
                                 //     "pyhfst": "pyhfst>=1.4",    // PyPI package specs
                                 //     "requests": "requests>=2.28",
                                 //     "spacy": "spacy>=3.7"
                                 //   },
                                 //   "postInstall": [               // Commands to run after pip
                                 //     {
                                 //       "command": "spacy download en_core_web_md",
                                 //       "label": "spaCy English model (for LYSS-sem)"
                                 //     }
                                 //   ],
                                 //   "requiresFst": true,           // true if GiellaLT FST needed
                                 //   "description": "LYSS equivalence linter + FST validation"
                                 // }

  "evalMetrics":    null,        // Language-specific evaluation metrics (LYSS standards).
                                 // When populated, the harness dynamically imports these
                                 // MetricPlugin classes from eval_standards/<lang>/ and applies
                                 // them to every run targeting this language — regardless of
                                 // which method (contestant) is being evaluated.
                                 // Keyed by metric ID:
                                 // {
                                 //   "lyss-eq": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkLinterMetric",
                                 //     "description": "LYSS deterministic variant-class linter"
                                 //   },
                                 //   "lyss-sem": {
                                 //     "module": "eval_standards.crk.metrics",
                                 //     "class": "CrkSemanticMetric",
                                 //     "description": "LYSS FST-based semantic validator",
                                 //     "dependencies": ["spacy>=3.7"],
                                 //     "spacy_models": ["en_core_web_md"]
                                 //   }
                                 // }
                                 // Architecture: eval standards are referees, not contestants.
                                 // They live in the harness (eval_standards/), not in method
                                 // plugins. This ensures all methods are scored equally.
                                 // Discovery: plugin_discovery.py reads this field via
                                 // language_cards.get_eval_metrics() and instantiates metrics
                                 // using importlib. Dependencies are checked against evalPack.

  "omt1600":        null,        // Meta's OMT-1600 (One Model for Translation) coverage assessment.
                                 // When populated:
                                 // {
                                 //   "covered": true,
                                 //   "tier": "R1",                  // Meta's resource tier
                                 //   "evalMetrics": ["chrF++", "BLASER-3"],
                                 //   "notes": "Plains Cree: no web-crawled bitext..."
                                 // }

  "evalDatasets":   [],          // Evaluation dataset IDs available for this language.
                                 // Example: ["flores-plus-devtest", "edtekla-dev-v1"].
                                 // Empty means no standardized eval set exists.

  "pipelineReadiness": null,     // Assessment of readiness for Champollion's translation pipeline.
                                 // When populated:
                                 // {
                                 //   "tier": "tier-2-feasible",
                                 //       // "watch-list"       — cataloged but no path to translation
                                 //       // "tier-3-cataloged" — basic metadata present
                                 //       // "tier-2-feasible"  — tools exist, pipeline possible
                                 //       // "tier-1-ready"     — pipeline operational
                                 //   "hasFST": true,
                                 //   "hasParallelCorpus": true,
                                 //   "hasEvalBenchmark": true,
                                 //   "blockers": ["Syllabics post-processing validation"],
                                 //   "notes": "FST-gated pipeline operational. EDTeKLA corpus..."
                                 // }

  // ═══════════════════════════════════════════════════════════════════════
  //  § 10. PROVENANCE & METADATA
  //  Where does this data come from? Who reviewed it? When was it
  //  generated? What's its overall quality level?
  //
  //  This section exists to make the card auditable. Every automated
  //  enrichment, every human review, every source consulted should
  //  leave a trace here.
  // ═══════════════════════════════════════════════════════════════════════

  "dataSources":   [],           // REQUIRED. Sources consulted for this card's data.
                                 // Can be a flat array (backwards-compatible):
                                 //   ["iso639-3-2024", "glottolog-5.3", "wikidata"]
                                 //
                                 // Or a structured per-field object (preferred for new cards):
                                 //   {
                                 //     "classification": ["glottolog-5.3"],
                                 //     "vitality": ["glottolog-aes-5.3", "unesco-atlas-2024"],
                                 //     "speakerEstimates": ["wikidata", "census-ca-2021"],
                                 //     "rules": ["cldr-48"],
                                 //     "methodSupport": ["google-translate-2026-06"]
                                 //   }

  "supportTier":   "cataloged",  // Auto-derived tier summarizing the card's depth:
                                 //   "cataloged"   — identity + classification only
                                 //   "emerging"    — + vitality + speakerEstimates
                                 //   "developing"  — + resources + methodSupport
                                 //   "supported"   — full research: registers, challenges, etc.

  "humanReviewed": null,         // null until a qualified human reviews the card. When populated:
                                 // {
                                 //   "reviewer": "Prof. Kenneth Jamandre",
                                 //   "affiliation": "University of the Philippines Diliman",
                                 //   "date": "2026-06-08",
                                 //   "scope": "full",             // "full", "partial", "vitality-only"
                                 //   "notes": "Verified speaker count, vitality assessment,
                                 //             and contact influences for Tagalog."
                                 // }

  "notes":         null,         // Free-text notes about this language or this card's data quality.
                                 // Example: "Low-resource language under active development.
                                 //           Translation pipeline uses FST-gated approach."

  "firstDocumented": null,       // Year of first known documentation. Negative for BCE.
                                 // Example: -1500 (Sanskrit, ~1500 BCE), 1787 (some languages).
                                 // Source: Glottolog CLDF.

  "lastDocumented":  null,       // Year of last known documentation (relevant for extinct languages).
                                 // Source: Glottolog CLDF.

  "_generated":    null          // Auto-populated by enrichment scripts. When populated:
                                 // {
                                 //   "by": "generate-all-cards.mjs",
                                 //   "at": "2026-06-07T12:34:56Z",
                                 //   "sources": ["iso639-3", "glottolog-5.3", "wikidata"],
                                 //   "completeness": "partial",
                                 //       // "partial"     — has identity + classification + coords
                                 //       // "substantial" — + vitality + speakerEstimates + script
                                 //       // "complete"    — all automatable fields populated
                                 //   "lastEnriched": "2026-06-07"
                                 // }
}

Feldreferenz

§ 1. Identitätsfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`code`	`string`	✅	✅	ISO 639-3-Registry
`name`	`string`	✅	✅	ISO 639-3-Registry
`nativeName`	`string \| null`	—	✅	Wikidata P1705
`alternateNames`	`string[]`	—	✅	Glottolog, Ethnologue
`iso639_3`	`string`	✅	✅	ISO 639-3-Registry
`iso639_1`	`string \| null`	—	✅	ISO 639-1
`bcp47`	`string \| null`	—	Teilweise	IANA-Subtag-Registry
`aliases`	`string[]`	—	❌	Manuelle Kuration
`isoScope`	`string`	✅	✅	ISO 639-3-Registry
`isoType`	`string`	✅	✅	ISO 639-3-Registry
`macrolanguage`	`string \| null`	—	✅	ISO 639-3 macrolanguages.tab
`extends`	`string \| null`	—	❌	Manuelle Kuration

§ 2. Klassifikationsfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`glottocode`	`string \| null`	—	✅	Glottolog
`classification`	`object \| null`	—	✅	Glottolog
`isIsolate`	`boolean`	—	✅	Glottolog CLDF

§ 3. Geografiefelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`macroarea`	`string \| null`	—	✅	Glottolog CLDF
`coordinates`	`object \| null`	—	✅	Glottolog
`countries`	`string[]`	—	✅	Glottolog
`regions`	`object[]`	—	❌	Volkszählung, Ethnologue, manuell
`arealContext`	`object \| null`	—	✅	Koordinaten + linguistische Arealzonen

§ 4. Schriftsystemfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`script`	`string \| null`	—	✅	Wikidata P282
`scriptUnicodeName`	`string \| null`	—	✅	Abgeleitet aus `script` über ISO 15924 → Unicode-Mapping
`scripts`	`object[]`	—	Teilweise	Wikidata, manuell
`dir`	`string \| null`	—	✅	Aus Schrift ableitbar
`scriptConverter`	`string \| null`	—	❌	Manuell
`orthographicStatus`	`object \| null`	—	Teilweise	Ethnologue, manuell

§ 5. Demografie- & Vitalitätsfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`speakerEstimates`	`object[]`	—	✅	Wikidata, Ethnologue, Volkszählung
`vitality`	`object \| null`	—	✅	Glottolog AES, UNESCO

§ 5.5 Dokumentations- & Felder zur digitalen Präsenz

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`documentationDepth`	`object \| null`	—	✅	Glottolog-Referenzen
`digitalPresence`	`object \| null`	—	✅	Wikipedia, Common Voice, Tatoeba
`dialectCount`	`number \| null`	—	✅	Glottolog

§ 6. Formalitäts-, Register- & Genusfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`formality`	`object \| null`	—	❌	Linguistische Forschung
`registers`	`object \| null`	—	❌	Linguistische Forschung
`gender`	`object \| null`	—	❌	Linguistische Forschung
`codeSwitching`	`object \| null`	—	❌	Linguistische Forschung

§ 7. Felder des linguistischen Profils

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`linguisticChallenges`	`object \| null`	—	❌	Linguistische Forschung
`contactInfluences`	`object[]`	—	❌	Veröffentlichte Linguistik
`rules`	`object \| null`	—	✅	CLDR
`typologicalProfile`	`object \| null`	—	✅	Grambank 1.0.3 — automatisch befüllt durch `enrich-grambank-typology.mjs`
`phonologicalInventory`	`object \| null`	—	✅	PHOIBLE 2.0 — automatisch befüllt durch `enrich-phoible-phonemes.mjs`

§ 8. Enzyklopädische Felder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`encyclopedic`	`object \| null`	—	❌	Manuelle Recherche
`culturalAphorism`	`object \| null`	—	❌	Community-Beitrag
`varieties`	`object[]`	—	❌	Manuelle Recherche

§ 9. Digitale Ressourcenfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`resources`	`object \| null`	—	Teilweise	Manuell + automatisiert
`databaseCoverage`	`object \| null`	—	✅	Aus Anreicherung abgeleitet
`corpusAvailability`	`object \| null`	—	✅	Bible Brain, OPUS, Lexibank
`keyboardSupport`	`object \| null`	—	✅	Keyman API, CLDR
`methodSupport`	`object`	✅	Teilweise	API-Verifizierung
`metricModelSupport`	`object \| null`	—	✅	XLM-R-Paper, AfriCOMET-Paper
`metricPlugins`	`object \| null`	—	✅	Kartenanreicherung — deklariert, welche Metrik-Plugin-Packs zutreffen (z. B. `{ formalityMarkers: true }`)
`omt1600`	`object \| null`	—	✅	Meta-Bewertung
`evalDatasets`	`string[]`	—	✅	Datensatz-Registry
`pipelineReadiness`	`object \| null`	—	Teilweise	Abgeleitet + manuell

resources.fsts[].install: FST-Einträge im resources-Objekt können ein install-Unterobjekt mit folgenden Feldern enthalten: repo, releaseTag, assetPattern, format, maturity und optional bundlePattern. Dies ersetzt das frühere fest codierte Dictionary GIELLALT_FST_REGISTRY. Siehe get_fst_install_info() in language_cards.py.

§ 10. Herkunftsfelder

Feld	Typ	Erforderlich	Automatisierbar	Quelle
`dataSources`	`array \| object`	✅	✅	Auto + manuell
`supportTier`	`string`	—	✅	Aus Kartenvollständigkeit abgeleitet
`humanReviewed`	`object \| null`	—	❌	Menschlicher Prüfer
`notes`	`string \| null`	—	❌	Manuell
`firstDocumented`	`number \| null`	—	✅	Glottolog CLDF
`lastDocumented`	`number \| null`	—	✅	Glottolog CLDF
`_generated`	`object \| null`	—	✅	Anreicherungsskripte

Sprachcode-Richtlinie

Champollion verwendet ISO 639-3 als kanonischen Bezeichner. Andere Standardcodes werden als Aliase registriert und zur Laufzeit zum ISO-639-3-Code aufgelöst.

Priorität	Standard	Beispiel	Feld	Verwendung
1 (kanonisch)	ISO 639-3	`crk`	`code`	Kartendateiname, Konfigurationsschlüssel, API-Parameter
2 (Alias)	ISO 639-1	`iu`	`aliases[]`	In der CLI akzeptiert, zu ISO 639-3 aufgelöst
3 (Alias)	BCP 47	`fil`	`aliases[]`	In der CLI akzeptiert, zu ISO 639-3 aufgelöst
Referenz	Glottocode	`plai1258`	`glottocode`	Nur Klassifikation, nicht für die Laufzeit

Auflösungsreihenfolge: Wenn ein Anwender einen Code angibt:

Direkte Übereinstimmung mit card.code → gefunden
Übereinstimmung mit card.aliases[] → gefunden, gibt die kanonische Karte zurück
Übereinstimmung mit card.iso639_1 → gefunden (Fallback)
Nicht gefunden → Fehler

Migrationsverlauf: ISO 639-1 → ISO 639-3

Vor v8 verwendeten Kartendateinamen ISO-639-1-Codes, sofern verfügbar (fr.json, de.json, ja.json). Bei der 639-3-Migration wurden alle Karten in ihre ISO-639-3-Entsprechungen umbenannt:

Vorher	Nachher	Grund
`fr.json`	`fra.json`	639-3 ist kanonisch
`de.json`	`deu.json`	639-3 ist kanonisch
`zh.json`	`cmn.json`	Makrosprache → standardmäßige Einzelsprache
`ar.json`	`arb.json`	Makrosprache → Modernes Hocharabisch
`ms.json`	`zsm.json`	Makrosprache → Standardmalaiisch

Was geschah mit den alten Codes?

Der alte 639-1-Code befindet sich in card.iso639_1
Der alte 639-1-Code befindet sich in card.aliases[]
resolveCode("fr") gibt zur Laufzeit "fra" zurück — abwärtskompatibel
Anwender können in ihrer Konfiguration weiterhin "fr" schreiben — es wird transparent aufgelöst

Was sich architektonisch geändert hat:

_deepMerge() überspringt nun null-Werte (erbt von übergeordneter Karte)
_deepMerge() hat nun ein gesetztes Identitätsfeld (code, extends, aliases werden niemals vererbt)
formality.default wird nun aus den Register-isDefault: true-Flags abgeleitet
205 aus Grambank abgeleitete Karten erhielten eine strukturelle formality.default-Korrektur
38 Genus-/Familien-/Makrosprachen-Karten dienen als Vererbungsziele

Sonderfälle

Gebärdensprachen

Gebärdensprachen (z. B. ASE — American Sign Language) sind legitime Sprachen mit ISO-639-3-Codes. Sie verfügen über Geografie und Sprecherzahlen, jedoch:

script ist typischerweise null (keine standardisierte Schriftform)
scripts kann "Sgnw" (SignWriting) enthalten, sofern ein Notationssystem verwendet wird
dir ist null
linguisticChallenges sollte räumliche Grammatik, Klassifikatoren usw. behandeln
gender.grammatical ist typischerweise false

Alte & historische Sprachen

Sprachen wie Latein (lat, isoType H) und Sanskrit (san, isoType H) werden in bestimmten Kontexten (liturgisch, akademisch) noch verwendet, haben aber keine Muttersprachler:

vitality kann „keine Muttersprachler" mit "trend": "stable" vermerken (nicht rückläufig — die Gemeinschaft, die sie verwendet, ist stabil, nur klein)
speakerEstimates sollte vermerken, dass es sich um L2-Sprecher handelt, nicht um L1
firstDocumented / lastDocumented verorten sie in der Zeit

Konstruierte Sprachen

Esperanto (epo, isoType C), Lojban usw.:

classification kann auf eine „konstruierte" Familie oder null verweisen
contactInfluences spiegelt das Ausgangsmaterial wider (z. B. schöpft Esperanto aus romanischen, germanischen, slawischen Sprachen)
vitality ist ungewöhnlich — eine wachsende Sprechergemeinschaft, aber kein heimatliches Ursprungsgebiet

Makrosprachen

Arabisch (ara), Chinesisch (zho), Cree (cre), Quechua (que) sind Makrosprachen, die mehrere Einzelsprachen umfassen:

isoScope: "M"
varieties sollte die Einzelsprachen mit ihren ISO-Codes auflisten
methodSupport sollte widerspiegeln, was die Makrosprachenkarte unterstützt (üblicherweise die standardisierte Varietät)
Einzelne Varietäten sollten ebenfalls eigene Karten haben

Sprachen ohne standardisierte Orthografie

Viele Sprachen (insbesondere Sprachen mündlicher Überlieferung) verfügen über kein standardisiertes Schriftsystem oder über konkurrierende Orthografien:

script ist null
scripts ist []
dir ist null
notes sollte die orthografische Situation erläutern
linguisticChallenges sollte vermerken, wie sich dies auf die MÜ auswirkt (z. B. keine Trainingsdaten)

Diglossie

Sprachen wie Arabisch (MSA vs. Dialekte) oder Guaraní (Jopará vs. reines Guaraní):

codeSwitching erfasst die Situation der gemischten Varietäten
registers kann Voreinstellungen für verschiedene Niveaus anbieten
varieties kann das diglossische Paar auflisten

Arten von Kontakteinflüssen

Typ	Bedeutung	Beispiel
`superstrate`	Dominante Sprache, die einer Gemeinschaft auferlegt wird	Französisch → Englisch (nach 1066)
`substrate`	Muttersprache, die eine auferlegte Sprache beeinflusst	Keltisch → Englisch
`adstrate`	Benachbarte Sprache mit gegenseitiger Beeinflussung	Nordisch → Englisch
`learned_borrowing`	Entlehnungen über Bildung/Wissenschaft	Latein → Englisch
`lexical_borrowing`	Direkte Wortschatzentlehnungen durch Kontakt	Spanisch → Filipino
`relexification`	Vollständige Ersetzung des Wortschatzes	Portugiesisch → Papiamentu

Tiefen von Kontakteinflüssen

Tiefe	Bedeutung
`light`	Einige wenige Lehnwörter, minimale strukturelle Auswirkung
`moderate`	Bedeutender Wortschatz in bestimmten Bereichen
`heavy`	Durchdringender Wortschatz und einige strukturelle Merkmale
`structural`	Grammatik, Syntax und Phonologie betroffen
`defining`	Kernidentität durch Kontakt geprägt (Kreolsprachen, Mischsprachen)

Gute Register-Voreinstellungen schreiben

Gute Voreinstellungs-Prompts:

Benennen Sie das Formalitätsmerkmal explizit (z. B. „해요체", „vous-Form", „siz-Form")
Erläutern Sie das spezifische Pronomen oder die zu verwendende Verbform
Geben Sie Kontext dafür, wann dieses Register angemessen ist
Erwähnen Sie Schriftaspekte, sofern zutreffend

Setzen Sie geschlechtergerechte Leitlinien nicht in den Voreinstellungs-Prompt. Genus-Leitlinien gehören in card.gender.inclusiveGuidance — sie werden separat eingefügt.

❌ Bad:  "Standard Thai. Professional register."
✔ Good: "Professional Thai. Use คุณ (khun) for second person, เรา (rao)
         for first person when needed. Clear, concise phrasing
         appropriate for digital interfaces."

Benennungskonvention für Voreinstellungen

Voreinstellungs-Schlüssel sollten beschreibend und in Kleinbuchstaben mit Bindestrichen sein:

T-V-Sprachen: formal-vous, informal-tu, formal-Sie, casual-du
Sprachebenen: polite-haeyo, formal-hapsyo, casual-hae
Neutral: professional, neutral-professional
Code-Switching: taglish-professional, pure-filipino

Anreicherungsverfahren

Verarbeitungsreihenfolge pro Karte

Beim Anreichern einer Karte konsultieren Sie die Quellen in dieser Reihenfolge. Dokumentieren Sie jede konsultierte Quelle, auch wenn sie keine Daten zurücklieferte.

ISO 639-3-Registry → code, name, isoScope, isoType
ISO 639-3 macrolanguages.tab → macrolanguage
Glottolog languoid.csv → glottocode, classification, coordinates, countries
Glottolog CLDF → macroarea, isIsolate, firstDocumented, lastDocumented
Glottolog AES → vitality (Gefährdungsstatus)
Wikidata SPARQL → nativeName, speakerEstimates, script, scripts, dir
CLDR → rules (Typografie, Plurale, Großschreibung)
NLLB-200 / FLORES+ → methodSupport.nllb, evalDatasets
API-Verifizierung → verbleibende methodSupport-Einträge
ML-Modell-Paper → metricModelSupport (XLM-R-Trainingsdaten, AfriCOMET-Abdeckung) Skript: node scripts/enrich-metric-model-support.mjs

Umgang mit Konflikten

Wenn Quellen uneinig sind:

Beide speichern mit Quellenangabe
KEINEN Durchschnitt bilden und keine Position beziehen
Die Diskrepanz vermerken im entsprechenden note-Feld
Die aktuellste Primärquelle bevorzugen nur dann, wenn ein einzelner Wert für die Berechnung benötigt wird

Validierung

Führen Sie den Linter nach jeder Anreicherung oder manuellen Bearbeitung aus:

node scripts/lint-language-cards.mjs              # all cards
node scripts/lint-language-cards.mjs --lang crk    # single card

PR-Checkliste

Beim Einreichen einer neuen oder geänderten Sprachkarte:

Datei benannt als <code>.json in shared/language-cards/
Alle Top-Level-Felder aus der kanonischen Vorlage sind vorhanden
classification aus Glottolog befüllt (nicht von Hand erstellt)
dataSources listet alle konsultierten Quellen auf
methodSupport-Einträge gegen die tatsächlichen API-Sprachlisten verifiziert
contactInfluences-Einträge haben veröffentlichte Quellen oder citation_needed: true
linguisticChallenges mit 3–6 MÜ-relevanten Herausforderungen (sofern recherchiert)
rules aus CLDR befüllt (sofern Locale-Daten existieren)
Linter besteht ohne Fehler

Fachliche Referenzen

Standard	Verwaltet von	Unsere Verwendung
ISO 639-3	SIL International	Kanonische Sprachcodes, Makrosprachen-Beziehungen
Glottolog	Max-Planck-Institut	Klassifikation, Koordinaten, AES-Gefährdung
WALS	Max-Planck-Institut	Genus-Definitionen, typologische Merkmale
ISO 15924	Unicode/ISO	Schriftcodes
CLDR	Unicode Consortium	Locale-Daten, Pluralregeln, Typografie
Wikidata	Wikimedia Foundation	Sprecherzahlen, Endonyme, Schriftdaten
Ethnologue	SIL International	EGIDS, Sprecherschätzungen, DLS
UNESCO Atlas	UNESCO	Gefährdungsklassifikation
Katig Collective	UP Diliman	Kapseln zu philippinischen Sprachen

Siehe auch: Zitierverfahren für Sprachkarten für detaillierte quellenbezogene Leitlinien.

Designprinzipien​

Drei-Schichten-Architektur​

Vererbungsmodell​

Zusammenführungssemantik​

Identitätsfelder (Niemals vererbt)​

Beispiel: Wie eine Cree-Karte aufgelöst wird​

Genus-Kartenvorlage​

Kanonische Vorlage​

Feldreferenz​

§ 1. Identitätsfelder​

§ 2. Klassifikationsfelder​

§ 3. Geografiefelder​

§ 4. Schriftsystemfelder​

§ 5. Demografie- & Vitalitätsfelder​

§ 5.5 Dokumentations- & Felder zur digitalen Präsenz​

§ 6. Formalitäts-, Register- & Genusfelder​

§ 7. Felder des linguistischen Profils​

§ 8. Enzyklopädische Felder​

§ 9. Digitale Ressourcenfelder​

§ 10. Herkunftsfelder​

Sprachcode-Richtlinie​

Migrationsverlauf: ISO 639-1 → ISO 639-3​

Sonderfälle​

Gebärdensprachen​

Alte & historische Sprachen​

Konstruierte Sprachen​

Makrosprachen​

Sprachen ohne standardisierte Orthografie​

Diglossie​

Arten von Kontakteinflüssen​

Tiefen von Kontakteinflüssen​

Gute Register-Voreinstellungen schreiben​

Benennungskonvention für Voreinstellungen​

Anreicherungsverfahren​

Verarbeitungsreihenfolge pro Karte​

Umgang mit Konflikten​

Validierung​

PR-Checkliste​

Fachliche Referenzen​