Zum Hauptinhalt springen

GLOSSARY

The linguistics, in plain language

Every technical term the language cards use — what it means, and why it makes machine translation harder. 127 terms, each with a citation to the standard reference.

A

absolutive

#

also: absolutive case

The unmarked case in an ergative system, covering intransitive subjects and transitive objects. It is usually the citation form of the noun.

Why it matters for MT: Absolutive nouns look identical in two different roles, so MT must use the verb to disambiguate.

abugida

#

also: alphasyllabary · abugidas

A script where each symbol is a consonant with a built-in default vowel, and other vowels are marked by modifying the base sign — as in Devanagari, Ethiopic, or Thai. Between an alphabet and a syllabary.

Why it matters for MT: Abugida vowel marks are combining characters that naïve text processing can strip or reorder, corrupting words.

adposition

#

also: adpositions · preposition · prepositions · postposition · postpositions

The cover term for prepositions (before their noun: 'in the house') and postpositions (after it: Japanese uchi de 'house in'). A language's choice correlates strongly with its verb–object order.

Why it matters for MT: Preposition-to-postposition conversion flips the bracketing of every spatial and temporal phrase.

advanced tongue root

#

also: ATR · ATR harmony · [+ATR]

A vowel quality made by pushing the tongue root forward, expanding the throat. Many African languages split their vowels into +ATR and -ATR sets and require words to stay within one set (ATR harmony).

Why it matters for MT: ATR distinctions are inconsistently marked in orthographies, splitting words across spellings.

affix

#

also: affixes · affixation

A morpheme attached to a word stem to modify its meaning or grammar. Prefixes attach before the stem, suffixes after, and a few languages use infixes inside the stem.

Why it matters for MT: Where affixes attach determines where subword tokenizers should split words for the MT model.

agglutinative

#

also: agglutination · agglutinating · agglutinative morphology

A word-building style where each grammatical meaning gets its own clearly separable affix, stacked one after another — as in Turkish or Swahili. Words can get long, but each piece has one job.

Why it matters for MT: Agglutinative words segment cleanly into subwords, but only if the tokenizer learns the language's affix order.

agreement

#

also: concord · object agreement · verbal agreement · agreement patterns

When one word's form must match another's grammatical features — verbs matching their subjects, adjectives matching their nouns in gender and number. Some languages mark agreement with both subject and object on the verb.

Why it matters for MT: Generated text must satisfy all agreement chains; a single wrong feature produces multiple visible errors.

animacy

#

also: animate · inanimate · animacy hierarchy

A grammatical distinction between living (or living-like) and non-living referents. In Algonquian languages every noun is grammatically animate or inanimate, and verbs change shape entirely depending on which they combine with.

Why it matters for MT: Animacy drives verb choice and agreement in many Indigenous languages, with no overt cue in an English source.

Example: Plains Cree (crk): verb conjugation changes completely based on whether subject/object nouns are animate or inanimate (card field linguisticChallenges.animacy).

article

#

also: articles · definite article · indefinite article · article system

A small word that marks a noun as definite ('the') or indefinite ('a'). About a third of the world's languages have no articles at all, and some use the numeral 'one' or demonstratives instead.

Why it matters for MT: Article insertion and deletion is one of the most frequent edit categories in MT between article and article-less languages.

aspect

#

also: aspectual · aspect marking

Grammatical marking for how an event unfolds in time — completed, ongoing, habitual, repeated — independent of when it happened. Russian verbs come in perfective/imperfective pairs; English uses 'was doing' vs 'did'.

Why it matters for MT: Aspect choices must be made on every verb when translating into aspect-marking languages, often without source cues.

B

benefactive

#

also: benefactives

Marking that an action is done for someone's benefit — 'I baked her a cake'. Some languages mark this with a verb affix (an applicative) rather than word order or a preposition.

Why it matters for MT: Beneficiaries can hide inside verb morphology, so MT must detect and re-express them as separate phrases.

C

chrF

#

also: chrF++ · chrf

An automatic MT evaluation metric that scores character-level overlap between a system's output and reference translations. Because it works on characters rather than words, it is fairer to morphologically rich languages than word-based metrics.

Why it matters for MT: chrF is the preferred surface metric for agglutinative and polysynthetic languages where word-level matching breaks down.

classifier

#

also: classifiers · numeral classifier · numeral classifiers · measure word · classifier system · counter system · counters · noun classifiers

A word required when counting or referring to nouns, sorting them by type — like English 'three head of cattle' or 'two sheets of paper', but obligatory for everything. Chinese, Japanese, Thai and many other languages cannot count without the right classifier.

Why it matters for MT: MT must select the correct classifier for each noun; the wrong one is instantly visible to native speakers.

click

#

also: clicks · click consonants · click consonant

Consonants made by sucking air in, like the English 'tsk-tsk' — but used as ordinary speech sounds. Khoisan and some southern Bantu languages (Zulu, Xhosa) have full sets of click consonants.

Why it matters for MT: Clicks are written with unusual characters (ǃ, ǂ, c, q, x) that tokenizers and fonts often mishandle.

clitic

#

also: clitics · proclitic · enclitic

A small grammatical word that cannot stand on its own and leans on a neighboring word, like the 's in "the queen of England's hat". Clitics behave partly like words and partly like affixes.

Why it matters for MT: Clitics blur word boundaries, so tokenizers may attach them to the wrong host and scramble the grammar.

clusivity

#

also: inclusive · exclusive · inclusive/exclusive · inclusive-exclusive · inclusiveExclusive

A distinction between two kinds of 'we': inclusive (me + you, maybe others) and exclusive (me + others, but not you). Hundreds of languages make this distinction obligatorily.

Why it matters for MT: Translating English 'we' requires choosing inclusive or exclusive with no source-side cue — and the wrong choice can be socially serious.

code-switching

#

also: code switching · codeswitching · codeSwitching · code-mixing

Alternating between two or more languages within one conversation or sentence, as bilingual speakers naturally do. It follows grammatical patterns rather than being random mixing.

Why it matters for MT: Mixed-language input confuses language detection and produces garbled output unless the system handles both codes.

COMET

#

also: AfriCOMET · comet score

A neural MT evaluation metric that uses a trained multilingual model to predict human quality judgments, rather than counting surface overlap. Variants like AfriCOMET extend coverage to African languages.

Why it matters for MT: COMET correlates better with human judgment than surface metrics, but only for languages its underlying model has seen.

comitative

#

also: comitative case

A case or marker meaning 'together with someone'. Some languages use the same marker for 'with' (accompaniment) and 'and' (coordination), which can be ambiguous to outsiders.

Why it matters for MT: Comitative/coordination overlap means 'X with Y' and 'X and Y' can be the same construction in the source.

compounding

#

also: compound nouns · compounds · compound words

Joining two or more independent words into one new word, like 'tooth' + 'brush' → 'toothbrush'. German-style compounding can chain many words into very long single tokens.

Why it matters for MT: Novel compounds are unseen tokens that MT must split into known parts to translate.

conjunct order

#

also: conjunct · independent order · conjunct verb

In Algonquian languages like Plains Cree, verbs come in distinct inflectional 'orders': the independent order for main statements and the conjunct order mainly for subordinate clauses, questions, and certain discourse contexts. The two use entirely different ending sets.

Why it matters for MT: Choosing independent vs conjunct forms is a clause-type decision English gives no direct cue for.

Example: Plains Cree (crk): described in the card's formality notes and crk-translate documentation.

consonant mutation

#

also: initial mutation · consonant mutations

A grammatical change to a word's first consonant triggered by the word before it. Welsh cath 'cat' becomes gath after the article: y gath 'the cat'. The dictionary form and the spoken form can look quite different.

Why it matters for MT: Mutation makes the same word appear under several spellings, fragmenting its statistics in MT training data.

copula

#

also: zero copula · copular

The linking verb 'be' in sentences like 'she is a doctor'. Many languages omit it (zero copula) in the present tense — Russian and Arabic say literally 'she doctor'.

Why it matters for MT: MT must insert copulas translating out of zero-copula languages and delete them going in.

creole

#

also: creoles · creole language · creolization

A full natural language that grew out of intense language contact, typically drawing most vocabulary from one language (the lexifier) while developing its own grammar. Haitian Creole, Tok Pisin, and Papiamento are examples.

Why it matters for MT: Creoles look deceptively like their lexifier in writing, so MT systems trained on the lexifier mistranslate them systematically.

D

dative

#

also: dative case

The case marking the recipient or beneficiary — the 'to whom' of giving and telling. German wem, Latin cui; English expresses it with word order or 'to'.

Why it matters for MT: MT must decide between dative case marking and prepositional phrasing depending on the target language.

definiteness

#

also: definite · indefinite · specific articles

Whether a noun refers to something the listener can already identify ('the dog') or not ('a dog'). Many languages never mark this distinction; others mark it with articles, affixes, or word order.

Why it matters for MT: Translating from an article-less language forces MT to decide 'the' vs 'a' for every noun phrase from context alone.

demonstrative

#

also: demonstratives · deixis

Pointing words like 'this' and 'that'. Languages divide pointing space differently — some have a three-way near-me / near-you / far split, others add visibility or elevation.

Why it matters for MT: Demonstrative systems rarely map one-to-one, so MT must approximate distance and visibility distinctions.

dependent marking

#

also: dependent-marking · dependentMarking

Putting grammatical marking on the dependents of a phrase — case endings on nouns, possessive endings on the possessor — while the head stays unmarked. Most European languages lean this way.

Why it matters for MT: Dependent marking puts the grammatical signal in noun endings, which tokenizers must segment correctly.

derivation

#

also: derivational

Building new words from existing ones, like teach → teacher or happy → unhappy. Unlike inflection, derivation creates a new dictionary word rather than a grammatical variant of the same word.

Why it matters for MT: Productive derivation lets speakers coin words on the fly that an MT system has never seen.

diacritic

#

also: diacritics · accent marks · tone marks · combining marks

Marks added to letters — accents, tildes, dots, macrons — to indicate tone, length, nasality, or different sounds. Some orthographies depend on them completely; users often omit them in casual typing.

Why it matters for MT: Diacritic-stripped text is a pervasive data-quality problem: it merges distinct words and mismatches clean training data.

differential object marking

#

also: DOM · differential marking

Marking some direct objects but not others, usually depending on animacy or definiteness. Spanish adds 'a' before human objects (veo a María) but not things (veo la casa).

Why it matters for MT: MT must decide per-object whether the marker belongs, using animacy cues the source may not show.

diglossia

#

also: diglossic

A stable situation where a community uses two varieties of a language for different purposes — a 'high' variety for writing and formal speech (Modern Standard Arabic) and a 'low' one for daily life (the spoken dialects). Speakers switch by context, not by choice.

Why it matters for MT: Training data skews toward the written 'high' variety, so MT fails on the spoken variety people actually use.

directional

#

also: directionals · directional affixes · directional prefixes

Verb marking that builds direction of motion into the verb itself — toward the speaker, away, upriver, uphill. Common in Mayan, Tibeto-Burman, and many Papuan languages.

Why it matters for MT: Directional meaning packed into verbs must be unpacked into adverbs or prepositions in the target.

downstep

#

also: downdrift

In many African tone languages, a step-down in pitch that affects all following high tones in the phrase. It is a tonal landmark that can itself distinguish meanings.

Why it matters for MT: Downstep is essentially never written, so tonal information is systematically absent from text data.

dual

#

also: dual number · trial

A grammatical number meaning exactly two, distinct from both singular and plural. Slovene, Arabic and many Oceanic languages have it; a few languages add trial (exactly three) in pronouns.

Why it matters for MT: MT into dual-marking languages must detect two-ness that the source expresses only by counting words.

E

ejective

#

also: ejectives · glottalized consonant · glottalized consonants

A consonant produced with a burst of air from the closed glottis instead of the lungs, giving a sharp popping quality. Common in languages of the Caucasus, the Americas, and East Africa; written with an apostrophe (k', t').

Why it matters for MT: Ejective marks are often dropped in casual typing, merging distinct words in the input text.

endonym

#

also: autonym · native name · exonym · endonyms

A community's own name for its language (endonym/autonym), as opposed to the name outsiders use (exonym). 'Deutsch' is the endonym for what English calls 'German'; many Indigenous communities prefer their endonyms.

Why it matters for MT: Name mismatches across databases cause language-identification and data-merging errors in MT pipelines.

ergativity

#

also: ergative · ergative-absolutive · ergative–absolutive · ergative alignment · ergative case · split ergativity

An alignment system where the subject of an intransitive verb ('she sleeps') is marked like the object of a transitive one ('saw her'), while transitive subjects get special ergative marking. Basque, many Mayan, Australian and Caucasian languages work this way.

Why it matters for MT: Ergative marking reverses the role cues MT expects, causing who-did-what-to-whom errors.

evidentiality

#

also: evidential · evidentials · evidential marking

Grammar that forces speakers to say how they know something — saw it, heard about it, inferred it. In Quechua or Turkish, leaving out the evidential is like leaving out tense in English.

Why it matters for MT: MT into evidential languages must state an information source the original text never specifies.

F

focus system

#

also: voice focus · voiceFocus · focus marker · symmetrical voice · Austronesian voice · focus construction

A clause system, best known from Philippine languages like Tagalog, where verb morphology selects which participant — actor, patient, location, instrument — is the grammatical pivot of the sentence. Often called symmetrical voice.

Why it matters for MT: Focus choice changes verb form, marker placement and word order at once, so MT cannot map clauses word-by-word.

Example: Surfaced on cards as linguisticChallenges.voiceFocus for many Austronesian languages.

FST

#

also: finite-state transducer · finite state transducer · FST-based

A finite-state transducer: a rule-based computational model that maps between word forms and their grammatical analyses. For morphologically complex languages, hand-built FSTs can analyze and generate word forms that statistical systems never saw.

Why it matters for MT: FSTs provide reliable morphological analysis and validation for languages too data-poor to learn morphology from text alone.

Example: Plains Cree (crk): the eval pack uses an FST-based semantic validator (card field evalMetrics.lyss-sem).

fusional

#

also: fusional morphology · inflecting language

A word-building style where one affix fuses several grammatical meanings at once. The Spanish ending -ó in habló marks past tense, third person, and singular simultaneously — no part of it can be assigned to just one meaning.

Why it matters for MT: Fused endings cannot be decomposed by tokenizers, so each combination must be learned as a unit.

G

gemination

#

also: geminate · geminated · double consonant

Holding a consonant longer to make a different word — Italian pala 'shovel' vs palla 'ball'. Some scripts write the doubling, others leave it to the reader.

Why it matters for MT: When the script omits gemination, distinct words collapse into one spelling and MT loses the contrast.

genitive

#

also: genitive case · possessive case

The case that marks possession or close association, like English 's or 'of'. Languages differ in whether the genitive phrase comes before or after the noun it modifies.

Why it matters for MT: Genitive order differences ('the king's horse' vs 'horse of-king') require systematic reordering inside noun phrases.

gerund

#

also: gerunds

A verb form used as a noun, like 'swimming' in 'swimming is fun'. Other languages use infinitives, verbal nouns, or special converb forms where English uses gerunds.

Why it matters for MT: English gerunds map to several different constructions depending on the target language.

glottal stop

#

also: glottal · ʔ · okina · ʻokina

The catch in the throat in the middle of 'uh-oh'. In many languages it is a full consonant that distinguishes words — Hawaiian writes it as the ʻokina (ʻ), and dropping it changes meanings.

Why it matters for MT: Glottal stops are frequently omitted or typed with the wrong apostrophe character, fragmenting words across spellings.

Example: Hawaiian (haw): the ʻokina is a phonemic consonant; card orthography notes flag apostrophe-variant issues.

glottocode

#

also: Glottolog code · glottocodes

A unique identifier from the Glottolog database (like stan1293 for English) covering languages, dialects, and families. More fine-grained than ISO codes and revised continuously by linguists.

Why it matters for MT: Glottocodes let pipelines join typological databases precisely, even for varieties without ISO codes.

grammatical case

#

also: case · case system · case marking · cases · case morphology · borderline case-marking

Marking nouns and pronouns to show their role in the sentence — who acts, who is acted on, where, with what. English keeps only traces (I/me/my); Finnish has fifteen cases; many languages have none.

Why it matters for MT: Case-marking languages allow free word order, so MT must read roles from endings rather than position — and generate the right endings in return.

grammatical gender

#

also: gender · gender system · gender agreement · noun gender · gendered

Sorting all nouns into classes (often called masculine/feminine/neuter) that force matching forms on articles, adjectives, and sometimes verbs. The assignment is grammatical, not biological — a German table is masculine, a Spanish one feminine.

Why it matters for MT: Translating into a gendered language forces choices about people and things the source leaves unspecified, a major source of MT bias.

grammatical number

#

also: plural marking · plurality · plural · number marking · obligatory plural

How a language marks how many — singular, plural, and sometimes dual (exactly two) or paucal (a few). Some languages mark number on every noun; others leave it to context entirely.

Why it matters for MT: When the source does not mark number, MT must guess it; when the target requires dual forms, MT must supply them.

grammatical voice

#

also: voice · voice system

The grammatical system controlling which participant is the subject — active, passive, middle, and in some language families much richer systems. Voice reshapes the whole clause around a chosen perspective.

Why it matters for MT: Voice mismatches require restructuring whole clauses, not substituting words.

H

head marking

#

also: head-marking · headMarking

Putting the grammatical marking on the head of a phrase — the verb shows who its subject and object are, the possessed noun shows who owns it. The dependents (the nouns themselves) can stay bare.

Why it matters for MT: Head-marking concentrates clause grammar on the verb, the mirror image of the dependent-marking languages most MT data comes from.

honorific

#

also: honorifics · respectful speech · respect forms · honorific system

Grammatical or lexical forms that encode respect toward the listener or the person discussed — Japanese keigo, Korean speech levels, special kin-respect vocabularies. Often obligatory, not optional politeness.

Why it matters for MT: Honorific selection requires social knowledge (who outranks whom) that the source text rarely states.

I

imperfective

#

also: imperfective aspect

Aspect presenting an event from the inside — ongoing, habitual, or repeated, like 'she was writing' or 'she used to write'. The counterpart of perfective.

Why it matters for MT: Imperfective readings (ongoing vs habitual) must be disambiguated by context for correct translation.

implosive

#

also: implosives

Consonants made with air briefly sucked inward at the throat, like the ɓ and ɗ of Hausa or Vietnamese. They sound like emphatic b/d to untrained ears but are distinct phonemes.

Why it matters for MT: Implosive letters (ɓ, ɗ) are often typed as plain b/d, merging distinct words in text corpora.

infix

#

also: infixes · infixation

An affix inserted inside a word stem rather than before or after it. Tagalog, for example, turns sulat 'write' into sumulat 'wrote' by inserting -um- after the first consonant.

Why it matters for MT: Infixes break the assumption that a word's stem is a contiguous string, which defeats simple subword segmentation.

inflection

#

also: inflectional · inflectional synthesis · inflected

Changing a word's form to express grammar — tense, number, case, gender — without changing its core meaning, like English sing/sang or cat/cats. Highly inflected languages can mark six or more categories on a single word.

Why it matters for MT: Every inflected form is a separate token for an MT system, so heavy inflection means more rare words and more agreement errors.

instrumental

#

also: instrumental case · instrumentals

A case meaning 'using/by means of' — Russian marks 'with a hammer' with an ending instead of a preposition. Polysynthetic languages may build the instrument right into the verb.

Why it matters for MT: Instrumental meaning shifts between case endings, prepositions, and verb-internal marking across languages.

interrogative

#

also: interrogatives · question particle · polar question · question marker

The grammar of asking questions. Yes/no questions may be marked by a particle, a verb form, word-order change, or intonation alone; content questions differ in whether 'who/what' moves to the front.

Why it matters for MT: If the source marks questions only by intonation, written input gives MT no signal that a question is being asked.

ISO 639-3

#

also: ISO 639 · iso639_3 · language code · three-letter code

The international standard of three-letter codes for the world's languages (eng, crk, haw), maintained by SIL. It aims to cover every known language, living or extinct, with one code each.

Why it matters for MT: ISO 639-3 codes are the join keys of multilingual NLP — wrong or ambiguous codes silently corrupt datasets.

isolating

#

also: analytic language · isolating morphology

A word-building style where words are mostly single morphemes and grammar is expressed by word order and helper words instead of endings — as in Vietnamese or Mandarin. The opposite extreme from polysynthesis.

Why it matters for MT: Isolating languages shift the MT problem from word segmentation to word order and function-word choice.

K

kinship terms

#

also: kin terms · kinship terminology · kinship system

The vocabulary for family relations, which different languages slice very differently — separate words for older vs younger siblings, or for maternal vs paternal uncles. Some systems encode the speaker's own position too.

Why it matters for MT: English 'uncle' or 'cousin' may have no single equivalent, forcing MT to choose among precise kin terms without the needed family facts.

L

language isolate

#

also: isolate · isolates · isIsolate

A language with no demonstrated relatives — a family of one. Basque, Ainu, and Burushaski are famous examples; isolates are surprisingly common worldwide.

Why it matters for MT: Isolates cannot borrow training signal from related languages, removing a key low-resource MT strategy (transfer learning).

language vitality

#

also: vitality · endangerment status · endangered · endangered language · dormant · sleeping language · vigorous · threatened

How robustly a language is being passed to children and used across life domains. Scales like EGIDS and Glottolog's endangerment status run from 'vigorous' through 'threatened' and 'moribund' to 'dormant' (no fluent speakers, but potential for revival).

Why it matters for MT: Vitality predicts data availability and, for community-driven MT, what role technology should play (e.g. revitalization support, not replacement).

lenition

#

also: lenited

The softening of a consonant, often between vowels or under grammatical triggers — a 'k' weakening toward 'g' or 'h'. In Celtic languages lenition is part of the grammar, not just pronunciation.

Why it matters for MT: Grammatical lenition changes word spellings in context, so surface text diverges from dictionary forms.

lexifier

#

also: lexifier language · lexified

The language that supplied most of a creole's or pidgin's vocabulary — English for Jamaican Patois, French for Haitian Creole, Portuguese for Papiamento. The grammar, however, is the creole's own.

Why it matters for MT: Shared vocabulary with the lexifier masks deep grammatical differences that MT must not gloss over.

loanword

#

also: loanwords · borrowing · borrowings · calque · English loanwords

A word taken from another language and adapted to local sound patterns, like 'sushi' in English or 'le weekend' in French. A calque borrows the structure instead, translating piece by piece ('skyscraper' → French 'gratte-ciel').

Why it matters for MT: Deciding whether to keep, adapt, or translate a loanword is a recurring choice in localization, especially for technical terms.

locative

#

also: locative case

A case meaning 'at/in/on' a place, expressed by a noun ending rather than a preposition. Finnish and Hungarian split location into several precise locative cases (inside, on top, near, motion toward, motion from).

Why it matters for MT: One English preposition can map to several locative cases, forcing MT to choose by context.

M

macrolanguage

#

also: macrolanguages

An ISO 639-3 bookkeeping category: a single code (like 'ara' Arabic or 'zho' Chinese) that covers several distinct member languages treated as one for historical or political reasons. Each member also has its own code.

Why it matters for MT: Data labeled with a macrolanguage code mixes mutually unintelligible varieties, contaminating training and evaluation.

mood

#

also: modality · grammatical mood

Grammatical marking for the speaker's stance toward an event — fact, wish, command, possibility. Indicative, subjunctive and imperative are the familiar European moods; other languages mark finer shades.

Why it matters for MT: Mood selection (especially subjunctive) follows target-language rules that cannot be copied from the source.

mora

#

also: moraic · morae

A timing unit smaller than the syllable: a short syllable counts one mora, a long vowel or a final consonant adds another. Japanese rhythm, poetry, and even abbreviations count morae, not syllables.

Why it matters for MT: Mora-based phonology shapes how loanwords and names are adapted, affecting transliteration quality.

moribund

#

A vitality status meaning the language is no longer being learned by children; the remaining fluent speakers are all older adults. Without intervention, such a language becomes dormant within a generation.

Why it matters for MT: Moribund languages have shrinking speaker pools to validate MT output, raising the stakes of every data decision.

morpheme

#

also: morphemes

The smallest piece of a word that carries meaning. The English word 'unhappiness' contains three: un-, happy, and -ness. Languages differ enormously in how many morphemes they pack into one word.

Why it matters for MT: MT systems work on tokens, and a token that contains many morphemes hides grammar the system needs to translate correctly.

morphology

#

also: morphological complexity · morphologically complex · complex morphology

The study of how words are built from smaller meaningful parts, and the word-building system of a language itself. A morphologically complex language expresses with word endings what English expresses with separate words.

Why it matters for MT: Complex morphology multiplies the distinct word forms an MT system must learn from limited data.

morphosyntactic alignment

#

also: alignment · alignment of verbal person marking

The system a language uses to group the three core roles — intransitive subject, transitive subject, and object — for marking purposes. Nominative–accusative and ergative–absolutive are the two big patterns; some languages split between them or use animacy-driven systems.

Why it matters for MT: Alignment mismatch between source and target is a structural translation problem, not a vocabulary one.

N

nasal vowel

#

also: nasal vowels · nasalization · nasalized · vowel nasalization

A vowel pronounced with air flowing through the nose, as in French bon or Portuguese são. Where it is contrastive, oral and nasal vowels distinguish different words.

Why it matters for MT: Nasalization marks (ã, ę, ą) are commonly dropped in informal typing, merging word pairs.

negation

#

also: negative morpheme · negator · negative marker · standard negation

How a language says 'not'. Strategies include particles (English not), affixes on the verb, special negative verbs, and two-part constructions like French ne…pas. Position varies: before the verb, after it, or both.

Why it matters for MT: Negation errors invert meaning entirely, making correct negative placement one of the highest-stakes MT requirements.

nominative

#

also: accusative · nominative-accusative · nominative–accusative

In the most familiar alignment system, the subject of any verb takes nominative case and the direct object takes accusative. Most European languages work this way, so it is what MT training data overwhelmingly reflects.

Why it matters for MT: Systems trained mostly on nominative–accusative languages misassign roles when translating ergative languages.

noun class

#

also: noun classes · noun class agreement · noun-class

A gender-like system with many classes — Bantu languages typically have 10–20, sorting nouns by shape, animacy, size and more. Each class triggers its own agreement prefixes across the sentence.

Why it matters for MT: Every noun choice ripples agreement markers through the whole clause, so one wrong class produces many visible errors.

noun incorporation

#

also: incorporation · incorporated noun

Folding a noun into the verb to make one word, roughly like turning 'hunt seals' into 'seal-hunt' as a verb. Common in polysynthetic languages, where it changes the sentence's emphasis and grammar.

Why it matters for MT: An incorporated noun disappears from the sentence as a separate word, so alignment-based translation loses it.

Example: Surfaced on cards as linguisticChallenges.nounIncorporation (Grambank-based), e.g. Plains Cree (crk).

O

oblique

#

also: obliques · oblique argument · oblique phrase

Any phrase in the clause that is neither subject nor direct object — typically locations, instruments, recipients and other 'extras', often marked with a case or adposition. In WALS, 'X' in orders like VOX stands for the oblique phrase.

Why it matters for MT: Where obliques sit in the sentence varies by language and must be reordered correctly around verb and object.

obviation

#

also: obviative · fourth person · proximate

A system, central to Algonquian languages like Plains Cree, that ranks third persons in a stretch of discourse: one is 'proximate' (in focus) and any others are 'obviative' (marked as backgrounded, sometimes called fourth person). It tracks who is who without pronouns like 'he₁ vs he₂'.

Why it matters for MT: English has no obviation, so MT into Cree must invent proximate/obviative assignments and keep them consistent across sentences.

Example: Plains Cree (crk) marks obviative referents on nouns and verbs; see the crk card and crk-translate documentation.

orthography

#

also: orthographic · spelling system · orthographies · orthographic status

The agreed rules for writing a language in its script — which letters, diacritics, and spellings are correct. Some languages have multiple competing orthographies or none standardized at all.

Why it matters for MT: Competing or unstandardized orthographies split scarce training data into incompatible spelling variants.

P

paradigm

#

also: inflectional paradigm · paradigms

The full set of inflected forms a word can take — like a verb conjugation table. Paradigms range from two forms (English 'must') to thousands in polysynthetic languages.

Why it matters for MT: Large paradigms guarantee that most word forms are rare or unseen in training data.

parallel corpus

#

also: parallel text · bitext · parallel corpora · parallel data

A collection of texts paired with their translations, aligned sentence by sentence. Parallel corpora are the primary fuel for training and evaluating MT systems.

Why it matters for MT: The size and domain of available parallel data is the strongest single predictor of MT quality for a language pair.

participle

#

also: participles · participial

A verb form that acts like an adjective or builds compound tenses — 'the running water', 'has eaten'. Languages differ in how many participles they have and what they are used for.

Why it matters for MT: Participial clauses often replace relative clauses in other languages, requiring structural conversion.

particle

#

also: particles · sentence-final particle · discourse particle · topic marker

A small, uninflected function word that adds grammatical or attitudinal meaning — question markers, topic markers, politeness softeners. East Asian languages make heavy use of sentence-final particles.

Why it matters for MT: Particles carry meaning (questionhood, attitude, topic) that MT must re-express by entirely different means.

passive

#

also: passive voice · passive constructions

A construction that promotes the object to subject and demotes or drops the doer: 'the window was broken (by the boy)'. Many languages lack a passive entirely or use other strategies to background the agent.

Why it matters for MT: Passive-less target languages force MT to restructure passives into actives, inventing or recovering the agent.

perfective

#

also: perfective aspect

Aspect presenting an event as a complete whole — 'she wrote the letter' viewed as one finished fact. Often paired with imperfective in a grammatical opposition.

Why it matters for MT: Choosing perfective vs imperfective wrongly is among the most common MT errors into Slavic languages.

pharyngeal

#

also: pharyngeals · pharyngeal consonants

Consonants made by squeezing the throat (pharynx), like Arabic ʿayn (ع). They are rare worldwide and hard for non-native speakers to hear or produce.

Why it matters for MT: Pharyngeals are romanized many ways (ʿ, ', 3, or nothing), creating spelling chaos in informal text.

phoneme

#

also: phonemes · phonemic · phoneme inventory · consonant inventory · vowel inventory

A speech sound that distinguishes words in a particular language — swap one phoneme for another and you get a different word (pat vs bat). A language's phoneme inventory ranges from about a dozen sounds to well over a hundred.

Why it matters for MT: Inventory size and content determine how foreign names and loanwords get reshaped in the language.

pidgin

#

also: pidgins

A simplified contact language with no native speakers, created for trade or work between groups with no common tongue. When children grow up speaking one natively, it becomes a creole.

Why it matters for MT: Pidgins have high variability and thin text data, making consistent MT especially hard.

pitch accent

#

also: pitch-accent

A system where pitch distinguishes words, but only one syllable per word carries the distinctive pitch — Japanese háshi 'chopsticks' vs hashí 'bridge'. Lighter than full tone, heavier than pure stress.

Why it matters for MT: Like tone, pitch accent is rarely written, so homographs multiply in text.

politeness

#

also: politeness distinctions · formality · formality system · politeness levels

The linguistic encoding of social relationships — through pronoun choice, verb endings, particles, or vocabulary. Languages range from no grammatical politeness to elaborate multi-level systems.

Why it matters for MT: A translation can be lexically perfect and still fail by choosing the wrong politeness level for the situation.

polypersonal agreement

#

also: polypersonalism · polypersonal

Verb agreement with more than one participant at once — the verb carries markers for both subject and object (and sometimes more). Basque, Georgian, and Algonquian languages do this systematically.

Why it matters for MT: The verb form encodes who acts on whom, so MT must resolve both roles before it can produce a single correct verb.

polysynthesis

#

also: polysynthetic · polysynthetic language

A word-building style where a single verb can contain what other languages express as a whole sentence — subject, object, location, instrument and more, all as parts of one word. Many Indigenous American languages work this way.

Why it matters for MT: Polysynthetic words rarely repeat exactly, so word-level MT sees an endless stream of unknown tokens.

Example: Plains Cree (crk): a single verb can incorporate subject/object pronouns, instrumentals, locations, and actions (card field linguisticChallenges.polysynthesis).

possession

#

also: possessive · possessives · alienable · inalienable · possessive affixes

How a language expresses 'my X / your X'. Many languages distinguish inalienable possession (body parts, kin — things you cannot give away) from alienable, marking them with different constructions.

Why it matters for MT: Inalienable possession often requires obligatory possessor marking that English sources omit.

prefix

#

also: prefixes · prefixing

An affix that attaches to the front of a word stem, like re- in 'rewrite'. Some languages, including many Bantu and Athabaskan languages, carry most of their grammar in strings of prefixes.

Why it matters for MT: Prefix-heavy languages put grammatical information at the start of words, the opposite of what suffix-trained tokenizers expect.

R

reduplication

#

also: reduplicated · reduplicative

Repeating all or part of a word to change its meaning — to mark plurals, intensity, or ongoing action. Indonesian orang 'person' becomes orang-orang 'people'.

Why it matters for MT: MT must recognize that a doubled word is grammar, not an accidental repetition to be deleted.

register

#

also: registers · speech register · register-levels · speech levels

A variety of a language tied to social context — formal, casual, ceremonial, technical. Some languages grammaticalize registers: Javanese has distinct vocabulary sets for different politeness levels.

Why it matters for MT: MT must hold a consistent register; mixing formal and casual forms in one output reads as broken or rude.

relative clause

#

also: relative clauses · relativization

A clause that modifies a noun: 'the book that I read'. Languages place it before or after the noun, and use strategies from relative pronouns to gaps to special verb forms.

Why it matters for MT: Prenominal relative clauses (Japanese, Turkish) require inverting long stretches of text relative to English order.

romanization

#

also: transliteration · romanized · latinization · scriptConverter

Writing a language in Latin letters instead of its native script, by rule (transliteration) or by sound. One language often has several competing romanization standards.

Why it matters for MT: Romanized and native-script text behave as different languages to an MT model unless explicitly converted.

root

#

also: roots · word root

The irreducible core of a word once all affixes are stripped away. In Semitic languages a root is often just three consonants (like k-t-b 'write' in Arabic) that vowel patterns turn into words.

Why it matters for MT: Languages whose roots are discontinuous (consonant skeletons) need specialized segmentation for MT to see word relationships.

root-and-pattern morphology

#

also: root pattern · rootPattern · templatic morphology · nonconcatenative morphology · root-pattern morphology

A word-building style, typical of Arabic and Hebrew, where a consonant root like k-t-b 'write' is threaded through vowel templates: kitāb 'book', kātib 'writer', maktab 'office'. The root and the pattern each carry meaning, but neither is a contiguous chunk.

Why it matters for MT: Standard subword tokenization cannot see the shared root across these forms, weakening generalization in Semitic-language MT.

Example: Surfaced on cards as linguisticChallenges.rootPattern (e.g. Arabic, Hebrew, Maltese).

S

sandhi

#

also: tone sandhi · external sandhi

Sound changes that happen where words or morphemes meet, like 'don't you' becoming 'dontcha'. In tone languages, tones themselves can change in context (tone sandhi).

Why it matters for MT: Sandhi makes written or transcribed forms context-dependent, complicating consistent tokenization.

script

#

also: writing system · scripts

The set of symbols a language is written in — Latin, Cyrillic, Arabic, Han characters, and many more. One language can use several scripts (Serbian), and one script can serve hundreds of languages.

Why it matters for MT: Script identity drives every downstream text process: encoding, tokenization, and which MT models even accept the input.

serial verb construction

#

also: serial verbs · serialVerbs · verb serialization · serial verb

Stringing several verbs together in one clause with no 'and' or 'to' between them — 'take knife cut bread' for 'cut the bread with a knife'. Common in West African, Southeast Asian and creole languages.

Why it matters for MT: Serial verbs must be decomposed into prepositions or subordinate clauses when translating into European languages, and rebuilt going the other way.

stem

#

also: stems · word stem

The core part of a word that affixes attach to. In 'unbelievable', the stem of -able is 'believe'. In many languages a stem never appears alone and must carry at least some inflection.

Why it matters for MT: Identifying shared stems across word forms is how MT systems generalize from limited training data.

subjunctive

#

also: subjunctive mood

A verb mood for non-asserted content — wishes, doubts, hypotheticals, and clauses after certain verbs. Romance languages require it in many subordinate clauses where English uses plain forms.

Why it matters for MT: Subjunctive triggers are target-language-specific, so MT must apply grammar rules rather than translate forms.

suffix

#

also: suffixes · suffixing

An affix that attaches to the end of a word stem, like -ness in 'kindness'. Suffixing is the most common affixation strategy across the world's languages.

Why it matters for MT: Stacked suffixes create long, rare word forms that MT systems must segment correctly to translate.

suppletion

#

also: suppletive

When a word's inflected forms come from completely different roots, like go/went or good/better. The grammar treats them as one word even though they share no sounds.

Why it matters for MT: Suppletive forms cannot be derived by rule, so MT must have seen each one in training data.

syllabary

#

also: syllabic script · syllabaries

A script with one symbol per syllable rather than per sound — Japanese kana or Cherokee. Works best for languages with simple syllable structures.

Why it matters for MT: Syllabaries change the granularity of text: tokenizers see syllables, not consonants and vowels.

syllabics

#

also: Canadian Aboriginal Syllabics · UCAS · Cans

The script family used for Cree, Inuktitut, Ojibwe and other Indigenous Canadian languages, where each character encodes a consonant and its rotation encodes the vowel. Invented in the 1840s and still in active community use.

Why it matters for MT: Many Cree/Inuktitut texts exist in both syllabics and roman orthography, so MT pipelines need reliable script conversion.

Example: Plains Cree (crk): card script is Cans (Canadian Aboriginal Syllabics) with a roman-orthography converter.

T

T-V distinction

#

also: tu/vous distinction · T-V · formal/informal pronouns · tu-vous · formal you

Having two (or more) words for 'you' depending on social distance, like French tu/vous or German du/Sie. The choice signals respect, intimacy, or hierarchy and is hard to undo once made.

Why it matters for MT: English 'you' gives no cue, so MT must choose formality from context — a frequent and socially visible error.

Example: French (fra): card formality data distinguishes tu/vous usage contexts.

tense

#

also: tenses · past tense · future tense · tense-aspect

Grammatical marking that locates an event in time — past, present, future. Some languages have no grammatical tense at all (Mandarin), while others distinguish several degrees of past remoteness.

Why it matters for MT: Tenseless source text forces MT to infer time reference; remoteness systems demand finer distinctions than the source provides.

tokenization

#

also: tokenizer · subword · subword segmentation · tokenize · tokenization and alignment

Splitting text into the units (tokens) a translation model actually processes. Modern systems split rare words into subword pieces; how well those pieces line up with real morphemes varies hugely by language.

Why it matters for MT: Bad tokenization is a root cause of MT failure for morphologically rich and low-resource languages.

tone

#

also: tonal · tone system · tonal language · tones · lexical tone · toneSystem · contour tone

Using voice pitch to distinguish words: Mandarin mā 'mother' vs mǎ 'horse'. Simple systems contrast two levels; complex systems (many West African and Southeast Asian languages) use several levels and contours.

Why it matters for MT: Tone is usually invisible in romanized or unmarked text, collapsing distinct words into one spelling for the MT system.

treebank

#

also: treebanks · UD treebank · Universal Dependencies

A corpus of sentences hand-annotated with grammatical structure (parse trees). The Universal Dependencies project maintains treebanks in a shared format for 150+ languages.

Why it matters for MT: Treebank existence signals serious NLP infrastructure for a language and enables syntax-aware evaluation.

U

umlaut

#

also: i-mutation

A vowel change caused historically by a following vowel, surviving as grammar: German Apfel 'apple' → Äpfel 'apples'. Related to ablaut but with a different historical origin.

Why it matters for MT: Umlaut puts plural or tense marking inside the stem, invisible to affix-based segmentation.

uvular

#

also: uvulars · uvular consonants · uvular consonant

Consonants made at the very back of the mouth against the uvula, like the Arabic q or French r. Rarer than velar k/g sounds and a signature of certain language areas.

Why it matters for MT: Uvulars are often romanized inconsistently (q/k/kh), splitting one word into several text forms.

V

valency

#

also: valence · valency patterns · valencyPatterns

How many participants a verb requires and how it encodes them — 'sleep' takes one, 'give' takes three. Languages disagree about which participants particular verbs take and how they are marked.

Why it matters for MT: Valency mismatches make literal translations assign the wrong roles or drop required participants.

vigesimal

#

also: base-20 · vigesimal system

A counting system based on twenty rather than ten. Maya, Yoruba, Nahuatl, and Danish (partly) count this way — 'eighty' is literally 'four twenties' in French (quatre-vingts).

Why it matters for MT: Number-word translation across different bases is error-prone, and numbers are high-stakes content.

Example: Surfaced on cards via numeralSystem.baseType from Numeralbank data.

vowel harmony

#

also: vowel-harmony

A rule that all vowels in a word must agree in some property, such as front/back or rounded/unrounded. In Turkish, suffix vowels change shape to match the stem: ev-ler 'houses' but at-lar 'horses'.

Why it matters for MT: Each suffix has several surface forms, multiplying the token variants MT must learn for one grammatical ending.

vowel length

#

also: long vowel · long vowels · vowel-length

Holding a vowel longer to make a different word — Finnish tuli 'fire' vs tuuli 'wind'. Scripts mark it with double letters, macrons (ā), or not at all.

Why it matters for MT: When length marking is optional (e.g. Hawaiian kahakō, Arabic vowels), the same word appears in multiple spellings.

Example: Hawaiian (haw): the kahakō (macron) marks long vowels and is often omitted in casual text.

W

word order

#

also: basic word order · dominant word order · constituent order · SOV · SVO · VSO · VOS · OVS · wordOrder · free word order · flexible word order

The typical arrangement of subject (S), object (O), and verb (V) in a plain statement. SOV (Japanese) and SVO (English) cover most languages; VSO (Welsh, Tagalog-type) is the third common type, and some languages have no fixed order at all.

Why it matters for MT: Word-order mismatch is the single largest driver of reordering errors between language pairs.

writing direction

#

also: right-to-left · RTL · left-to-right · LTR · writingDirection · bidirectional text

Which way the script runs: left-to-right (Latin), right-to-left (Arabic, Hebrew), or historically top-to-bottom (Mongolian, classical Chinese). Mixed-direction text needs special handling.

Why it matters for MT: RTL and bidirectional text break naive string handling — numbers, punctuation and embedded Latin words reorder unpredictably.

These terms are marked with a dotted underline wherever they appear in card prose — hover one on any card in the Atlas or the trading-card atlas for the short version, then follow “More →” back here.