Glossary | champollion

A

ablaut

also: apophony · vowel gradation

Changing a vowel inside a word to change its grammar, like English sing/sang/sung. The word's consonant frame stays put while the vowel does the grammatical work.

Why it matters for MT: Ablaut hides inflection inside the stem where subword tokenizers cannot isolate it.

SIL Glossary of Linguistic Terms ↗Related:root-and-pattern morphology umlaut inflection

absolutive

also: absolutive case

The unmarked case in an ergative system, covering intransitive subjects and transitive objects. It is usually the citation form of the noun.

Why it matters for MT: Absolutive nouns look identical in two different roles, so MT must use the verb to disambiguate.

SIL Glossary of Linguistic Terms ↗Related:ergativity grammatical case

abugida

also: alphasyllabary · abugidas

A script where each symbol is a consonant with a built-in default vowel, and other vowels are marked by modifying the base sign — as in Devanagari, Ethiopic, or Thai. Between an alphabet and a syllabary.

Why it matters for MT: Abugida vowel marks are combining characters that naïve text processing can strip or reorder, corrupting words.

Wikipedia: Abugida (standard reference) ↗Related:syllabary script diacritic

adposition

also: adpositions · preposition · prepositions · postposition · postpositions

The cover term for prepositions (before their noun: 'in the house') and postpositions (after it: Japanese uchi de 'house in'). A language's choice correlates strongly with its verb–object order.

Why it matters for MT: Preposition-to-postposition conversion flips the bracketing of every spatial and temporal phrase.

WALS Online, chapter 85 (Dryer) ↗Related:word order locative grammatical case

advanced tongue root

also: ATR · ATR harmony · [+ATR]

A vowel quality made by pushing the tongue root forward, expanding the throat. Many African languages split their vowels into +ATR and -ATR sets and require words to stay within one set (ATR harmony).

Why it matters for MT: ATR distinctions are inconsistently marked in orthographies, splitting words across spellings.

SIL Glossary of Linguistic Terms ↗Related:vowel harmony phoneme

affix

also: affixes · affixation

A morpheme attached to a word stem to modify its meaning or grammar. Prefixes attach before the stem, suffixes after, and a few languages use infixes inside the stem.

Why it matters for MT: Where affixes attach determines where subword tokenizers should split words for the MT model.

SIL Glossary of Linguistic Terms ↗Related:prefix suffix infix morpheme

agglutinative

also: agglutination · agglutinating · agglutinative morphology

A word-building style where each grammatical meaning gets its own clearly separable affix, stacked one after another — as in Turkish or Swahili. Words can get long, but each piece has one job.

Why it matters for MT: Agglutinative words segment cleanly into subwords, but only if the tokenizer learns the language's affix order.

SIL Glossary of Linguistic Terms ↗Related:fusional isolating polysynthesis morphology

agreement

also: concord · object agreement · verbal agreement · agreement patterns

When one word's form must match another's grammatical features — verbs matching their subjects, adjectives matching their nouns in gender and number. Some languages mark agreement with both subject and object on the verb.

Why it matters for MT: Generated text must satisfy all agreement chains; a single wrong feature produces multiple visible errors.

SIL Glossary of Linguistic Terms ↗Related:grammatical gender polypersonal agreement inflection

animacy

also: animate · inanimate · animacy hierarchy

A grammatical distinction between living (or living-like) and non-living referents. In Algonquian languages every noun is grammatically animate or inanimate, and verbs change shape entirely depending on which they combine with.

Why it matters for MT: Animacy drives verb choice and agreement in many Indigenous languages, with no overt cue in an English source.

Example: Plains Cree (crk): verb conjugation changes completely based on whether subject/object nouns are animate or inanimate (card field linguisticChallenges.animacy).

SIL Glossary of Linguistic Terms ↗Related:grammatical gender obviation differential object marking

article

also: articles · definite article · indefinite article · article system

A small word that marks a noun as definite ('the') or indefinite ('a'). About a third of the world's languages have no articles at all, and some use the numeral 'one' or demonstratives instead.

Why it matters for MT: Article insertion and deletion is one of the most frequent edit categories in MT between article and article-less languages.

WALS Online, chapter 38 (Dryer) ↗Related:definiteness demonstrative

aspect

also: aspectual · aspect marking

Grammatical marking for how an event unfolds in time — completed, ongoing, habitual, repeated — independent of when it happened. Russian verbs come in perfective/imperfective pairs; English uses 'was doing' vs 'did'.

Why it matters for MT: Aspect choices must be made on every verb when translating into aspect-marking languages, often without source cues.

WALS Online, chapter 65 (Dahl & Velupillai) ↗Related:perfective imperfective tense

B

benefactive

also: benefactives

Marking that an action is done for someone's benefit — 'I baked her a cake'. Some languages mark this with a verb affix (an applicative) rather than word order or a preposition.

Why it matters for MT: Beneficiaries can hide inside verb morphology, so MT must detect and re-express them as separate phrases.

SIL Glossary of Linguistic Terms ↗Related:dative valency

C

chrF

also: chrF++ · chrf

An automatic MT evaluation metric that scores character-level overlap between a system's output and reference translations. Because it works on characters rather than words, it is fairer to morphologically rich languages than word-based metrics.

Why it matters for MT: chrF is the preferred surface metric for agglutinative and polysynthetic languages where word-level matching breaks down.

Popović 2015, chrF: character n-gram F-score (ACL Anthology) ↗Related:COMET tokenization

classifier

also: classifiers · numeral classifier · numeral classifiers · measure word · classifier system · counter system · counters · noun classifiers

A word required when counting or referring to nouns, sorting them by type — like English 'three head of cattle' or 'two sheets of paper', but obligatory for everything. Chinese, Japanese, Thai and many other languages cannot count without the right classifier.

Why it matters for MT: MT must select the correct classifier for each noun; the wrong one is instantly visible to native speakers.

WALS Online, chapter 55 (Gil) ↗Related:noun class grammatical number

click

also: clicks · click consonants · click consonant

Consonants made by sucking air in, like the English 'tsk-tsk' — but used as ordinary speech sounds. Khoisan and some southern Bantu languages (Zulu, Xhosa) have full sets of click consonants.

Why it matters for MT: Clicks are written with unusual characters (ǃ, ǂ, c, q, x) that tokenizers and fonts often mishandle.

WALS Online, chapter 19 (Maddieson) ↗Related:phoneme orthography

clitic

also: clitics · proclitic · enclitic

A small grammatical word that cannot stand on its own and leans on a neighboring word, like the 's in "the queen of England's hat". Clitics behave partly like words and partly like affixes.

Why it matters for MT: Clitics blur word boundaries, so tokenizers may attach them to the wrong host and scramble the grammar.

SIL Glossary of Linguistic Terms ↗Related:affix particle tokenization

clusivity

also: inclusive · exclusive · inclusive/exclusive · inclusive-exclusive · inclusiveExclusive

A distinction between two kinds of 'we': inclusive (me + you, maybe others) and exclusive (me + others, but not you). Hundreds of languages make this distinction obligatorily.

Why it matters for MT: Translating English 'we' requires choosing inclusive or exclusive with no source-side cue — and the wrong choice can be socially serious.

WALS Online, chapter 39 (Cysouw) ↗Related:grammatical number T-V distinction

code-switching

also: code switching · codeswitching · codeSwitching · code-mixing

Alternating between two or more languages within one conversation or sentence, as bilingual speakers naturally do. It follows grammatical patterns rather than being random mixing.

Why it matters for MT: Mixed-language input confuses language detection and produces garbled output unless the system handles both codes.

SIL Glossary of Linguistic Terms ↗Related:diglossia loanword

COMET

also: AfriCOMET · comet score

A neural MT evaluation metric that uses a trained multilingual model to predict human quality judgments, rather than counting surface overlap. Variants like AfriCOMET extend coverage to African languages.

Why it matters for MT: COMET correlates better with human judgment than surface metrics, but only for languages its underlying model has seen.

Rei et al. 2020, COMET (ACL Anthology) ↗Related:chrF

comitative

also: comitative case

A case or marker meaning 'together with someone'. Some languages use the same marker for 'with' (accompaniment) and 'and' (coordination), which can be ambiguous to outsiders.

Why it matters for MT: Comitative/coordination overlap means 'X with Y' and 'X and Y' can be the same construction in the source.

WALS Online, chapter 63 (Stassen) ↗Related:grammatical case instrumental

compounding

also: compound nouns · compounds · compound words

Joining two or more independent words into one new word, like 'tooth' + 'brush' → 'toothbrush'. German-style compounding can chain many words into very long single tokens.

Why it matters for MT: Novel compounds are unseen tokens that MT must split into known parts to translate.

SIL Glossary of Linguistic Terms ↗Related:derivation noun incorporation tokenization

conjunct order

also: conjunct · independent order · conjunct verb

In Algonquian languages like Plains Cree, verbs come in distinct inflectional 'orders': the independent order for main statements and the conjunct order mainly for subordinate clauses, questions, and certain discourse contexts. The two use entirely different ending sets.

Why it matters for MT: Choosing independent vs conjunct forms is a clause-type decision English gives no direct cue for.

Example: Plains Cree (crk): described in the card's formality notes and crk-translate documentation.

Wikipedia: Plains Cree (standard reference; see also Wolvengrey 2011) ↗Related:obviation mood agreement

consonant mutation

also: initial mutation · consonant mutations

A grammatical change to a word's first consonant triggered by the word before it. Welsh cath 'cat' becomes gath after the article: y gath 'the cat'. The dictionary form and the spoken form can look quite different.

Why it matters for MT: Mutation makes the same word appear under several spellings, fragmenting its statistics in MT training data.

SIL Glossary of Linguistic Terms ↗Related:lenition sandhi

copula

also: zero copula · copular

The linking verb 'be' in sentences like 'she is a doctor'. Many languages omit it (zero copula) in the present tense — Russian and Arabic say literally 'she doctor'.

Why it matters for MT: MT must insert copulas translating out of zero-copula languages and delete them going in.

WALS Online, chapter 120 (Stassen) ↗Related:negation word order

creole

also: creoles · creole language · creolization

A full natural language that grew out of intense language contact, typically drawing most vocabulary from one language (the lexifier) while developing its own grammar. Haitian Creole, Tok Pisin, and Papiamento are examples.

Why it matters for MT: Creoles look deceptively like their lexifier in writing, so MT systems trained on the lexifier mistranslate them systematically.

APiCS Online (Michaelis et al., eds.) ↗Related:pidgin lexifier loanword

D

dative

also: dative case

The case marking the recipient or beneficiary — the 'to whom' of giving and telling. German wem, Latin cui; English expresses it with word order or 'to'.

Why it matters for MT: MT must decide between dative case marking and prepositional phrasing depending on the target language.

SIL Glossary of Linguistic Terms ↗Related:grammatical case benefactive

definiteness

also: definite · indefinite · specific articles

Whether a noun refers to something the listener can already identify ('the dog') or not ('a dog'). Many languages never mark this distinction; others mark it with articles, affixes, or word order.

Why it matters for MT: Translating from an article-less language forces MT to decide 'the' vs 'a' for every noun phrase from context alone.

WALS Online, chapter 37 (Dryer) ↗Related:article differential object marking

demonstrative

also: demonstratives · deixis

Pointing words like 'this' and 'that'. Languages divide pointing space differently — some have a three-way near-me / near-you / far split, others add visibility or elevation.

Why it matters for MT: Demonstrative systems rarely map one-to-one, so MT must approximate distance and visibility distinctions.

WALS Online, chapter 41 (Diessel) ↗Related:article definiteness

dependent marking

also: dependent-marking · dependentMarking

Putting grammatical marking on the dependents of a phrase — case endings on nouns, possessive endings on the possessor — while the head stays unmarked. Most European languages lean this way.

Why it matters for MT: Dependent marking puts the grammatical signal in noun endings, which tokenizers must segment correctly.

WALS Online, chapter 23 (Nichols & Bickel) ↗Related:head marking grammatical case

derivation

also: derivational

Building new words from existing ones, like teach → teacher or happy → unhappy. Unlike inflection, derivation creates a new dictionary word rather than a grammatical variant of the same word.

Why it matters for MT: Productive derivation lets speakers coin words on the fly that an MT system has never seen.

SIL Glossary of Linguistic Terms ↗Related:inflection compounding morpheme

diacritic

also: diacritics · accent marks · tone marks · combining marks

Marks added to letters — accents, tildes, dots, macrons — to indicate tone, length, nasality, or different sounds. Some orthographies depend on them completely; users often omit them in casual typing.

Why it matters for MT: Diacritic-stripped text is a pervasive data-quality problem: it merges distinct words and mismatches clean training data.

SIL Glossary of Linguistic Terms ↗Related:orthography tone vowel length

differential object marking

also: DOM · differential marking

Marking some direct objects but not others, usually depending on animacy or definiteness. Spanish adds 'a' before human objects (veo a María) but not things (veo la casa).

Why it matters for MT: MT must decide per-object whether the marker belongs, using animacy cues the source may not show.

SIL Glossary of Linguistic Terms ↗Related:animacy definiteness grammatical case

diglossia

also: diglossic

A stable situation where a community uses two varieties of a language for different purposes — a 'high' variety for writing and formal speech (Modern Standard Arabic) and a 'low' one for daily life (the spoken dialects). Speakers switch by context, not by choice.

Why it matters for MT: Training data skews toward the written 'high' variety, so MT fails on the spoken variety people actually use.

SIL Glossary of Linguistic Terms ↗Related:register code-switching macrolanguage

directional

also: directionals · directional affixes · directional prefixes

Verb marking that builds direction of motion into the verb itself — toward the speaker, away, upriver, uphill. Common in Mayan, Tibeto-Burman, and many Papuan languages.

Why it matters for MT: Directional meaning packed into verbs must be unpacked into adverbs or prepositions in the target.

SIL Glossary of Linguistic Terms ↗Related:affix valency

downstep

also: downdrift

In many African tone languages, a step-down in pitch that affects all following high tones in the phrase. It is a tonal landmark that can itself distinguish meanings.

Why it matters for MT: Downstep is essentially never written, so tonal information is systematically absent from text data.

SIL Glossary of Linguistic Terms ↗Related:tone

dual

also: dual number · trial

A grammatical number meaning exactly two, distinct from both singular and plural. Slovene, Arabic and many Oceanic languages have it; a few languages add trial (exactly three) in pronouns.

Why it matters for MT: MT into dual-marking languages must detect two-ness that the source expresses only by counting words.

SIL Glossary of Linguistic Terms ↗Related:grammatical number clusivity

E

ejective

also: ejectives · glottalized consonant · glottalized consonants

A consonant produced with a burst of air from the closed glottis instead of the lungs, giving a sharp popping quality. Common in languages of the Caucasus, the Americas, and East Africa; written with an apostrophe (k', t').

Why it matters for MT: Ejective marks are often dropped in casual typing, merging distinct words in the input text.

WALS Online, chapter 7 (Maddieson) ↗Related:glottal stop phoneme orthography

endonym

also: autonym · native name · exonym · endonyms

A community's own name for its language (endonym/autonym), as opposed to the name outsiders use (exonym). 'Deutsch' is the endonym for what English calls 'German'; many Indigenous communities prefer their endonyms.

Why it matters for MT: Name mismatches across databases cause language-identification and data-merging errors in MT pipelines.

Glottolog ↗Related:ISO 639-3 glottocode

ergativity

also: ergative · ergative-absolutive · ergative–absolutive · ergative alignment · ergative case · split ergativity

An alignment system where the subject of an intransitive verb ('she sleeps') is marked like the object of a transitive one ('saw her'), while transitive subjects get special ergative marking. Basque, many Mayan, Australian and Caucasian languages work this way.

Why it matters for MT: Ergative marking reverses the role cues MT expects, causing who-did-what-to-whom errors.

WALS Online, chapter 98 (Comrie) ↗Related:absolutive nominative morphosyntactic alignment

evidentiality

also: evidential · evidentials · evidential marking

Grammar that forces speakers to say how they know something — saw it, heard about it, inferred it. In Quechua or Turkish, leaving out the evidential is like leaving out tense in English.

Why it matters for MT: MT into evidential languages must state an information source the original text never specifies.

WALS Online, chapter 77 (de Haan) ↗Related:mood tense

F

focus system

also: voice focus · voiceFocus · focus marker · symmetrical voice · Austronesian voice · focus construction

A clause system, best known from Philippine languages like Tagalog, where verb morphology selects which participant — actor, patient, location, instrument — is the grammatical pivot of the sentence. Often called symmetrical voice.

Why it matters for MT: Focus choice changes verb form, marker placement and word order at once, so MT cannot map clauses word-by-word.

Example: Surfaced on cards as linguisticChallenges.voiceFocus for many Austronesian languages.

SIL Glossary of Linguistic Terms ↗Related:grammatical voice word order

FST

also: finite-state transducer · finite state transducer · FST-based

A finite-state transducer: a rule-based computational model that maps between word forms and their grammatical analyses. For morphologically complex languages, hand-built FSTs can analyze and generate word forms that statistical systems never saw.

Why it matters for MT: FSTs provide reliable morphological analysis and validation for languages too data-poor to learn morphology from text alone.

Example: Plains Cree (crk): the eval pack uses an FST-based semantic validator (card field evalMetrics.lyss-sem).

GiellaLT infrastructure (UiT) ↗Related:morphology paradigm

fusional

also: fusional morphology · inflecting language

A word-building style where one affix fuses several grammatical meanings at once. The Spanish ending -ó in habló marks past tense, third person, and singular simultaneously — no part of it can be assigned to just one meaning.

Why it matters for MT: Fused endings cannot be decomposed by tokenizers, so each combination must be learned as a unit.

SIL Glossary of Linguistic Terms ↗Related:agglutinative isolating inflection

G

gemination

also: geminate · geminated · double consonant

Holding a consonant longer to make a different word — Italian pala 'shovel' vs palla 'ball'. Some scripts write the doubling, others leave it to the reader.

Why it matters for MT: When the script omits gemination, distinct words collapse into one spelling and MT loses the contrast.

SIL Glossary of Linguistic Terms ↗Related:vowel length orthography

genitive

also: genitive case · possessive case

The case that marks possession or close association, like English 's or 'of'. Languages differ in whether the genitive phrase comes before or after the noun it modifies.

Why it matters for MT: Genitive order differences ('the king's horse' vs 'horse of-king') require systematic reordering inside noun phrases.

WALS Online, chapter 86 (Dryer) ↗Related:grammatical case possession word order

gerund

also: gerunds

A verb form used as a noun, like 'swimming' in 'swimming is fun'. Other languages use infinitives, verbal nouns, or special converb forms where English uses gerunds.

Why it matters for MT: English gerunds map to several different constructions depending on the target language.

SIL Glossary of Linguistic Terms ↗Related:participle

glottal stop

also: glottal · ʔ · okina · ʻokina

The catch in the throat in the middle of 'uh-oh'. In many languages it is a full consonant that distinguishes words — Hawaiian writes it as the ʻokina (ʻ), and dropping it changes meanings.

Why it matters for MT: Glottal stops are frequently omitted or typed with the wrong apostrophe character, fragmenting words across spellings.

Example: Hawaiian (haw): the ʻokina is a phonemic consonant; card orthography notes flag apostrophe-variant issues.

SIL Glossary of Linguistic Terms ↗Related:ejective diacritic orthography

glottocode

also: Glottolog code · glottocodes

A unique identifier from the Glottolog database (like stan1293 for English) covering languages, dialects, and families. More fine-grained than ISO codes and revised continuously by linguists.

Why it matters for MT: Glottocodes let pipelines join typological databases precisely, even for varieties without ISO codes.

Glottolog ↗Related:ISO 639-3 endonym

grammatical case

also: case · case system · case marking · cases · case morphology · borderline case-marking

Marking nouns and pronouns to show their role in the sentence — who acts, who is acted on, where, with what. English keeps only traces (I/me/my); Finnish has fifteen cases; many languages have none.

Why it matters for MT: Case-marking languages allow free word order, so MT must read roles from endings rather than position — and generate the right endings in return.

WALS Online, chapter 49 (Iggesen) ↗Related:nominative ergativity oblique morphosyntactic alignment

grammatical gender

also: gender · gender system · gender agreement · noun gender · gendered

Sorting all nouns into classes (often called masculine/feminine/neuter) that force matching forms on articles, adjectives, and sometimes verbs. The assignment is grammatical, not biological — a German table is masculine, a Spanish one feminine.

Why it matters for MT: Translating into a gendered language forces choices about people and things the source leaves unspecified, a major source of MT bias.

WALS Online, chapter 30 (Corbett) ↗Related:noun class agreement animacy

grammatical number

also: plural marking · plurality · plural · number marking · obligatory plural

How a language marks how many — singular, plural, and sometimes dual (exactly two) or paucal (a few). Some languages mark number on every noun; others leave it to context entirely.

Why it matters for MT: When the source does not mark number, MT must guess it; when the target requires dual forms, MT must supply them.

WALS Online, chapter 33 (Dryer) ↗Related:dual clusivity agreement

grammatical voice

also: voice · voice system

The grammatical system controlling which participant is the subject — active, passive, middle, and in some language families much richer systems. Voice reshapes the whole clause around a chosen perspective.

Why it matters for MT: Voice mismatches require restructuring whole clauses, not substituting words.

SIL Glossary of Linguistic Terms ↗Related:passive focus system valency

H

head marking

also: head-marking · headMarking

Putting the grammatical marking on the head of a phrase — the verb shows who its subject and object are, the possessed noun shows who owns it. The dependents (the nouns themselves) can stay bare.

Why it matters for MT: Head-marking concentrates clause grammar on the verb, the mirror image of the dependent-marking languages most MT data comes from.

WALS Online, chapter 23 (Nichols & Bickel) ↗Related:dependent marking agreement polysynthesis

honorific

also: honorifics · respectful speech · respect forms · honorific system

Grammatical or lexical forms that encode respect toward the listener or the person discussed — Japanese keigo, Korean speech levels, special kin-respect vocabularies. Often obligatory, not optional politeness.

Why it matters for MT: Honorific selection requires social knowledge (who outranks whom) that the source text rarely states.

SIL Glossary of Linguistic Terms ↗Related:T-V distinction register politeness

I

imperfective

also: imperfective aspect

Aspect presenting an event from the inside — ongoing, habitual, or repeated, like 'she was writing' or 'she used to write'. The counterpart of perfective.

Why it matters for MT: Imperfective readings (ongoing vs habitual) must be disambiguated by context for correct translation.

SIL Glossary of Linguistic Terms ↗Related:perfective aspect

implosive

also: implosives

Consonants made with air briefly sucked inward at the throat, like the ɓ and ɗ of Hausa or Vietnamese. They sound like emphatic b/d to untrained ears but are distinct phonemes.

Why it matters for MT: Implosive letters (ɓ, ɗ) are often typed as plain b/d, merging distinct words in text corpora.

WALS Online, chapter 7 (Maddieson) ↗Related:ejective phoneme

infix

also: infixes · infixation

An affix inserted inside a word stem rather than before or after it. Tagalog, for example, turns sulat 'write' into sumulat 'wrote' by inserting -um- after the first consonant.

Why it matters for MT: Infixes break the assumption that a word's stem is a contiguous string, which defeats simple subword segmentation.

SIL Glossary of Linguistic Terms ↗Related:affix root-and-pattern morphology

inflection

also: inflectional · inflectional synthesis · inflected

Changing a word's form to express grammar — tense, number, case, gender — without changing its core meaning, like English sing/sang or cat/cats. Highly inflected languages can mark six or more categories on a single word.

Why it matters for MT: Every inflected form is a separate token for an MT system, so heavy inflection means more rare words and more agreement errors.

SIL Glossary of Linguistic Terms ↗Related:morphology paradigm agreement

instrumental

also: instrumental case · instrumentals

A case meaning 'using/by means of' — Russian marks 'with a hammer' with an ending instead of a preposition. Polysynthetic languages may build the instrument right into the verb.

Why it matters for MT: Instrumental meaning shifts between case endings, prepositions, and verb-internal marking across languages.

SIL Glossary of Linguistic Terms ↗Related:grammatical case polysynthesis

interrogative

also: interrogatives · question particle · polar question · question marker

The grammar of asking questions. Yes/no questions may be marked by a particle, a verb form, word-order change, or intonation alone; content questions differ in whether 'who/what' moves to the front.

Why it matters for MT: If the source marks questions only by intonation, written input gives MT no signal that a question is being asked.

WALS Online, chapter 116 (Dryer) ↗Related:particle word order

ISO 639-3

also: ISO 639 · iso639_3 · language code · three-letter code

The international standard of three-letter codes for the world's languages (eng, crk, haw), maintained by SIL. It aims to cover every known language, living or extinct, with one code each.

Why it matters for MT: ISO 639-3 codes are the join keys of multilingual NLP — wrong or ambiguous codes silently corrupt datasets.

SIL ISO 639-3 Registration Authority ↗Related:glottocode macrolanguage

isolating

also: analytic language · isolating morphology

A word-building style where words are mostly single morphemes and grammar is expressed by word order and helper words instead of endings — as in Vietnamese or Mandarin. The opposite extreme from polysynthesis.

Why it matters for MT: Isolating languages shift the MT problem from word segmentation to word order and function-word choice.

SIL Glossary of Linguistic Terms ↗Related:agglutinative fusional word order

K

kinship terms

also: kin terms · kinship terminology · kinship system

The vocabulary for family relations, which different languages slice very differently — separate words for older vs younger siblings, or for maternal vs paternal uncles. Some systems encode the speaker's own position too.

Why it matters for MT: English 'uncle' or 'cousin' may have no single equivalent, forcing MT to choose among precise kin terms without the needed family facts.

SIL Glossary of Linguistic Terms ↗Related:honorific clusivity

L

language isolate

also: isolate · isolates · isIsolate

A language with no demonstrated relatives — a family of one. Basque, Ainu, and Burushaski are famous examples; isolates are surprisingly common worldwide.

Why it matters for MT: Isolates cannot borrow training signal from related languages, removing a key low-resource MT strategy (transfer learning).

Glottolog: language families ↗Related:glottocode macrolanguage

language vitality

also: vitality · endangerment status · endangered · endangered language · dormant · sleeping language · vigorous · threatened

How robustly a language is being passed to children and used across life domains. Scales like EGIDS and Glottolog's endangerment status run from 'vigorous' through 'threatened' and 'moribund' to 'dormant' (no fluent speakers, but potential for revival).

Why it matters for MT: Vitality predicts data availability and, for community-driven MT, what role technology should play (e.g. revitalization support, not replacement).

Glottolog: Agglomerated Endangerment Status ↗Related:moribund language isolate

lenition

also: lenited

The softening of a consonant, often between vowels or under grammatical triggers — a 'k' weakening toward 'g' or 'h'. In Celtic languages lenition is part of the grammar, not just pronunciation.

Why it matters for MT: Grammatical lenition changes word spellings in context, so surface text diverges from dictionary forms.

SIL Glossary of Linguistic Terms ↗Related:consonant mutation sandhi

lexifier

also: lexifier language · lexified

The language that supplied most of a creole's or pidgin's vocabulary — English for Jamaican Patois, French for Haitian Creole, Portuguese for Papiamento. The grammar, however, is the creole's own.

Why it matters for MT: Shared vocabulary with the lexifier masks deep grammatical differences that MT must not gloss over.

APiCS Online (Michaelis et al., eds.) ↗Related:creole pidgin loanword

loanword

also: loanwords · borrowing · borrowings · calque · English loanwords

A word taken from another language and adapted to local sound patterns, like 'sushi' in English or 'le weekend' in French. A calque borrows the structure instead, translating piece by piece ('skyscraper' → French 'gratte-ciel').

Why it matters for MT: Deciding whether to keep, adapt, or translate a loanword is a recurring choice in localization, especially for technical terms.

WOLD (World Loanword Database) ↗Related:code-switching lexifier

locative

also: locative case

A case meaning 'at/in/on' a place, expressed by a noun ending rather than a preposition. Finnish and Hungarian split location into several precise locative cases (inside, on top, near, motion toward, motion from).

Why it matters for MT: One English preposition can map to several locative cases, forcing MT to choose by context.

SIL Glossary of Linguistic Terms ↗Related:grammatical case adposition

M

macrolanguage

also: macrolanguages

An ISO 639-3 bookkeeping category: a single code (like 'ara' Arabic or 'zho' Chinese) that covers several distinct member languages treated as one for historical or political reasons. Each member also has its own code.

Why it matters for MT: Data labeled with a macrolanguage code mixes mutually unintelligible varieties, contaminating training and evaluation.

SIL ISO 639-3: macrolanguage scope ↗Related:ISO 639-3 diglossia language isolate

mood

also: modality · grammatical mood

Grammatical marking for the speaker's stance toward an event — fact, wish, command, possibility. Indicative, subjunctive and imperative are the familiar European moods; other languages mark finer shades.

Why it matters for MT: Mood selection (especially subjunctive) follows target-language rules that cannot be copied from the source.

SIL Glossary of Linguistic Terms ↗Related:subjunctive evidentiality tense

mora

also: moraic · morae

A timing unit smaller than the syllable: a short syllable counts one mora, a long vowel or a final consonant adds another. Japanese rhythm, poetry, and even abbreviations count morae, not syllables.

Why it matters for MT: Mora-based phonology shapes how loanwords and names are adapted, affecting transliteration quality.

SIL Glossary of Linguistic Terms ↗Related:vowel length pitch accent

moribund

A vitality status meaning the language is no longer being learned by children; the remaining fluent speakers are all older adults. Without intervention, such a language becomes dormant within a generation.

Why it matters for MT: Moribund languages have shrinking speaker pools to validate MT output, raising the stakes of every data decision.

Glottolog: Agglomerated Endangerment Status ↗Related:language vitality

morpheme

also: morphemes

The smallest piece of a word that carries meaning. The English word 'unhappiness' contains three: un-, happy, and -ness. Languages differ enormously in how many morphemes they pack into one word.

Why it matters for MT: MT systems work on tokens, and a token that contains many morphemes hides grammar the system needs to translate correctly.

SIL Glossary of Linguistic Terms ↗Related:affix root tokenization

morphology

also: morphological complexity · morphologically complex · complex morphology

The study of how words are built from smaller meaningful parts, and the word-building system of a language itself. A morphologically complex language expresses with word endings what English expresses with separate words.

Why it matters for MT: Complex morphology multiplies the distinct word forms an MT system must learn from limited data.

SIL Glossary of Linguistic Terms ↗Related:morpheme inflection agglutinative polysynthesis

morphosyntactic alignment

also: alignment · alignment of verbal person marking

The system a language uses to group the three core roles — intransitive subject, transitive subject, and object — for marking purposes. Nominative–accusative and ergative–absolutive are the two big patterns; some languages split between them or use animacy-driven systems.

Why it matters for MT: Alignment mismatch between source and target is a structural translation problem, not a vocabulary one.

WALS Online, chapter 100 (Siewierska) ↗Related:ergativity nominative agreement

N

nasal vowel

also: nasal vowels · nasalization · nasalized · vowel nasalization

A vowel pronounced with air flowing through the nose, as in French bon or Portuguese são. Where it is contrastive, oral and nasal vowels distinguish different words.

Why it matters for MT: Nasalization marks (ã, ę, ą) are commonly dropped in informal typing, merging word pairs.

WALS Online, chapter 10 (Hajek) ↗Related:diacritic phoneme

negation

also: negative morpheme · negator · negative marker · standard negation

How a language says 'not'. Strategies include particles (English not), affixes on the verb, special negative verbs, and two-part constructions like French ne…pas. Position varies: before the verb, after it, or both.

Why it matters for MT: Negation errors invert meaning entirely, making correct negative placement one of the highest-stakes MT requirements.

WALS Online, chapter 112 (Dryer) ↗Related:particle word order

nominative

also: accusative · nominative-accusative · nominative–accusative

In the most familiar alignment system, the subject of any verb takes nominative case and the direct object takes accusative. Most European languages work this way, so it is what MT training data overwhelmingly reflects.

Why it matters for MT: Systems trained mostly on nominative–accusative languages misassign roles when translating ergative languages.

WALS Online, chapter 98 (Comrie) ↗Related:ergativity grammatical case morphosyntactic alignment

noun class

also: noun classes · noun class agreement · noun-class

A gender-like system with many classes — Bantu languages typically have 10–20, sorting nouns by shape, animacy, size and more. Each class triggers its own agreement prefixes across the sentence.

Why it matters for MT: Every noun choice ripples agreement markers through the whole clause, so one wrong class produces many visible errors.

WALS Online, chapter 30 (Corbett) ↗Related:grammatical gender agreement classifier

noun incorporation

also: incorporation · incorporated noun

Folding a noun into the verb to make one word, roughly like turning 'hunt seals' into 'seal-hunt' as a verb. Common in polysynthetic languages, where it changes the sentence's emphasis and grammar.

Why it matters for MT: An incorporated noun disappears from the sentence as a separate word, so alignment-based translation loses it.

Example: Surfaced on cards as linguisticChallenges.nounIncorporation (Grambank-based), e.g. Plains Cree (crk).

SIL Glossary of Linguistic Terms ↗Related:polysynthesis compounding

O

oblique

also: obliques · oblique argument · oblique phrase

Any phrase in the clause that is neither subject nor direct object — typically locations, instruments, recipients and other 'extras', often marked with a case or adposition. In WALS, 'X' in orders like VOX stands for the oblique phrase.

Why it matters for MT: Where obliques sit in the sentence varies by language and must be reordered correctly around verb and object.

WALS Online, chapter 84 (Dryer & Gensler) ↗Related:grammatical case word order adposition

obviation

also: obviative · fourth person · proximate

A system, central to Algonquian languages like Plains Cree, that ranks third persons in a stretch of discourse: one is 'proximate' (in focus) and any others are 'obviative' (marked as backgrounded, sometimes called fourth person). It tracks who is who without pronouns like 'he₁ vs he₂'.

Why it matters for MT: English has no obviation, so MT into Cree must invent proximate/obviative assignments and keep them consistent across sentences.

Example: Plains Cree (crk) marks obviative referents on nouns and verbs; see the crk card and crk-translate documentation.

SIL Glossary of Linguistic Terms ↗Related:animacy agreement conjunct order

orthography

also: orthographic · spelling system · orthographies · orthographic status

The agreed rules for writing a language in its script — which letters, diacritics, and spellings are correct. Some languages have multiple competing orthographies or none standardized at all.

Why it matters for MT: Competing or unstandardized orthographies split scarce training data into incompatible spelling variants.

SIL Glossary of Linguistic Terms ↗Related:script diacritic romanization

P

paradigm

also: inflectional paradigm · paradigms

The full set of inflected forms a word can take — like a verb conjugation table. Paradigms range from two forms (English 'must') to thousands in polysynthetic languages.

Why it matters for MT: Large paradigms guarantee that most word forms are rare or unseen in training data.

SIL Glossary of Linguistic Terms ↗Related:inflection suppletion

parallel corpus

also: parallel text · bitext · parallel corpora · parallel data

A collection of texts paired with their translations, aligned sentence by sentence. Parallel corpora are the primary fuel for training and evaluating MT systems.

Why it matters for MT: The size and domain of available parallel data is the strongest single predictor of MT quality for a language pair.

OPUS, the open parallel corpus collection ↗Related:treebank chrF

participle

also: participles · participial

A verb form that acts like an adjective or builds compound tenses — 'the running water', 'has eaten'. Languages differ in how many participles they have and what they are used for.

Why it matters for MT: Participial clauses often replace relative clauses in other languages, requiring structural conversion.

SIL Glossary of Linguistic Terms ↗Related:gerund relative clause

particle

also: particles · sentence-final particle · discourse particle · topic marker

A small, uninflected function word that adds grammatical or attitudinal meaning — question markers, topic markers, politeness softeners. East Asian languages make heavy use of sentence-final particles.

Why it matters for MT: Particles carry meaning (questionhood, attitude, topic) that MT must re-express by entirely different means.

SIL Glossary of Linguistic Terms ↗Related:clitic interrogative register

passive

also: passive voice · passive constructions

A construction that promotes the object to subject and demotes or drops the doer: 'the window was broken (by the boy)'. Many languages lack a passive entirely or use other strategies to background the agent.

Why it matters for MT: Passive-less target languages force MT to restructure passives into actives, inventing or recovering the agent.

WALS Online, chapter 107 (Siewierska) ↗Related:grammatical voice valency

perfective

also: perfective aspect

Aspect presenting an event as a complete whole — 'she wrote the letter' viewed as one finished fact. Often paired with imperfective in a grammatical opposition.

Why it matters for MT: Choosing perfective vs imperfective wrongly is among the most common MT errors into Slavic languages.

SIL Glossary of Linguistic Terms ↗Related:imperfective aspect

pharyngeal

also: pharyngeals · pharyngeal consonants

Consonants made by squeezing the throat (pharynx), like Arabic ʿayn (ع). They are rare worldwide and hard for non-native speakers to hear or produce.

Why it matters for MT: Pharyngeals are romanized many ways (ʿ, ', 3, or nothing), creating spelling chaos in informal text.

SIL Glossary of Linguistic Terms ↗Related:uvular glottal stop romanization

phoneme

also: phonemes · phonemic · phoneme inventory · consonant inventory · vowel inventory

A speech sound that distinguishes words in a particular language — swap one phoneme for another and you get a different word (pat vs bat). A language's phoneme inventory ranges from about a dozen sounds to well over a hundred.

Why it matters for MT: Inventory size and content determine how foreign names and loanwords get reshaped in the language.

WALS Online, chapter 1 (Maddieson) ↗Related:tone orthography

pidgin

also: pidgins

A simplified contact language with no native speakers, created for trade or work between groups with no common tongue. When children grow up speaking one natively, it becomes a creole.

Why it matters for MT: Pidgins have high variability and thin text data, making consistent MT especially hard.

APiCS Online (Michaelis et al., eds.) ↗Related:creole lexifier

pitch accent

also: pitch-accent

A system where pitch distinguishes words, but only one syllable per word carries the distinctive pitch — Japanese háshi 'chopsticks' vs hashí 'bridge'. Lighter than full tone, heavier than pure stress.

Why it matters for MT: Like tone, pitch accent is rarely written, so homographs multiply in text.

SIL Glossary of Linguistic Terms ↗Related:tone mora

politeness

also: politeness distinctions · formality · formality system · politeness levels

The linguistic encoding of social relationships — through pronoun choice, verb endings, particles, or vocabulary. Languages range from no grammatical politeness to elaborate multi-level systems.

Why it matters for MT: A translation can be lexically perfect and still fail by choosing the wrong politeness level for the situation.

WALS Online, chapter 45 (Helmbrecht) ↗Related:T-V distinction honorific register

polypersonal agreement

also: polypersonalism · polypersonal

Verb agreement with more than one participant at once — the verb carries markers for both subject and object (and sometimes more). Basque, Georgian, and Algonquian languages do this systematically.

Why it matters for MT: The verb form encodes who acts on whom, so MT must resolve both roles before it can produce a single correct verb.

WALS Online, chapter 102 (Siewierska) ↗Related:agreement polysynthesis obviation

polysynthesis

also: polysynthetic · polysynthetic language

A word-building style where a single verb can contain what other languages express as a whole sentence — subject, object, location, instrument and more, all as parts of one word. Many Indigenous American languages work this way.

Why it matters for MT: Polysynthetic words rarely repeat exactly, so word-level MT sees an endless stream of unknown tokens.

Example: Plains Cree (crk): a single verb can incorporate subject/object pronouns, instrumentals, locations, and actions (card field linguisticChallenges.polysynthesis).

SIL Glossary of Linguistic Terms ↗Related:noun incorporation agglutinative morphology

possession

also: possessive · possessives · alienable · inalienable · possessive affixes

How a language expresses 'my X / your X'. Many languages distinguish inalienable possession (body parts, kin — things you cannot give away) from alienable, marking them with different constructions.

Why it matters for MT: Inalienable possession often requires obligatory possessor marking that English sources omit.

WALS Online, chapter 58 (Nichols & Bickel) ↗Related:genitive head marking

prefix

also: prefixes · prefixing

An affix that attaches to the front of a word stem, like re- in 'rewrite'. Some languages, including many Bantu and Athabaskan languages, carry most of their grammar in strings of prefixes.

Why it matters for MT: Prefix-heavy languages put grammatical information at the start of words, the opposite of what suffix-trained tokenizers expect.

WALS Online, chapter 26 (Dryer) ↗Related:suffix affix

R

reduplication

also: reduplicated · reduplicative

Repeating all or part of a word to change its meaning — to mark plurals, intensity, or ongoing action. Indonesian orang 'person' becomes orang-orang 'people'.

Why it matters for MT: MT must recognize that a doubled word is grammar, not an accidental repetition to be deleted.

WALS Online, chapter 27 (Rubino) ↗Related:morphology grammatical number

register

also: registers · speech register · register-levels · speech levels

A variety of a language tied to social context — formal, casual, ceremonial, technical. Some languages grammaticalize registers: Javanese has distinct vocabulary sets for different politeness levels.

Why it matters for MT: MT must hold a consistent register; mixing formal and casual forms in one output reads as broken or rude.

SIL Glossary of Linguistic Terms ↗Related:T-V distinction honorific diglossia

relative clause

also: relative clauses · relativization

A clause that modifies a noun: 'the book that I read'. Languages place it before or after the noun, and use strategies from relative pronouns to gaps to special verb forms.

Why it matters for MT: Prenominal relative clauses (Japanese, Turkish) require inverting long stretches of text relative to English order.

WALS Online, chapter 90 (Dryer) ↗Related:word order participle

romanization

also: transliteration · romanized · latinization · scriptConverter

Writing a language in Latin letters instead of its native script, by rule (transliteration) or by sound. One language often has several competing romanization standards.

Why it matters for MT: Romanized and native-script text behave as different languages to an MT model unless explicitly converted.

SIL Glossary of Linguistic Terms ↗Related:script orthography tone

root

also: roots · word root

The irreducible core of a word once all affixes are stripped away. In Semitic languages a root is often just three consonants (like k-t-b 'write' in Arabic) that vowel patterns turn into words.

Why it matters for MT: Languages whose roots are discontinuous (consonant skeletons) need specialized segmentation for MT to see word relationships.

SIL Glossary of Linguistic Terms ↗Related:stem root-and-pattern morphology morpheme

root-and-pattern morphology

also: root pattern · rootPattern · templatic morphology · nonconcatenative morphology · root-pattern morphology

A word-building style, typical of Arabic and Hebrew, where a consonant root like k-t-b 'write' is threaded through vowel templates: kitāb 'book', kātib 'writer', maktab 'office'. The root and the pattern each carry meaning, but neither is a contiguous chunk.

Why it matters for MT: Standard subword tokenization cannot see the shared root across these forms, weakening generalization in Semitic-language MT.

Example: Surfaced on cards as linguisticChallenges.rootPattern (e.g. Arabic, Hebrew, Maltese).

SIL Glossary of Linguistic Terms ↗Related:root ablaut tokenization

S

sandhi

also: tone sandhi · external sandhi

Sound changes that happen where words or morphemes meet, like 'don't you' becoming 'dontcha'. In tone languages, tones themselves can change in context (tone sandhi).

Why it matters for MT: Sandhi makes written or transcribed forms context-dependent, complicating consistent tokenization.

SIL Glossary of Linguistic Terms ↗Related:tone lenition consonant mutation

script

also: writing system · scripts

The set of symbols a language is written in — Latin, Cyrillic, Arabic, Han characters, and many more. One language can use several scripts (Serbian), and one script can serve hundreds of languages.

Why it matters for MT: Script identity drives every downstream text process: encoding, tokenization, and which MT models even accept the input.

Wikipedia: Writing system (standard reference) ↗Related:orthography abugida syllabary writing direction

serial verb construction

also: serial verbs · serialVerbs · verb serialization · serial verb

Stringing several verbs together in one clause with no 'and' or 'to' between them — 'take knife cut bread' for 'cut the bread with a knife'. Common in West African, Southeast Asian and creole languages.

Why it matters for MT: Serial verbs must be decomposed into prepositions or subordinate clauses when translating into European languages, and rebuilt going the other way.

SIL Glossary of Linguistic Terms ↗Related:grammatical voice particle

stem

also: stems · word stem

The core part of a word that affixes attach to. In 'unbelievable', the stem of -able is 'believe'. In many languages a stem never appears alone and must carry at least some inflection.

Why it matters for MT: Identifying shared stems across word forms is how MT systems generalize from limited training data.

SIL Glossary of Linguistic Terms ↗Related:root affix inflection

subjunctive

also: subjunctive mood

A verb mood for non-asserted content — wishes, doubts, hypotheticals, and clauses after certain verbs. Romance languages require it in many subordinate clauses where English uses plain forms.

Why it matters for MT: Subjunctive triggers are target-language-specific, so MT must apply grammar rules rather than translate forms.

SIL Glossary of Linguistic Terms ↗Related:mood

suffix

also: suffixes · suffixing

An affix that attaches to the end of a word stem, like -ness in 'kindness'. Suffixing is the most common affixation strategy across the world's languages.

Why it matters for MT: Stacked suffixes create long, rare word forms that MT systems must segment correctly to translate.

WALS Online, chapter 26 (Dryer) ↗Related:prefix affix

suppletion

also: suppletive

When a word's inflected forms come from completely different roots, like go/went or good/better. The grammar treats them as one word even though they share no sounds.

Why it matters for MT: Suppletive forms cannot be derived by rule, so MT must have seen each one in training data.

WALS Online, chapter 79 (Veselinova) ↗Related:paradigm inflection

syllabary

also: syllabic script · syllabaries

A script with one symbol per syllable rather than per sound — Japanese kana or Cherokee. Works best for languages with simple syllable structures.

Why it matters for MT: Syllabaries change the granularity of text: tokenizers see syllables, not consonants and vowels.

Wikipedia: Syllabary (standard reference) ↗Related:abugida syllabics script

syllabics

also: Canadian Aboriginal Syllabics · UCAS · Cans

The script family used for Cree, Inuktitut, Ojibwe and other Indigenous Canadian languages, where each character encodes a consonant and its rotation encodes the vowel. Invented in the 1840s and still in active community use.

Why it matters for MT: Many Cree/Inuktitut texts exist in both syllabics and roman orthography, so MT pipelines need reliable script conversion.

Example: Plains Cree (crk): card script is Cans (Canadian Aboriginal Syllabics) with a roman-orthography converter.

Wikipedia: Canadian Aboriginal syllabics (standard reference) ↗Related:syllabary romanization script

T

T-V distinction

also: tu/vous distinction · T-V · formal/informal pronouns · tu-vous · formal you

Having two (or more) words for 'you' depending on social distance, like French tu/vous or German du/Sie. The choice signals respect, intimacy, or hierarchy and is hard to undo once made.

Why it matters for MT: English 'you' gives no cue, so MT must choose formality from context — a frequent and socially visible error.

Example: French (fra): card formality data distinguishes tu/vous usage contexts.

WALS Online, chapter 45 (Helmbrecht) ↗Related:register honorific politeness

tense

also: tenses · past tense · future tense · tense-aspect

Grammatical marking that locates an event in time — past, present, future. Some languages have no grammatical tense at all (Mandarin), while others distinguish several degrees of past remoteness.

Why it matters for MT: Tenseless source text forces MT to infer time reference; remoteness systems demand finer distinctions than the source provides.

WALS Online, chapter 66 (Dahl & Velupillai) ↗Related:aspect mood evidentiality

tokenization

also: tokenizer · subword · subword segmentation · tokenize · tokenization and alignment

Splitting text into the units (tokens) a translation model actually processes. Modern systems split rare words into subword pieces; how well those pieces line up with real morphemes varies hugely by language.

Why it matters for MT: Bad tokenization is a root cause of MT failure for morphologically rich and low-resource languages.

Wikipedia: Byte pair encoding (standard reference) ↗Related:morpheme compounding agglutinative

tone

also: tonal · tone system · tonal language · tones · lexical tone · toneSystem · contour tone

Using voice pitch to distinguish words: Mandarin mā 'mother' vs mǎ 'horse'. Simple systems contrast two levels; complex systems (many West African and Southeast Asian languages) use several levels and contours.

Why it matters for MT: Tone is usually invisible in romanized or unmarked text, collapsing distinct words into one spelling for the MT system.

WALS Online, chapter 13 (Maddieson) ↗Related:pitch accent diacritic sandhi

treebank

also: treebanks · UD treebank · Universal Dependencies

A corpus of sentences hand-annotated with grammatical structure (parse trees). The Universal Dependencies project maintains treebanks in a shared format for 150+ languages.

Why it matters for MT: Treebank existence signals serious NLP infrastructure for a language and enables syntax-aware evaluation.

Universal Dependencies ↗Related:parallel corpus tokenization

U

umlaut

also: i-mutation

A vowel change caused historically by a following vowel, surviving as grammar: German Apfel 'apple' → Äpfel 'apples'. Related to ablaut but with a different historical origin.

Why it matters for MT: Umlaut puts plural or tense marking inside the stem, invisible to affix-based segmentation.

SIL Glossary of Linguistic Terms ↗Related:ablaut inflection

uvular

also: uvulars · uvular consonants · uvular consonant

Consonants made at the very back of the mouth against the uvula, like the Arabic q or French r. Rarer than velar k/g sounds and a signature of certain language areas.

Why it matters for MT: Uvulars are often romanized inconsistently (q/k/kh), splitting one word into several text forms.

WALS Online, chapter 6 (Maddieson) ↗Related:pharyngeal phoneme

V

valency

also: valence · valency patterns · valencyPatterns

How many participants a verb requires and how it encodes them — 'sleep' takes one, 'give' takes three. Languages disagree about which participants particular verbs take and how they are marked.

Why it matters for MT: Valency mismatches make literal translations assign the wrong roles or drop required participants.

ValPaL (Valency Patterns Leipzig) ↗Related:grammatical voice agreement benefactive

vigesimal

also: base-20 · vigesimal system

A counting system based on twenty rather than ten. Maya, Yoruba, Nahuatl, and Danish (partly) count this way — 'eighty' is literally 'four twenties' in French (quatre-vingts).

Why it matters for MT: Number-word translation across different bases is error-prone, and numbers are high-stakes content.

Example: Surfaced on cards via numeralSystem.baseType from Numeralbank data.

Numeralbank (channumerals) ↗Related:classifier

vowel harmony

also: vowel-harmony

A rule that all vowels in a word must agree in some property, such as front/back or rounded/unrounded. In Turkish, suffix vowels change shape to match the stem: ev-ler 'houses' but at-lar 'horses'.

Why it matters for MT: Each suffix has several surface forms, multiplying the token variants MT must learn for one grammatical ending.

SIL Glossary of Linguistic Terms ↗Related:advanced tongue root suffix agglutinative

vowel length

also: long vowel · long vowels · vowel-length

Holding a vowel longer to make a different word — Finnish tuli 'fire' vs tuuli 'wind'. Scripts mark it with double letters, macrons (ā), or not at all.

Why it matters for MT: When length marking is optional (e.g. Hawaiian kahakō, Arabic vowels), the same word appears in multiple spellings.

Example: Hawaiian (haw): the kahakō (macron) marks long vowels and is often omitted in casual text.

SIL Glossary of Linguistic Terms ↗Related:mora diacritic gemination

W

word order

also: basic word order · dominant word order · constituent order · SOV · SVO · VSO · VOS · OVS · wordOrder · free word order · flexible word order

The typical arrangement of subject (S), object (O), and verb (V) in a plain statement. SOV (Japanese) and SVO (English) cover most languages; VSO (Welsh, Tagalog-type) is the third common type, and some languages have no fixed order at all.

Why it matters for MT: Word-order mismatch is the single largest driver of reordering errors between language pairs.

WALS Online, chapter 81 (Dryer) ↗Related:oblique adposition isolating

writing direction

also: right-to-left · RTL · left-to-right · LTR · writingDirection · bidirectional text

Which way the script runs: left-to-right (Latin), right-to-left (Arabic, Hebrew), or historically top-to-bottom (Mongolian, classical Chinese). Mixed-direction text needs special handling.

Why it matters for MT: RTL and bidirectional text break naive string handling — numbers, punctuation and embedded Latin words reorder unpredictably.

Wikipedia: Bidirectional text (standard reference) ↗Related:script orthography

These terms are marked with a dotted underline wherever they appear in card prose — hover one on any card in the Atlas or the trading-card atlas for the short version, then follow “More →” back here.