Skip to main content

Content Translation Resilience

Champollion's content translation pipeline (Markdown/MDX documents) uses a multi-layered resilience system to handle failures gracefully. Unlike key-value translation — where each batch is small and retries are cheap — content translation involves large prompts and long outputs that can fail for structural reasons, not just transient ones.

The Problem

Content translation has fundamentally different failure modes from key-value translation:

Failure ModeKey-ValueContent
Rate limit (429)Common, transientCommon, transient
TimeoutRare (small batches)Common (long output)
Empty responseRareCommon (output limits, filters)
Output truncationN/A (JSON validated)Happens silently
Content filterExtremely rarePossible (CLI docs, security docs)
Model limitationRetry fixes itRetrying won't fix it

The key insight: retrying the same failing request is not redundancy, it's stubbornness. A proper resilience system identifies why something failed and changes its approach accordingly.

Architecture Overview

Layer 1: Diagnostic-First Retry

Before deciding how to retry, the system inspects the API response to understand what failed.

Finish Reason Analysis

Every LLM API returns a finish_reason alongside the generated text. Champollion uses this to make intelligent retry decisions:

finish_reasonMeaningAction
stop + contentModel completed normally✅ Accept result
stop + emptyModel generated nothing⚠️ Retry same request (transient)
lengthOutput hit token limit🔶 Auto-chunk the document
content_filterSafety filter blocked output🔴 Log and skip (retry won't help)
null / missingMalformed response⚠️ Retry same request (transient)

This replaces the current approach of treating every failure identically with backoff retries.

Retry Budget

The standard retry budget for transient failures:

RoundAttemptsTimeoutBackoff
Standard4 (0→3)60s1s → 2s → 4s
Escalated4 (0→3)120s1s → 2s → 4s
Total8~3.5 min worst case

Between rounds, a 10-second cool-down allows transient issues to resolve.

Layer 2: Content Chunking

When a document exceeds a size threshold — or when Layer 1 signals output truncation — the system splits the document into translation-sized chunks.

See Context Rollover for detailed chunking configuration. The key points:

Splitting Strategy

  1. Heading boundaries## and ### are natural translation unit boundaries. Each section is self-contained enough for independent translation.
  2. Paragraph fallback — if a single heading section exceeds the chunk size, split at double newlines.
  3. Hard split — last resort for extremely long paragraphs (e.g., tables). Split at sentence boundaries.

Context Between Chunks

Each chunk receives the last 2-3 paragraphs of the previous chunk's translation as context. This prevents:

  • Terminology drift — the model sees what it called "tableau de bord" in the previous chunk
  • Pronoun resolution — antecedents from the previous section carry forward
  • Register consistency — the tone established in chunk 1 persists through chunk N

Auto-Chunking Triggers

TriggerBehavior
contentChunkSize set in configAlways chunk docs exceeding that size
finish_reason: "length" returnedAuto-chunk as fallback (even without config)
Input > ~12KB (auto-detect)Log suggestion, but don't force

Layer 3: Model Fallback Chain

When the configured model fails consistently — not transiently, but structurally — the system tries alternative models. Different models have different context windows, output limits, safety filters, and multilingual strengths.

Default Fallback Chain

champollion.config.json
{
"contentFallbackChain": [
"google/gemini-2.5-flash",
"anthropic/claude-sonnet-4"
]
}

The configured model is always tried first. Fallback models are only used after all retry rounds (standard + escalated) are exhausted.

Why Multiple Architectures

ScenarioPrimary Model FailsFallback Model Succeeds
Vietnamese CLI docsGemini returns emptyClaude handles it fine
Safety-filtered contentOpenAI blocks itGemini has different filter thresholds
Long structured tablesModel A truncatesModel B has larger output window

The value of fallback is architectural diversity — different model families have different failure modes. A failure that's structural for one model may be trivial for another.

Scope

Model fallback is content-only. Key-value batches are small and almost never fail structurally. Adding fallback complexity there would be over-engineering.

Layer 4: Failure Accounting

When failures do occur, the system tracks and reports them properly instead of silently continuing.

During Sync

  • Failed items show [FAIL] in progress output
  • Each failure logs the specific reason (timeout, empty response, content filter, truncation)
  • Completed items are saved to the manifest immediately (incremental persistence)

After Sync

A failure summary prints at the end:

┌─ Content Translation Failures ─────────────────────────────────────┐
│ │
│ 2 of 24 content translations failed: │
│ │
│ ✗ docs/reference/cli.md → vi │
│ Reason: empty response after 8 attempts + 1 fallback model │
│ Models tried: google/gemini-3.1-pro-preview, gemini-2.5-flash │
│ │
│ ✗ docs/guides/troubleshooting.md → ar │
│ Reason: content_filter (no retry — blocked by safety filter) │
│ │
│ Re-run: npx champollion@latest sync │
│ (22 completed translations are cached and won't re-run) │
└─────────────────────────────────────────────────────────────────────┘

Retry Manifest

Failed files are written to .champollion-retry.json:

{
"failedAt": "2026-05-27T21:45:00Z",
"files": [
{
"source": "docs/reference/cli.md",
"locale": "vi",
"reason": "empty_response",
"attempts": 8,
"modelsTried": ["google/gemini-3.1-pro-preview", "google/gemini-2.5-flash"]
}
]
}

On the next sync run, only these files are re-processed. Completed files are preserved via the content hash manifest (.champollion-content.lock).

Exit Codes

CodeMeaning
0All translations succeeded
1Configuration error, missing API key, etc.
2Partial failure — some content translations failed

Configuration

champollion.config.json
{
"contentChunkSize": 4000,
"contentOverlap": 200,
"contentFallbackChain": [
"google/gemini-2.5-flash",
"anthropic/claude-sonnet-4"
]
}
FieldTypeDefaultDescription
contentChunkSizenumber | nullnullMax tokens per content chunk. null = no chunking (auto-chunks on truncation only)
contentOverlapnumber200Overlap tokens between content chunks for context continuity
contentFallbackChainstring[][]Fallback models to try when the configured model fails structurally

Implementation Status

FeatureStatus
Diagnostic-first retry (finish_reason parsing)🔲 Planned
Content chunking (heading/paragraph split)🔲 Planned
Context rollover between chunks🔲 Planned
Model fallback chain🔲 Planned
Failure summary report🔲 Planned
Retry manifest (.champollion-retry.json)🔲 Planned
Exit code 2 for partial failures🔲 Planned
Escalation retry (extended timeout)✅ Implemented (v3.3.3)
Attempt-numbered retry messages✅ Implemented (v3.3.3)
Loud failure on content errors✅ Implemented (v3.3.3)

See Also