ADR-0010: MCP Client Context Window Defaults¶

Status: Proposed Date: 2026-02-23 Decision Makers: Brandon Fox

Context¶

The MCPRulesClient (spec 010-mcp-client-integration) retrieves rule markdown chunks from the RAG MCP server. A single broad query can return 10–20 chunks totalling 10,000–15,000+ tokens. Agents operate within finite LLM context windows (Gemini 1.5 Flash: 1M tokens theoretical, but practical prompt budgets are much smaller for cost and latency reasons). Without a client-side cap, agents risk:

Context pollution — low-relevance chunks dilute the signal for the LLM.
Latency bloat — more tokens = slower inference, violating the <1.5s retrieval target.
Silent truncation — if the LLM or framework truncates input silently, critical rules may be lost unpredictably.

A default token budget must be chosen. The trade-off is between comprehensiveness (more context = more rules coverage) and precision (less context = faster, more focused reasoning).

Decision¶

Set the default max_context_tokens to 8,192 tokens with the following truncation strategy:

Chunks are returned in descending relevance order (highest score first).
Chunks are accumulated until adding the next chunk would exceed the budget.
No mid-chunk splitting — a chunk is either fully included or fully excluded.
If any chunks are dropped, the RulesQueryResult.truncated flag is set to true and a structlog warning is emitted.

Token estimation uses a simple heuristic: len(content) // 4 (roughly 4 characters per token for English text). This avoids a tokenizer dependency while being accurate enough for budget enforcement.

Rationale for 8,192¶

Gemini 1.5 Flash handles 8K retrieval context comfortably alongside a system prompt and conversation history.
The RAG pipeline's average chunk size is ~500 tokens, so 8,192 tokens ≈ 16 chunks — enough to cover most focused queries without overloading.
Agents with specialized needs (e.g., full-army analysis) can override to higher values explicitly.
The value is a power of two, aligning with common LLM context window conventions.

Consequences¶

Positive¶

Predictable agent behaviour: retrieved context size is bounded and documented
Agents always receive the most relevant chunks first
No silent data loss — truncation is explicit via truncated flag and logging
Zero external tokenizer dependency (no tiktoken, no model-specific encoders)

Negative¶

The len // 4 heuristic is approximate — actual token counts may vary ±15% for non-English text, code blocks, or special characters in rules markdown
8,192 may be too conservative for agents doing cross-faction comparisons that legitimately need 20+ chunks
Overriding the default requires the caller to know and set max_context_tokens, which is a configuration detail leaked to the agent layer

Neutral¶

The default can be adjusted in MCPClientConfig without breaking changes
A future ADR could introduce model-aware tokenization if the heuristic proves insufficient

Alternatives Considered¶

No default cap (unlimited): Risks context pollution, unpredictable latency, and silent truncation at the LLM level. Rejected.
4,096 tokens: Too restrictive for queries spanning multiple units or complex interactions. Would frequently trigger truncation on routine queries.
16,384 tokens: Generous but risks latency degradation and wastes context budget that agents need for conversation history and system prompts.
Exact tokenizer (tiktoken / SentencePiece): Accurate but adds a heavy dependency, couples the client to a specific model family, and is overkill for a budget enforcement heuristic.

References¶

010-mcp-client-integration spec
005-rag-pipeline spec — defines chunk structure and scoring
Gemini 1.5 Flash context window