ADR-0010: MCP Client Context Window Defaults¶
Status: Proposed Date: 2026-02-23 Decision Makers: Brandon Fox
Context¶
The MCPRulesClient (spec 010-mcp-client-integration) retrieves rule markdown chunks from the RAG MCP server. A single broad query can return 10–20 chunks totalling 10,000–15,000+ tokens. Agents operate within finite LLM context windows (Gemini 1.5 Flash: 1M tokens theoretical, but practical prompt budgets are much smaller for cost and latency reasons). Without a client-side cap, agents risk:
- Context pollution — low-relevance chunks dilute the signal for the LLM.
- Latency bloat — more tokens = slower inference, violating the <1.5s retrieval target.
- Silent truncation — if the LLM or framework truncates input silently, critical rules may be lost unpredictably.
A default token budget must be chosen. The trade-off is between comprehensiveness (more context = more rules coverage) and precision (less context = faster, more focused reasoning).
Decision¶
Set the default max_context_tokens to 8,192 tokens with the following truncation strategy:
- Chunks are returned in descending relevance order (highest
scorefirst). - Chunks are accumulated until adding the next chunk would exceed the budget.
- No mid-chunk splitting — a chunk is either fully included or fully excluded.
- If any chunks are dropped, the
RulesQueryResult.truncatedflag is set totrueand astructlogwarning is emitted.
Token estimation uses a simple heuristic: len(content) // 4 (roughly 4 characters per token for English text). This avoids a tokenizer dependency while being accurate enough for budget enforcement.
Rationale for 8,192¶
- Gemini 1.5 Flash handles 8K retrieval context comfortably alongside a system prompt and conversation history.
- The RAG pipeline's average chunk size is ~500 tokens, so 8,192 tokens ≈ 16 chunks — enough to cover most focused queries without overloading.
- Agents with specialized needs (e.g., full-army analysis) can override to higher values explicitly.
- The value is a power of two, aligning with common LLM context window conventions.
Consequences¶
Positive¶
- Predictable agent behaviour: retrieved context size is bounded and documented
- Agents always receive the most relevant chunks first
- No silent data loss — truncation is explicit via
truncatedflag and logging - Zero external tokenizer dependency (no
tiktoken, no model-specific encoders)
Negative¶
- The
len // 4heuristic is approximate — actual token counts may vary ±15% for non-English text, code blocks, or special characters in rules markdown - 8,192 may be too conservative for agents doing cross-faction comparisons that legitimately need 20+ chunks
- Overriding the default requires the caller to know and set
max_context_tokens, which is a configuration detail leaked to the agent layer
Neutral¶
- The default can be adjusted in
MCPClientConfigwithout breaking changes - A future ADR could introduce model-aware tokenization if the heuristic proves insufficient
Alternatives Considered¶
- No default cap (unlimited): Risks context pollution, unpredictable latency, and silent truncation at the LLM level. Rejected.
- 4,096 tokens: Too restrictive for queries spanning multiple units or complex interactions. Would frequently trigger truncation on routine queries.
- 16,384 tokens: Generous but risks latency degradation and wastes context budget that agents need for conversation history and system prompts.
- Exact tokenizer (tiktoken / SentencePiece): Accurate but adds a heavy dependency, couples the client to a specific model family, and is overkill for a budget enforcement heuristic.
References¶
- 010-mcp-client-integration spec
- 005-rag-pipeline spec — defines chunk structure and scoring
- Gemini 1.5 Flash context window