← All posts

Why your AI coding assistant gets brain-rekt by turn 7. Firmware can't afford that

Digvijay Rathore··7 min read
aifirmwarecontext-decayllmresearch

The gist

Firmware engineers keep telling me the same thing about AI coding assistants. First few turns, it is sharp. Reads the driver, finds the config change, spots the issue. By turn 5 it is re-reading files it already looked at. By turn 7 it suggests a fix that contradicts what it told you three turns ago. By turn 8 you are spending more time verifying its output than you would have spent writing the code yourself. And the worst part: it delivers the wrong register value with complete confidence, no hesitation, no hedge.

If you have used Claude Code or Copilot on any real firmware task, a clock tree trace, a DMA debug, an SPI bringup, you have seen this arc. The tool does not get confused by your domain. It gets confused by its own accumulating context.

This is now well-measured. Microsoft Research found 39% performance drop in multi-turn versus single-turn across 200,000+ simulated conversations [1]. Levy et al. showed accuracy falling from 92% to 68% on identical reasoning when context grows from 250 to 3,000 tokens [5]. And next-word prediction accuracy, the model's internal measure of how well it tracks the text, actually improves on longer inputs while reasoning gets worse. The model reads better and thinks worse at the same time.

The model knows what SPI is. It knows clock prescalers. What it cannot do is hold that knowledge at register-level precision across 10 interactions while its context fills with file reads, build output, and test dumps. Can we keep it sharp at turn 10 instead of watching it decay by turn 3? The research says yes, if we stop wasting its cognitive budget on tokens that have nothing to do with firmware.

The problem, described

Claude Code loads approximately 71,000 tokens before you type a character. A 6,200-token system prompt covering git commit protocols, CSS formatting, emoji policies. 16,500 tokens of schemas for 18 general-purpose tools, 70% of which is behavioral prose embedded inside tool descriptions. 13,200 tokens of deferred tool stubs. And 33,000 tokens of reserved compaction buffer that sits as permanent dead space.

Your firmware problem arrives at positional offset 71,000 in the token sequence. That number matters. Du et al. (EMNLP 2025) tested what happens when all distractor content is replaced with whitespace, keeping only the evidence and the question. On Llama-3.1-8B, code generation accuracy still dropped from 57.3% to 25.6% at 30,000 tokens. With full attention masking, it fell to 7.3%. The degradation is in the positional encoding itself [2]. Your problem starts at 71K, deep into that zone, before a single file is read.

Now trace a firmware session. Reading clock config, SPI driver, DMA setup, headers, board init: roughly 20,000 tokens. Git diff of the recent refactor: 2,500. Build attempt: 2,000 to 8,000 depending on warnings. Test harness output with register dumps, assertion failures, serial logs: 5,000 to 15,000 per run.

Five turns in, you sit at 100,000 to 120,000 tokens. Your problem description is now mid-context, in what Liu et al. (TACL 2024) call the attention valley: information there receives 22 points less accuracy than content at the beginning or end [3]. The model over-attends to its system prompt at position 0 (attention sinks: initial tokens absorb over 50% of attention in most layers regardless of content [4]) and to the most recent tool output. Your problem is fading between them.

This is where firmware gets punished in ways that higher-level development does not. When the model drifts, it drifts on register addresses, on bit field positions in 32-bit configuration words, on the derivation chain from PLL output through bus prescalers to peripheral clock. These errors compile. They link. They flash onto target. They pass the quick sanity check. They fail when the system is under real load, at real temperature, in the configuration that the customer actually uses.

When context hits the compaction threshold (83.5% capacity), the system summarizes the conversation and drops the raw history. The summary preserves "key decisions" and "files touched." It does not preserve the value written to RCC->D2CFGR, the PLL2Q-to-SPI-kernel-clock derivation, or the errata workaround for DMA burst mode on specific silicon revisions. Gist-level compression applied to a domain that runs on exact numbers. And the summary itself was generated at 167K offset, under the same degraded attention that caused the problems. Every compaction cycle compounds the loss.

How to solve it

The research converges on a counterintuitive point: less context produces better results. Chroma Research (2025) tested 18 frontier models including Claude Opus 4. Focused 300-token prompts massively outperformed 113K-token prompts across every model tested [6]. DEPO (2025) cut token usage by 60.9% and saw 29.3% performance improvement [7]. AgentDiet (2025) found 40-60% of tokens in agent trajectories are useless and removable with no accuracy loss [8].

A firmware-specific harness applies these findings at the architecture level. The goal is not a smarter model. It is a leaner context that keeps the same model in its high-performance zone for more interactions.

Overhead first. A firmware system prompt needs roughly 2,000 tokens: MISRA awareness, register safety, embedded toolchain conventions. Eight domain tools (compile-for-target, flash-and-test, read-register-map, trace-clock-tree, check-MISRA, parse-datasheet, analyze-linker-map, run-hardware-tests) need about 4,500 tokens of schemas. Total fixed overhead: roughly 21,500 tokens instead of 71,000.

This shifts the problem from position 71K to position 21K. Per the positional encoding research, that is a different performance regime entirely.

Fewer tools also means fewer wrong turns. Gan and Sun (2025) measured tool selection accuracy tripling from 13.6% to 43.1% when irrelevant tools are removed [9]. A general tool exploring firmware greps with web-centric patterns, runs commands with wrong toolchain flags, reads files outside the HAL layer. Each wasted call dumps tokens into context, snowballing toward compaction. SWE-Effi (2025) measured the cost: failed paths consume 4x more tokens and 3x more time [10]. A tool that knows _IRQHandler suffixes, RCC-> prefixes, and linker script conventions avoids that spiral.

The compaction arithmetic shifts fundamentally. Every token of overhead you remove is a token the model can spend on actual work before compaction triggers. Cutting 49,500 tokens of overhead roughly doubles the usable working space in a session. That is the difference between zero compactions and two compactions in a typical debug cycle. Zero compactions means the model never loses register values, never compresses clock tree derivations into prose.

When compaction does fire, domain structure determines what survives. Generic compaction produces: "fixed the APB2 prescaler and updated SPI driver." Domain-aware compaction preserves a register state table, a clock tree snapshot with frequencies and derivation chains, and a decision log with constraints. The model gets structured data to reason from, not a paragraph to reinterpret.

EmbedAgent (ICSE 2025) showed domain enhancements taking embedded task performance from 55.6% to 65.1% without changing the underlying model [11]. The gain comes from the harness, not the weights. Sustained accuracy across a full debug session, not peak accuracy on turn 1, is the metric that matters. A firmware harness where every token pulls its weight, compaction preserves what firmware needs, and the model reaches turn 10 still in its high-precision zone.

References

  1. "LLMs Get Lost In Multi-Turn Conversation," Microsoft Research/Salesforce, 2025. arXiv:2505.06120
  2. Du et al., "Context Length Alone Hurts LLM Performance Despite Perfect Retrieval," EMNLP 2025. arXiv:2510.05381
  3. Liu et al., "Lost in the Middle: How Language Models Use Long Contexts," TACL 2024. arXiv:2307.03172
  4. Xiao et al., "Efficient Streaming Language Models with Attention Sinks," ICLR 2024. arXiv:2309.17453
  5. Levy et al., "Same Task, More Tokens," 2024. arXiv:2402.14848
  6. Hong, Troynikov, Huber, "Context Rot," Chroma Research, 2025. trychroma.com/research/context-rot
  7. "DEPO: Dual-Efficiency Preference Optimization," 2025. arXiv:2511.15392
  8. "AgentDiet: Trajectory Reduction for LLM Agent Systems," 2025. arXiv:2509.23586
  9. Gan and Sun, "RAG-MCP: Mitigating Prompt Bloat in LLM Tool Selection," 2025. arXiv:2505.03275
  10. "SWE-Effi: Agent Effectiveness Under Resource Constraints," 2025. arXiv:2509.09853
  11. Xu, Cao, Wu et al., "EmbedAgent: Benchmarking LLMs in Embedded System Development," ICSE 2025/2026