How I keep my context window clean

I have always worked by splitting up my day into hour long sessions.

After about 60 minutes tackling my first todo, I start to lose focus, get thirsty, and long for a little stretch. When I get back to my desk I will have lost some context on what I was doing, but I feel refreshed and ready to lock in again.

Working with LLMs, specifically coding Agents, is no different.

An agent is "model + system prompt + tools". Every agent has a "context window", the history of all input and outputs the LLM receives before generating its next output. Agents often store a catalog of "tools" in the context window by default. Tools allow models to communicate with a computer, they are the execution bridge between the text-based reasoning output and the runtime.

Each time we send a new message, the new message is added to the context "window" and the entire thing is sent to the model. After each message, or "turn", models will read the entire context window, not just the latest message. All our interactions with a model, including tool calls, is stored as text in the context window. All of this text is sent to the model each time we prompt it.

The problem is that "agents get drunk if you feed them too many tokens", as Lewis Metcalf so aptly put it. Like a bouncer at a bar, you need to keep an eye on LLMs and cut them off before they go one prompting session too far.

How early to end a session depends on the task and the Agent "harness", but they need breaks as much as I do during the workday.

Why? It all comes back to the "context window".

Models have no other state than the context window. Everything they know about your session lives in it. With coding Agent harnesses, every session will begin with the context window partially taken up by a combination of AGENTS.md and tool definitions. Without this, even the best prompt engineer would struggle to get good output when planning a new feature or exploring the codebase with the model. It would not know what commands it can run, how to run the test suite, where the business logic interfaces are, etc.

This is why tools are included by default. Tool calls happen when you ask the model to do something that it recognizes requires info outside its training data, for example "find this file". The model would read the available tool descriptions, stored in the context window, to find and use the right tool. Most have a grep tool, which it would use in this case to "find the file". All of the reasoning output for finding the tool, deciding to use it, the inputs for and outputs of the tool are stored in the context window. All of this will be sent again the next time you prompt the model.

This is true across models and harnesses, it is a universal constraint of the context window. Some models might have larger context than others. Others still will use clever tricks to optimize the window, like storing tool descriptions, outputs, and even chat history in the filesystem instead of the context. But all "agent" interactions are contained by the fact that over time the context window will fill up.

GitHub Copilot puts 30K tokens into the window by default. For me, Cursor puts around 18.8K tokens, or roughly 10% of the Claude Haiku 4.5 context window, before I have even sent a single message. This can vary between harness, Agent, model, and even AGENTS.md. As the context fills up, model output will get worse. How much worse depends on what is in it, but everything in the context window influences the output.

Geoffrey Huntley recently compared it to a Commodore 64. I'm too young to have used one, but I had a Windows 98 computer with 20 MB of RAM. That's the right mental model. You've got maybe 200 to 500 kilobytes to work with. When the context fills up completely, you get "compaction," which is basically death for your useful assistant. Everything gets confused and degraded.

— Duncan Oglivie

How to get the best out of the context

The context window's compounding impact on Agent output is the reason the "Ralph loop" has gotten so popular. Coined by Geoffrey Huntley, the Ralph loop is the concept of making a coding Agent repetitively do the same task until it has reached a "done" state.

while :; do cat PROMPT.md | claude-code ; done

This workflow can out-perform manual human to agent prompting sessions. Why? Every iteration of the Ralph loop starts with a fresh context window. You essentially get to re-roll the prompt, hoping for a critical hit. Even if you do not get one, the next roll still benefits as you have taken some hit points off of the monster (PRD).

The problem with this approach is that it requires no caps on tool and token usage. Most simply cannot afford the workflow, so for the rest of us what options are there?

Markdown files

The simplest one is free, easy to use, and even easier to understand: markdown files.

Before you scoff and roll your eyes, I am not talking about the "this guy replaced his whole company with agents" meme. System prompts are great but they do not solve the context issue.

The power of markdown files is that they act as a poor man's memory system for your Agent.

During each session, you instruct the Agent to dump different outputs (reasoning, bash tool use, grep, user messages, etc) into the file system, specifically Git ignored .md files like upload-component-refactor-chat-01-11-26.md. Ideally, you do this with a "sub-agent", an agent spawned outside of the context window of the current agent whose output will not pollute the current context window.

Then, once the context of current session is full, you start a new session but with a twist: this session's Agent can, when needed, reference the "memories" of the previous sessions.

This is especially useful for long running refactors. Every iteration can store observations about and changes made to modules in markdown files for that module. When the next Agent goes to make a change, it can look at that file to quickly restore working knowledge it would otherwise need to be guided towards again.

This is different from "compacting" your context window. Compacting is the worst of all worlds because it still pollutes the context window, but it does it with lossy information about your previous iterations.

As I mentioned before, some harnesses, like Cursor, handle this well by default now. Most have introduced some concept of cross sessions "thread" referencing, to bring in specific bits of context from one session into another Agent session.

Under the hood however, it is all really bash + grep + text pulled into the window just at the right time and not a moment before.

Personally, I love OpenCode. It does not have a built-in thread referencing primitive (yet), so I have given the harness some tools to automate this as much as possible. I prefer this because, as good as Cursor is, I like having control over how much context is pulled in and when.