An introductory field guide to Context Engineering for LLM users
tl;dr: it’s not just “prompts” that matter for good AI results — context is king. LLM interfaces inject a lot of context themselves, and you can augment it. If you do so well, you’ll get better answers!
As I began to onboard friends to interface0, I realized many of them could use a primer on what information actually goes into an LLM call. A strong layman’s understanding of this allows you to get better results from AI.
Recently, the management of this information has taken up the moniker “context engineering,” popularized by tweets from Tobi Lutke and Andrej Karpathy.
LLMs have enormously complex networked “brains.” As users, we send inquiries (“prompts”) to those brains, along with “context” that augments the brains’ existing intelligence.
Without context, the brain only knows what it was trained on. That means that it doesn’t know about your preferences, your company’s specific data or practices, or anything else that isn’t on the open web.
Think of the LLMs as a new, genius Ph.D.-level team member or assistant — but one for whom it is always their first day at work. The LLM itself doesn’t remember or know anything about you or your company, so context is required. This post is about how you and the products you use can get that context to them.
Much has been said about “prompt engineering”: making the right inquiries of the model. But as Tobi and Andrej point out, equally critical is “context engineering”: making sure the right supplemental information goes to the model.
Background
Let’s say you want to ask an LLM a health question. We can imagine three levels of framing:
- Bad prompt, bad context: is giving blood good for you?
- Better prompt, bad context: does giving blood help with elevated SHBG levels?
- Better prompt, better context: does giving blood help with elevated SHBG levels? Here are attachments of my historical blood panels, dates of giving blood, and relevant gene testing results.
As the prompt and context improve, so too will the results — dramatically. The more complex your question, the more the right context will help.
But the context that goes into LLMs is typically opaque to us as users. Even when all you see is a simple chatbox, there is a lot of context going in behind the scenes.
I’ll put the bottom line up front: here’s my mental model for a typicalEach one is different, of course! And this leaves out some of the more technical aspects. This post is meant to be more useful than correct. LLM call. There’s a lot here — don’t worry, we’ll break it down piece-by-piece.
"You are GPT 4-1..."
"The user is using interface0 to interact with you..."
"Be clear and concise..."
"[searchWeb], [sendEmail], ..."
"searchWeb: requires term, ..."
"searchWeb_1 returned..."
"User recently asked about..."
"User is named Andy, lives in..."
"Relevant docs: [file.docx], ..."
"User uploaded a file..."
Creativity level
Budget for length of prompt
Specific term bias
Tool-choice, etc.
Yes, all of that is included in an inquiry as simple as this:

Conversation history & prompts
Your experience of using an LLM frontend (ChatGPT or Claude or interface0) mostly centers around the blue section in the diagram (“Conversation history + new prompt”).
That bit is straightforward enough, although we’ll revisit it at the end of this post — there are techniques for doing even that part of the process better.
Call parameters
Then, the call parameters (orange in the diagram) are rarely exposed to you as a user. Typically, whatever frontend you’re using selects this for you. The key ones are:
Regarding “max tokens”: each model has a maximum “context length” that it can handle. If the context exceeds the maximum context length of the LLM, it will fail (this is when you get the notification “this conversation is getting too long, please start a new one”). Some frontends handle this for you by beginning to compress or summarize the conversation history or context, thus making more space (but also possibly losing fidelity or details).
Context lengths are defined in “tokens.” For reasons beyond the scope of this post, 1 token is roughly 3/4 of a average word. For example, Claude Pro’s context window is 200k tokens at time of writing. This is around 150k words, or almost two normal books’ worth of content.
System prompt
Now let’s talk about the “system prompt,” which in my view is often made up of three sub-components:
The last of these — “personas” or “custom instructions” is typically the one you are able to edit. In ChatGPT you go to your settings, then Personalization, and then Custom Instructions. In Claude, Settings and then fill in your Personal Preferences. In interface0, you can set a variety of Personas from the chat bar and swap them in.
Let’s just revisit the components of the full prompt briefly. So far we’ve gone through “Conversation history + new prompt,” “Call parameters,” and “System prompt:”
"You are GPT 4-1..."
"The user is using interface0 to interact with you..."
"Be clear and concise..."
"[searchWeb], [sendEmail], ..."
"searchWeb: requires term, ..."
"searchWeb_1 returned..."
"User recently asked about..."
"User is named Andy, lives in..."
"Relevant docs: [file.docx], ..."
"User uploaded a file..."
Creativity level
Budget for length of prompt
Specific term bias
Tool-choice, etc.
What remains is injected knowledge and tooling. We’ll cover those, then recap what you can practically do about all this, and some tips for prompting & context management.
Injected knowledge / context
Injected knowledge is the meat of what we usually think of as “context.” It includes:
- Memories that the system has preserved about you
- Data the system retrieves from some sort of connected datastore
- Anything you upload
You typically have little control over #1 on a prompt-by-prompt basis — although in some applications, you can see a list of retained memories and add/delete them globally.
One differentiator of interface0 is that it preserves memory across providers — so if you say something to OpenAI o3, when you are later talking to Claude Sonnet 4, Claude will “remember” that. This is actually illustrative of the overall point here: memory (and all this other context) does not happen at the model / LLM level. Rather, it all happens at the application level.
The models o3 and Sonnet 4 are not “remembering” anything — rather, the interface you’re using to interact with them is retrieving relevant memories and including them in the context before sending them off to the big-brain LLM.
In much the same way, you can retrieve data (often called Retrieval-Augmented Generation, or “RAG”) from databases or elsewhere before sending the inquiry to the model. I’m very excited about next-generation retrieval engines like Lucenia that can power workflows like this (disclaimer: so excited that I recently invested!).
You can think of this happening in a business context, where your AI-powered customer service agent application conducts a retrieval operation across your internal knowledgebase to find relevant articles, before attaching those articles as context in the inquiry to the LLM which ultimately produces the message to be sent to the user. The LLM doesn’t “search” itself — rather, the infrastructure searches and finds the context to attach before shipping it all off to the LLM.
Last, user-submitted data is obviously relevant. You can either attach files directly to a prompt, or you might use a feature like ChatGPT/Claude’s “Projects” or interface0’s “Knowledge,” where saved files and text can be attached across many related conversations.
Tools & MCP
Finally, we have “tools.” You may have heard buzz about “MCP,” or the Model Context Protocol. MCP fits into the tools category.
Tools can be anything from searching the web, to drafting emails, to writing and running code, or anything else you can imagine. But tools aren’t magical — nor do they even really have anything specifically to do with LLMs.
Anything that was programmable before LLMs — like, again, searching the web, or drafting emails — can be a “tool.”
And, in fact, the LLM itself doesn’t even use the tools (just like the LLM doesn’t “remember” things or “search”). The LLM is a floating brain-in-a-jar. Brains-in-jars don’t use tools.
Rather, there’s an “orchestrator” — a piece of backend infrastructure - that coordinates sending the ultimate inquiry to the LLM brain, and receiving the results. And the LLMs have been trained in such a way that they can be given a list of tools (called “tool specifications”), and that when they want to use that tool (“I think it’d be appropriate to search the web now”), they emit a certain sequence of tokens/words that indicate this.
Once they emit those tokens/words, the orchestrator takes over, and does the programmatic thing for the LLM. Once that thing is done, there is presumably some result (“success!” or “here are 10 restaurants in New York from Yelp” or whatever), and then the orchestrator passes that result back to the LLM to finish its work.
People sometimes call this type of architecture “agentic” — meaning that the LLM doesn’t always just spit back an answer immediately, but rather (managed by the non-LLM orchestrator) goes through some “loops” of trying different tools, thinking, and assessing its progress before ultimately saying “I’m done!” and returning the result back to the user.
There has been a lot of buzz about “MCP” technology. It is indeed cool (albeit with some risks). A full breakdown is out of scope here, but the simplest way to think about it is that “MCP” is just a standard by which tools can be implemented across many different LLMs. That is, it’s a unification of how services should respond to LLM-initiated calls and how they should expose their abilities to those LLMs and their orchestrators. “MCP” itself doesn’t really do anything — it’s a standard that other tools follow.
But LLMs can use non-MCP tools too! In fact, many LLM systems just take MCP specifications and transform them to more “traditional” tool specifications.
Anyway. Let’s sum it all up by showing our diagram again. But this time, let’s highlight in red the parts that are actually in your control (typically):
"You are GPT 4-1..."
"The user is using interface0 to interact with you..."
"Be clear and concise..."
"[searchWeb], [sendEmail], ..."
"searchWeb: requires term, ..."
"searchWeb_1 returned..."
"User recently asked about..."
"User is named Andy, lives in..."
"Relevant docs: [file.docx], ..."
"User uploaded a file..."
Creativity level
Budget for length of prompt
Specific term bias
Tool-choice, etc.
All that explanation for just five parameters!
- Personas/custom instructions: I think this is underrated by people right now. You can ask the model to behave however you want it to! Use this! If you have stylistic or intelligence preferences, or anything else, put them in here. (I believe this is partially because most frontends make it hard to edit this on a message-by-message basis.)
- Available tools: interfaces are increasingly exposing a list of tools to you (connect to your Gmail, search the web, etc.). Intelligent usage here is key. Cursor and other AI IDEs make heavy use of tools, and that’s a big part of why they are so useful.
- RAG: you don’t directly choose the queries that lead to the retrieved documents, but, especially in enterprise settings, you can choose which datastores you are connecting for queryability (as well as the search solutions you’re using). This is a key lever for performance. In consumer use cases this is less relevant.
- User-submitted data: more context is better! Attach files whenever you think it would be helpful. Again, features like Projects or interface0’s “knowledge” help make this easier, so you aren’t re-attaching all the time. Again, Cursor and other AI IDEs make it trivial to attach files that are relevant, and it makes an enormous performance difference.
- Your prompt: last of all, there’s what you ultimately send in the chatbox. This, of course, matters greatly.
Much has been written about the prompt itself, and so I’ll spare you the words. But the one thing to note is that different models perform best with different prompts. Models like o3 pro need a lot of detail to perform at their best (see this great writeup). Other models can get by with less.
Consider the model when putting together the prompt (p.s. this is why interface0’s “enhance prompt” button returns different enhancements depending on the model you’re using!).
The full picture
Here’s a rough sketch of the whole flow:
- You type in your prompt and attach files
- The application enriches that with memories or data it retrieves
- It then compiles the full context, including pulling together the various system prompts, the tools selected, the call parameters, all the injected context, and the prompt and conversation history
- It drops that to the orchestration / LLM services
- The orchestrator then fires off a request to the LLM brain itself, and they together may decide to call some tools, do some agentic looping, whatever
- Eventually the LLM decides it’s done, sends the results to the orchestrator, and you get your response
In an image:

When you sum it all up, here’s what the full request ends up looking like. I’ve taken some creative liberties and removed some of the true technical details here, but it’s directionally accurate.
I’ve been shamelessly plugging interface0 throughout this post — but it’s because I have genuinely built it to solve my problems with making LLMs more performant. The Personas and Knowledge features (and Template Prompts) are a direct result of my frustration with the main interfaces to these models having poor context engineering capabilities.
I will mention one more feature I use a lot: Summarize Chat. Like I mentioned, models can run out of context capacity, and performance will start to degrade (or they will fail entirely). I built interface0 so that with one click, you can generate a summary of a long chat. That summary can then be turned into a “Knowledge entry,” which you can then tag in on any subsequent chat.
So I’ll often run short on context length, hit Summarize, add it as a new context, and then start a new chat, tagging in that context to get things rolling without missing a beat.
I’m looking forward to adding more context engineering features as well! If you have any ideas, I’m all ears.
Plus: any feedback on this post? Let me know.
Looking for more to read?
Want to hear about new essays? Subscribe to my roughly-monthly newsletter recapping my recent writing and things I'm enjoying:
And I'd love to hear from you directly: andy@andybromberg.com