Context Engineering

From the TaranCodes Wiki

This article is about the AI development discipline. For the narrower practice of writing model instructions, see prompt engineering.

Context Engineering

Type	Discipline / practice in artificial intelligence
Category	Large language model (LLM) application development
Origin	Mid-June 2025
Popularized by	Tobi Lütke (Shopify), Andrej Karpathy
Year introduced	2025
Key figures	Tobi Lütke, Andrej Karpathy, Harrison Chase, Phil Schmid, Drew Breunig, Walden Yan, Simon Willison
Academic foundation	"A Survey of Context Engineering for Large Language Models" (Mei et al., July 17, 2025)
Related concepts	Prompt engineering, retrieval-augmented generation (RAG), fine-tuning, Model Context Protocol (MCP)
Primary use case	Building reliable AI agents and LLM applications
Core framework	Write, Select, Compress, Isolate (LangChain / Lance Martin)
Canonical definition source	Anthropic, "Effective context engineering for AI agents" (Sep 29, 2025)
Notable tools/implementations	Claude Code, Cursor, GitHub Copilot, Devin (Cognition), LangChain/LangGraph, Anthropic memory tool, MCP
Top security risk	Prompt injection (OWASP LLM01)
Current status	Active, rapidly evolving field as of 2026

Contents

Origin of the Term
History and Development
Overview and Key Concepts
How It Works
Types and Techniques
Applications and Use Cases
Advantages and Limitations
Comparisons
Impact and Significance
Debates and Controversies
Legal, Ethical, and Security Aspects
Future Outlook
Common Misconceptions
Key Takeaways

Context Engineering is an AI discipline for curating LLM inputs, emerging in mid-2025.

Context engineering is the discipline of designing, structuring, and managing the complete set of information supplied to a large language model (LLM)—an artificial intelligence system trained to generate text—at the moment it produces a response. This information, measured in tokens (the basic units of text a model processes), includes system prompts, user input, conversation history, retrieved documents, tool definitions, and stored memory. The goal of context engineering is to assemble the smallest set of high-value tokens that allows the model to reliably complete a task. The field emerged in mid-2025 as the widely adopted successor to "prompt engineering," the earlier practice of crafting effective text instructions for AI models.

This article explains the origins of the term, its core concepts and mechanisms, the main techniques used, real-world applications, and the debates surrounding the field. Context engineering is best understood not as a replacement for prompt engineering but as a broader discipline that includes it. It is driven by a central empirical finding: simply enlarging a model's context window—the maximum amount of text it can consider at once—does not reliably improve performance and often degrades it.

The practice gained formal recognition through industry publications from companies such as Anthropic and LangChain, alongside an academic survey analyzing more than 1,400 research papers. As a result, context engineering has become a defining skill for building AI agents and other production-grade language model applications.

Origin of the Term

The term "context engineering" became widely used in mid-June 2025 through a rapid sequence of public posts by prominent figures in the artificial intelligence industry. While the underlying practice of carefully assembling model inputs had existed for a year or two among agent builders, the specific label crystallized during this period.

The earliest defined use came around June 17, 2025, when Cognition AI's Walden Yan used the concept in a blog post titled "Don't Build Multi-Agents," describing it as "the next level" of prompt engineering. On June 18, 2025, Shopify CEO Tobi Lütke posted on the platform X that he preferred the term because it better described "the art of providing all the context for the task to be plausibly solvable by the LLM." Some secondary sources cite June 19, likely due to timezone differences.

The term gained further traction on June 25, 2025, when Andrej Karpathy, a former artificial intelligence leader at OpenAI and Tesla, endorsed Lütke's post. Karpathy described context engineering as "the delicate art and science of filling the context window with just the right information for the next step." He also introduced an influential analogy comparing the LLM to a computer's central processing unit (CPU) and the context window to its random-access memory (RAM). Notably, Karpathy stated he was not trying to coin a new term but argued that the word "prompts" oversimplified a complex component. No prominent use of the exact phrase as a named discipline could be verified before 2025.

History and Development

The development of context engineering can be traced through a series of publications across mid-to-late 2025 that moved it from an informal idea to a formal discipline. According to Harrison Chase of LangChain, agent builders had performed the underlying work for some time before the label existed.

On June 23, 2025, Chase published "The rise of context engineering," defining it as "building dynamic systems to provide the right information and tools in the right format such that the LLM can plausibly accomplish the task." Days later, on June 27, software developer Simon Willison wrote that the term was gaining traction as a better alternative because "prompt engineering" had been diluted to mean simply typing into a chatbot. Around the same period, Phil Schmid of Google DeepMind contributed a widely cited definition emphasizing providing the right information and tools "in the right format, at the right time."

The field received its academic foundation on July 17, 2025, with the survey "A Survey of Context Engineering for Large Language Models" by Mei and colleagues. This work systematically analyzed over 1,400 research papers to establish a technical roadmap for the field. Building on this momentum, Anthropic published "Effective context engineering for AI agents" on September 29, 2025, alongside its Claude Sonnet 4.5 model, framing the discipline as "the natural progression of prompt engineering." Together, these publications established context engineering as a recognized area of practice within a span of roughly three months.

Overview and Key Concepts

At its core, context engineering concerns the management of "context," which is the full set of tokens an LLM sees when generating a response. The discipline reframes the work of building AI applications from "finding the right words" to answering a broader question: what configuration of context is most likely to produce the desired behavior from the model?

Across major sources including Anthropic, LangChain, and Phil Schmid, the components of context are commonly broken into several categories. The system prompt, or instructions, defines the model's behavior, rules, and examples. The user prompt represents the immediate task or question. Conversation history serves as short-term memory of the ongoing dialogue, while long-term memory holds persistent knowledge across sessions, such as user preferences. Retrieved information brings in external, up-to-date knowledge from documents or databases. Finally, tool definitions describe available functions, and structured output definitions specify the required format of responses.

Anthropic offers a canonical definition: context engineering is "the set of strategies for curating and maintaining the optimal set of tokens (information) during LLM inference." A useful conceptual distinction, attributed to Drew Breunig, holds that prompts are instructions, while context is everything the model needs to act on those instructions. This distinction clarifies why the discipline extends well beyond writing a single prompt.

How It Works

Context engineering operates within the transformer architecture, the neural network design underlying modern language models. In this architecture, every token can relate, or "attend," to every other token, producing a number of pairwise relationships that grows with the square of the token count. As the context grows longer, this attention is stretched thin, creating tension between context size and the model's focus.

Anthropic describes this constraint as an "attention budget." Every token added depletes the budget, an effect comparable to the limits of human working memory. Models also have less training experience with very long sequences, and methods used to extend context length can reduce the model's understanding of token positions. The combined result is a gradual performance decline rather than a sudden failure—described as "a performance gradient rather than a hard cliff."

In practice, a context assembly pipeline dynamically constructs the model's input at each step. Anthropic notes a shift from retrieving all information in advance toward "just-in-time" strategies, in which an agent keeps lightweight identifiers such as file paths or search queries and loads the underlying data only when needed. For example, the coding tool Claude Code uses a hybrid method, loading certain reference files up front while using search functions to retrieve other files at runtime. The guiding principle throughout is to find the smallest possible set of high-signal tokens that maximizes the chance of the desired outcome.

Types and Techniques

Several frameworks organize the techniques of context engineering. The most widely cited is LangChain's four-category taxonomy, formalized by Lance Martin, which groups strategies under the headings Write, Select, Compress, and Isolate.

A 2x2 grid infographic showing the four context engineering strategies: Write (saving notes externally, teal), Select (retrieving relevant context, amber), Compress (summarizing and pruning tokens, coral), and Isolate (splitting context across sub-agents, indigo). Each color-coded card has an outlined icon, heading, and short description, with connectors linking to a central "Context Window" hub. Flat minimalist style, organized and academic in tone. — LangChain's Write–Select–Compress–Isolate taxonomy organizes the core techniques for managing an LLM's context.

The "Write" category covers saving context outside the context window, such as using scratchpads or note files for agentic memory. "Select" involves pulling relevant context in through methods like retrieval, similarity search, and memory retrieval. "Compress" focuses on retaining only the required tokens through summarization or pruning of tool outputs. Finally, "Isolate" means splitting context across separate sub-agents or sandboxed environments to keep different concerns apart.

Anthropic separately details three techniques aimed at long-horizon tasks. Compaction involves summarizing a nearly full context window and reinitializing it with the summary. Structured note-taking persists notes externally; a demonstration called "Claude playing Pokémon" maintained tallies across thousands of game steps using this method. Sub-agent architectures use specialized agents that each return a condensed summary of roughly 1,000 to 2,000 tokens. Additional standard techniques include retrieval-augmented generation, limiting the number of available tools, context pruning, and summarization. These overlapping frameworks provide practitioners with a shared vocabulary for diagnosing and improving agent behavior.

Applications and Use Cases

Context engineering is applied across a range of real-world AI systems, with coding assistants among the most prominent. Tools such as Claude Code, Cursor, GitHub Copilot, and Cognition's Devin rely on these techniques, and context degradation is described as the primary failure mode for coding agents. Features like compaction and dedicated commands help these tools manage long working sessions.

Beyond coding, context engineering supports AI agents performing long-horizon tasks such as codebase migrations and multi-hour research projects. Multi-agent research systems, such as one built by Anthropic, use an orchestrator that delegates work to parallel sub-agents. The discipline also underpins customer support bots and enterprise systems that ground their answers in company wikis and knowledge bases through retrieval.

Memory-enabled assistants form another major application, using persistent storage to retain user preferences and project state across sessions. Anthropic's file-based memory tool, released in public beta with Claude Sonnet 4.5, is one such implementation. These varied use cases demonstrate that context engineering is central to systems requiring reliability, statefulness, and access to external information.

Advantages and Limitations

Context engineering offers several practical advantages. It grounds model outputs in verifiable, up-to-date data, which reduces hallucinations—instances where a model generates false information. It enables personalization and statefulness, improves reliability in production environments, and allows systems to accomplish tasks beyond the reach of static prompts. Practitioners often observe that most agent failures are "context failures" rather than failures of the underlying model.

Despite these benefits, long contexts can fail in several distinct ways. Drew Breunig has described a widely cited taxonomy of these failure modes. Context poisoning occurs when a hallucination or error enters the context and is then repeatedly referenced. Context distraction happens when an overly long context causes the model to over-focus on it while neglecting its training; in one example, a Gemini 2.5 Pro agent playing Pokémon began repeating past actions beyond roughly 100,000 tokens. Context confusion arises when superfluous content, such as too many tool definitions, degrades responses, with research suggesting that more than about 20 tools can confuse some models. Context clash occurs when conflicting information accumulates, which one study found caused an average performance drop of 39 percent.

Several research studies document these limitations in detail. The "lost in the middle" problem, identified by Liu and colleagues in 2024, describes a U-shaped accuracy curve in which models perform best at the beginning and end of a context but significantly worse in the middle. Chroma's 2025 "context rot" report, which evaluated 18 models, found that models do not use their context uniformly and degrade as input grows, even on trivial tasks. A Databricks Mosaic Research study found that some models begin to lose accuracy well before their stated limits—for instance, after 32,000 tokens for one model. Additional constraints include rising token costs and latency, and the accumulation of large tool outputs that can consume tens of thousands of tokens before a request is even processed.

Comparisons

Context engineering is frequently compared to three related concepts, and understanding these distinctions clarifies its scope. The clearest contrast is with prompt engineering. Prompt engineering focuses on writing and organizing instructions, often as a single block of text, while context engineering manages the entire dynamic context state across multiple turns. Both Anthropic and LangChain describe context engineering as the superset, or natural progression, of prompt engineering.

A second comparison is with fine-tuning, the process of adjusting a model's internal weights to change its parametric knowledge—the information stored within the model itself. In contrast, context engineering operates entirely at inference time on the model's input, without any retraining. This makes it a faster and more flexible way to supply a model with new or specific information.

The third comparison is with retrieval-augmented generation, commonly abbreviated as RAG, which retrieves external data to ground a model's outputs. Rather than being a competing approach, RAG is best understood as a component or technique within context engineering, corresponding to the "Select" category. Context engineering is the broader discipline that encompasses RAG along with memory, tools, and compression.

Impact and Significance

Context engineering has reshaped how developers approach the construction of AI applications. By shifting the central question from word choice to context configuration, it has provided a unifying framework for a set of practices that were previously informal and scattered. Harrison Chase of LangChain has described it as becoming one of the most important skills an AI engineer can develop.

The discipline's significance lies in its diagnostic value. By treating context as a finite resource with diminishing returns, practitioners can systematically identify whether an agent's failure stems from missing context, excessive context, or a genuine gap in model capability. This reframing has influenced the design of major tools and protocols, including Anthropic's Model Context Protocol, an open standard for connecting models to external tools and data. As AI agents grow more capable, the principles of context engineering increasingly govern how reliably those agents perform.

Debates and Controversies

Several debates surround context engineering, reflecting its status as a young and rapidly evolving field. The first concerns whether the term is merely rebranded prompt engineering. Critics argue it is "another fancy term," and even Drew Breunig has cautioned against leaning into marketing and buzzwords. Defenders, including Simon Willison, Anthropic, and Harrison Chase, counter that the rename is justified because "prompt engineering" had narrowed in meaning, and the broader, dynamic, multi-component scope warrants a distinct label.

A second debate asks whether very large context windows make retrieval-augmented generation obsolete. The "RAG is dead" position holds that million-token windows allow developers to simply include all available information. However, the consensus from research and practitioners is that RAG remains relevant. According to evaluations from Dataiku, Google DeepMind, and others, RAG is often cheaper, more accurate for citation and retrieval, and there is "no silver bullet"—the best choice depends on the model, task, and context.

A third debate weighs single-agent against multi-agent architectures. Cognition's "Don't Build Multi-Agents" argues that single agents with one coherent context are more reliable. In contrast, Anthropic reported that a multi-agent system outperformed a single-agent system by 90.2 percent on its internal research evaluation, though it noted that multi-agent systems used roughly 15 times more tokens. By 2026, this debate had partly converged on patterns using an orchestrator with isolated sub-agents.

Legal, Ethical, and Security Aspects

Security and privacy are first-order concerns in context engineering, because assembling context from untrusted sources expands the potential attack surface. The most prominent risk is prompt injection, ranked as the top LLM risk by the Open Worldwide Application Security Project (OWASP), a security organization. Direct injections, sometimes called jailbreaking, and indirect injections—malicious instructions hidden within retrieved web pages or documents—can hijack an agent to leak data or take unauthorized actions.

A related concern is sensitive information disclosure, listed second on OWASP's LLM risk list. Personal, financial, health, or confidential business data placed into context can leak. Real-world demonstrations include an attack on Slack AI by the security firm PromptArmor and a zero-click exploit known as "EchoLeak" in a production system.

The Model Context Protocol carries its own specific risks. Security analyses in April 2025 found that prompt injection and "poisoned tools" could enable data exfiltration. As a result, MCP servers are treated as a significant attack surface that requires consent flows and the treatment of all tool inputs as untrusted. These concerns make security practices an integral part of responsible context engineering rather than an afterthought.

Future Outlook

Experts anticipate that context engineering will continue to evolve as models improve. Anthropic predicts that agentic design will trend toward "letting intelligent models act intelligently, with progressively less human curation," while the principle of treating context as a finite resource remains central. Harrison Chase describes a related evolution toward "harness engineering," in which the model is given more control over its own context.

Several specific directions appear likely. These include automated and dynamic context management, longer and cheaper context windows, more sophisticated memory architectures, and agentic file-system abstractions that let models manage their own data. Some figures, including Sam Altman, envision a compact reasoning engine paired with extreme memory and universal tool access, though this vision remains aspirational. Taken together, these predictions suggest that context engineering will remain a foundational discipline even as its specific techniques change.

Common Misconceptions

Two misconceptions about context engineering are especially common. The first is the belief that a bigger context window automatically means better performance. This is false; performance often degrades well before a model reaches its advertised limit, due to phenomena such as context rot and the "lost in the middle" effect. The advertised window size represents a ceiling, not a guarantee of quality.

The second misconception is that context engineering replaces prompt engineering. In reality, it subsumes prompt engineering; writing good prompts remains an essential component of the broader discipline. Clarifying these two points helps set realistic expectations for anyone building applications with language models.

Key Takeaways

Context engineering is the discipline of curating the complete set of information given to a language model at inference time so that it can reliably accomplish a task. It emerged as a named field in mid-2025, popularized by figures including Tobi Lütke and Andrej Karpathy and formalized by Anthropic, LangChain, and an academic survey of more than 1,400 papers.

The field rests on a central insight: context is a finite resource with diminishing returns, so the goal is the smallest set of high-signal tokens rather than the largest possible context. Practitioners organize their techniques using frameworks such as Write, Select, Compress, and Isolate, and they guard against documented failure modes including context rot, the "lost in the middle" problem, and security risks like prompt injection. Understood as the natural successor to prompt engineering, context engineering has become a foundational skill for building reliable AI agents and applications.

References

Liu et al., "Lost in the Middle" (TACL 2024)
Mei et al., "A Survey of Context Engineering for Large Language Models" (arXiv)
Microsoft/Salesforce, "LLMs Get Lost in Multi-Turn Conversation" (arXiv)
Databricks Mosaic Research, long-context study (arXiv)
Effective context engineering for AI agents
The rise of context engineering" by Harrison Chase
Don't Build Multi-Agents" by Walden Yan
The New Skill in AI is Not Prompting, It's Context Engineering
OWASP's GenAI Security Project, and posts by Tobi Lütke, Andrej Karpathy, and Simon Willison.

Last updated on June 16, 2026.