Building Effective AI Agents

Inspect the source, tune calibration, review outputs, and recover pipeline stages.

Status: narrated

Source: article_urlWords: 2809Created: 2026-03-09 19:59:41 UTC

Source overview

Canonical source details and stored content preview.

Source typearticle_url

Statusnarrated

Words2809

Created2026-03-09 19:59:41 UTC

URL: https://www.anthropic.com/engineering/building-effective-agentsFetch: ready

Source preview

[Engineering at Anthropic](https://www.anthropic.com/engineering)

![](https://www-cdn.anthropic.com/images/4zrzovbb/website/039b6648c28eb33070a63a58d49013600b229238-2554x2554.svg)

# Building effective agents

Published Dec 19, 2024

We've worked with dozens of teams building LLM agents across industries. Consistently, the most successful implementations use simple, composable patterns rather than complex frameworks.

Over the past year, we've worked with dozens of teams building large language model (LLM) agents across industries. Consistently, the most successful implementations weren't using complex frameworks or specialized libraries. Instead, they were building with simple, composable patterns.

In this post, we share what we’ve learned from working with our customers and building agents ourselves, and give practical advice for developers on building effective agents.

## What are agents?

"Agent" can be defined in several ways. Some customers define agents as fully autonomous systems that operate independently over extended periods, using various tools to accomplish complex tasks. Others use the term to describe more prescriptive implementations that follow predefined workflows. At Anthropic, we categorize all these variations as **agentic systems**, but draw an important architectural distinction between **workflows** and **agents**:

- **Workflows** are systems where LLMs and tools are orchestrated through predefined code paths.
- **Agents**, on the other hand, are systems where LLMs dynamically direct their own processes and tool usage, maintaining control over how they accomplish tasks.

Below, we will explore both types of agentic systems in detail. In Appendix 1 (“Agents in Practice”), we describe two domains where customers have found particular value in using these kinds of systems.

## When (and when not) to use agents

When building applications with LLMs, we recommend finding the simplest solution possible, and only increasing complexity when needed. This might mean not building agentic systems at all. Agentic systems often trade latency and cost for better task performance, and you should consider when this tradeoff makes sense.

When more complexity is warranted, workflows offer predictability and consistency for well-defined tasks, whereas agents are the better option when flexibility and model-driven decision-making are needed at scale. For many applications, however, optimizing single LLM calls with retrieval and in-context examples is usually enough.

## When and how to use frameworks

There are many frameworks that make agentic systems easier to implement, including:

- The [Claude Agent SDK](https://platform.claude.com/docs/en/agent-sdk/overview);
- [Strands Agents SDK by AWS](https://strandsagents.com/latest/);
- [Rivet](https://rivet.ironcladapp.com/), a drag and drop GUI LLM workflow builder; and
- [Vellum](https://www.vellum.ai/), another GUI tool for building and testing complex workflows.

These frameworks make it easy to get started by simplifying standard low-level tasks like calling LLMs, defining and parsing tools, and chaining calls together. However, they often create extra layers of abstraction that can obscure the underlying prompts and responses, making them harder to debug. They can also make it tempting to add complexity when a simpler setup would suffice.

We suggest that developers start by using LLM APIs directly: many patterns can be implemented in a few lines of code. If you do use a fra…

Pipeline

Stage progress, recent jobs, and manual recovery actions.

Ingest source

complete

Extract themes

complete

Factual summary

complete

Executive summary

complete

Audio narrative

complete

Audio file

not started

EPUB export

not started

Manual actions

Recent jobs

generate_epub: completedAttempts: 0/3Updated: 2026-03-09 22:02:09 UTC

generate_audio_file: completedAttempts: 0/3Updated: 2026-03-09 22:06:05 UTC

generate_audio_narrative: completedAttempts: 0/3Updated: 2026-03-09 22:02:07 UTC

generate_executive_summary: completedAttempts: 0/3Updated: 2026-03-09 22:01:39 UTC

generate_epub: completedAttempts: 0/3Updated: 2026-03-09 21:47:16 UTC

generate_audio_file: completedAttempts: 0/3Updated: 2026-03-09 21:47:16 UTC

generate_audio_narrative: completedAttempts: 0/3Updated: 2026-03-09 21:43:29 UTC

generate_epub: completedAttempts: 0/3Updated: 2026-03-09 21:20:10 UTC

Structured factual package, audio narrative, rendered audio, and EPUB export.

Factual package

Artifact: readyProvider: openaiModel: gpt-5.4

Created: 2026-03-09 21:19:28 UTC

Show content

{
  "sourceTitle": "Building Effective AI Agents",
  "sourceType": "article_url",
  "coreClaim": "Anthropic’s main claim is that effective AI agent systems usually come from starting with simple LLM-based building blocks and adding structure only when it measurably helps. In practice, most successful systems use a small set of composable patterns—like chaining, routing, parallelization, orchestrator-workers, evaluator-optimizer, and only sometimes fully autonomous agents—rather than complex agent frameworks or maximal autonomy by default.",
  "whyItMatters": "For a software engineer, the article is basically a design guide for deciding when to keep an application as a single LLM call, when to turn it into a workflow, and when a true agent is justified. The practical payoff is better reliability, lower cost, easier debugging, and clearer evaluation. The article also reframes agent engineering as interface design: the hardest and most important work is often not 'making the model smarter' but designing tools, control flow, feedback loops, and tests so the model can act safely and consistently.",
  "mainIdeas": [
    "Anthropic distinguishes between workflows and agents. Workflows are predefined code paths that orchestrate LLM calls and tools; agents are systems where the LLM dynamically decides the process and tool usage.",
    "The recommended starting point is the simplest viable system. Often, a single well-prompted LLM call with retrieval and examples is enough, and agentic complexity should be added only when it clearly improves results.",
    "The fundamental building block is an 'augmented LLM': a model paired with retrieval, tools, and memory. The article assumes these capabilities are available and emphasizes tailoring them to the use case and exposing them through easy, well-documented interfaces.",
    "Prompt chaining is useful when a task can be broken into fixed sequential subtasks. The point is to turn one hard task into several easier ones, optionally with programmatic gates between steps.",
    "Routing is appropriate when inputs fall into distinct classes that benefit from specialized prompts, tools, or models. It is a separation-of-concerns pattern for heterogenous workloads.",
    "Parallelization has two forms: sectioning, where independent subtasks are done concurrently, and voting, where multiple attempts or perspectives are aggregated for confidence or robustness.",
    "Orchestrator-workers is for cases where subtasks cannot be fully predefined. A central model decomposes the problem dynamically, delegates work, and then synthesizes results.",
    "Evaluator-optimizer introduces an explicit refinement loop: one model produces, another critiques, and the system iterates. This works best when quality criteria are clear and improvement from feedback is real.",
    "True agents are most suitable for open-ended tasks with uncertain step counts, where the model needs to plan, act, inspect environmental feedback, recover from mistakes, and potentially ask humans for help at checkpoints.",
    "As agent autonomy rises, cost and error compounding rise too. Anthropic therefore emphasizes sandbox testing, stopping conditions, transparency of planning, and human oversight.",
    "Frameworks can accelerate initial development, but they also hide prompts, tool calls, and responses behind abstractions. Anthropic recommends understanding and often directly using the underlying API primitives.",
    "Tool design is treated as a first-class engineering problem. The article argues that many agent failures come from poorly designed interfaces, ambiguous tool descriptions, awkward output formats, or insufficient testing of tool usage."
  ],
  "practicalInterpretation": "The article’s practical message is: treat agent design like systems engineering, not magic prompting. Start with a baseline single-call system. Add retrieval, tools, or memory if needed. If failures are due to task decomposition, add chaining. If different request classes need different handling, add routing. If independent checks or perspectives help, use parallelization. If task decomposition itself is input-dependent, use orchestrator-workers. If the model can improve through explicit critique, add evaluator-optimizer. Only use a true autonomous loop when the task is genuinely open-ended and you have both trustworthy tools and strong guardrails. Also, expect tool/interface quality to dominate real-world performance: clear parameter names, natural output formats, edge-case documentation, and usage testing often matter more than adding another abstract framework layer.",
  "prerequisitesExplained": [
    {
      "topic": "LLM prompting and API usage",
      "explanation": "The article assumes direct familiarity with calling LLM APIs, defining prompts, and reading structured responses. The main extension here is that the LLM is not just returning text; it may also emit tool-use requests or intermediate outputs that become inputs to later steps. So the engineering shift is from 'one prompt in, one answer out' to 'a controlled dialogue between model, code, and environment.'",
      "familiarityLevel": "know_well"
    },
    {
      "topic": "Tool use / function calling",
      "explanation": "The article treats tool use as a core primitive. A tool is an external capability—search, file edit, database fetch, code execution, refund action, etc.—described to the model in a structured way. The important subtlety is that tool quality is not just about backend correctness; it is also about how understandable the interface is to the model. In other words, a tool schema is partly an API contract for software and partly a prompt for the model.",
      "familiarityLevel": "know_well"
    },
    {
      "topic": "Retrieval and memory augmentation",
      "explanation": "Retrieval means fetching relevant external context at run time rather than hoping the model already 'knows' it. Memory means preserving useful prior information across turns or steps. The article bundles these with tools as augmentations that turn a plain LLM into an 'augmented LLM.' The key mental model is that these augmentations give the model ways to look things up, act, and carry forward state, making it less like a static text generator and more like a controller over external resources.",
      "familiarityLevel": "know_somewhat"
    },
    {
      "topic": "Workflow orchestration concepts",
      "explanation": "This is one of the article’s central ideas. A workflow is explicit control logic around one or more LLM calls. Think of it as the application deciding the graph of execution: first classify, then call one of several prompts; or first draft, then review; or split a task into parallel branches and merge results. The workflow can be fixed ahead of time, which makes it easier to debug and test. The article contrasts this with agents, where the model decides more of the execution path at run time. The practical distinction is important: workflows are like deterministic pipelines with probabilistic components, while agents are more like closed-loop controllers that choose their own next actions based on feedback.",
      "familiarityLevel": "add_background"
    },
    {
      "topic": "Evaluation and guardrail design",
      "explanation": "The article repeatedly suggests that complexity is justified only when it improves measurable outcomes, which implies having evaluations. In this context, evaluation means creating tests, criteria, or judges that check whether the system is doing the right thing—accuracy, safety, completeness, policy compliance, resolution success, and so on. Guardrails are the constraints and screening steps that reduce harmful or invalid behavior, such as filtering unsafe requests or preventing destructive actions outside a sandbox. One useful idea from the article is that guardrails can be their own parallel workflow rather than something you ask the same generation call to 'also remember.' This often works better because each model invocation has a cleaner objective.",
      "familiarityLevel": "add_background"
    },
    {
      "topic": "Basic software engineering for interfaces and testing",
      "explanation": "The article assumes standard engineering instincts: clear interfaces, strong documentation, sandboxing, and iterative testing. Its twist is that the consumer of the interface is partly an LLM. So naming, examples, defaults, and shape of parameters matter even more than usual because they influence both correctness and model behavior.",
      "familiarityLevel": "know_well"
    }
  ],
  "limitations": [
    "This is a practitioner article, not a formal research paper. Its guidance is based on Anthropic’s experience with customers and internal systems, so it is highly practical but not backed here by controlled benchmarks across all patterns.",
    "The article is intentionally high-level. It names patterns and gives examples, but it does not provide implementation details, quantitative decision thresholds, or rigorous comparisons showing when one pattern overtakes another.",
    "The recommendation to prefer simplicity is sensible, but the article does not deeply discuss cases where organizational constraints, compliance requirements, or very large existing systems might justify more framework-heavy architectures.",
    "The workflows are presented as common patterns rather than an exhaustive taxonomy. Real systems may blur categories or require additional components like persistent planners, state machines, schedulers, or external policy engines.",
    "The discussion of agents emphasizes sandboxing and guardrails but does not fully specify how to design robust permission models, rollback mechanisms, or failure recovery for high-risk production environments.",
    "Some examples are Anthropic-specific or Claude-oriented, especially around tool use, MCP, workbench, and reference implementations. The core ideas generalize, but some implementation advice assumes Anthropic’s ecosystem and API conventions.",
    "The article argues that tool/interface design is crucial, but it provides only a brief appendix rather than a full methodology for systematically evaluating tool schemas, error rates, or human review burden."
  ],
  "groundedWebContext": [],
  "provenanceNotes": [
    "All core claims, workflow definitions, examples, and recommendations are grounded in the provided source text from Anthropic’s article published December 19, 2024.",
    "The summary preserves the article’s explicit distinction between workflows and agents, which is one of its main architectural claims.",
    "Descriptions of diagrams are inferred from the surrounding text and captions in the source: each figure illustrates a control-flow pattern rather than presenting experimental data tables.",
    "No external web context was added because the user supplied the source text directly and the task is to summarize that source faithfully.",
    "Interpretive language in the practicalInterpretation field is a synthesis of the source’s advice for engineers, not a new claim about empirical performance beyond what the article states."
  ],
  "audioRewriteHandoff": "Emphasize that the surprising message is anti-hype: the article says the best agent systems are often simple, composable workflows rather than elaborate autonomous architectures. Keep the tone practical and engineering-focused, with special weight on the workflows-vs-agents distinction and the idea that tool/interface design is often the real bottleneck."
}

Executive summary

Artifact: readyProvider: openaiModel: gpt-5.4

Created: 2026-03-09 22:01:39 UTC

Show content

Feed description: This article is a practical architecture guide for AI applications that sits between single-shot prompting and fully autonomous agents. Anthropic argues that most effective systems come from starting with an augmented LLM and then adding a small set of workflow patterns—chaining, routing, parallelization, orchestrator-workers, and evaluator-optimizer—only when they clearly improve results. Its main contribution is a plainspoken design framework for deciding when to keep control flow explicit, when to let the model dynamically choose actions, and how tool and interface design often determines reliability more than agent hype does.

Larger picture: As teams rush to build "agents," this article reframes the problem as systems engineering: use the least autonomy that works, because more autonomy usually means higher cost, harder debugging, and more compounding errors. It matters now because it gives engineers a clearer way to separate workflow orchestration from true agency and to add complexity only with measurable benefit.

Main contribution: The article provides a concrete taxonomy of common LLM system patterns and ties each one to the kind of task it fits best, from fixed sequential decomposition to open-ended agent loops. For engineers, the key takeaway is to build around augmented LLM primitives and well-designed tools, with sandboxing, stopping conditions, transparency, and human oversight as autonomy increases.

Audio narrative

Artifact: readyProvider: geminiModel: gemini-2.5-pro

Created: 2026-03-09 21:20:05 UTC

Show content

If you're building an application on top of a large language model, you've probably hit a point where a single, well-crafted prompt just isn't cutting it anymore. The task is too complex, the failure rate is too high, and you start wondering if you need to build a full-blown autonomous agent. A recent article from Anthropic offers a refreshingly pragmatic answer: probably not. Their core claim is that the most effective AI systems usually come from starting with simple, composable building blocks and adding structure only when it measurably helps. In practice, this means using a small set of reliable workflow patterns, and only reaching for a truly autonomous agent when the problem genuinely demands it.

So, what are these building blocks? The fundamental unit is what Anthropic calls an "augmented LLM." This isn't just a raw model; it's the model paired with its senses and tools. Think of it as an LLM with access to a filing cabinet for retrieval, a set of APIs to take action, and a notepad for memory. The key engineering insight is to treat these augmentations as first-class citizens. Your job isn't just to prompt the model, but to design a clean, well-documented environment of tools and information that the model can reliably use.

With that foundation, the most important architectural decision is the distinction between a workflow and an agent. A workflow is a system where your code defines the path of execution. It's like a script that orchestrates LLM calls and tool usage in a predictable sequence. An agent, on the other hand, is a system where the LLM itself dynamically decides what to do next. The article strongly suggests that you should default to workflows, because they are easier to debug, test, and control. Agents are powerful, but they introduce a lot of uncertainty.

Let's walk through the common workflow patterns Anthropic lays out. The simplest is prompt chaining. This is for any task you can break down into fixed, sequential steps. Instead of asking a model to do one huge, complicated thing, you ask it to do three smaller, simpler things in a row. For example, first extract entities from a user query, then use those entities to search a database, and finally, synthesize the search results into a natural language answer. It's like an assembly line for reasoning, turning one error-prone task into several more reliable ones.

Next up is routing. This pattern is your friend whenever you're dealing with different kinds of inputs that need different kinds of handling. Imagine you're building a customer support bot. A simple "how do I reset my password?" query should be handled differently than a complex, angry complaint about a billing error. Routing uses an initial LLM call to act like a switch statement, classifying the input and directing it to a specialized prompt, a different tool, or even a more powerful model. It's a classic separation-of-concerns pattern applied to LLM workloads.

Then there's parallelization, which comes in two flavors. The first is sectioning, where you break a big task into independent sub-problems that can be worked on at the same time. Think of writing a report with five distinct sections; you could have five parallel LLM calls each drafting a section, and then a final call to stitch them together. The other form is voting, which is more about robustness. You can run the same prompt multiple times, or with slightly different instructions, and then aggregate the results. If three out of five runs agree on an answer, you can have much higher confidence. It's like getting a second, third, and fourth opinion before making a critical decision.

Now, what if the subtasks can't be fully defined ahead of time? That's where the orchestrator-workers pattern comes in. Here, you have a central "orchestrator" model that acts like a project manager. It analyzes the main goal, dynamically breaks it down into smaller tasks, and delegates that work to specialized "worker" models or tools. Once the workers are done, the orchestrator synthesizes their outputs into a final result. This is a step up in complexity from simple chaining, because the plan itself is generated at runtime, but it's still more structured than a fully autonomous agent.

For tasks that demand high quality and can benefit from refinement, there's the evaluator-optimizer pattern. This sets up an explicit feedback loop. One model acts as the "producer," generating a draft, a piece of code, or a plan. A second model, the "evaluator," then critiques that output based on a set of criteria or a rubric. The initial draft and the critique are then fed back to the producer for another attempt. It’s a programmatic version of a writer-editor relationship or a code review cycle, forcing the system to iterate and improve its own work.

Only after you've considered these structured workflows should you reach for a true agent. According to Anthropic, agents are best suited for open-ended tasks where the number of steps is uncertain and the environment is complex. This is where the model needs to plan, act, observe the results, and potentially recover from mistakes. But this autonomy comes at a cost. With each step the agent takes on its own, the potential for error compounding grows, and so do your token costs. This is why guardrails are non-negotiable: you need robust sandboxing to prevent destructive actions, clear stopping conditions to prevent infinite loops, and enough transparency for a human to understand the agent's plan and intervene if needed.

Finally, the article makes a crucial point about the engineering that surrounds the model. It cautions against leaning too heavily on complex frameworks that hide what's actually happening under the hood. Abstractions can be helpful, but if you can't see the exact prompts, tool calls, and model responses, debugging becomes a nightmare. Often, it's better to work directly with the API primitives. And perhaps most importantly, tool design is everything. Many so-called "agent failures" are actually interface failures. An ambiguous tool description, an awkward output format, or an untested edge case in an API can easily send a model down the wrong path. The hard work is often in designing clean, predictable, and well-documented tools that the model can use as reliably as any other piece of software.

So, to wrap it all up, here are the key takeaways. First, always start with the simplest possible system and only add complexity when it's clearly justified. Your baseline should be a single augmented LLM call. Second, think in terms of composable workflow patterns—chaining, routing, parallelization—before you even consider building a fully autonomous agent. These patterns give you structure and control. Third, treat your tool and API design as a first-class engineering discipline. The clarity and reliability of the tools you give an LLM are often more important than the cleverness of your prompts. And finally, use agents for what they're good at—open-ended, dynamic problems—but always with strong guardrails and human oversight in place.

Audio file

Artifact: readyProvider: deepgramModel: aura-2-draco-en

Created: 2026-03-09 22:06:05 UTC

No audio file available.

EPUB export

Artifact: ready

Created: 2026-03-09 22:02:09 UTC

No file available.

Feedback

Capture what worked, what was unclear, and what to revisit.

No feedback yet.