agents

Agent architectures that actually ship

A 2026 pattern catalog grounded in the ReAct → Reflexion → SWE-agent literature.

Sal AnvarovMay 8, 202618 min read

LLM agents
ReAct
Reflexion
Tool use
SWE-agent

The premise

"Agent" has become an overloaded word. In this paper we use it narrowly: a system where a language model decides what to do next — which tools to call, in which order, with what inputs — rather than executing a human-authored script.

The 2022-2025 literature has produced four architectural primitives that every production agent is composed of, in some combination: ReAct (the loop), Reflexion (the memory), multi-agent (the team), and the agent-computer interface (the tools). Recent benchmarks tell us where to spend engineering time. Recent production experience tells us where to not over-engineer.

The four primitives

1. ReAct — the loop

ReAct: Synergizing Reasoning and Acting in Language Models (Yao et al., 2022) is the foundational architectural primitive: interleave reasoning traces with actions. The agent thinks, acts (calls a tool), observes the result, and updates its plan in the same context.

The reported results justified everything that followed. On knowledge tasks (HotpotQA, Fever), ReAct cut hallucination versus pure chain-of-thought. On interactive environments (ALFWorld, WebShop), ReAct beat imitation-learning baselines by 34 and 10 absolute percentage points respectively.

In Yao et al.'s framing: "reasoning traces help the model induce, track, and update action plans as well as handle exceptions, while actions allow it to interface with external sources... to gather additional information."

Virtually every production agent in 2026 still uses some variant of this loop. The question is which variant.

2. Reflexion — the memory

Reflexion: Language Agents with Verbal Reinforcement Learning (Shinn et al., 2023) added a self-critique layer on top of ReAct: after each trial, the agent reflects on what went wrong, stores those reflections in an episodic memory buffer, and consults that memory on the next attempt. No gradient updates required.

The eye-catching result: Reflexion hit 91% pass@1 on HumanEval, beating the GPT-4 baseline of 80% — without any weight updates. As the authors put it: "Reflexion agents verbally reflect on task feedback signals, then maintain their own reflective text in an episodic memory buffer to induce better decision-making in subsequent trials."

Most "agentic" production systems shipping today use some descendant of this pattern. It is the cheapest deployable improvement loop in the agent toolkit.

3. Multi-agent — the team

AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation (Wu et al., Microsoft, 2023) defined the dominant multi-agent design pattern: conversable, customizable agents that collaborate via structured conversation, with optional human and tool participation. Multi-agent often beats single-agent on complex workflows that genuinely benefit from specialization.

The framing in their abstract is precise: "AutoGen agents are customizable, conversable, and can operate in various modes that employ combinations of LLMs, human inputs, and tools."

The trap: most "multi-agent" systems we see are deterministic chains in costume. The agents do not collaborate; they sit in a pipeline calling each other. When multi-agent works, it works because each agent has fewer choices to make, not more.

4. The agent-computer interface — the unsung hero

The most under-discussed result of 2024 is from SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering (Yang et al., Princeton, NeurIPS 2024). They show that the interface the agent uses to manipulate its environment is itself a first-class design variable. A purpose-built "agent-computer interface" — narrower commands, clearer feedback, structured editing — lifts SWE-bench pass@1 from low single-digits to 12.5%, and HumanEvalFix to 87.7%.

Their framing: SWE-agent's "custom agent-computer interface (ACI) significantly enhances an agent's ability to create and edit code files, navigate entire repositories, and execute tests and other programs."

For practitioners this is the biggest unlock since ReAct. The biggest production wins often come from redesigning the tool surface, not the model or the prompt.

Where agents actually fail (per AgentBench)

AgentBench: Evaluating LLMs as Agents (Liu et al., ICLR 2024) evaluated 29 models across 8 interactive environments. The dominant failure modes were not "the model doesn't know enough." They were:

Long-horizon reasoning — agents lose the plot past 8-12 steps.
Decision-making under uncertainty — agents over-commit to early hypotheses.
Instruction following — agents drift from constraints set earlier in the run.

Liu et al. put it plainly: "poor long-term reasoning, decision-making, and instruction following abilities are the main obstacles for developing usable LLM agents."

That tells you where to spend engineering effort: aggressive context curation, structured planning, and machine-readable constraints — not more elaborate prompts.

The five practical patterns

Drawing the literature together, here are the five architectures we see in production. They are not mutually exclusive; most real systems compose two or three.

Pattern 1 — Deterministic chain (with one LLM step)

The simplest "agent" is not really an agent at all — a deterministic pipeline where exactly one step happens to be an LLM call. Use when the workflow is known and stable. Mislabeling this as an agent is a common failure: it adds agentic infrastructure the pipeline does not need.

Pattern 2 — Router

A single LLM call decides which downstream chain to execute. Each chain is deterministic. Right for "AI assistant"-style products with a small set of skills. Fails by route explosion past ~25 chains; the fix is nested routers or a planner.

Pattern 3 — Planner-executor

Two or more LLM calls — one plans, one executes. Use when the space of possible plans is open and you want plan transparency. Failure mode: plan-execution drift. Bound the re-planning depth or it cascades.

Pattern 4 — Single-loop ReAct

One LLM call sits in a loop. Each turn calls a tool or returns a final answer. Use when the task is genuinely exploratory and the tool count is moderate (under 15). Failure mode: context collapse around turn 8-12. Aggressive context curation is the fix — summarize prior turns, drop irrelevant tool outputs, manage context as a resource.

Pattern 5 — Multi-agent

Multiple agents with narrower roles coordinating via shared state or message bus. Use when specialization is real and you can prove it. The SWE-agent lesson applies in reverse here: most multi-agent gains come from constraint per agent, not parallelism.

Six things every production agent needs

Regardless of pattern, every production agent we ship has these:

A bounded turn count. No infinite loops, ever.
A cost ceiling. Per-conversation and per-tenant, enforced by the system not by prompts.
A tool-call budget. Same principle.
An eval harness that runs end-to-end traces. See our paper on eval-driven development.
A replay log. Every input, model output, tool call, and final answer is replayable.
A kill switch. Disable per-user, per-tenant or globally without a deploy.

Decision tree

One LLM decision: deterministic chain.

Pick between a small set of skills: router.

Plannable, plan worth showing the user: planner-executor.

Exploratory, turn count bounded: single-loop ReAct with strict context management.

Specialization is real: multi-agent.

Closing

Agents will keep getting more capable. The architectural choices in this paper will keep mattering, because the constraints are not really about model capability — they are about reliability, cost, and the operational debugging surface.

The one rule that lasts: agents should be the smallest architecture that does the job, not the most elaborate one you can justify. And when in doubt, the SWE-agent lesson is the highest-leverage place to spend the next afternoon — get the tool interface right and a mid-tier model often beats a frontier model on a sloppy interface.

Sources

Yao et al., ReAct: Synergizing Reasoning and Acting in Language Models, arXiv:2210.03629
Shinn et al., Reflexion: Language Agents with Verbal Reinforcement Learning, arXiv:2303.11366
Wu et al., AutoGen: Enabling Next-Gen LLM Applications via Multi-Agent Conversation, arXiv:2308.08155
Liu et al., AgentBench: Evaluating LLMs as Agents, arXiv:2308.03688
Yang et al., SWE-agent: Agent-Computer Interfaces Enable Automated Software Engineering, arXiv:2405.15793

Working on agents in production? Book a 30-minute call — we will look at the architecture together.

Next from the lab.

Work with us

evaluation

Eval-driven development for production LLM systems

Most production LLM systems are evaluated by other LLMs, against benchmarks that have probably leaked into the model's training data, with judges that systematically prefer their own kin. The recent literature is clear-eyed about all three problems. This paper is a working playbook for shipping evaluation infrastructure that survives those facts.

rag

RAG that actually retrieves

Production RAG in 2026 looks nothing like the "embed, retrieve, generate" of 2023. The recent literature converges on retrieval as a routing problem — when to retrieve, how to verify retrieval quality, when to switch to graph indexing, when to lean on long-context instead. This paper is the practitioner-facing version.