Agents & Tool Use

Building autonomous LLM agents — the ReAct pattern, function calling, planning strategies, the Model Context Protocol, and multi-agent systems.

Overview

A language model answering questions from a static context window is useful, but it is fundamentally limited. It cannot look up today's stock price, run a unit test, or send a calendar invite. An LLM agent removes that ceiling: the model is given a set of tools — search, code execution, API calls, file access — and it decides which to invoke, in what order, to complete a task.

The LLM acts as the reasoning brain. Tools are its hands. Together they turn a Q&A system into an autonomous task-solver that can act on the world rather than merely describe it.

Understanding agents requires understanding both sides: how the model reasons about what to do next, and how the surrounding application framework executes those decisions and feeds results back.

The Agent Loop

Every agent architecture reduces to the same core cycle: reason, act, observe, repeat.

Loading diagram...

The loop continues until the model decides it has enough information to answer, or until a hard stop condition — maximum iterations, timeout, or a human override — interrupts it. Without stop conditions, a buggy or confused agent can loop indefinitely, which makes iteration limits a required safety feature rather than an optional one.

The ReAct Pattern

ReAct (Reasoning + Acting) is the most widely used agent prompting strategy. The model is prompted to interleave explicit thought traces with tool calls rather than jumping straight to an action.

The structure of each step is:

Thought — the model articulates its current understanding and what it needs next
Action — the model specifies a tool call
Observation — the application returns the tool result

A concrete multi-step trace looks like this:

Task: What is the population of Tokyo, and how many times larger is it than Oslo?

Thought: I need to find Tokyo's population. I'll search for it.
Action: search("Tokyo population 2024")
Observation: Tokyo's population is approximately 13.96 million in the city proper,
             37.4 million in the greater metropolitan area.

Thought: Now I need Oslo's population to calculate the ratio.
Action: search("Oslo population 2024")
Observation: Oslo has approximately 717,000 residents.

Thought: I'll use the city proper figures. 13,960,000 / 717,000 ≈ 19.5.
Action: calculate("13960000 / 717000")
Observation: 19.469177...

Thought: I have both figures and the calculation result. I can now answer.
Answer: Tokyo's city population (~13.96M) is approximately 19.5 times larger than Oslo's (~717,000).

The explicit reasoning step matters because it forces the model to commit to a plan before acting. Without it, the model may call the wrong tool or pass malformed arguments. ReAct was introduced by Yao et al. (2023) and showed consistent improvements over action-only and thought-only baselines.

Chain-of-Thought vs. ReAct

Chain-of-Thought (CoT) prompting produces reasoning traces but no actions — the model thinks step by step, but entirely within its own knowledge. ReAct extends CoT by alternating between internal reasoning and external tool calls. Think of CoT as reasoning in a closed room, and ReAct as reasoning while being able to step outside and check things.

Function Calling / Tool Use

Early agent systems passed tool results as plain text in the conversation. Modern LLMs support structured function calling: the model receives machine-readable tool definitions and returns structured JSON tool calls instead of free text. The host application routes those calls to the actual implementations, executes them, and returns results.

A tool definition specifies the tool's name, what it does, and the schema for its parameters:

{
  "name": "get_weather",
  "description": "Get current weather for a city",
  "parameters": {
    "type": "object",
    "properties": {
      "city": { "type": "string" }
    },
    "required": ["city"]
  }
}

The model never executes code. It produces a JSON object like {"name": "get_weather", "arguments": {"city": "Tokyo"}}. The application code handles routing, authentication, rate limits, and error handling — the model just decides what to call and with what arguments.

This separation is important for safety. The model can only invoke tools that the application explicitly exposes. Permissions, sandboxing, and audit logging all live in the host application layer, not inside the model.

Code Example

A minimal agent with two tools: weather lookup (stubbed) and arithmetic evaluation.

import json
from openai import OpenAI

client = OpenAI()

# Tool definitions
tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "Get the current weather for a city",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "City name"}
                },
                "required": ["city"],
            },
        },
    },
    {
        "type": "function",
        "function": {
            "name": "calculate",
            "description": "Evaluate a mathematical expression",
            "parameters": {
                "type": "object",
                "properties": {
                    "expression": {"type": "string", "description": "Math expression to evaluate"}
                },
                "required": ["expression"],
            },
        },
    },
]

def execute_tool(name: str, args: dict) -> str:
    if name == "get_weather":
        return f"Weather in {args['city']}: 22°C, partly cloudy"  # stub
    if name == "calculate":
        return str(eval(args["expression"]))  # noqa: S307
    return "Unknown tool"

def run_agent(user_message: str) -> str:
    messages = [{"role": "user", "content": user_message}]

    while True:
        response = client.chat.completions.create(
            model="gpt-4o-mini",
            messages=messages,
            tools=tools,
        )
        msg = response.choices[0].message
        messages.append(msg)

        if msg.tool_calls:
            for call in msg.tool_calls:
                result = execute_tool(call.function.name, json.loads(call.function.arguments))
                messages.append({
                    "role": "tool",
                    "tool_call_id": call.id,
                    "content": result,
                })
        else:
            return msg.content

# Example
print(run_agent("What is the weather in Tokyo, and what is 15 * 8?"))

Planning Strategies

Not all tasks suit the same planning approach. Three strategies have emerged as the main options.

ReAct interleaves thought and action in a single loop. It handles open-ended tasks well because each observation can redirect the next thought. The downside is that each step is made locally — the model does not look ahead, so it can take inefficient paths on structured tasks.

Plan-and-Execute separates planning from execution. The model first generates a complete step-by-step plan, then a separate execution pass carries out each step. This works better for tasks with a known structure — writing a report, running a test suite, filling out a form — because you can inspect and correct the plan before any irreversible actions run.

Tree of Thought explores multiple reasoning branches in parallel, scores each branch, and selects the best path forward. It is more compute-intensive but dramatically outperforms single-path strategies on puzzles, mathematics, and multi-constraint planning problems. Think of it as beam search applied to reasoning rather than token generation.

Memory in Agents

A stateless agent that forgets everything between turns is rarely useful for complex tasks. Agent memory falls into three tiers.

Loading diagram...

In-context memory is the simplest form — the full conversation history is appended to every prompt. It is immediate and requires no infrastructure, but it is bounded by the model's context window. For tasks spanning many turns or hours, the context fills up or becomes prohibitively expensive.

External memory uses a vector store (such as Pinecone, Chroma, or pgvector) to store and retrieve relevant past information via semantic search. The agent queries for relevant memories at the start of each turn rather than loading everything. This scales to arbitrarily long histories, but retrieval introduces latency and can miss relevant context if the query is imprecise.

Episodic memory stores the results and outcomes of past tasks — not just raw text, but structured records of what was attempted, what succeeded, and what failed. A task-completion agent that remembers its own past mistakes can avoid repeating them across sessions.

Model Context Protocol (MCP)

Tool definitions have traditionally been bespoke: each application defines its own tool schemas, each model provider documents its own calling convention, and connecting a new data source to an agent requires custom integration work every time.

Anthropic's Model Context Protocol (MCP), released in late 2024, is an open standard designed to solve this fragmentation. The analogy is USB: before USB, every peripheral needed a proprietary connector. MCP provides a universal connector for AI tools and data sources.

Loading diagram...

An MCP server exposes capabilities — tools, resources (file contents, database rows), and prompts — over a standardized JSON-RPC protocol. An MCP client (Claude, an IDE plugin, a custom agent) discovers available capabilities and calls them without needing to know the implementation details.

The practical result is a growing ecosystem of pre-built MCP servers for common services — GitHub, Slack, databases, file systems, browser automation — that any MCP-compatible client can use without integration work.

Multi-Agent Systems

Some tasks are too large or too complex for a single agent. Multi-agent systems decompose work across specialized agents coordinated by an orchestrator.

Loading diagram...

A research task illustrates the pattern well. An orchestrator receives the research question and breaks it into parallel sub-tasks: one agent searches recent web sources, another reads and summarizes a set of documents, a third cross-checks claims against known facts. The orchestrator collects all results and synthesizes the final response.

The benefits are parallelism (sub-tasks run simultaneously), specialization (each agent can be fine-tuned or prompted for its specific role), and error checking (a dedicated verification agent reduces the chance that hallucinated content reaches the user).

The challenges are coordination overhead, compounding errors when upstream agents produce incorrect output, and debugging difficulty — tracing a wrong final answer back to the sub-agent that produced the bad intermediate result requires good observability tooling.

Challenges and Safety

Agents introduce risks that prompting alone does not

Prompt injection — malicious content embedded in tool outputs (a web page, a document, an API response) that instructs the agent to take unauthorized actions. Treat all tool output as untrusted user input.

Minimal permissions — an agent with read/write access to a file system, email, and a code executor is a large attack surface. Grant only the permissions required for the specific task. Separate agents for sensitive operations.

Loop detection — always set a maximum iteration limit. A confused agent can enter reasoning loops that consume significant API spend and produce no useful output.

Human-in-the-loop for irreversible actions — sending emails, deleting files, making purchases, and deploying code cannot be undone. Require explicit user confirmation before the agent executes these actions, regardless of how confident the model appears.

Key Papers and Resources

"ReAct: Synergizing Reasoning and Acting in Language Models" — Yao et al., 2023 (arXiv:2210.03629). Introduced the interleaved thought-action-observation format. Showed consistent improvements over chain-of-thought and action-only baselines across question answering, fact verification, and interactive task benchmarks.

"Tree of Thoughts: Deliberate Problem Solving with Large Language Models" — Yao et al., 2023 (arXiv:2305.10601). Extended single-path reasoning to tree search, enabling deliberate exploration of multiple reasoning paths with backtracking.

Anthropic Model Context Protocol specification — modelcontextprotocol.io. The open standard for connecting LLM applications to external tools and data sources. Includes SDK documentation for building MCP servers and clients.

OpenAI Function Calling documentation — platform.openai.com/docs/guides/function-calling. Definitive reference for the structured tool-calling API supported by GPT-4 and GPT-4o models.

Connected Topics

Prompt Engineering — ReAct and Plan-and-Execute are both prompt engineering strategies applied at the agent level. Techniques like chain-of-thought, few-shot examples, and role specification all carry over directly to agent prompts.

RAG and Retrieval — retrieval is one of the most common agent tools. Understanding how vector search works and how to structure retrieval queries makes the external memory tier of an agent significantly more reliable.

Fine-tuning and RLHF — models that are fine-tuned specifically on tool use and instruction following perform substantially better as agent backbones than base models. Instruction-following fine-tuning teaches the model the discipline of producing well-formed tool calls and stopping when the task is complete.

On this page