AI Agents — Definition · Patterns · Frameworks · Autonomy

The word "agent" is used in several senses, which is a confusing situation. Script automation is called an agent, and so is the shape of an LLM calling tools.

1. About agents

The traditional AI textbook (Russell · Norvig "AI: A Modern Approach") defines an agent as "something that perceives its environment and acts." In the LLM era, agents narrow further on top of that.

The general shape of an LLM agent:

Observation → Reasoning → Action → Observation → ...

Observation — User input · tool result · environment signal.
Reasoning — The LLM decides the next action.
Action — Tool call · code execution · message send.

The loop's termination is usually when a "final answer" is produced or a step limit is reached.

2. Agent vs chatbot

Chatbot — One conversation turn is the I/O. Even with tools, calls are usually simple — one or two steps.
Agent — A worker that goes through multi-step tool calls · self-evaluation · re-planning.

The boundary is gradual, and real systems mix the two shapes.

3. ReAct

Yao et al. (2022). A pattern that places "Thought → Action → Observation" explicitly inside the model output. It weaves tool use and reasoning into a single flow.

Thought: The user is asking about the exchange rate.
Action: search("USD KRW today")
Observation: 1378.5
Thought: Compose the answer.
Final Answer: Today's exchange rate is 1,378.5 KRW.

Reflexion (Shinn et al. 2023) — Try → evaluate the result → self-critique → retry. After failure, leave natural-language feedback to apply on the next try.

Plan-and-Execute — Split the task into a step-by-step plan first, and execute each step separately. The plan can be updated mid-run (the shape of LangChain Plan-and-Execute · BabyAGI's task list).

Self-Critique · Self-Refine (Madaan et al. 2023) — Adds self-criticism to its own output to improve. Combined with CoT.

4. Tool use · Function calling

Tool calls are two-stage:

The model sees the tool signature and outputs the call arguments (JSON).
The caller (application) executes the actual function and returns the result to the model.

OpenAI · Anthropic · Google · Mistral all have a standardized format, and SDKs convert to each provider's shape. It's common for tool calls and text responses to mix within the same response.

5. Memory

The place that preserves information across sessions. Split into short-term (conversation history) · long-term (external storage). Long-term memory is usually written into a vector DB · KV · relational DB, then retrieved · summarized as needed and put back into context.

6. Multi-agent frameworks

Framework	Origin · time	Trait
LangChain · LangGraph	LangChain Inc., 2022~	Most popular. LangGraph is a state graph.
AutoGen	Microsoft Research, 2023	Multi-agent conversation. Big refactor from 0.4.
CrewAI	João Moura, 2024	Role-centered collaboration.
Semantic Kernel	Microsoft, 2023	C# · Python · Java. Plugin model.
LlamaIndex	Jerry Liu, 2022~	Document indexing · RAG-centric. agent module.
Haystack	deepset, 2020~	Search · NLP pipeline.
Smolagents	Hugging Face, 2024	Small framework centered on code agents.
Google ADK	Google, 2024	Vertex AI integration · multi-agent.

7. Single agent vs multi-agent

A single agent is simple, but the context window fills fast and responsibilities blur. Multi-agent gains responsibility separation · parallelism · context isolation, but message design · debugging gets harder.

Benefits:

Context savings — Each sub-agent focuses on its own work.
Parallelism — Independent tasks run concurrently.
Role separation — Search · writing · review divided.

Costs:

Coordination cost — Designing message structure · termination conditions.
Error propagation — One agent's mistake spreads to other agents.
Debugging — Tracing what went wrong where.

8. Levels of autonomy

Level	Description	Example
Suggestion	Suggests only to the human	IDE autocomplete.
Approval	Approval before action	Payment · file deletion approval gate.
Autonomous (limited)	Free within set tools · scope	Mail classification·labeling.
Autonomous (open)	Free to add tools · external calls	Research helper · script writing.

As autonomy rises, the impact of security · billing · mistakes grows. Least privilege · tool whitelists · budget caps are handled together.

9. Tool permission models

Read / write split — Reads auto-allowed, writes need approval.
Domain whitelist — Only allowed URLs · APIs.
Budget cap — Token · money · count caps.
Side-effect isolation — Code execution inside a container · VM.

10. The difficulty of evaluation

No ground truth — Tasks are open-ended and defining the correct answer is hard.

Reproducibility — Non-deterministic behavior is common even for the same input.

Meaning of benchmarks — Benchmarks like WebArena · SWE-bench · GAIA · OSWorld exist, but don't directly tie to performance in your domain.

Diverse failure modes — Wrong answer, wrong tool call, halted midway, infinite loop, environment damage — each carries a different meaning.

Evaluation flow:

Pair human review with LLM evaluation.
A small evaluation set with your own domain tasks.
Regression tests on changes.
Save agent traces (tool calls · messages) for post-mortem analysis.

11. Spots where you often get stuck

Infinite loops — Weak termination conditions cause the same tool to be called repeatedly. Force a step cap · budget cap.

Context blow-up — Tool results pile up and context fills fast. Summarization · compression · sub-agent split.

Prompt injection — Instructions inside external materials (web pages · email) shift agent behavior. Trust boundary design.

Cost of mistakes — Don't put irreversible actions like file · data deletion · payment in autonomous places.

Tool signature drift — When tool definitions diverge from code, the model produces wrong arguments. Auto-generation or single source.

LLM-change regression — The same prompt behaves differently on a different model · point in time. Model pin + regression tests.

Missing logs — Without traces, post-hoc analysis is nearly impossible.

User perception gap — The autonomy · responsibility users expect from the word "agent" can differ from actual behavior.

Closing thoughts

The definition of "agent" is fuzzy, but the core is "multi-step tool calls + self-evaluation + re-planning." As autonomy rises, least privilege · budget caps · human-approval gates must come together to keep operations safe. Because evaluation is hard, a small domain evaluation set + regression tests + trace logs become the standard tools of operation.

llm-landscape

References: ReAct (2022) · Reflexion (2023) · Self-Refine (2023) · LangGraph · AutoGen · CrewAI · LlamaIndex Agents · GAIA · SWE-bench · WebArena.