Join our daily and weekly newsletters for the latest updates and exclusive content on industry-leading AI coverage. Learn More
As soon as AI agents have showed promise, organizations have had to grapple with figuring out if a single agent was enough, or if they should invest in building out a wider multi-agent network that touches more points in their organization.
Orchestration framework company LangChain sought to get closer to an answer to this question. It subjected an AI agent to several experiments that found single agents do have a limit of context and tools before their performance begins to degrade. These experiments could lead to a better understanding of the architecture needed to maintain agents and multi-agent systems.
In a blog post, LangChain detailed a set of experiments it performed with a single ReAct agent and benchmarked its performance. The main question LangChain hoped to answer was, “At what point does a single ReAct agent become overloaded with instructions and tools, and subsequently sees performance drop?”
LangChain chose to use the ReAct agent framework because it is “one of the most basic agentic architectures.”
While benchmarking agentic performance can often lead to misleading results, LangChain chose to limit the test to two easily quantifiable tasks of an agent: answering questions and scheduling meetings.
“There are many existing benchmarks for tool-use and tool-calling, but for the purposes of this experiment, we wanted to evaluate a practical agent that we actually use,” LangChain wrote. “This agent is our internal email assistant, which is responsible for two main domains of work — responding to and scheduling meeting requests and supporting customers with their questions.”
Parameters of LangChain’s experiment
LangChain mainly used pre-built ReAct agents through its LangGraph platform. These agents featured tool-calling large language models (LLMs) that became part of the benchmark test. These LLMs included Anthropic’s Claude 3.5 Sonnet, Meta’s Llama-3.3-70B and a trio of models from OpenAI, GPT-4o, o1 and o3-mini.
The company broke testing down to better assess the performance of email assistant on the two tasks, creating a list of steps for it to follow. It began with the email assistant’s customer support capabilities, which look at how the agent accepts an email from a client and responds with an answer.
LangChain first evaluated the tool calling trajectory, or the tools an agent taps. If the agent followed the correct order, it passed the test. Next, researchers asked the as …