LLM agents extend the capabilities of pre-trained language models by integrating tools like Retrieval-Augmented Generation (RAG), short-term and long-term memory, and external APIs to enhance reasoning and decision-making.
The efficiency of an LLM agent depends on the selection of the right LLM model. While a small self-hosted LLM model might not be powerful enough to understand the complexity of the problem, relying on powerful third-party LLM APIs can be expensive and increase latency.
Efficient inference strategies, robust guardrails, and bias detection mechanisms are key components of successful and reliable LLM agents.
Capturing the user interactions and refining prompts with few-shot learning helps LLMs adapt to evolving language and user preferences.
Large Language Models (LLMs) perform exceptionally well on various Natural Language Processing (NLP) tasks, such as text summarization, question answering, and code generation. However, these capabilities do not extend to domain-specific tasks.
A foundational model’s “knowledge” can only be as good as its training dataset. For example, GPT-3 was trained on a web crawl dataset that included data collected up to 2019. Therefore, the model does not contain information about later events or developments.
Likewise, GPT-3 cannot “know” any information that is unavailable on the open internet or not contained in the books on which it was trained. This results in curtailed performance when GPT-3 is used on a company’s proprietary data, compared to its abilities on general knowledge tasks.
There are two ways to address this issue. The first is to fine-tune the pre-trained model with domain-specific data, encoding the information in the model’s weights. Fine-tuning requires curating a dataset and is usually resource-intensive and time-consuming.
The second option is to provide the required additional information to the model during inference. One straightforward way is to create a prompt template containing the information. However, when it is not known upfront which information might be required to generate the correct response or solving a task involves multiple steps, we need a more sophisticated approach.
So, what is an LLM agent?
LLM agents are systems that harness LLMs’ reasoning capabilities to respond to queries, fulfill tasks, or make decisions. For example, consider a customer query: “What are the best smartwatch options for fitness tracking and heart rate monitoring under $150?” Finding an appropriate response requires knowledge of the available products, their reviews and ratings, and their current prices. It’s infeasible to include this information in an LLM’s training data or in the prompt.
An LLM agent solves this task by tapping an LLM to plan and execute a series of actions:
- Access online shops and/or price aggregators to gather information about available smartwatch models with the desired capabilities under $150.
- Retrieve and analyze product reviews for the relevant models, potentially by running generated software code.
- Compile a list of suitable options, potentially refined by considering the user’s purchase history.
By completing this series of actions in an order, the LLM agent can provide a tailored, well-informed, and up-to-date response.
LLM agents can go far beyond a simple sequence of prompts. By tapping the LLM’s comprehension and reasoning abilities, agents can devise new strategies for solving a task and determine or adjust the required next steps ad-hoc. In this article, we’ll introduce the fundamental building blocks of LLM agents and then walk through the process of building an LLM agent step by step.
After reading the article, you’ll know:
- How LLM agents extend the capabilities of large language models by integrating reasoning, planning, and external tools.
- How LLM agents work: their components, including memory (short-term and long-term), planning mechanisms, and action execution.
- How to build an LLM agent from scratch: We’ll cover framework selection, memory integration, tool setup, and inference optimization step by step.
- How to optimize an LLM agent by applying techniques like Retrieval-Augmented Generation (RAG), quantization, distillation, and tensor parallelization to improve efficiency and reduce costs.
- How to address common development challenges such as solutions for scalability, security, hallucinations, and bias mitigation.
How do LLM agents work?
LLM agents came onto the scene with the NLP breakthroughs fueled by transformer models. Over time, the following blueprint for LLM agents has emerged: First, the agent determines the sequence of actions it needs to take to fulfill the request. Using the LLM’s reasoning abilities, actions are selected from a predefined set created by the developer. To perform these actions, the agent may utilize a set of so-called “tools,” such as querying a knowledge repository or storing a piece of information in a memory component. Finally, the agent uses the LLM to generate the response.
Before we dive into creating our own LLM agent, let’s take an in-depth look at the components and abilities involved.

How LLMs guide agents?
The LLM serves as the “brain” of the LLM agents, making decisions and acting on the situation to solve the given task. It is responsible for creating a plan of execution, determining the series of actions, making sure the LLM agent sticks to the role assigned, and ensuring actions do not deviate from the given task.
LLMs have been used to generate actions corresponding to predefined actions without direct human intervention. They are capable of processing complex natural language tasks and have demonstrated strong abilities in structured inference and planning.
How do LLM agents plan their actions?
Planning is the process of figuring out future actions that the LLM agent needs to execute to solve a given task.
Actions could occur in a pre-defined sequence, or future actions could be determined based on the results of previous actions. The LLM has to break down complex tasks into smaller ones and decide which action to take by identifying and evaluating possible options.
For example, consider a user requesting the agent to “Create a trip plan for a visit to the Grand Canyon next month.” To solve this task, the LLM agent has to execute a series of actions such as the following:
- Fetch the weather forecast for “Grand Canyon” next month.
- Research accommodation options near “Grand Canyon.”
- Research transportation and logistics.
- Identify points of interest and list must-see attractions at the “Grand Canyon.”
- Assess the requirement for any advance booking for activities.
- Determine what kinds of outfits are suitable for the trip, search in a fashion retail catalog, and recommend outfits.
- Compile all information and synthesize a well-organized itinerary for the trip.
The LLM is responsible for creating a plan like this based on the given task. There are two categories of planning strategies:
- Static Planning: The LLM constructs a plan at the beginning of the agentic workflow, which the agent follows without any changes. The plan could be a single-path sequence of actions or consist of multiple paths represented in a hierarchy or a tree-like structure.
- ReWOO is a technique popular for single-path reasoning. It enables LLMs to refine and improve their initial reasoning paths by iteratively rewriting and structuring the reasoning process in a way that improves the coherence and correctness of the output. It allows for the reorganization of reasoning steps, leading to more logical, structured, and interpretable outputs. ReWOO is particularly effective for tasks where a step-by-step breakdown is needed.
- Chain of Thoughts with Self-Consistency is a multi-path static planning strategy. First, the LLM is queried with prompts that are created using a chain-of-thought prompting strategy. Then, instead of greedily selecting the optimal reasoning path, it uses a “sample-and-marginalize” selection process where it generates a diverse set of reasoning paths. Each reasoning path might lead to a different answer. The most consistent answer is selected based on majority voting on the final state answer. Finally, a reasoning path is sampled from the set of reasoning paths that leads to the most consistent answer.
- Tree of Thoughts is another popular multi-path static planning strategy. It uses Breadth-First-Search (BFS) and Depth-First-Search (DFS) algorithms to systematically determine the optimal path. It allows the LLM to perform deliberate decision-making by considering multiple reasoning paths and self-evaluating paths to decide the next course of action, as well as looking forward and backward to make global decisions.
- Dynamic Planning: The LLM creates an initial plan, executes an initial set of actions, and observes the outcome to decide the next set of actions. In contrast to static planning, where the LLM generates a static plan at the beginning of the agentic workflow, dynamic planning requires multiple calls to the LLM to iteratively update the plan based on feedback from the previously taken actions.
- Self-Refinement generates an initial plan, executes the plan, collects feedback from LLM on the last plan, and refines it based on self-provided feedback. Self-reflection iterates between feedback and refinement until a desired criterion is met.
- ReACT combines reasoning and acting to solve diverse reasoning and decision-making tasks. In the ReACT framework, the LLM agent takes an action based on the initial thought and observes the feedback from the environment for executing this action. Then, it generates the next thought based on observations.
Why is memory so important for LLM agents?
Adding memory to an LLM agent improves its consistency, accuracy, and reliability. The use of memory in LLM agents is inspired by how humans remember events of the past to learn methods to deal with the current situation. A memory could be a structured database, a store for natural language, or a vector index that stores embeddings. A memory stores information about plans and actions generated by the LLM, responses to a query, or external knowledge.
In a conversational framework, where the LLM agent executes a series of tasks to answer a query, it must remember contexts from previous actions. Similarly, when a user interacts with the LLM agent, they may ask a series of follow-up queries in one session. As an example, one of the follow-up questions after “Create a trip plan for a visit to the Grand Canyon next month” is “recommend a hotel for the trip.” To answer this question, the LLM Agent needs to know the past queries in the session to understand the question about a hotel for the previously planned trip to the Grand Canyon.
A simple form of memory is to store the history of queries in a queue and consider a fixed number of the most recent queries when answering the current query. As the conversation becomes longer, the chat context consumes increasingly more tokens in the input prompt. Hence, to accommodate large context, a summary of the historic chat is often stored and retrieved from memory.
There are two types of memory in an LLM agent:
- Short-term memory stores immediate context, such as a retrieved weather report or past questions from the current session, and uses an in-context learning method to retrieve relevant context. It’s used to improve the accuracy of LLM agent’s responses to solve a given task.
- Long-term memory stores historical conversations, plans, and actions, as well as external knowledge that can be retrieved through search and retrieval algorithms. It also stores self-reflections to offer consistency for future actions.
One of the most popular implementations of memory is a vector store, where information is indexed in the form of embeddings, and approximate nearest neighbor algorithms are used to retrieve the most relevant information using embedding similarity methods like cosine similarity. A memory could also be implemented as a database with the LLM generating SQL queries to retrieve the desired contextual information.
What about the tools in LLM agents?
Tools and actions enable an LLM agent to interact with external systems. While LLMs excel at understanding and generating text, they cannot perform tasks like retrieving data or executing actions.
Tools are predefined functions that LLM agents can use to perform actions. Common examples of tools are the following:
- API calls are essential for integrating real-time data. When an LLM agent encounters a query that requires external information (like the latest weather data or financial reports), it can fetch accurate, up-to-date details from an API. As an example, a tool could be a supporting function that fetches real-time weather data from OpenWeatherMap or another weather API.
- Code execution enables an LLM agent to carry out tasks like calculations, file operations, or script executions. The LLM generates code, which is then executed. The output is returned to the LLM as part of the next prompt. A simple example is a Python function that converts temperature values from Fahrenheit to degrees Celsius.
- Plot generation allows an LLM agent to create graphs or visual reports when users need more than just text-based responses.
- RAG (Retrieval-Augmented Generation) helps the agent access and incorporate relevant external documents into its responses, improving the depth and accuracy of the generated content.
Building an LLM agent from scratch
In the following, we’ll build a trip-planning LLM agent from scratch. The agent’s goal is to assist the user in planning a vacation by recommending accommodations and outfits and addressing the need for advance booking for activities like hiking.
Automating trip planning is not straightforward. A human would search the web for accommodation, transport, and outfits and iteratively make choices by looking into hotel reviews, feedback in social media comments, or experiences shared by bloggers. Similarly, the LLM agent has to collect information from the external world to recommend an itinerary.
Our trip planning LLM agent will consist of two separate agents internally:
- The planning agent will use a ReACT-based strategy to plan the necessary steps.
- The research agent will have access to various tools for fetching weather data, searching the web, scraping web content, and retrieving information from a RAG system.
We will use Microsoft’s AutoGen framework to implement our LLM agent. The open-source framework offers a low-code environment to quickly build conversational LLM agents with a rich selection of tools. We’ll utilize Azure OpenAI to host our agent’s LLM privately. While AutoGen itself is free to use, deploying the agent with Azure OpenAI incurs costs based on model usage, API calls, and computational resources required for hosting.
💡 You can find the complete source code on GitHub
Step 0: Setting up the environment
Let’s set up the necessary environment, dependencies, and cloud resources for this project.
- Install Python 3.9. Check your current Python version with:
If you need to install or switch to Python 3.9, download it from python.org or use pyenv or uv if managing multiple versions.
- Create a virtual environment to manage the dependencies:
python -m venv autogen_env
source autogen_env/bin/activate
- Once inside the virtual environment, install the required dependencies:
pip install autogen==0.3.1 \
openai==1.44.0 \
chromadb<=0.5.0 \
markdownify==0.13.1 \
ipython==8.18.1 \
pypdf==5.0.1 \
psycopg-binary==3.2.3 \
psycopg-pool==3.2.3 \
sentence_transformers==3.3.0 \
python-dotenv==1.0.1 \
geopy==2.4.1
- Set up an Azure account and set up the Azure OpenAI service:
- Navigate to Azure OpenAI service and log in (or sign up).
- Create a new OpenAI resource and a Bing Search resource under your Azure subscription.
- Deploy a model (e.g., GPT-4 or GPT-3.5-turbo).
- Note your OpenAI and Bing Search API keys, endpoint URL, deployment name, and API version.
- Configure the environment variables. To use your Azure OpenAI credentials securely, store them in a .env text file:
OPENAI_API_KEY=
OPENAI_ENDPOINT=https://.openai.azure.com
OPENAI_DEPLOYMENT_NAME=
OPENAI_API_VERSION=
BING_API_KEY=
- Next, import all the dependencies that will be used throughout the project:
import os
from autogen.agentchat.contrib.web_surfer import WebSurferAgent
from autogen.coding.func_with_reqs import with_requirements
import requests
import chromadb
from geopy.geocoders import Nominatim
from pathlib import Path
from bs4 import BeautifulSoup
from autogen.agentchat.contrib.retrieve_user_proxy_agent import RetrieveUserProxyAgent
from autogen import AssistantAgent, UserProxyAgent
from autogen import register_function
from autogen.cache import Cache
from autogen.coding import LocalCommandLineCodeExecutor, CodeBlock
from typing import Annotated, List
import typing
import logging
import autogen
from dotenv import load_dotenv, find_dotenv
import tempfile
Step 1: Selection of the LLM
When building an LLM agent, one of the most important initial decisions is to choose the appropriate LLM model. Since the LLM serves as the central controller responsible for reasoning, planning, and orchestrating the execution of actions, the selection has to consider and balance the following criteria:
- Strong capability in reasoning and planning.
- Capability in natural language communication.
- Support for modalities beyond text input, such as image and audio support.
- Development considerations such as latency, cost, and context window.
Broadly speaking, there are two categories of LLM models we can choose from: Open-source LLMs like Falcon, Mistral, or Llama2 that we can self-host, and proprietary LLMs like OpenAI GPT-3.5-Turbo, GPT-4, GPT-4o, Google Gemini, or Anthropic Claude that are accessible via API only. Proprietary LLMs offload operations to a third party and typically include security features like filtering harmful content. Open-source LLMs require effort to serve the model but allow us to keep our data internal. We also need to set up and manage any guardrails ourselves.
Another important consideration is the context window, which is the number of tokens that an LLM can consider when generating text. When building the LLM agent, we will generate a prompt that will be used as input to the LLM to either generate a series of actions or produce a response to the request. A larger context window allows the LLM agent to execute more complex plans and consider extensive information. For example, OpenAI’s GPT-4 Turbo offers a maximum context window of 128,000 tokens. There are LLMs like Anthropic’s Claude that offer a context window of more than 200,000 tokens.
For our trip-planning LLM agent, we’ll use OpenAI’s GPT-4o mini, which, at the time of writing, is the most affordable among the GPT family. This model delivers excellent performance in reasoning, planning, and language understanding tasks. GPT-4o mini is available directly via OpenAI and Azure OpenAI, which is suitable for applications that have regulatory concerns regarding data governance.
To use GPT-4o mini, we first need to create and deploy an Azure OpenAI resource as specified in step 0. This provides us with a deployment name, an API key, an endpoint address, and the API version. We set these as environment variables, define the LLM configuration, and load it at runtime:
config_list = [{
"model": os.environ.get("OPENAI_DEPLOYMENT_NAME"),
"api_key": os.environ.get("OPENAI_API_KEY"),
"base_url": os.environ.get("OPENAI_ENDPOINT"),
"api_version": os.environ.get("OPENAI_API_VERSION"),
"api_type": "azure"
}]
llm_config = {
"seed": 42,
"config_list": config_list,
"temperature": 0.5
}
bing_api_key = os.environ.get("BING_API_KEY")
Step 2: Adding an embedding model, a vector store, and building the RAG pipeline
Embeddings are a series of numerical numbers that represent a text in a high-dimensional vector space. In an LLM agent, embeddings can help find similar questions to historical questions in long-term memory or identify relevant examples to include in the input prompt.
In our trip planning LLM agent, we need embeddings to identify relevant historical information. For example, if the user previously asked the agent to “Plan a trip to Philadelphia in the summer of 2025,” the LLM should consider this context when answering their follow-up question, “What are the must-visit places in Philadelphia?”. We’ll also use embeddings in the Retrieval Augmented Generation (RAG) tool to retrieve relevant context from long text documents. As the trip planning agent searches the web and scrapes HTML content from multiple web pages, their content is split into small chunks. These chunks are stored in a vector database, which indexes data with embeddings. To find relevant information to a query, the query is embedded and used to retrieve similar chunks.
Setting up ChromaDB as the vector store
We’ll use ChromaDB as our trip-planning LLM agent’s vector store. First, we initialize ChromeDB with a persistent client:
Implementing the RAG pipeline
As discussed earlier, the LLM agent might require a RAG tool to retrieve relevant sections from the web content. A RAG pipeline consists of a data injection block that converts the raw document from HTML, PDF, XML, or JSON format into an unstructured series of text chunks. Then, chunks are converted to a vector and indexed into a vector database. During the retrieval phase, a predefined number of the most relevant chunks is retrieved from the vector database using an approximate nearest neighbor search.

We use the RetrieveUserProxyAgent to implement the RAG tool. This tool retrieves information from stored chunks. First, we set a fixed chunk length of 1000 tokens.
@with_requirements(python_packages=["typing", "requests", "autogen", "chromadb"], global_imports=["typing", "requests", "autogen", "chromadb"])
def rag_on_document(query: typing.Annotated[str, "The query to search in the index."], document: Annotated[Path, "Path to the document"]) -> str:
logger.info(f"************ RAG on document is executed with query: {query} ************")
default_doc = temp_file_path
doc_path = default_doc if document is None or document == "" else document
ragproxyagent = autogen.agentchat.contrib.retrieve_user_proxy_agent.RetrieveUserProxyAgent(
"ragproxyagent",
human_input_mode="NEVER",
retrieve_config={
"task": "qa",
"docs_path": doc_path,
"chunk_token_size": 1000,
"model": config_list[0]["model"],
"client": chromadb_client,
"collection_name": "tourist_places",
"get_or_create": True,
"overwrite": False
},
code_execution_config={"use_docker": False}
)
res = ragproxyagent.initiate_chat(planner_agent, message=ragproxyagent.message_generator, problem = query, n_results = 2, silent=True)
return str(res.chat_history[-1]['content'])
Step 3: Implementing planning
As discussed in the earlier section, reasoning and planning by the LLM is the central controller of the LLM agent. Using AutoGen’s OpenAI Assistant Agent, we instantiate a prompt that the LLM agent will follow throughout its interactions. This system prompt sets the rules, scope, and behavior of the agent when handling trip-planning tasks.
The AssistantAgent is instantiated with a system prompt and an LLM configuration:
planner_agent = AssistantAgent(
"Planner_Agent",
system_message="You are a trip planner assistant whose objective is to plan itineraries of the trip to a destination. "
"Use tools to fetch weather, search web using bing_search, "
"scrape web context for search urls using visit_website tool and "
"do RAG on scraped documents to find relevant section of web context to find out accommodation, "
"transport, outfits, adventure activities and bookings need. "
"Use only the tools provided, and reply TERMINATE when done. "
"While executing tools, print outputs and reflect exception if failed to execute a tool. "
"If web scraping tool is required, create a temp txt file to store scraped website contents "
"and use the same file for rag_on_document as input.",
llm_config=llm_config,
human_input_mode="NEVER"
)
By setting human_input_mode to “NEVER“ we ensure that the LLM agent operates autonomously without requiring or waiting for human input during its execution. This means the agent will process tasks based solely on its predefined system prompt without prompting the user for additional inputs.
When initiating the chat, we use a ReACT-based prompt that guides the LLM to analyze the input, take action, observe the outcome, and dynamically determine the next actions:
ReAct_prompt = """
You are a Trip Planning expert tasked with helping users make a trip itinerary.
You can analyse the query, figure out the travel destination, dates and assess the need of checking weather forecast, search accommodation, recommend outfits and suggest adventure activities like hiking, trekking opportunity and need for advance booking.
Use the following format:
Question: the input question or request
Thought: you should always think about what to do to respond to the question
Action: the action to take (if any)
Action Input: the input to the action (e.g., search query, location for weather, query for rag, url for web scraping)
Observation: the result of the action
... (this process can repeat multiple times)
Thought: I now know the final answer
Final Answer: the final answer to the original input question or request
Once you get all the answers, ask the planner agent to write code and execute to visualise the answer in a table format.
Begin!
Question: {input}
"""
def react_prompt_message(sender, recipient, context):
return ReAct_prompt.format(input=context["question"])
Step 4: Building tools for web search, weather, and scraping
The predefined tools define the action space for the LLM agent. Now that we have planning in place let’s see how to build and register tools that allow the LLM to fetch external information.
All tools in our system follow the XxxYyyAgent naming pattern, such as RetrieveUserProxyAgent or WebSurferAgent. This convention helps maintain clarity within the LLM agent framework by making a difference between different types of agents based on their primary function. The first part of the name (Xxx) describes the high-level task the agent performs (e.g., Retrieve, Planner), while the second part (YyyAgent) signifies that it is an autonomous component managing interactions in a specific domain.
Building a code execution tool
A code execution tool enables an LLM agent to run the generated code and terminate when needed. AutoGen offers an implementation called UserProxyAgent that allows for human input and interaction in the agent-based system. When integrated with tools like CodeExecutorAgent, it can execute code and dynamically evaluate Python code.
work_dir = Path("../coding")
work_dir.mkdir(exist_ok=True)
code_executor = LocalCommandLineCodeExecutor(work_dir=work_dir)
print(
code_executor.execute_code_blocks(
code_blocks=[
CodeBlock(language="python", code="print('Hello, World!');"),
]
)
)
user_proxy = UserProxyAgent(
name="user_proxy",
is_termination_msg=lambda x: x.get("content", "") and x.get("content", "").rstrip().endswith("TERMINATE"),
human_input_mode="NEVER",
max_consecutive_auto_reply=10,
code_execution_config={"executor": code_executor},
)
In this block, we define a custom termination condition: the agent checks if the message content ends with “TERMINATE” and if so, it stops further processing. This ensures that termination is signaled once the conversation is complete.
Also, to prevent infinite loops where the agent responds indefinitely, we limit the agent to 10 consecutive automatic replies before stopping (in max_conscutive_auto_reply).
Building a weather tool
To fetch the weather at the travel destination, we’ll use the Open-Meteo API:
@with_requirements(python_packages=["typing", "requests", "autogen", "chromadb"], global_imports=["typing", "requests", "autogen", "chromadb"])
def get_weather_info(destination: typing.Annotated[str, "The place of which weather information to retrieve"], start_date: typing.Annotated[str, "The date of the trip to retrieve weather data"]) -> typing.Annotated[str, "The weather data for given location"]:
logger.info(f"************ Get weather API is executed for {destination}, {start_date} ************")
coordinates = {"Grand Canyon": {"lat": 36.1069, "lon": -112.1129},
"Philadelphia": {"lat": 39.9526, "lon": -75.1652},
"Niagara Falls": {"lat": 43.0962, "lon": -79.0377},
"Goa": {"lat": 15.2993, "lon": 74.1240}}
destination_coordinates = coordinates[destination]
lat, lon = destination_coordinates["lat"], destination_coordinates["lon"] if destination in coordinates else (None, None)
forecast_api_url = f"https://api.open-meteo.com/v1/forecast?latitude={lat}&longitude={lon}&daily=temperature_2m_max,precipitation_sum&start={start_date}&timezone=auto"
weather_response = requests.get(forecast_api_url)
weather_data = weather_response.json()
return str(weather_data)
The function get_weather_info is designed to fetch weather data for a given destination and start date using the Open-Meteo API. It starts with the @with_requirements decorator, which ensures that necessary Python packages—like typing, requests, autogen, and chromadb—are installed before running the function.
typing.Annotated is used to describe both the input parameters and the return type. For instance, destination: typing.Annotated[str, “The place of which weather information to retrieve”] doesn’t just say that destination is a string but also provides a description of what it represents. This is particularly useful in workflows like this one, where descriptions can help guide LLMs to use the function correctly.
Building a web search tool
We’ll create our trip-planning agent’s web search tool using the Bing Web Search API, which requires the API key we obtained in Step 0.
Let’s look at the full code first before going through it step by step:
@with_requirements(python_packages=["typing", "requests", "autogen", "chromadb"], global_imports=["typing", "requests", "autogen", "chromadb"])
def bing_search(query: typing.Annotated[str, "The input query to search"]) -> Annotated[str, "The search results"]:
web_surfer = WebSurferAgent(
"bing_search",
system_message="You are a Bing Web surfer Agent for travel planning.",
llm_config= llm_config,
summarizer_llm_config=llm_config,
browser_config={"viewport_size": 4096, "bing_api_key": bing_api_key}
)
register_function(
visit_website,
caller=web_surfer,
executor=user_proxy,
name="visit_website",
description="This tool is to scrape content of website using a list of urls and store the website content into a text file that can be used for rag_on_document"
)
search_result = user_proxy.initiate_chat(web_surfer, message=query, summary_method="reflection_with_llm", max_turns=2)
return str(search_result.summary)
First, we define a function bing_search that takes a query and returns search results.
Inside the function, we create a WebSurferAgent named bing_search, which is responsible for searching the web using Bing. It’s configured with a system message that tells it its job is to find relevant websites for travel planning. The agent also uses bing_api_key to access Bing’s API.
Next, we initiate a chat between the user_proxy and the web_surfer agent. This lets the agent interact with Bing, retrieve the results, and summarize them using “reflection_with_llm”.
Register functions as tools
For the LLM agent to be able to use the tools, we have to register them. Let’s see how:
register_function(
get_weather_info,
caller=planner_agent,
executor=user_proxy,
name = "get_weather_info",
description = "This tool fetch weather data from open source api"
)
register_function(
rag_on_document,
caller=planner_agent,
executor=user_proxy,
name = "rag_on_document",
description = "This tool fetch relevant information from a document"
)
register_function(
bing_search,
caller=planner_agent,
executor=user_proxy,
name = "bing_search",
description = "This tool to search a query on the web and get results."
)
register_function(
visit_website,
caller=planner_agent,
executor=user_proxy,
name = "visit_website",
description = "This tool is to scrape content of website using a list of urls and store the website content into a text file that can be used for rag_on_document"
)
Step 6: Adding memory
LLMs are stateless, meaning they don’t keep track of previous prompts and outputs. To build an LLM agent, we must add memory to make it stateful.
Our trip-planning LLM agent utilizes two kinds of memory. One to keep track of the conversation (short-term memory), and one to store prompts and responses searchably (long-term memory).
We use LangChain’s ConversationBufferMemory to implement the short-term memory:
from langchain.memory import ConversationBufferMemory
memory = ConversationBufferMemory(memory_key="chat_history", k = 5, return_messages=True)
memory.chat_memory.add_user_message("Plan a trip to Grand Canyon next month on 16 Nov 2024, I will stay for 5 nights")
memory.chat_memory.add_ai_message("Final Answer: Here is your trip itinerary for the Grand Canyon from 16 November 2024 for 5 nights:
### Weather:
- Temperatures range from approximately 16.9°C to 19.8°C.
- Minimal precipitation expected.
... ")
We’ll add the content of the short-term memory to each prompt by retrieving the last five interactions from memory, appending them to the user’s new query, and then sending it to the model.
While short-term memory is very useful for remembering immediate context, it quickly grows beyond the context window. Even if the context window limit is not exhausted, a history that is too long adds noise, and the LLM might struggle to determine the relevant parts of the context.
To overcome this issue, we also need long-term memory, which acts as a semantic memory store. In this memory, we store answers to questions in a log of conversations over time and retrieve similar ones.
At this point, we could go further and add a long-term memory store. For example, using memory.vectorstore.VectorStoreRetrieverMemory enables long-term memory by:
- Storing the conversation history as embeddings in a vector database.
- Retrieving similar past queries using semantic similarity search instead of direct recall.
Step 7: Putting it all together
Now, we are finally able to use our agent to plan trips! Let’s try planning a trip to the Grand Canyon with the following instructions: “Plan a trip to the Grand Canyon next month starting on the 16th. I will stay for 5 nights”.
In this first step, we set up the prompt and send the question. The agent then releases its internal thought process identifying that it needs to gather weather, accommodation, outfit, and activity information.

Next, the agent fetches the weather forecast for the specified dates by calling the get_weather_info. It calls the weather tool by providing the destination and the start date. This is repeated for all the external information needed by the planner agent: it calls bing_search for retrieving accommodation options near the Grand Canyon, outfits, and activities for the trip.

Finally, the agent compiles all the gathered information into a final itinerary in a table, similar to this one:

What are the challenges and limitations of developing AI agents?
Building and deploying LLM agents comes with challenges around performance, usability, and scalability. Developers must address issues like handling inaccurate responses, managing memory efficiently, reducing latency, and ensuring security.
Computational constraints
If we run an LLM in-house, inference consumes enormous computation resources. It requires hardware like GPUs or TPUs to run inferences, resulting in high energy costs and financial burdens. At the same time, using API-based LLM like OpenAI GPT-3.5-Turbo, GPT-4, GPT-4o, Google Gemini, or Anthropic Claude incurs high costs that are proportional to the number of tokens consumed as input and output by the LLM. So, while building the LLM agent, the developer has the objective of minimizing the number of calls and the number of tokens while calling the LLM model.
LLMs, especially those with a large number of model parameters, may encounter latency issues during real-time interactions. To ensure a smooth user experience, an agent should be able to produce responses quickly. However, generating high-quality text on the fly from a large model can cause delays, especially when processing complex queries that necessitate multiple rounds of calls to the LLM.
Hallucinations
LLMs sometimes generate factually incorrect responses, which are called hallucinations. This occurs because LLMs do not truly understand the information they generate; they rely on patterns learned from data. As a result, they may produce incorrect information, which can lead to critical errors, especially in sensitive domains like healthcare. The LLM agent architecture must ensure the model has access to the relevant context required to answer the questions, thus avoiding hallucinations.
Memory
An LLM agent leverages long-term and short-term memory to store past conversations. During an ongoing conversation, similar questions are retrieved to learn from past answers. While this sounds simple, retrieving the relevant context from the memory is not straightforward. Developers face challenges such as:
- Noise in memory retrieval: Irrelevant or unrelated past responses may be retrieved, leading to incorrect or misleading answers.
- Scalability issues: As memory grows, searching through a large conversation history efficiently can become computationally expensive.
- Balancing memory size vs. performance: Storing too much history can slow down response time while storing too little can lead to loss of relevant context.
Guardrails and content filtering
LLM agents are vulnerable to prompt injection attacks, where malicious inputs trick the model into generating unintended outputs. For example, a user could manipulate a chatbot into leaking sensitive information by crafting deceptive prompts.
Guardrails address this by employing input sanitization, blocking suspicious phrases, and setting limits on query structures to prevent misuse. Additionally, security-focused guardrails protect the system from being exploited to generate harmful content, spam, or misinformation, ensuring the agent behaves reliably even in adversarial scenarios. Content filtering suppresses inappropriate outputs, such as offensive language, misinformation, or biased responses.
Bias and fairness in the response
LLMs inherently reflect the biases present in their training data as they learn the encoded patterns, structures, and priorities. However, not all biases are harmful. For example, Grammarly is intentionally biased toward grammatically correct and well-structured sentences. This bias enhances its usefulness as a writing assistant rather than making it unfair.
In the middle, neutral biases may not actively harm users but can skew model behavior. For instance, an LLM trained in predominantly Western literature may overrepresent certain cultural perspectives, limiting diversity in the answers.
On the other end, harmful biases reinforce social inequities, such as a recruitment model favoring male candidates due to biased historical hiring data. These biases require intervention through techniques like data balancing, ethical fine-tuning, and continuous monitoring.
Enhancing LLM agent performance
While architecting an LLM agent, you have to keep in mind opportunities to improve the performance of the LLM agent. The performance of LLM agents could be improved by taking care of the following aspects:
Feedback loops and learnings from usage
Adding a feedback loop in the design will help capture the user’s feedback. For example, incorporating a binary feedback system (e.g., a like/dislike button or a thumbs-up/down rating) enables the collection of labeled examples. This feedback can be used to identify patterns in user dissatisfaction and fine-tune response generation. Further, storing feedback as structured examples (e.g., a user’s disliked response vs. an ideal response) can improve retrieval accuracy.
Adapting to the evolving language and usage
As with any other machine-learning model, domain adaptation and continuous training of the model are essential to adapting to emerging trends and the evolution of the language. Fine-tuning LLM on new datasets is expensive and impractical for frequent updates.
Instead, consider collecting positive and negative examples based on the latest trends and use those examples as few-shot examples in the prompt to let LLM adapt to the evolving language.
Scaling and optimization
Another dimension of performance optimization is improving the inference pipeline. LLM inference latency is one of the biggest bottlenecks when deploying at scale. Some key techniques include:
- Quantization: Reducing model precision to improve inference speed with minimal accuracy loss.
- Distillation: Instead of using a very large and slow LLM for every request, we can train a smaller, faster model to mimic the behavior of the large model. This process transfers knowledge from the bigger model to the smaller one, allowing it to generate similar responses while running much more efficiently.
- Tensor parallelization: Distributing model computations across multiple GPUs or TPUs to speed up processing.
Further ideas to explore
Great, you’ve built your first LLM agent!
Now, let’s recap a bit: In this guide, we’ve walked through the process of designing and deploying an LLM agent step by step. Along the way, we’ve discussed selecting the right LLM model and memory architecture and integrating Retrieval-Augmented Generation (RAG), external tools, and optimization techniques.
If you want to take a step further, here are a couple of ideas to explore: