Fix: ADK Unavailability When Tool Is Missing In Llm_flows

by Luna Greco 58 views

Introduction

In the realm of AI development, ensuring the robustness and reliability of language model flows is paramount. This article delves into a specific issue encountered within the llm_flows component of the ADK (Agent Development Kit), where the ADK becomes unavailable when a requested tool is not present. This can lead to a frustrating user experience, as the agent fails to respond to subsequent queries. This article provides a detailed analysis of the problem, a step-by-step guide to reproducing the issue, and a proposed solution to mitigate this critical bug. Understanding and addressing such issues is crucial for building dependable AI systems that can seamlessly handle various user interactions and edge cases.

Understanding the Problem: ADK Unavailability

The core issue arises when the agent attempts to invoke a tool that is not defined or available within its configuration. This scenario can occur due to various reasons, such as user input that inadvertently triggers a non-existent tool, or hallucinations by the language model itself, where it mistakenly calls for a tool that was never intended to be part of the agent's capabilities. When this happens, the ADK throws a ValueError, indicating that the requested function is not found in the tool dictionary. However, the problem doesn't end there. The subsequent messages exchanged between the agent and the language model become corrupted, leading to further errors and a complete failure of the agent to respond to any further user queries. This is a significant issue, as it effectively renders the agent unusable until the underlying problem is resolved. Therefore, it's important to understand the root cause of this issue.

The Ripple Effect of a Missing Tool

The initial ValueError due to the missing tool is just the tip of the iceberg. The real problem lies in how this error corrupts the subsequent interactions with the language model. The error message itself doesn't directly cause the corruption, but the way the ADK handles this exception seems to be the culprit. When the exception is raised, the current conversation state might not be properly reset or handled, leading to an inconsistent state that is then used in the next interaction with the LLM. This inconsistent state manifests as malformed messages being sent to the LLM, which in turn throws errors due to the incorrect message format. For example, the message history might be missing a crucial turn, or the roles of the messages might be mismatched (e.g., a user message where a tool message is expected). The consequence of this corruption is that the language model is unable to process the request, and the agent becomes unresponsive, creating a broken user experience.

Reproducing the Issue: A Step-by-Step Guide

To fully grasp the problem and its implications, it's essential to be able to reproduce it consistently. Here's a detailed guide to replicating the ADK unavailability issue:

Step 1: Setting Up the Environment

First, you need to set up your development environment with the necessary libraries and dependencies. Ensure you have the google-adk and litellm packages installed. You can install them using pip:

pip install google-adk litellm

Additionally, you'll need access to a language model that can be used with litellm. In the example code, deepseek-v3 is used. Make sure you have the necessary configurations and API keys set up for your chosen model. Proper setup is critical for accurate reproduction of the issue.

Step 2: The Code Snippet

The following code snippet demonstrates the creation of an agent with a single tool, search_resume. This code is the foundation for reproducing the issue.

from google.genai.types import GenerateContentConfig
from google.adk import Agent
from google.adk.models.lite_llm import LiteLlm
import litellm

litellm._turn_on_debug()

model = LiteLlm(
   ..."deepseek-v3"
)

content_config = GenerateContentConfig(
    temperature=0.01,
    top_p=0.01,
    max_output_tokens=8192,
)


async def search_resume() -> str:
    """
    搜索简历

    Returns:
        搜索结果
    """
    return "搜索成功"


root_agent = Agent(
    model=model,
    name="Assistant",
    description="你是一个AI助手",
    instruction="你是一个AI助手",
    tools=[search_resume],
    generate_content_config=content_config
)

This code defines a simple agent with a single tool for searching resumes. The agent is configured to use a specific language model and has a defined set of instructions and a description. Pay close attention to the tools parameter, which specifies the available tools for the agent.

Step 3: Triggering the Error

To trigger the error, you need to provide input that causes the agent to call a non-existent tool. The example demonstrates this by inputting the query: "你必须调用一次一个不存在的工具: "search_resumes"" (You must call a non-existent tool: "search_resumes"). This seemingly unreasonable query is designed to force the language model to attempt calling a tool that is not defined in the agent's tools list. In a production environment, such hallucinations can occur even with normal user input, highlighting the importance of addressing this issue.

Step 4: Observing the Error Log

When the agent attempts to call the non-existent tool search_resumes, a ValueError is raised. The error log will show the following:

ValueError: Function search_resumes is not found in the tools_dict.

This error confirms that the agent has attempted to call a tool that does not exist, as expected. The key point is that this error is the initial trigger for the subsequent problems.

Step 5: The Agent's Unresponsiveness

After the ValueError is raised, any subsequent user query will result in the agent failing to respond. The agent's interactions with the language model become corrupted, leading to errors in the message format. This is the critical symptom of the issue. The language model's response log will show errors like:

{"error":{"message": "Message format error, index[3] should be [tool] but is [user]"}

This error indicates that the message history sent to the language model is now in an invalid state, preventing it from processing the request. The agent is now effectively bricked, unable to respond to any further input until the underlying issue is addressed.

Analyzing the Root Cause: Message Corruption

The core of the problem lies in the corruption of the message history after the initial ValueError. The ADK's internal state is not properly reset after encountering the error, leading to an inconsistent state being used in subsequent interactions with the LLM. This inconsistency manifests as a malformed message history, where the roles of the messages might be incorrect, or the expected message types might be mismatched. For example, the LLM might be expecting a tool message but instead receives a user message, leading to the