🔴 Required Information
Describe the Bug:
When using LiteLlm in streaming mode (StreamingMode.SSE), if the model's response is truncated due to reaching max_output_tokens (or the model's natural max length), and the model was attempting to produce a tool call, the entire response is silently dropped — no LlmResponse is yielded, no error is raised, and the ADK event stream simply ends with no output.
This happens because the streaming aggregation logic in LiteLlm.generate_content_async() (lite_llm.py lines 1955-2005) only handles finish_reason == "tool_calls" or finish_reason == "stop" when deciding whether to yield the aggregated response. When finish_reason == "length" (which LiteLLM returns when MAX_TOKENS is hit), the accumulated function_calls dict is never yielded and is silently discarded.
Relevant code (v1.24.1, lite_llm.py ~line 1955):
if (
finish_reason == "tool_calls" or finish_reason == "stop"
) and function_calls:
# ... builds and yields aggregated_llm_response_with_tool_call
elif finish_reason == "stop" and (text or reasoning_parts):
# ... builds and yields aggregated_llm_response
For pure text responses this is less critical because text chunks are already yielded incrementally via _message_to_generate_content_response(..., is_partial=True). But for tool calls, chunks are only accumulated into the function_calls dict during streaming and are yielded as a single aggregated response at the end — so when finish_reason == "length", the aggregated tool call response is never emitted.
Steps to Reproduce:
- Install
google-adk latest
- Create an agent with a tool and set a very low
max_output_tokens (e.g., 10) to force truncation during tool call generation
- Run a query that triggers a tool call in streaming mode (
StreamingMode.SSE)
- Observe that the agent produces no output — the event stream ends silently
Expected Behavior:
When the model's output is truncated due to max tokens, ADK should either:
- Raise an error or yield an
LlmResponse with finish_reason=MAX_TOKENS so the caller knows the response was truncated, OR
- Yield the partial tool call data accumulated so far with the appropriate
finish_reason, allowing the framework/caller to handle the truncation gracefully (e.g., retry with higher limits)
At a minimum, a non-silent failure is expected — either an exception or a response event indicating truncation occurred.
Observed Behavior:
- The
function_calls dict accumulates partial tool call data during streaming
- When
finish_reason == "length" arrives, no branch in the if/elif handles it
aggregated_llm_response_with_tool_call is never assigned
- The method exits the
async for loop and falls through to the yield guards, which find None values
- Result: Zero events yielded for the complete response. The ADK runner produces no output.
Environment Details:
- ADK Library Version:
1.24.1
- Desktop OS: Linux
- Python Version:
3.12.11
Model Information:
- Are you using LiteLLM: Yes
- LiteLLM Version:
1.79.3
- Which model is being used: Azure OpenAI (e.g.,
azure/<deployment-id>, gpt-4o / gpt-4.1 / o4-mini class models)
🟡 Optional Information
Minimal Reproduction Code:
import asyncio
import os
from dotenv import load_dotenv
from google.adk.agents import Agent
from google.adk.agents.run_config import RunConfig, StreamingMode
from google.adk.models.lite_llm import LiteLlm
from google.adk.runners import Runner
from google.adk.sessions import InMemorySessionService
from google.genai import types
from google.genai.types import GenerateContentConfig
load_dotenv()
def get_litellm_model() -> LiteLlm:
deployment_id = os.getenv("AZURE_OPENAI_DEPLOYMENT_ID")
return LiteLlm(
model=f"azure/{deployment_id}",
stream=True,
)
async def add(a: float, b: float) -> float:
return a + b
def create_simple_agent() -> Agent:
model = get_litellm_model()
instructions = """
You are a helpful mathematical assistant with access to a calculator tool.
"""
agent = Agent(
name="Claudia",
model=model,
generate_content_config=GenerateContentConfig(
temperature=0.0, max_output_tokens=10
),
instruction=instructions,
description="A simple agent that can perform basic mathematical calculations",
tools=[add],
)
return agent
async def run_agent_single_query(query: str):
agent = create_simple_agent()
session_service = InMemorySessionService()
runner = Runner(
agent=agent,
app_name="SimpleADKApp",
session_service=session_service,
auto_create_session=True,
)
user_id = "user_123"
session_id = "session_002"
content = types.Content(role="user", parts=[types.Part(text=query)])
response_text = ""
async for event in runner.run_async(
user_id=user_id,
session_id=session_id,
new_message=content,
run_config=(
RunConfig(
streaming_mode=StreamingMode.SSE,
response_modalities=["TEXT"],
)
),
):
if event.content and event.content.parts:
response_text = event.content.parts[0].text or ""
print(f"Agent Response:\n{response_text}\n")
return response_text
async def main():
await run_agent_single_query("Ciao! 3 + 9")
if __name__ == "__main__":
asyncio.run(main())
How often has this issue occurred?:
- Always (100%) - when
max_output_tokens is set low enough to truncate during tool call generation.
Proposed Fix
In generate_content_async() (streaming branch), add an explicit check for finish_reason == "length" when tool calls have been partially accumulated. Since truncated tool calls contain incomplete JSON arguments that cannot be reliably executed, the appropriate behavior is to raise an error rather than silently dropping the response:
elif finish_reason == "stop" and (text or reasoning_parts):
message_content = text if text else None
aggregated_llm_response = _message_to_generate_content_response(
# ... existing code ...
)
aggregated_llm_response.finish_reason = _map_finish_reason(
finish_reason
)
text = ""
reasoning_parts = []
elif finish_reason == "length" and function_calls:
raise ValueError(
"LLM response was truncated due to max output token limit "
"while generating a tool call. The partial tool call data "
"cannot be executed. Consider increasing `max_output_tokens` "
"in your GenerateContentConfig, or reducing the number/complexity "
"of available tools to allow the model to complete its response."
)
Thanks @notTyche for spotting this one
🔴 Required Information
Describe the Bug:
When using
LiteLlmin streaming mode (StreamingMode.SSE), if the model's response is truncated due to reachingmax_output_tokens(or the model's natural max length), and the model was attempting to produce a tool call, the entire response is silently dropped — noLlmResponseis yielded, no error is raised, and the ADK event stream simply ends with no output.This happens because the streaming aggregation logic in
LiteLlm.generate_content_async()(lite_llm.pylines 1955-2005) only handlesfinish_reason == "tool_calls"orfinish_reason == "stop"when deciding whether to yield the aggregated response. Whenfinish_reason == "length"(which LiteLLM returns whenMAX_TOKENSis hit), the accumulatedfunction_callsdict is never yielded and is silently discarded.Relevant code (v1.24.1,
lite_llm.py~line 1955):For pure text responses this is less critical because text chunks are already yielded incrementally via
_message_to_generate_content_response(..., is_partial=True). But for tool calls, chunks are only accumulated into thefunction_callsdict during streaming and are yielded as a single aggregated response at the end — so whenfinish_reason == "length", the aggregated tool call response is never emitted.Steps to Reproduce:
google-adklatestmax_output_tokens(e.g.,10) to force truncation during tool call generationStreamingMode.SSE)Expected Behavior:
When the model's output is truncated due to max tokens, ADK should either:
LlmResponsewithfinish_reason=MAX_TOKENSso the caller knows the response was truncated, ORfinish_reason, allowing the framework/caller to handle the truncation gracefully (e.g., retry with higher limits)At a minimum, a non-silent failure is expected — either an exception or a response event indicating truncation occurred.
Observed Behavior:
function_callsdict accumulates partial tool call data during streamingfinish_reason == "length"arrives, no branch in theif/elifhandles itaggregated_llm_response_with_tool_callis never assignedasync forloop and falls through to the yield guards, which findNonevaluesEnvironment Details:
1.24.13.12.11Model Information:
1.79.3azure/<deployment-id>, gpt-4o / gpt-4.1 / o4-mini class models)🟡 Optional Information
Minimal Reproduction Code:
How often has this issue occurred?:
max_output_tokensis set low enough to truncate during tool call generation.Proposed Fix
In
generate_content_async()(streaming branch), add an explicit check forfinish_reason == "length"when tool calls have been partially accumulated. Since truncated tool calls contain incomplete JSON arguments that cannot be reliably executed, the appropriate behavior is to raise an error rather than silently dropping the response:Thanks @notTyche for spotting this one