fix(live): prevent zombie WebSocket session after LiveRequestQueue.close()#5226
fix(live): prevent zombie WebSocket session after LiveRequestQueue.close()#5226TonyLee-AI wants to merge 2 commits intogoogle:mainfrom
Conversation
|
Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). View this failed invocation of the CLA check for more information. For the most up to date status, view the checks section at the bottom of the pull request. |
|
Response from ADK Triaging Agent Hello @TonyLee-AI, thank you for your contribution! It looks like you haven't signed the Contributor License Agreement (CLA) yet. Please make sure to sign it so we can proceed with reviewing your PR. Thanks! |
…ose() Calling `LiveRequestQueue.close()` is the documented way to shut down a live session from the client side. However, `run_live()`'s `while True:` reconnect loop had no awareness of this intentional shutdown: when the resulting APIError(1000) / ConnectionClosed event arrived it would either reconnect (if a session-resumption handle was present) or raise a spurious error (if no handle was present), in both cases creating a long-lived zombie WebSocket connection that Gemini eventually terminates after ~2 hours with a 1006 error. Fix --- * Add `is_closed: bool` property to `LiveRequestQueue` backed by a simple boolean flag that is set synchronously in `close()` *before* the sentinel is enqueued. The synchronous flag avoids any asyncio scheduling race: by the time any connection-close exception reaches `run_live()`'s handlers, the flag is already True. * In `run_live()`, check `live_request_queue.is_closed` in both the `ConnectionClosed` and `APIError(1000)` exception handlers. When the queue is closed, log an info message and `return` instead of reconnecting or raising. A trailing guard at the bottom of the loop body covers the (less common) case where the receive generator returns normally without raising. Behaviour after this fix ------------------------ | Scenario | Before | After | |--------------------------------------------|---------------|-----------| | `close()` called, no session handle | raises error | terminates cleanly | | `close()` called, session handle present | reconnects | terminates cleanly | | Network drop, session handle present | reconnects | reconnects (unchanged) | | Network drop, no session handle | raises | raises (unchanged) | Tests ----- * `test_is_closed_initially_false` — property starts False * `test_is_closed_true_after_close` — property becomes True after close() * `test_is_closed_not_affected_by_other_sends` — other sends don't set it * `test_run_live_no_reconnect_after_queue_close_api_error_1000` — APIError(1000) after close() → terminates, connect called once * `test_run_live_no_reconnect_after_queue_close_connection_closed` — same for ConnectionClosed variant * `test_run_live_still_reconnects_on_unintentional_drop_with_handle` — genuine network drop without close() still reconnects (regression guard)
6c277e7 to
b2024e2
Compare
|
Response from ADK Triaging Agent Hello @TonyLee-AI, thank you for your contribution! I see this PR is a bug fix. As per our contribution guidelines, could you please create a GitHub issue for this bug and associate it with this PR? If an issue already exists, please link it. This helps us with tracking. Thanks! |
Fixes #5228
Problem
Calling
LiveRequestQueue.close()is the documented way to shut down a livesession from the client side. However,
run_live()'swhile True:reconnectloop had no awareness of this intentional shutdown.
When
close()is called,_send_to_modelsends a WebSocket close frame(code 1000).
_receive_from_modelthen surfaces this as either anAPIError(1000)or aConnectionClosedevent. The loop's exception handlersresponded in one of two unintended ways:
close()called, with session-resumption handleclose()called, no session-resumption handleThe zombie session stays open until Google's server forces a timeout after ~2 hours,
at which point a
1006 abnormal closureerror is logged to the application —long after the user's call has ended.
Root cause
run_live()'s exception handlers could not distinguish between:LiveRequestQueue.close()(code 1000)Both the
ConnectionClosedandAPIError(1000)paths lacked this distinction.Fix
live_request_queue.py— addis_closedproperty backed by a synchronous flag:Setting the flag synchronously before any
awaitguarantees it is alreadyTrueby the time any connection-close exception propagates back torun_live()'s handlers — no asyncio scheduling race.base_llm_flow.py— checkis_closedbefore the reconnect logic in both exception handlers:A trailing guard at the bottom of the loop body covers the edge case where the
receive generator exhausts without raising.
Behaviour after this fix
close()called, no session handleclose()called, session handle presentLogs
Before fix
Intentional close, session handle present → zombie reconnect:
Intentional close, no session handle → spurious error:
After fix
Intentional close (APIError 1000 path) — terminates cleanly, no reconnect:
Intentional close (ConnectionClosed path) — terminates cleanly, no reconnect:
Genuine network drop with handle — reconnection still works (regression guard):
Testing plan
is_closedproperty tests intests/unittests/agents/test_live_request_queue.py:test_is_closed_initially_false— property starts asFalsetest_is_closed_true_after_close— property becomesTrueafterclose()test_is_closed_not_affected_by_other_sends— other sends do not set the flagrun_livezombie session tests intests/unittests/flows/llm_flows/test_base_llm_flow.py:test_run_live_no_reconnect_after_queue_close_api_error_1000—APIError(1000)afterclose()terminates without reconnectingtest_run_live_no_reconnect_after_queue_close_connection_closed—ConnectionClosedafterclose()terminates without reconnectingtest_run_live_still_reconnects_on_unintentional_drop_with_handle— genuine network drop (noclose()) still reconnects — regression guardReal-world impact — Google Cloud Support case
This bug was independently reported to Google Cloud Support (January 2026) by a user building a voice agent on Vertex AI. The observed symptom was:
The ~10-minute interval matches Gemini Live's server-side idle timeout: the zombie session is terminated by the server,
while True:immediately reconnects, and the cycle repeats forever.Google Cloud Support confirmed (Pratik, Google Cloud Support):
The support response recommended using the
async withpattern and referenced Discussion #4156 as a community-identified workaround. This PR provides the proper fix at the framework level so thatLiveRequestQueue.close()behaves as documented — terminating the session cleanly without requiring callers to manage task cancellation externally.Related