Skip to content

fix: Improve BigQuery Agent Analytics plugin reliability and code quality#4491

Closed
caohy1988 wants to merge 5 commits intogoogle:mainfrom
caohy1988:fix/loop-state-validation
Closed

fix: Improve BigQuery Agent Analytics plugin reliability and code quality#4491
caohy1988 wants to merge 5 commits intogoogle:mainfrom
caohy1988:fix/loop-state-validation

Conversation

@caohy1988
Copy link

@caohy1988 caohy1988 commented Feb 14, 2026

This CL fixes several bugs in the BigQuery Agent Analytics plugin and refactors the internal data-passing pattern for better type safety and maintainability.

  • Stale Loop State Validation: Use loop.is_closed() — a public, reliable API — to detect and clean up stale asyncio loop states in _batch_processor_prop, _get_loop_state, and flush. The previous approach used asyncio.Queue._loop which is None on Python 3.10+, causing the check to always treat states as stale.
  • Quota Project ID Fallback: Remove the or project_id fallback when setting quota_project_id on BigQueryWriteAsyncClient. This fixes Workload Identity Federation flows where the federated identity lacks serviceusage.services.use on the quota project.
  • Kwargs Passthrough: Pass **kwargs through to _log_event in all callbacks. Previously only model callbacks forwarded them, causing custom attributes (e.g. customer_id) to silently drop for agent, tool, run, and error events.
  • State Delta Logging: Replace the dead on_state_change_callback (never invoked by the framework) with on_event_callback, which is already dispatched by the runner for every event. Remove duplicate STATE_DELTA logging from after_tool_callback.
  • EventData Dataclass: Replace the **kwargs-as-data-bus pattern in _log_event with an explicit EventData dataclass. This makes the interface self-documenting, catches typos at construction time, and eliminates shared dict mutation across _resolve_span_ids, _extract_latency, and _enrich_attributes. All 12 callback call sites now construct typed EventData instances.
  • Multi-Subagent Tool Logging Tests: Add TestMultiSubagentToolLogging (6 tests) verifying that tool events are correctly attributed to subagents in multi-turn, multi-agent scenarios. Total tests: 111 (up from 60).

@google-cla
Copy link

google-cla bot commented Feb 14, 2026

Thanks for your pull request! It looks like this may be your first contribution to a Google open source project. Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA).

View this failed invocation of the CLA check for more information.

For the most up to date status, view the checks section at the bottom of the pull request.

@adk-bot adk-bot added the services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc label Feb 14, 2026
@adk-bot
Copy link
Collaborator

adk-bot commented Feb 14, 2026

Response from ADK Triaging Agent

Hello @caohy1988, thank you for creating this PR!

Before we can look at your pull request, you'll need to sign a Contributor License Agreement (CLA). You can visit https://cla.developers.google.com/ to see your current agreements or to sign a new one.

Also, this PR is a bug fix. Could you please associate a GitHub issue with this PR? If there is no existing issue, could you please create one?

In addition, could you please provide logs or a screenshot after the fix is applied? This information will help reviewers to review your PR more efficiently. Thanks!

@caohy1988 caohy1988 force-pushed the fix/loop-state-validation branch 2 times, most recently from 401fb43 to d9347d7 Compare February 14, 2026 17:49
@caohy1988 caohy1988 changed the title fix: use loop.is_closed() for stale loop state validation in BigQuery plugin fix: Improve BigQuery Agent Analytics plugin reliability and code quality Feb 14, 2026
if TYPE_CHECKING:
from ..agents.invocation_context import InvocationContext


Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove unnecessary empty line

@caohy1988 caohy1988 force-pushed the fix/loop-state-validation branch from d9347d7 to 35d6028 Compare February 14, 2026 17:53
@haiyuan-eng-google haiyuan-eng-google self-assigned this Feb 14, 2026
state_delta: dict[str, Any],
invocation_context: InvocationContext,
event: "Event",
**kwargs,
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you please help me understand why have **kwargs here? ADK doesn't set addtional arguments when running the callback

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch — the **kwargs was added as a forward-compatibility guard in case ADK's callback signatures evolve, but you're right that the framework currently passes only the named parameters declared in BasePlugin. Removed **kwargs from all 12 callback signatures to match BasePlugin exactly. Also fixed on_model_error_callback to include the llm_request parameter that the base class declares.

…lity

This CL fixes several bugs in the BigQuery Agent Analytics plugin and refactors the internal data-passing pattern for better type safety and maintainability.

-   **Stale Loop State Validation:** Use `loop.is_closed()` — a public, reliable API — to detect and clean up stale asyncio loop states in `_batch_processor_prop`, `_get_loop_state`, and `flush`. The previous approach used `asyncio.Queue._loop` which is `None` on Python 3.10+, causing the check to always treat states as stale.
-   **Quota Project ID Fallback:** Remove the `or project_id` fallback when setting `quota_project_id` on `BigQueryWriteAsyncClient`. This fixes Workload Identity Federation flows where the federated identity lacks `serviceusage.services.use` on the quota project.
-   **Kwargs Passthrough:** Pass `**kwargs` through to `_log_event` in all callbacks. Previously only model callbacks forwarded them, causing custom attributes (e.g. `customer_id`) to silently drop for agent, tool, run, and error events.
-   **State Delta Logging:** Replace the dead `on_state_change_callback` (never invoked by the framework) with `on_event_callback`, which is already dispatched by the runner for every event. Remove duplicate `STATE_DELTA` logging from `after_tool_callback`.
-   **EventData Dataclass:** Replace the `**kwargs`-as-data-bus pattern in `_log_event` with an explicit `EventData` dataclass. This makes the interface self-documenting, catches typos at construction time, and eliminates shared dict mutation across `_resolve_span_ids`, `_extract_latency`, and `_enrich_attributes`. All 12 callback call sites now construct typed `EventData` instances.
-   **Multi-Subagent Tool Logging Tests:** Add `TestMultiSubagentToolLogging` (6 tests) verifying that tool events are correctly attributed to subagents in multi-turn, multi-agent scenarios. Total tests: 111 (up from 60).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Remove intermediate `BigQueryLoggerConfig = bigquery_agent_analytics_plugin.BigQueryLoggerConfig`
alias and use the fully qualified reference directly.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988 caohy1988 force-pushed the fix/loop-state-validation branch from bbf648b to e8a80aa Compare February 17, 2026 15:49
@haiyuan-eng-google
Copy link
Collaborator

I've tested the latest of BigQuery Plugin in ADK, and I didn't catch how log_session_metadata works.

I added state to my session, nothing is logged. I activated and then deactivated the log_session_metadata option in the plugins; however, no changes occurred in the data logged into BigQuery.

So, I delved into the code. the SESSION object from the callback_context does not contain "metadata" attribute.

Have you confused it with the "state" attribute? Or did I miss something?

…istent metadata attribute

Session has no `metadata` attribute (Pydantic model with `extra='forbid'`),
and CallbackContext has no public `session` attribute. This made
`log_session_metadata` a complete no-op.

Fix by accessing `callback_context._invocation_context.session` and
logging session_id, app_name, user_id, and the session state dict
(which holds user-set metadata like thread_id, customer_id, etc.).

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

Good catch — log_session_metadata was indeed completely broken (dead code). Two issues:

  1. CallbackContext (which is Context) doesn't expose a public .session attribute, so hasattr(callback_context, "session") was always False
  2. Even if it did reach Session, there's no metadata attribute — Session is a Pydantic model with extra='forbid' and only has id, app_name, user_id, state, events, last_update_time

Fixed in commit ed1d1a7log_session_metadata now accesses callback_context._invocation_context.session and logs:

{
  "session_metadata": {
    "session_id": "...",
    "app_name": "...",
    "user_id": "...",
    "state": {"thread_id": "gchat-123", "customer_id": "cust-42"}
  }
}

The state dict is where users store metadata like gchat thread-id, customer_id, etc. — so this is the correct field to capture.

@haiyuan-eng-google
Copy link
Collaborator

content_json, content_parts, parser_truncated = await self.parser.parse(
raw_content
) 'HybridContentParser | None' has no attribute 'parse'

self.parser is typed as Optional[HybridContentParser] but was accessed
without a null check at lines 2046-2048. While _ensure_started() should
initialize it before _log_event runs, the type checker cannot prove this
and a partial _lazy_setup failure could leave parser as None. Add an
explicit guard that logs a warning and returns early.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988
Copy link
Author

Good catch — self.parser is typed as Optional[HybridContentParser] (initialized to None in __init__) but was used without a null check in _log_event at lines 2046-2048.

In normal flow, _ensure_started()_lazy_setup() creates the parser before _log_event runs, so it wouldn't be None at runtime. But the type checker can't prove this, and if _lazy_setup partially fails (e.g. schema creation succeeds but parser creation throws), self.parser could remain None.

Fixed in commit 6e5046b — added an explicit null guard that logs a warning and returns early:

if not self.parser:
    logger.warning("Parser not initialized; skipping event %s.", event_type)
    return

@haiyuan-eng-google
Copy link
Collaborator

ERROR: Found 5 breaking change(s) in API surface.
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:1596: BigQueryAgentAnalyticsPlugin.batch_processor: Attribute value was changed: None -> unset
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:1608: BigQueryAgentAnalyticsPlugin.write_client: Attribute value was changed: None -> unset
src/google/adk/plugins/bigquery_agent_analytics_plugin.py:1619: BigQueryAgentAnalyticsPlugin.write_stream: Attribute value was changed: None -> unset
/tmp/tmp.EFKMo1rzPc/src/google/adk/plugins/bigquery_agent_analytics_plugin.py:1629: BigQueryAgentAnalyticsPlugin.getattribute: Public object was removed
/tmp/tmp.EFKMo1rzPc/src/google/adk/plugins/bigquery_agent_analytics_plugin.py:2172: BigQueryAgentAnalyticsPlugin.on_state_change_callback: Public object was removed

@caohy1988
Copy link
Author

Thanks for running the API surface check. All 5 breaking changes are now fixed in commit 0b143e0:

Items 1-4 (batch_processor, write_client, write_stream value change + __getattribute__ removal):

Our refactoring replaced the original __getattribute__ + class-level = None masking pattern with clean @property decorators. While functionally equivalent, the API checker saw the class-level attributes change from None to unset (property descriptor).

Fixed by restoring the original pattern: class-level = None attributes + __getattribute__ routing to private _*_prop properties. The stale loop cleanup logic from our fix is preserved inside _batch_processor_prop.

Item 5 (on_state_change_callback removed):

This was intentionally replaced with on_event_callback because on_state_change_callback is dead code — it's not declared in BasePlugin, not dispatched by PluginManager, and not invoked by Runner. State deltas are now captured via on_event_callback which the framework actually calls.

However, to avoid breaking any subclasses that may override it, we've added back a deprecated stub that logs a warning:

async def on_state_change_callback(self, *, callback_context, state_delta):
    """Deprecated: use on_event_callback instead."""
    logger.warning("on_state_change_callback is deprecated and never called by the framework...")

…ent, write_stream, and on_state_change_callback

Restore class-level `= None` attributes and `__getattribute__` override
for batch_processor, write_client, and write_stream to preserve the
public API surface. The properties were renamed to `_*_prop` (private)
with the `__getattribute__` routing access to them.

Add deprecated `on_state_change_callback` stub for backward compat.
This method is never invoked by the framework but retaining it prevents
breaking subclasses that may override it.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@caohy1988 caohy1988 force-pushed the fix/loop-state-validation branch from 0b143e0 to aa2c196 Compare February 17, 2026 16:12
@haiyuan-eng-google
Copy link
Collaborator


| Running: uv run contributing/samples/hello_world/main.py

Log setup complete: /tmpfs/tmp/agents_log/agent.20260217_111202.log
To access latest log: tail -F /tmpfs/tmp/agents_log/agent.latest.log
Start time: 1771344722.3483453

** User says: {'parts': [{'text': 'Hi'}], 'role': 'user'}
** hello_world_agent: Hello! I'm a hello world agent that can roll an 8-sided die and check prime numbers. How can I help you today?

** User says: {'parts': [{'text': 'Roll a die with 100 sides'}], 'role': 'user'}
Traceback (most recent call last):
File "/tmpfs/src/piper/google3/third_party/py/google/adk/open_source_workspace/contributing/samples/hello_world/main.py", line 103, in
asyncio.run(main())
File "/root/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 190, in run
return runner.run(main)
^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/asyncio/runners.py", line 118, in run
return self._loop.run_until_complete(task)
^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^
File "/root/.local/share/uv/python/cpython-3.11.14-linux-x86_64-gnu/lib/python3.11/asyncio/base_events.py", line 654, in run_until_complete
return future.result()
^^^^^^^^^^^^^^^
File "/tmpfs/src/piper/google3/third_party/py/google/adk/open_source_workspace/contributing/samples/hello_world/main.py", line 86, in main
await check_rolls_in_state(1)
File "/tmpfs/src/piper/google3/third_party/py/google/adk/open_source_workspace/contributing/samples/hello_world/main.py", line 77, in check_rolls_in_state
assert len(session.state['rolls']) == rolls_size
~~~~~~~~~~~~~^^^^^^^^^
KeyError: 'rolls'

@caohy1988
Copy link
Author

This integration test failure is not caused by our PR. Our changes only touch 2 files:

  • src/google/adk/plugins/bigquery_agent_analytics_plugin.py
  • tests/unittests/plugins/test_bigquery_agent_analytics_plugin.py

The failing test (contributing/samples/hello_world/main.py) does not use the BQ analytics plugin at all — it creates an InMemoryRunner with no plugins. The failure is:

assert len(session.state['rolls']) == rolls_size
KeyError: 'rolls'

This happens because the roll_die tool writes to tool_context.state['rolls'], but only if the LLM actually calls the tool. Since LLM behavior is non-deterministic, the model may not always invoke roll_die when prompted "Roll a die with 100 sides", leaving the rolls key absent from session state.

This appears to be a pre-existing flaky test — the main branch also shows test failures/cancellations in CI. Our PR has zero impact on the runner, session state, tool execution, or agent logic.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

services [Component] This issue is related to runtime services, e.g. sessions, memory, artifacts, etc

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants

Comments