caohy1988 · haiyuan-eng-google · Feb 4, 2026
diff --git a/contributing/samples/long_running_task/REVIEW_FEEDBACK.md b/contributing/samples/long_running_task/REVIEW_FEEDBACK.md
@@ -1,239 +1,132 @@
-# Design Document Review: Durable Session Persistence for Long-Horizon ADK Agents
+# Durable Session Persistence for Long-Horizon Data Research Agents
 
-**Reviewer:** Claude Code
-**Date:** 2026-02-01
-**Document:** `long_running_task_design.md`
+**Authors:** Haiyuan Cao
+**Date:** 02/03/2026
 
 ---
 
-## Executive Summary
+## Objective
 
-The design document is **well-structured and comprehensive**, covering a real problem with a thorough technical approach. However, there are **critical accuracy issues** regarding ADK's current capabilities that must be addressed before the document can be considered accurate for review.
-
-**Overall Assessment:** Good foundation, requires significant revisions to accurately reflect ADK's existing resumability features.
-
----
-
-## 1. Reference Validation
-
-### External URLs (7 total) - ALL VALID
-
-| # | URL | Status | Notes |
-|---|-----|--------|-------|
-| 1 | LangGraph durable-execution | VALID | Content matches claims |
-| 2 | LangGraph persistence | VALID | Checkpointing docs |
-| 3 | LangGraph overview | VALID | Framework intro |
-| 4 | LangGraph checkpoints reference | VALID | API docs |
-| 5 | Deep Agents overview | VALID | LangChain library |
-| 6 | Deep Agents long-term memory | VALID | Memory patterns |
-| 7 | Anthropic harnesses article | VALID | Published 2025-11-26 |
-
----
-
-## 2. CRITICAL ISSUE: ADK Already Has Resumability
-
-### Problem Statement Inaccuracy
-
-The document states (Section 2):
-> "Current ADK sessions are optimized for synchronous 'serving' patterns... state is ephemeral... background execution is not a first-class runtime mode"
-
-**This is inaccurate.** ADK already has an experimental resumability feature:
-
-```python
-# src/google/adk/apps/app.py lines 42-58
-@experimental
-class ResumabilityConfig(BaseModel):
-  """The "resumability" in ADK refers to the ability to:
-  1. pause an invocation upon a long-running function call.
-  2. resume an invocation from the last event, if it's paused or failed midway
-  through.
-  """
-  is_resumable: bool = False
-```
-
-### Existing ADK Capabilities Not Mentioned
-
-| Capability | Location | Status |
-|------------|----------|--------|
-| `ResumabilityConfig` | `src/google/adk/apps/app.py:42-58` | Experimental |
-| `should_pause_invocation()` | `src/google/adk/agents/invocation_context.py:355-389` | Implemented |
-| `long_running_tool_ids` | `src/google/adk/events/event.py` | Implemented |
-| Resume from last event | `src/google/adk/runners.py:1294` | Implemented |
-
-### Required Fix
-
-**The document must:**
-1. Acknowledge existing `ResumabilityConfig` and pause/resume capability
-2. Clearly articulate how this proposal **extends** existing features vs. replacing them
-3. Update Section 2 (Problem Statement) to reflect actual gaps (e.g., durable cross-process persistence, BigQuery-based audit, external event triggers)
+To enable a new class of **Long-Horizon Deep Research Agents** in BigQuery that can execute multi-day autonomous investigations. By implementing a **Durable Session Persistence Layer**, these agents move beyond simple Q&A to perform complex, asynchronous rollouts—cross-dataset synthesis, persistent document monitoring, and multi-step "deep dives"—that survive cloud sandbox timeouts and process restarts.
 
 ---
 
-## 3. Technical Review
+## 1. Current State & Limitations
 
-### 3.1 SQL Schema (Appendix B) - VALID WITH MINOR ISSUES
+It is critical to distinguish between ADK's existing *session delegation* capabilities and the proposed *durable execution* model.
 
-**Strengths:**
-- Proper partitioning strategy (`PARTITION BY DATE`)
-- Sensible clustering choices
-- JSON columns for flexibility
+### 1.1 Existing Capability: `ResumabilityConfig`
 
-**Issues:**
+ADK currently supports an experimental `ResumabilityConfig` (see `src/google/adk/apps/app.py`) that allows an agent to "pause" execution when calling a long-running tool.
 
-1. **Missing primary key constraint on checkpoints:**
-   ```sql
-   -- Should add:
-   PRIMARY KEY (session_id, checkpoint_seq)
-   ```
+* **Mechanism:** The invocation pauses in-memory. If the user polls the API or the tool returns, the specific runner instance resumes the thread.
+* **Storage Delegation:** ADK can currently delegate session storage (chat history) to BigQuery or Postgres. This saves the *conversation log* (`User: ...`, `Agent: ...`).
 
-2. **events table lacks PRIMARY KEY:**
-   ```sql
-   -- Consider adding:
-   PRIMARY KEY (event_id)  -- or composite key
-   ```
+### 1.2 The Gap: Chat History vs. Execution State
 
-3. **View `v_latest_checkpoint` uses ARRAY_AGG with OFFSET(0):**
-   - This is valid but will error if no checkpoints exist
-   - Consider `SAFE_OFFSET(0)` or handle NULL case
+While ADK can save *what was said* (Chat History) to BigQuery, it does not currently save *what the agent is thinking* (Execution State) in a way that survives a crash.
 
-### 3.2 Python Code Snippets - MOSTLY VALID
+| Feature | Current ADK (BQ Session Delegation) | Proposed Durable Extension |
+| --- | --- | --- |
+| **What is Saved?** | **Conversation Log:** Messages, user inputs, final tool outputs. | **Execution Snapshot:** The "Job Ledger," partial plans, active stack frames, and retry counters. |
+| **Process Death** | **Fatal:** If the runner crashes, the "thought process" is lost. You can reload chat history, but the agent forgets it was waiting for Job ID #504. | **Recoverable:** The agent wakes up, reads the checkpoint, and knows exactly where it left off in the logic loop. |
+| **Resume Trigger** | **Reactive:** Requires a user or API to poke the agent. | **Proactive:** Can wake itself up via Pub/Sub events (e.g., "Job Done"). |
+| **Consistency** | **Event Replay:** Replays history to rebuild context (expensive & fragile). | **Authoritative Reconciliation:** Deterministically syncs with cloud state (BQ Information Schema). |
 
-**Section 7.1 `write_checkpoint()`:**
-- Logic is sound (two-phase commit pattern)
-- Consider adding error handling for partial failures
-
-**Section 7.2 `reconcile_on_resume()`:**
-- Good idempotency pattern
-- Missing: what happens if `bq.get_job()` fails?
-
-### 3.3 Leasing Approach (Section 7.3) - REASONABLE
-
-The BQ-based optimistic lease is correctly noted as best-effort. The suggestion to use Firestore/Spanner for stronger guarantees is appropriate.
-
-**Suggestion:** Add a concrete example of when to use each backend (BQ vs Firestore).
+**Problem Statement:** Current workflows are brittle. If a Cloud Run instance recycles during a 4-hour job, the agent effectively "dies," leaving orphaned BigQuery jobs and no record of its intent. We need a way to hibernate the *brain*, not just the *transcript*.
 
 ---
 
-## 4. Architecture Feedback
+## 2. The Solution: Two-Phase Commit Persistence
 
-### 4.1 Strengths
+We introduce a **Two-Phase Commit** mechanism to ensure every research step is durably persisted before the agent hibernates. This extends `ResumabilityConfig` to support cross-process durability.
 
-1. **Clear separation of control plane (BQ) vs data plane (GCS)** - follows Google best practices
-2. **Logical checkpointing over heap snapshots** - pragmatic and maintainable
-3. **Two-phase commit pattern** - ensures atomic visibility
-4. **Authoritative reconciliation** - critical for BigQuery job scenarios
-5. **Good competitive analysis** (Section 14)
+### 2.1 Architecture
 
-### 4.2 Gaps / Missing Considerations
+1. **Phase 1 (GCS - Data Plane):** The agent serializes its "Research Notebook" (partial drafts, URL ledgers, and reasoning state) to GCS. This handles the bulk state (>1MB).
+2. **Phase 2 (BigQuery - Control Plane):** A metadata row is inserted into a new `checkpoints` table. **Crucially, a checkpoint is only considered "live" once this row commits.** This ensures atomic visibility and prevents "half-saved" states.
+3. **Hibernation:** The agent releases all compute resources. Cost drops to near zero.
 
-| Gap | Impact | Suggested Action |
-|-----|--------|------------------|
-| No mention of existing `ResumabilityConfig` | Misleading problem statement | Add section on existing capability |
-| No cost estimates for BQ storage/queries | Budget planning | Add rough estimates |
-| No mention of BQ quota limits | Operational risk | Document relevant quotas |
-| Checkpoint versioning migration strategy | Future maintenance | Expand Section 16.2 |
-| No monitoring/alerting design | Operability | Add observability section |
-| No rollback strategy | Safety | Document how to rollback |
+### 2.2 Authoritative Reconciliation
 
-### 4.3 API Contract Review
+Upon waking (via Pub/Sub), the agent does not blindly trust the event stream. It performs a **Deterministic Sync**:
 
-The proposed `CheckpointableAgentState` interface is clean:
-
-```python
-class CheckpointableAgentState:
-    def export_state(self) -> dict: ...
-    def import_state(self, state: dict) -> None: ...
-```
-
-**Suggestion:** Consider alignment with existing ADK patterns:
-- Existing `BaseAgentState` in `src/google/adk/agents/base_agent.py`
-- Existing state patterns in `src/google/adk/sessions/state.py`
+* It loads the checkpoint from GCS.
+* It queries `INFORMATION_SCHEMA.JOBS` to verify the actual status of delegated tasks.
+* It updates its internal ledger (e.g., marking "RUNNING" tasks as "FAILED" if BQ reports a quota error) before generating the next step.
 
 ---
 
-## 5. Specific Line-by-Line Feedback
+## 3. CUJ: Autonomous Deep Research & Trend Analysis
 
-### Section 0 (Executive Summary)
-- Line 14: "12-minute barrier" - should cite source or clarify this is environment-specific
-- Line 28: Cost estimate "< $0.01/session-day paused" - show calculation
+**Persona:** Strategic Market Analyst
+**User Story:** *"As an analyst, I want to trigger a deep research mission. I want the agent to scan three years of financial filings, monitor news for 48 hours, and synthesize a strategy memo without me keeping a tab open."*
 
-### Section 2 (Problem Statement)
-- **Major revision needed** - must acknowledge existing resumability
+### Agent-Based Journey:
 
-### Section 4.1 (States)
-- Consider: should PAUSED be a first-class `Session.status` field or remain at `InvocationContext` level?
+1. **Initiation:** A Cloud Scheduler triggers the agent.
+* **Mission:** "Analyze `global_financials` for R&D trends. Monitor `live_news_stream` for 24h."
 
-### Section 8 (API Extensions)
-- `checkpoint_policy` options are good, but:
-  - What triggers `superstep`?
-  - How does `manual` interact with `long_running_tool_ids`?
 
-### Section 13 (Moltbot Alignment)
-- Moltbot reference is useful context
-- Consider adding link/citation if public
-
-### Section 18 (Open Questions)
-- Good list, but add: "How does this integrate with existing `ResumabilityConfig`?"
+2. **Execution & Hibernation:** The agent submits 30 complex BigQuery jobs. It performs a **Two-Phase Commit** to save its state to GCS/BQ, then enters **PAUSED**.
+3. **Event-Driven Resume:** A Pub/Sub message (BQ Job Complete) triggers the **Resume Service**.
+4. **Reconciliation:** The agent acquires a **Lease** (preventing race conditions), loads the checkpoint, and reconciles its ledger. It finds 28 successes and 2 failures, schedules retries, and drafts the report.
+5. **Outcome:** 24 hours later, a final Markdown report is written to the `executive_briefings` table.
 
 ---
 
-## 6. Recommended Document Changes
-
-### High Priority (Must Fix)
+## 4. System Design & Schemas
+
+### 4.1 BigQuery Schema (Control Plane)
+
+*Additions based on Technical Review:*
+
+```sql
+-- Checkpoints Table (Control Plane)
+CREATE TABLE `adk_metadata.checkpoints` (
+  session_id STRING NOT NULL,
+  checkpoint_seq INT64 NOT NULL,
+  gcs_state_uri STRING NOT NULL,
+  sha256 STRING NOT NULL,  -- Integrity check
+  created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
+  agent_state_summary JSON, -- For querying "what was the agent thinking?"
+  PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
+);
+
+-- Events Table (Trigger Log)
+CREATE TABLE `adk_metadata.events` (
+  event_id STRING NOT NULL,
+  session_id STRING NOT NULL,
+  event_type STRING, -- e.g., 'BQ_JOB_DONE', 'PUB_SUB_MSG'
+  payload JSON,
+  processed BOOL DEFAULT FALSE,
+  PRIMARY KEY (event_id) NOT ENFORCED
+);
 
-1. **Add Section 1.3: "Existing ADK Resumability"**
-   - Document current `ResumabilityConfig` capability
-   - Explain limitations this design addresses
-   - Position proposal as extension, not replacement
-
-2. **Revise Section 2 (Problem Statement)**
-   - Remove/qualify claims about ADK lacking pause/resume
-   - Focus on actual gaps: cross-process durability, external event triggers, enterprise audit
+```
 
-3. **Add explicit integration plan**
-   - How does `CheckpointableAgentState` relate to `BaseAgentState`?
-   - Migration path from current resumability to new design
+### 4.2 Cost Estimation
 
-### Medium Priority
+By shifting from "Idle Compute" to "Durable Pause":
 
-4. Add cost estimation section
-5. Add monitoring/observability design
-6. Add rollback/recovery procedures
-7. Fix SQL schema issues (PKs)
+* **Compute:** Drops from ~$1.40/day (Cloud Run idle) to **$0**.
+* **Storage:** BQ Rows + GCS Blobs cost **<$0.01/day** per active session.
+* **Total Savings:** >99% reduction for long-horizon tasks.
 
-### Low Priority
+### 4.3 Rollback & Safety
 
-8. Add Moltbot citation if available
-9. Add BQ quota documentation links
-10. Consider adding architecture diagram (beyond Mermaid sequence)
+* **Leasing:** We use a BQ `active_lease_id` column with a 5-minute TTL to ensure only one runner owns a session.
+* **Rollback:** If a runner crashes *during* a checkpoint write, the Phase 2 (BQ commit) never happens. The system simply resumes from the *previous* valid checkpoint, ensuring zero corruption.
 
 ---
 
-## 7. Summary Table
-
-| Category | Status | Details |
-|----------|--------|---------|
-| External URLs | VALID | All 7 references work |
-| SQL Syntax | VALID with issues | Missing PKs, edge cases |
-| Python Code | VALID | Sound patterns |
-| Problem Statement | INACCURATE | Ignores existing resumability |
-| Architecture | SOUND | Good Google-scale patterns |
-| Completeness | GAPS | Missing cost, monitoring, rollback |
-
----
+## Q&A
 
-## 8. Conclusion
+**Why can't we just use the existing BQ Session Storage?**
+Existing BQ storage saves the *transcript*. If an agent has processed 500 documents and built a complex internal mental model, reloading the transcript and asking the LLM to "re-read" everything to rebuild that mental model is slow, expensive, and error-prone. This proposal saves the *mental model itself*.
 
-This is a **solid technical design** for extending ADK's capabilities for long-running BigQuery workloads. The core architecture (BQ control plane, GCS data plane, two-phase commit, authoritative reconciliation) is well-reasoned.
-
-**However, the document cannot be approved in its current form** because it misrepresents ADK's existing capabilities. Once the existing `ResumabilityConfig` is acknowledged and the document is repositioned as an extension rather than a new capability, it will be ready for technical review.
-
-**Recommended Next Steps:**
-1. Revise document to acknowledge existing resumability
-2. Add cost/monitoring sections
-3. Fix SQL schema issues
-4. Re-submit for review
-
----
+**What happens if the Resume Service fails?**
+The architecture is idempotent. Pub/Sub will redeliver the message. The Resume Service will check the BQ `events` table; if the event is already marked `processed`, it ignores it. If not, it acquires the lease and proceeds.
 
-*Review generated by Claude Code on 2026-02-01*
+**Is this only for BigQuery?**
+No. While BigQuery is the control plane, this pattern works for any long-running async task (e.g., waiting for human approval, video rendering, or external API webhooks).