Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
275 changes: 84 additions & 191 deletions contributing/samples/long_running_task/REVIEW_FEEDBACK.md
Original file line number Diff line number Diff line change
@@ -1,239 +1,132 @@
# Design Document Review: Durable Session Persistence for Long-Horizon ADK Agents
# Durable Session Persistence for Long-Horizon Data Research Agents

**Reviewer:** Claude Code
**Date:** 2026-02-01
**Document:** `long_running_task_design.md`
**Authors:** Haiyuan Cao
**Date:** 02/03/2026

---

## Executive Summary
## Objective

The design document is **well-structured and comprehensive**, covering a real problem with a thorough technical approach. However, there are **critical accuracy issues** regarding ADK's current capabilities that must be addressed before the document can be considered accurate for review.

**Overall Assessment:** Good foundation, requires significant revisions to accurately reflect ADK's existing resumability features.

---

## 1. Reference Validation

### External URLs (7 total) - ALL VALID

| # | URL | Status | Notes |
|---|-----|--------|-------|
| 1 | LangGraph durable-execution | VALID | Content matches claims |
| 2 | LangGraph persistence | VALID | Checkpointing docs |
| 3 | LangGraph overview | VALID | Framework intro |
| 4 | LangGraph checkpoints reference | VALID | API docs |
| 5 | Deep Agents overview | VALID | LangChain library |
| 6 | Deep Agents long-term memory | VALID | Memory patterns |
| 7 | Anthropic harnesses article | VALID | Published 2025-11-26 |

---

## 2. CRITICAL ISSUE: ADK Already Has Resumability

### Problem Statement Inaccuracy

The document states (Section 2):
> "Current ADK sessions are optimized for synchronous 'serving' patterns... state is ephemeral... background execution is not a first-class runtime mode"

**This is inaccurate.** ADK already has an experimental resumability feature:

```python
# src/google/adk/apps/app.py lines 42-58
@experimental
class ResumabilityConfig(BaseModel):
"""The "resumability" in ADK refers to the ability to:
1. pause an invocation upon a long-running function call.
2. resume an invocation from the last event, if it's paused or failed midway
through.
"""
is_resumable: bool = False
```

### Existing ADK Capabilities Not Mentioned

| Capability | Location | Status |
|------------|----------|--------|
| `ResumabilityConfig` | `src/google/adk/apps/app.py:42-58` | Experimental |
| `should_pause_invocation()` | `src/google/adk/agents/invocation_context.py:355-389` | Implemented |
| `long_running_tool_ids` | `src/google/adk/events/event.py` | Implemented |
| Resume from last event | `src/google/adk/runners.py:1294` | Implemented |

### Required Fix

**The document must:**
1. Acknowledge existing `ResumabilityConfig` and pause/resume capability
2. Clearly articulate how this proposal **extends** existing features vs. replacing them
3. Update Section 2 (Problem Statement) to reflect actual gaps (e.g., durable cross-process persistence, BigQuery-based audit, external event triggers)
To enable a new class of **Long-Horizon Deep Research Agents** in BigQuery that can execute multi-day autonomous investigations. By implementing a **Durable Session Persistence Layer**, these agents move beyond simple Q&A to perform complex, asynchronous rollouts—cross-dataset synthesis, persistent document monitoring, and multi-step "deep dives"—that survive cloud sandbox timeouts and process restarts.

---

## 3. Technical Review
## 1. Current State & Limitations

### 3.1 SQL Schema (Appendix B) - VALID WITH MINOR ISSUES
It is critical to distinguish between ADK's existing *session delegation* capabilities and the proposed *durable execution* model.

**Strengths:**
- Proper partitioning strategy (`PARTITION BY DATE`)
- Sensible clustering choices
- JSON columns for flexibility
### 1.1 Existing Capability: `ResumabilityConfig`

**Issues:**
ADK currently supports an experimental `ResumabilityConfig` (see `src/google/adk/apps/app.py`) that allows an agent to "pause" execution when calling a long-running tool.

1. **Missing primary key constraint on checkpoints:**
```sql
-- Should add:
PRIMARY KEY (session_id, checkpoint_seq)
```
* **Mechanism:** The invocation pauses in-memory. If the user polls the API or the tool returns, the specific runner instance resumes the thread.
* **Storage Delegation:** ADK can currently delegate session storage (chat history) to BigQuery or Postgres. This saves the *conversation log* (`User: ...`, `Agent: ...`).

2. **events table lacks PRIMARY KEY:**
```sql
-- Consider adding:
PRIMARY KEY (event_id) -- or composite key
```
### 1.2 The Gap: Chat History vs. Execution State

3. **View `v_latest_checkpoint` uses ARRAY_AGG with OFFSET(0):**
- This is valid but will error if no checkpoints exist
- Consider `SAFE_OFFSET(0)` or handle NULL case
While ADK can save *what was said* (Chat History) to BigQuery, it does not currently save *what the agent is thinking* (Execution State) in a way that survives a crash.

### 3.2 Python Code Snippets - MOSTLY VALID
| Feature | Current ADK (BQ Session Delegation) | Proposed Durable Extension |
| --- | --- | --- |
| **What is Saved?** | **Conversation Log:** Messages, user inputs, final tool outputs. | **Execution Snapshot:** The "Job Ledger," partial plans, active stack frames, and retry counters. |
| **Process Death** | **Fatal:** If the runner crashes, the "thought process" is lost. You can reload chat history, but the agent forgets it was waiting for Job ID #504. | **Recoverable:** The agent wakes up, reads the checkpoint, and knows exactly where it left off in the logic loop. |
| **Resume Trigger** | **Reactive:** Requires a user or API to poke the agent. | **Proactive:** Can wake itself up via Pub/Sub events (e.g., "Job Done"). |
| **Consistency** | **Event Replay:** Replays history to rebuild context (expensive & fragile). | **Authoritative Reconciliation:** Deterministically syncs with cloud state (BQ Information Schema). |

**Section 7.1 `write_checkpoint()`:**
- Logic is sound (two-phase commit pattern)
- Consider adding error handling for partial failures

**Section 7.2 `reconcile_on_resume()`:**
- Good idempotency pattern
- Missing: what happens if `bq.get_job()` fails?

### 3.3 Leasing Approach (Section 7.3) - REASONABLE

The BQ-based optimistic lease is correctly noted as best-effort. The suggestion to use Firestore/Spanner for stronger guarantees is appropriate.

**Suggestion:** Add a concrete example of when to use each backend (BQ vs Firestore).
**Problem Statement:** Current workflows are brittle. If a Cloud Run instance recycles during a 4-hour job, the agent effectively "dies," leaving orphaned BigQuery jobs and no record of its intent. We need a way to hibernate the *brain*, not just the *transcript*.

---

## 4. Architecture Feedback
## 2. The Solution: Two-Phase Commit Persistence

### 4.1 Strengths
We introduce a **Two-Phase Commit** mechanism to ensure every research step is durably persisted before the agent hibernates. This extends `ResumabilityConfig` to support cross-process durability.

1. **Clear separation of control plane (BQ) vs data plane (GCS)** - follows Google best practices
2. **Logical checkpointing over heap snapshots** - pragmatic and maintainable
3. **Two-phase commit pattern** - ensures atomic visibility
4. **Authoritative reconciliation** - critical for BigQuery job scenarios
5. **Good competitive analysis** (Section 14)
### 2.1 Architecture

### 4.2 Gaps / Missing Considerations
1. **Phase 1 (GCS - Data Plane):** The agent serializes its "Research Notebook" (partial drafts, URL ledgers, and reasoning state) to GCS. This handles the bulk state (>1MB).
2. **Phase 2 (BigQuery - Control Plane):** A metadata row is inserted into a new `checkpoints` table. **Crucially, a checkpoint is only considered "live" once this row commits.** This ensures atomic visibility and prevents "half-saved" states.
3. **Hibernation:** The agent releases all compute resources. Cost drops to near zero.

| Gap | Impact | Suggested Action |
|-----|--------|------------------|
| No mention of existing `ResumabilityConfig` | Misleading problem statement | Add section on existing capability |
| No cost estimates for BQ storage/queries | Budget planning | Add rough estimates |
| No mention of BQ quota limits | Operational risk | Document relevant quotas |
| Checkpoint versioning migration strategy | Future maintenance | Expand Section 16.2 |
| No monitoring/alerting design | Operability | Add observability section |
| No rollback strategy | Safety | Document how to rollback |
### 2.2 Authoritative Reconciliation

### 4.3 API Contract Review
Upon waking (via Pub/Sub), the agent does not blindly trust the event stream. It performs a **Deterministic Sync**:

The proposed `CheckpointableAgentState` interface is clean:

```python
class CheckpointableAgentState:
def export_state(self) -> dict: ...
def import_state(self, state: dict) -> None: ...
```

**Suggestion:** Consider alignment with existing ADK patterns:
- Existing `BaseAgentState` in `src/google/adk/agents/base_agent.py`
- Existing state patterns in `src/google/adk/sessions/state.py`
* It loads the checkpoint from GCS.
* It queries `INFORMATION_SCHEMA.JOBS` to verify the actual status of delegated tasks.
* It updates its internal ledger (e.g., marking "RUNNING" tasks as "FAILED" if BQ reports a quota error) before generating the next step.

---

## 5. Specific Line-by-Line Feedback
## 3. CUJ: Autonomous Deep Research & Trend Analysis

### Section 0 (Executive Summary)
- Line 14: "12-minute barrier" - should cite source or clarify this is environment-specific
- Line 28: Cost estimate "< $0.01/session-day paused" - show calculation
**Persona:** Strategic Market Analyst
**User Story:** *"As an analyst, I want to trigger a deep research mission. I want the agent to scan three years of financial filings, monitor news for 48 hours, and synthesize a strategy memo without me keeping a tab open."*

### Section 2 (Problem Statement)
- **Major revision needed** - must acknowledge existing resumability
### Agent-Based Journey:

### Section 4.1 (States)
- Consider: should PAUSED be a first-class `Session.status` field or remain at `InvocationContext` level?
1. **Initiation:** A Cloud Scheduler triggers the agent.
* **Mission:** "Analyze `global_financials` for R&D trends. Monitor `live_news_stream` for 24h."

### Section 8 (API Extensions)
- `checkpoint_policy` options are good, but:
- What triggers `superstep`?
- How does `manual` interact with `long_running_tool_ids`?

### Section 13 (Moltbot Alignment)
- Moltbot reference is useful context
- Consider adding link/citation if public

### Section 18 (Open Questions)
- Good list, but add: "How does this integrate with existing `ResumabilityConfig`?"
2. **Execution & Hibernation:** The agent submits 30 complex BigQuery jobs. It performs a **Two-Phase Commit** to save its state to GCS/BQ, then enters **PAUSED**.
3. **Event-Driven Resume:** A Pub/Sub message (BQ Job Complete) triggers the **Resume Service**.
4. **Reconciliation:** The agent acquires a **Lease** (preventing race conditions), loads the checkpoint, and reconciles its ledger. It finds 28 successes and 2 failures, schedules retries, and drafts the report.
5. **Outcome:** 24 hours later, a final Markdown report is written to the `executive_briefings` table.

---

## 6. Recommended Document Changes

### High Priority (Must Fix)
## 4. System Design & Schemas

### 4.1 BigQuery Schema (Control Plane)

*Additions based on Technical Review:*

```sql
-- Checkpoints Table (Control Plane)
CREATE TABLE `adk_metadata.checkpoints` (
session_id STRING NOT NULL,
checkpoint_seq INT64 NOT NULL,
gcs_state_uri STRING NOT NULL,
sha256 STRING NOT NULL, -- Integrity check
created_at TIMESTAMP DEFAULT CURRENT_TIMESTAMP(),
agent_state_summary JSON, -- For querying "what was the agent thinking?"
PRIMARY KEY (session_id, checkpoint_seq) NOT ENFORCED
);

-- Events Table (Trigger Log)
CREATE TABLE `adk_metadata.events` (
event_id STRING NOT NULL,
session_id STRING NOT NULL,
event_type STRING, -- e.g., 'BQ_JOB_DONE', 'PUB_SUB_MSG'
payload JSON,
processed BOOL DEFAULT FALSE,
PRIMARY KEY (event_id) NOT ENFORCED
);

1. **Add Section 1.3: "Existing ADK Resumability"**
- Document current `ResumabilityConfig` capability
- Explain limitations this design addresses
- Position proposal as extension, not replacement

2. **Revise Section 2 (Problem Statement)**
- Remove/qualify claims about ADK lacking pause/resume
- Focus on actual gaps: cross-process durability, external event triggers, enterprise audit
```

3. **Add explicit integration plan**
- How does `CheckpointableAgentState` relate to `BaseAgentState`?
- Migration path from current resumability to new design
### 4.2 Cost Estimation

### Medium Priority
By shifting from "Idle Compute" to "Durable Pause":

4. Add cost estimation section
5. Add monitoring/observability design
6. Add rollback/recovery procedures
7. Fix SQL schema issues (PKs)
* **Compute:** Drops from ~$1.40/day (Cloud Run idle) to **$0**.
* **Storage:** BQ Rows + GCS Blobs cost **<$0.01/day** per active session.
* **Total Savings:** >99% reduction for long-horizon tasks.

### Low Priority
### 4.3 Rollback & Safety

8. Add Moltbot citation if available
9. Add BQ quota documentation links
10. Consider adding architecture diagram (beyond Mermaid sequence)
* **Leasing:** We use a BQ `active_lease_id` column with a 5-minute TTL to ensure only one runner owns a session.
* **Rollback:** If a runner crashes *during* a checkpoint write, the Phase 2 (BQ commit) never happens. The system simply resumes from the *previous* valid checkpoint, ensuring zero corruption.

---

## 7. Summary Table

| Category | Status | Details |
|----------|--------|---------|
| External URLs | VALID | All 7 references work |
| SQL Syntax | VALID with issues | Missing PKs, edge cases |
| Python Code | VALID | Sound patterns |
| Problem Statement | INACCURATE | Ignores existing resumability |
| Architecture | SOUND | Good Google-scale patterns |
| Completeness | GAPS | Missing cost, monitoring, rollback |

---
## Q&A

## 8. Conclusion
**Why can't we just use the existing BQ Session Storage?**
Existing BQ storage saves the *transcript*. If an agent has processed 500 documents and built a complex internal mental model, reloading the transcript and asking the LLM to "re-read" everything to rebuild that mental model is slow, expensive, and error-prone. This proposal saves the *mental model itself*.

This is a **solid technical design** for extending ADK's capabilities for long-running BigQuery workloads. The core architecture (BQ control plane, GCS data plane, two-phase commit, authoritative reconciliation) is well-reasoned.

**However, the document cannot be approved in its current form** because it misrepresents ADK's existing capabilities. Once the existing `ResumabilityConfig` is acknowledged and the document is repositioned as an extension rather than a new capability, it will be ready for technical review.

**Recommended Next Steps:**
1. Revise document to acknowledge existing resumability
2. Add cost/monitoring sections
3. Fix SQL schema issues
4. Re-submit for review

---
**What happens if the Resume Service fails?**
The architecture is idempotent. Pub/Sub will redeliver the message. The Resume Service will check the BQ `events` table; if the event is already marked `processed`, it ignores it. If not, it acquires the lease and proceeds.

*Review generated by Claude Code on 2026-02-01*
**Is this only for BigQuery?**
No. While BigQuery is the control plane, this pattern works for any long-running async task (e.g., waiting for human approval, video rendering, or external API webhooks).
Loading