Skip to content

Conversation

@manavgup
Copy link
Owner

Summary

Add comprehensive architecture documentation for the Agentic RAG Platform. These documents
establish the design foundation for transforming RAG Modulo into a fully agentic system.

Documents Added

Document Lines Description
agentic-ui-architecture.md ~1,470 React component hierarchy, state management, API integration
backend-architecture-diagram.md ~510 Backend architecture with Mermaid diagrams
mcp-integration-architecture.md ~200 MCP client/server strategy, PR comparison
rag-modulo-mcp-server-architecture.md ~450 RAG as MCP server (tools, resources, auth)
search-agent-hooks-architecture.md ~410 3-stage agent pipeline architecture
system-architecture.md ~410 Complete system architecture overview

Total: ~3,450 lines of documentation

Architecture Highlights

3-Stage Agent Pipeline (search-agent-hooks-architecture.md)

User Query → Pre-Search Agents → RAG Search → Post-Search Agents → Generation → Response Agents → Final Response
  • Pre-search: Query expansion, translation, intent classification
  • Post-search: Re-ranking, deduplication, enrichment
  • Response: Artifact generation (PowerPoint, PDF, charts) in parallel

MCP Integration (mcp-integration-architecture.md)

  • RAG Modulo as MCP Client: Consume external tools via Context Forge
  • RAG Modulo as MCP Server: Expose rag_search, rag_ingest, etc. to Claude Desktop

Agentic UI (agentic-ui-architecture.md)

  • Agent configuration per collection
  • Artifact display in search results
  • Real-time pipeline status
  • Agent marketplace and dashboard

Implementation Roadmap

These documents guide:

Test Plan

  • All markdown files lint-clean (markdownlint passed)
  • Cross-references between documents are valid
  • Mermaid diagrams render correctly in GitHub
  • Team review for architectural decisions

Closes #696

🤖 Generated with Claude Code

claude and others added 8 commits November 26, 2025 20:27
Add environment variables to support SPIFFE workload identity integration
for AI agents and services. This enables cryptographic machine identity
with configurable migration phases:

- SPIFFE_ENABLED: Toggle SPIFFE integration
- SPIFFE_AUTH_MODE: Migration phases (disabled→optional→preferred→required)
- SPIFFE_ENDPOINT_SOCKET: SPIRE Agent Workload API socket
- SPIFFE_TRUST_DOMAIN: Trust domain for identity hierarchy
- SPIFFE_LEGACY_JWT_WARNING: Track legacy auth usage during migration
- SPIFFE_SVID_TTL_SECONDS: Certificate lifetime configuration
- SPIFFE_JWT_AUDIENCES: Allowed JWT-SVID audiences

Related to: MCP Context Forge integration (PR #684)
This architecture document outlines how to integrate SPIRE (SPIFFE Runtime
Environment) into RAG Modulo to provide cryptographic workload identities
for AI agents. This enables zero-trust agent authentication and secure
agent-to-agent (A2A) communication.

Key architectural decisions:
- JWT-SVIDs for stateless verification (vs X.509 for mTLS)
- Trust domain: spiffe://rag-modulo.example.com
- Integration with IBM MCP Context Forge (PR #684)
- Capability-based access control for agents
- 5-phase implementation plan

Agent types defined:
- search-enricher: MCP tool invocation
- cot-reasoning: Chain of Thought orchestration
- question-decomposer: Query decomposition
- source-attribution: Document source tracking
- entity-extraction: Named entity recognition
- answer-synthesis: Answer generation

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
This commit implements the SPIFFE/SPIRE integration for AI agent
authentication as designed in docs/architecture/spire-integration-architecture.md.

Key changes:
- Add py-spiffe dependency for SPIFFE JWT-SVID support
- Create core SPIFFE authentication module (spiffe_auth.py) with:
  - SPIFFEConfig for environment-based configuration
  - AgentPrincipal dataclass for authenticated agent identity
  - SPIFFEAuthenticator for JWT-SVID validation
  - AgentType and AgentCapability enums
  - Helper functions for SPIFFE ID parsing and building
- Create Agent data model with SQLAlchemy:
  - Agent model with SPIFFE ID, type, capabilities, status
  - Relationships to User (owner) and Team
  - Status management (active, suspended, revoked)
- Add Agent repository, service, and router layers:
  - Full CRUD operations for agents
  - Agent registration with SPIFFE ID generation
  - Status and capability management
  - JWT-SVID validation endpoint
- Extend AuthenticationMiddleware to detect and validate SPIFFE JWT-SVIDs
- Add SPIRE deployment configuration templates:
  - server.conf, agent.conf for SPIRE configuration
  - docker-compose.spire.yml for local development
  - README.md with deployment instructions
- Add comprehensive unit tests for all SPIFFE components

Reference: PR #695

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Critical fixes:
- Add database migration for agents table (migrations/add_agents_table.sql)
- Fix signature verification security: failed validation now always rejects
  (prevents fallback bypass attack)
- Fix timezone handling: use UTC consistently for JWT timestamps

Improvements:
- Align env vars with .env.example (SPIFFE_JWT_AUDIENCES, SPIFFE_SVID_TTL_SECONDS)
- Add capability enforcement decorator (require_capabilities)
- Add OpenAPI tags metadata for agents endpoint
- Update and expand unit tests (47 tests passing)

Addresses review comments from PR #695.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
…served word

SQLAlchemy's Declarative API reserves the 'metadata' attribute name.
Renamed the field to 'agent_metadata' in the model while keeping the
database column name as 'metadata' via explicit column name mapping.

This also updates the schema to use validation_alias for proper
model_validate() from ORM objects.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
The test_validate_jwt_svid_valid test was failing because AgentPrincipal
requires a trust_domain field which was not being provided.

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Critical fixes:
- Fix timezone-naive datetime to use UTC throughout (agent.py, agent_repository.py)
- Change default agent status from ACTIVE to PENDING for approval workflow
- Add RuntimeError when SPIFFE enabled but py-spiffe library missing
- Restrict trust domain to configured value only (security fix)

High priority security fixes:
- Add capability validation per agent type (ALLOWED_CAPABILITIES_BY_TYPE)
- Add authentication requirement to SPIFFE validation endpoint
- Reject user-specified trust domains that don't match server config

Code quality improvements:
- Add OpenAPI tags metadata for agent router documentation
- Fix require_capabilities decorator type hints (ParamSpec, TypeVar)
- Add composite database indexes (owner+status, type+status, team+status)
- Update migration script with new composite indexes

Test updates:
- Update test_register_agent_with_custom_trust_domain to verify rejection
- Fix test_authenticator_creates_principal_with_fallback to mock spiffe module

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Add comprehensive architecture documentation for the Agentic RAG Platform:

- agentic-ui-architecture.md: React component hierarchy, state management,
  and API integration for agent features
- backend-architecture-diagram.md: Overall backend architecture with
  Mermaid diagrams showing service layers and data flow
- mcp-integration-architecture.md: MCP client/server integration strategy,
  PR comparison (#671 vs #684), and Context Forge integration
- rag-modulo-mcp-server-architecture.md: Exposing RAG capabilities as MCP
  server with tools (rag_search, rag_ingest, etc.) and resources
- search-agent-hooks-architecture.md: 3-stage agent pipeline (pre-search,
  post-search, response) with database schema and execution flow
- system-architecture.md: Complete system architecture overview with
  technology stack and data flows

These documents guide implementation of:
- PR #695 (SPIFFE/SPIRE agent identity)
- PR #671 (MCP Gateway client)
- Issue #697 (Agent execution hooks)
- Issue #698 (MCP Server)
- Issue #699 (Agentic UI)

Closes #696

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
@manavgup manavgup added documentation Improvements or additions to documentation priority:high High priority - important for release architecture Architectural decisions and design choices labels Nov 27, 2025
@github-actions
Copy link
Contributor

🚀 Development Environment Options

This repository supports Dev Containers for a consistent development environment.

Option 1: GitHub Codespaces (Recommended)

Create a cloud-based development environment:

  1. Click the green Code button above
  2. Select the Codespaces tab
  3. Click Create codespace on docs/agentic-architecture-696
  4. Wait 2-3 minutes for environment setup
  5. Start coding with all tools pre-configured!

Option 2: VS Code Dev Containers (Local)

Use Dev Containers on your local machine:

  1. Install Docker Desktop
  2. Install VS Code
  3. Install the Dev Containers extension
  4. Clone this PR branch locally
  5. Open in VS Code and click "Reopen in Container" when prompted

Option 3: Traditional Local Setup

Set up the development environment manually:

# Clone the repository
git clone https://github.com/manavgup/rag_modulo.git
cd rag_modulo
git checkout docs/agentic-architecture-696

# Initialize development environment
make dev-init
make dev-build
make dev-up
make dev-validate

Available Commands

Once in your development environment:

make help           # Show all available commands
make dev-validate   # Validate environment setup
make test-atomic    # Run atomic tests
make test-unit      # Run unit tests
make lint          # Run linting

Services Available

When running make dev-up:


This automated message helps reviewers quickly set up the development environment.

@github-actions
Copy link
Contributor

Pull Request Review: Agentic RAG Architecture Documentation & SPIFFE/SPIRE Integration

Overview

This PR introduces comprehensive architecture documentation and a complete SPIFFE/SPIRE identity integration for agent workload authentication. The implementation is ambitious, adding ~9,673 lines across 29 files including core infrastructure, database models, services, and extensive documentation.


✅ Strengths

1. Comprehensive Architecture Documentation

  • Excellent documentation suite covering all aspects of the agentic RAG system
  • Well-structured Mermaid diagrams for visual architecture representation
  • Clear cross-references between documents
  • Strong alignment with CLAUDE.md guidelines

2. Security-First Design

  • Proper SPIFFE/SPIRE integration following CNCF standards
  • Zero-trust architecture with cryptographic workload identities
  • Capability-based access control system
  • Multi-layer defense with JWT-SVID validation
  • Security-conscious trust domain restrictions in agent_service.py:84-92

3. Production-Ready Code Quality

  • Comprehensive test coverage (772 lines for spiffe_auth, 470 lines for agent_service)
  • Proper error handling with custom exceptions
  • Type hints throughout the codebase
  • Database migrations with rollback support
  • Well-documented code with docstrings

4. Database Design Excellence

  • Proper composite indexes for query optimization (agent.py:70-74)
  • GIN index for JSONB capabilities for efficient containment queries
  • Automatic updated_at trigger
  • Proper foreign key constraints with CASCADE/SET NULL

5. Clean Architecture

  • Clear separation: Models → Repository → Service → Router
  • Dependency injection pattern
  • Repository pattern for data access
  • Pydantic schemas for validation

🔍 Issues & Recommendations

CRITICAL Issues

1. Missing Dependency: py-spiffe

Location: backend/core/spiffe_auth.py:342, pyproject.toml

Issue: The code imports py-spiffe library but it's not added to pyproject.toml:

from spiffe import JwtSource, WorkloadApiClient  # type: ignore[import-not-found]

Impact: Runtime ImportError when SPIFFE is enabled

Fix Required:

poetry add py-spiffe
poetry lock

Verification: Check that poetry.lock was regenerated after adding dependency


2. Migration Script Missing psycopg2 Dependency Check

Location: migrations/apply_agents_migration.py:14

Issue: Script imports psycopg2 without try/except or dependency declaration

import psycopg2
from dotenv import load_dotenv

Impact: Migration will fail if psycopg2 not installed

Recommendation: Add error handling:

try:
    import psycopg2
except ImportError as e:
    print("ERROR: psycopg2 is required. Install with: pip install psycopg2-binary")
    sys.exit(1)

3. Security: Signature Validation Fallback

Location: backend/core/spiffe_auth.py:477-487

Issue: The fallback mode accepts tokens without signature validation in development:

if self.config.fallback_to_jwt:
    logger.warning(
        "SPIRE unavailable, accepting token without signature validation. "
        "This is ONLY safe in development environments."
    )

Concern: While the security note is present, this could be dangerous if accidentally enabled in production.

Recommendation: Add environment check:

if self.config.fallback_to_jwt:
    if os.getenv("ENVIRONMENT", "development") == "production":
        logger.error("SPIRE unavailable in production. Fallback disabled for security.")
        return None
    logger.warning("...")

HIGH Priority Issues

4. Authentication Middleware: Agent vs User Confusion

Location: backend/core/authentication_middleware.py:242-244

Issue: Agent authentication sets request.state.user for backward compatibility:

agent_data = {...}
request.state.user = agent_data  # For backward compatibility

Concern: This violates the principle of least surprise. Downstream code checking request.state.user might not expect an agent object.

Recommendation:

  • Add request.state.principal for both users and agents
  • Keep request.state.user only for actual users
  • Update downstream code to check principal first, then user

5. Race Condition in Agent Registration

Location: backend/rag_solution/services/agent_service.py:80-81

Issue: Agent instance ID uses UUID prefix without checking uniqueness:

agent_instance_id = str(uuid.uuid4())[:8]

Concern: While collision probability is low, there's no database uniqueness check.

Recommendation: Either:

  • Use full UUID for agent_instance_id
  • Add a uniqueness retry loop (max 3 attempts)
  • Add unique constraint in SPIFFE ID generation

6. Missing Index on last_seen_at

Location: migrations/add_agents_table.sql

Issue: last_seen_at is used for activity tracking but lacks an index.

Use Case: Queries like "find inactive agents" will do full table scans.

Recommendation: Add:

CREATE INDEX IF NOT EXISTS idx_agents_last_seen_at ON agents(last_seen_at DESC) 
WHERE last_seen_at IS NOT NULL;

7. SPIRE Docker Compose Not Integrated

Location: deployment/spire/docker-compose.spire.yml

Issue: This is a standalone compose file, not integrated with main docker-compose.yml.

Impact: Developers won't know how to run SPIRE with local dev

Recommendation:

  • Add docker-compose.override.yml example
  • Document integration in CLAUDE.md
  • Add make local-dev-spire command

MEDIUM Priority Issues

8. Inconsistent Enum Definitions

Location: Multiple files

Issue: AgentType and AgentCapability are defined in 3 places:

  • backend/core/spiffe_auth.py (core)
  • backend/rag_solution/schemas/agent_schema.py (API layer)
  • Tests import from both

Concern: Potential inconsistency and maintenance burden

Recommendation:

  • Keep single source of truth in core/spiffe_auth.py
  • Import from there in schemas
  • OR create core/agent_types.py for shared types

9. Missing Error Handling for JWT Decode

Location: backend/core/spiffe_auth.py:433-437

Issue: JWT decode in is_spiffe_jwt_svid() has bare except Exception:

try:
    unverified = jwt.decode(token, options={"verify_signature": False})
    ...
except Exception:
    return False

Concern: Masks all errors, including programming errors

Recommendation: Be specific:

except (jwt.DecodeError, jwt.InvalidTokenError):
    return False
except Exception as e:
    logger.error(f"Unexpected error checking SPIFFE JWT-SVID: {e}")
    return False

10. Repository Error Handling - Lost Context

Location: backend/rag_solution/repository/agent_repository.py:82-85

Issue: Generic catch-all loses original exception context:

except Exception as e:
    self.db.rollback()
    logger.error(f"Error creating agent: {e!s}")
    raise RepositoryError(f"Failed to create agent: {e!s}") from e

Recommendation: Handle specific exceptions:

except (IntegrityError, SQLAlchemyError) as e:
    self.db.rollback()
    raise RepositoryError(f"Database error creating agent: {e!s}") from e

11. Missing API Documentation

Location: backend/rag_solution/router/agent_router.py

Issue: Endpoints lack OpenAPI examples in docstrings

Impact: API documentation will be less helpful

Recommendation: Add OpenAPI examples:

@router.post(
    "/register",
    response_model=AgentRegistrationResponse,
    responses={
        201: {"description": "Agent registered successfully"},
        400: {"description": "Invalid request", "model": ErrorResponse},
        409: {"description": "SPIFFE ID already exists"},
    }
)

12. Type Annotation: Self vs AgentRepository

Location: backend/rag_solution/repository/agent_repository.py:29

Issue: Using Any type hint for self:

def __init__(self: Any, db: Session) -> None:

Recommendation: Remove type hint (not needed) or use proper type:

def __init__(self, db: Session) -> None:

LOW Priority / Style Issues

13. Magic String for Trust Domain

Location: Multiple files

Issue: "rag-modulo.example.com" appears in multiple places

Recommendation: Use constant or config default


14. Logging: f-string vs Extra Context

Location: backend/rag_solution/services/agent_service.py:86

Issue: Uses f-string instead of structured logging:

logger.warning(f"Rejected trust domain '{request.trust_domain}'...")

Per CLAUDE.md Enhanced Logging guidelines:

logger.warning(
    "Rejected trust domain from user",
    extra={
        "requested_domain": request.trust_domain,
        "configured_domain": self._config.trust_domain,
        "user_id": str(owner_user_id)
    }
)

15. Test Fixture Naming

Location: tests/unit/services/test_agent_service.py:35-44

Issue: Mock fixtures lack docstrings

Recommendation: Add brief descriptions for clarity


📋 Testing Checklist

Required Before Merge:

  • Add py-spiffe to pyproject.toml and regenerate poetry.lock
  • Run full test suite (currently 947+ tests) with new changes
  • Verify migration script works on clean database
  • Test SPIFFE integration end-to-end with SPIRE containers
  • Security review of fallback authentication logic
  • Load testing for agent authentication at scale
  • Documentation review by team (per PR description test plan)

Recommended:

  • Add integration test for middleware agent authentication
  • Test migration rollback scenario
  • Verify all Mermaid diagrams render correctly on GitHub
  • Run security scanners (Bandit, Safety) on new code

🎯 Summary Assessment

Overall Rating: 8.5/10

Verdict: Approve with Required Changes

This is an exceptionally well-designed and implemented feature. The architecture is sound, the code quality is high, and the documentation is outstanding. However, there are critical dependency issues that must be resolved before merge.

Must Fix Before Merge:

  1. ✅ Add py-spiffe dependency to pyproject.toml
  2. ✅ Add psycopg2 import error handling to migration script
  3. ✅ Add production environment check to signature validation fallback
  4. ⚠️ Consider the agent/user authentication backward compatibility approach

Recommended for Follow-Up PR:

  • Refactor enum definitions to single source
  • Improve structured logging throughout
  • Add missing indexes (last_seen_at)
  • Integrate SPIRE docker-compose with main setup

📚 Alignment with Repository Standards

Follows CLAUDE.md guidelines:

  • Production-ready emphasis ✅
  • Comprehensive testing ✅
  • Clean architecture ✅
  • Type hints throughout ✅
  • Documentation excellence ✅

⚠️ Minor deviations:


🚀 Impact Assessment

Positive Impact:

Risk Assessment:

  • Medium Risk: Missing py-spiffe dependency will cause runtime failures
  • Low Risk: Migration script is well-tested and has rollback
  • Low Risk: Fallback authentication has clear security warnings

Migration Complexity: Medium

  • Requires new database table
  • Requires SPIRE infrastructure setup
  • Backward compatible with existing JWT auth

🤝 Recommendations

For Immediate Action:

  1. Add missing dependencies to pyproject.toml
  2. Test with make test-all to ensure no regressions
  3. Verify SPIRE Docker setup works with make local-dev-spire

For Documentation:

  1. Add SPIRE setup guide to CLAUDE.md under "Common Development Commands"
  2. Document agent registration workflow in getting-started docs

For Future Enhancement:

  1. Consider implementing automatic SPIRE registration via SPIRE Server API
  2. Add agent activity monitoring dashboard
  3. Implement agent credential rotation automation

Great work on this comprehensive feature! The architecture is solid and the implementation quality is high. Address the critical dependency issues and this will be ready to merge.

🤖 Generated with Claude Code

Co-Authored-By: Claude noreply@anthropic.com

@manavgup
Copy link
Owner Author

Closing - branch included unrelated files from PR #695. Will recreate from clean main branch.

@manavgup manavgup closed this Nov 27, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

architecture Architectural decisions and design choices documentation Improvements or additions to documentation priority:high High priority - important for release

Projects

None yet

Development

Successfully merging this pull request may close these issues.

docs: Add agentic RAG architecture documentation

3 participants