Staging #296

manavgup · 2025-10-04T03:46:45Z

No description provided.

## Overview Implements complete Kubernetes/OpenShift deployment strategy with Helm charts, auto-scaling, high availability, and comprehensive documentation. ## What's New ### Kubernetes Manifests - ✅ Complete K8s manifests in `deployment/k8s/base/` - ✅ Namespace, ConfigMaps, and Secrets templates - ✅ StatefulSets for PostgreSQL, Milvus, MinIO, etcd - ✅ Deployments for Backend, Frontend, MLFlow - ✅ Services for all components - ✅ Ingress with TLS and OpenShift Routes - ✅ HorizontalPodAutoscaler for auto-scaling ### Helm Chart - ✅ Production-ready Helm chart in `deployment/helm/rag-modulo/` - ✅ Environment-specific values (dev, staging, prod) - ✅ Configurable resources and scaling policies - ✅ Support for multiple cloud providers ### Deployment Scripts - ✅ `deployment/scripts/deploy-k8s.sh` - Raw K8s deployment - ✅ `deployment/scripts/deploy-helm.sh` - Helm deployment - ✅ Environment validation and health checks - ✅ Automated deployment workflow ### Makefile Targets (40+ new commands) **Kubernetes:** - `make k8s-deploy-dev/staging/prod` - Deploy to K8s - `make k8s-status` - Show deployment status - `make k8s-logs-backend/frontend` - Stream logs - `make k8s-port-forward-*` - Port forwarding - `make k8s-shell-backend` - Open pod shell - `make k8s-cleanup` - Clean up resources **Helm:** - `make helm-install-dev/staging/prod` - Install chart - `make helm-upgrade-dev/staging/prod` - Upgrade release - `make helm-rollback` - Rollback release - `make helm-status` - Show release status - `make helm-uninstall` - Remove release **Cloud Providers:** - `make ibmcloud-deploy CLUSTER_NAME=<name>` - IBM Cloud - `make openshift-deploy` - OpenShift - Support for AWS EKS, Azure AKS, Google GKE **Documentation:** - `make docs-install` - Install MkDocs - `make docs-serve` - Serve docs locally - `make docs-build` - Build static site - `make docs-deploy` - Deploy to GitHub Pages ### CI/CD Workflows - ✅ `.github/workflows/k8s-deploy-production.yml` - Production deployment - ✅ `.github/workflows/k8s-deploy-staging.yml` - Staging/PR deployment - ✅ Automated build, push, and deploy pipeline - ✅ Health checks and verification ### Documentation (MkDocs) - ✅ Updated `mkdocs.yml` with complete navigation - ✅ `docs/deployment/QUICKSTART.md` - 5-minute quick start - ✅ `docs/deployment/kubernetes.md` - Complete K8s guide - ✅ `docs/deployment/index.md` - Deployment overview - ✅ `docs/README.md` - MkDocs writing guide - ✅ `docs/MKDOCS_SETUP.md` - Setup summary - ✅ Custom styling in `docs/stylesheets/extra.css` ## Key Features ### High Availability - Backend: 3 replicas with auto-scaling (2-10 pods) - Frontend: 2 replicas with auto-scaling (2-5 pods) - Rolling updates with zero downtime - Health probes (liveness, readiness, startup) ### Auto-Scaling - HPA based on CPU (70%) and Memory (80%) - Intelligent scale-up/down policies - Resource limits enforced ### Persistent Storage - PostgreSQL: 50Gi (prod), 10Gi (dev) - Milvus: 100Gi (prod), 20Gi (dev) - MinIO: 100Gi (prod), 20Gi (dev) - etcd: 10Gi (prod), 5Gi (dev) ### Security - Secrets management templates - TLS/SSL with cert-manager integration - OpenShift SCC support - Network policies ready ### Monitoring - Prometheus metrics endpoints - HPA metrics collection - Comprehensive logging ## Cloud Provider Support ### IBM Cloud Kubernetes Service ```bash make ibmcloud-deploy CLUSTER_NAME=<cluster-name> ``` ### OpenShift ```bash make openshift-deploy ``` ### AWS EKS / Azure AKS / Google GKE See docs/deployment/kubernetes.md for details ## Files Changed - Modified: Makefile, mkdocs.yml, docs/deployment/index.md - Added: 45+ new files for K8s, Helm, docs, scripts ## Testing - ✅ All pre-commit checks passed - ✅ Helm chart lints successfully - ✅ K8s manifests validate - ✅ Deployment scripts tested Closes #260

) Implements IBM Docling integration with AI-powered table extraction (TableFormer) and layout analysis (DocLayNet) to significantly improve document processing quality. Key Features: - DoclingProcessor with comprehensive text, table, and image extraction - Feature flag control (ENABLE_DOCLING) for transparent deployment - Automatic fallback to legacy processors on error - Support for PDF, DOCX, PPTX, HTML, and image formats - 313% improvement in chunk extraction vs legacy processors - Table detection: 3 tables vs 0 (legacy) - Image detection: 13 images vs 0 (legacy) Implementation: - New DoclingProcessor class with DocumentConverter integration - Enhanced metadata extraction with table/image counts - Page number tracking with new Docling API compatibility - Chunking strategy integration for optimal text segmentation - Type-safe implementation with mypy validation Testing: - 14 comprehensive unit tests (100% passing) - Real PDF comparison validation - Debug utilities for development - All critical code quality checks passing Technical Details: - Updated transformers to 4.56.2 for compatibility - Handled Docling API changes (tuple unpacking, page_no attribute) - Multiple text item types support (TextItem, SectionHeaderItem, ListItem, CodeItem) - Separate counters for tables, images, and chunks - Code quality: 9.64/10 (docling_processor.py), 9.84/10 (document_processor.py) Closes #255

Resolved conflicts: - backend/core/config.py: Combined Docling and hierarchical chunking settings - backend/pyproject.toml: Added both docling+transformers and pydub dependencies - backend/poetry.lock: Regenerated after dependency resolution - .linting-progress.json: Removed (deleted in main) All conflicts resolved and dependencies updated.

- Free up ~10GB by removing unnecessary pre-installed tools (.NET, Android SDK, GHC, CodeQL) - Clean Docker system before builds to free up space - Enable Docker BuildKit for better caching and smaller layers - Add disk space reporting before/after cleanup Fixes: GitHub Actions 'No space left on device' error

Prevents 'No space left on device' errors in GitHub Actions by freeing ~10-14GB: Modified workflows: - ci.yml (main CI/CD pipeline) - k8s-deploy-staging.yml (staging deployment) - k8s-deploy-production.yml (production deployment) - dev-environment-ci.yml (already fixed in previous commit) Added test workflow: - test-disk-cleanup.yml (manual validation workflow) Each cleanup step: 1. Removes pre-installed tools: .NET (~3-4GB), Android SDK (~3-4GB), GHC (~1-2GB), CodeQL (~1-2GB) 2. Cleans Docker system cache (~1-2GB) 3. Reports disk space before/after Validation performed: ✓ All YAML syntax valid ✓ Cleanup runs BEFORE Docker builds in all workflows ✓ Industry-standard pattern (used by Kubernetes, TensorFlow, etc.) Fixes: https://github.com/manavgup/rag_modulo/actions/runs/18222106174/job/51884221970

Make the staging deployment workflow conditional so it doesn't fail when no Kubernetes cluster is configured. Changes: - Added conditional to deploy-helm job (only runs if KUBECONFIG exists) - Added deployment-skipped job for when cluster is not available - Both jobs provide informative PR comments This allows the workflow to succeed even without a K8s cluster while still building and pushing Docker images to GHCR.

- Fix staging workflow to skip deployment for PRs (no KUBECONFIG available) - Add informational job for PRs that builds images but skips deployment - Add disk space cleanup step to staging workflow - Create complete Helm chart templates: - Namespace, ConfigMap, Secrets - Backend and Frontend Deployments and Services - Ingress with TLS support - HorizontalPodAutoscaler for auto-scaling - Fix Helm template helpers and validation - All workflows now pass linting and are ready for deployment Resolves: PR #261 deployment failures

- Add IBM Code Engine staging deployment workflow - Uses local disk cache (like IBM mcp-context-forge project) - Buildx v3 with explicit cache restore - Cache mode=min to prevent exhaustion - Comprehensive timeouts to prevent hangs - Clean IBM-style structure with emojis and sections - Remove failing K8s staging workflow - Was failing 100% of runs (11/11) - Caused by GitHub Actions cache exhaustion - Not needed without K8s cluster - Add local workflow testing with act - New script: scripts/test-workflows.sh - Test workflows before pushing to GitHub - Documentation: docs/development/testing-workflows-locally.md - Add deployment documentation - Quick start: docs/deployment/ibm-code-engine-quickstart.md - Setup time: 1-2 hours - Cost: $0-20/month (vs $300+ for K8s) - Update CLAUDE.md - Add workflow testing section - Add deployment documentation references - Document new CI/CD workflows Resolves CI/CD timeout issues and prepares for IBM Cloud deployment.

- Add scripts/ibm-create-secrets.sh to automatically create secrets from .env - Update quickstart guide to use the automated script - Prevents API keys from being hardcoded in documentation - Add api-key.json to gitignore for security This allows secure secret management without exposing credentials in git.

…sues - Remove local disk cache (/tmp/.buildx-cache) that fills GitHub runner disk - Use GitHub Actions cache with mode=min instead - Add disk cleanup step before builds - Prevents 'No space left on device' errors The local cache strategy from IBM project doesn't work well on GitHub Actions runners due to disk space constraints.

PROBLEM: - Docker builds taking 15+ minutes in deployment workflow - Builds timing out or running out of disk space - Inefficient to rebuild on every deployment SOLUTION: - Use pre-built images from GHCR (published by publish.yml) - Deployment now takes ~5 minutes instead of 15+ minutes - Reduces GitHub Actions runner resource usage - Staging deploys whatever is in main (standard practice) BENEFITS: - 3x faster deployments (5min vs 15+min) - More reliable (no disk space issues) - Lower resource consumption - Simpler workflow RELATED: - Issue #272: Track Dockerfile optimization work - Previous build attempt took 16+ minutes and was cancelled

- Add comprehensive infrastructure creation script (setup-ibm-openshift.sh) - Creates resource group - Creates VPC and subnets - Creates Cloud Object Storage (for OpenShift registry) - Creates OpenShift cluster - Deploys application using Helm - Add new Makefile targets: - openshift-create-infra: Create all infrastructure - openshift-deploy-app: Deploy app to existing cluster - openshift-setup-complete: Full end-to-end setup with cluster wait - openshift-cleanup: Clean up all resources - Add IBM Cloud OpenShift variables: - ENVIRONMENT, REGION, ZONE, WORKERS, FLAVOR - CLUSTER_NAME (derived from PROJECT_NAME-ENVIRONMENT) - Update .gitignore to allow deployment scripts Related to PR #261 - Kubernetes/OpenShift deployment

github-actions · 2025-10-04T03:46:54Z

🚀 Development Environment Options

This repository supports Dev Containers for a consistent development environment.

Option 1: GitHub Codespaces (Recommended)

Create a cloud-based development environment:

Click the green Code button above
Select the Codespaces tab
Click Create codespace on staging
Wait 2-3 minutes for environment setup
Start coding with all tools pre-configured!

Option 2: VS Code Dev Containers (Local)

Use Dev Containers on your local machine:

Install Docker Desktop
Install VS Code
Install the Dev Containers extension
Clone this PR branch locally
Open in VS Code and click "Reopen in Container" when prompted

Option 3: Traditional Local Setup

Set up the development environment manually:

# Clone the repository
git clone https://github.com/manavgup/rag_modulo.git
cd rag_modulo
git checkout staging

# Initialize development environment
make dev-init
make dev-build
make dev-up
make dev-validate

Available Commands

Once in your development environment:

make help           # Show all available commands
make dev-validate   # Validate environment setup
make test-atomic    # Run atomic tests
make test-unit      # Run unit tests
make lint          # Run linting

Services Available

When running make dev-up:

This automated message helps reviewers quickly set up the development environment.

claude · 2025-10-04T03:52:05Z

Pull Request Review - Staging Branch

This is a very large PR with 77 files changed and 12,124 additions. The changes span deployment infrastructure, workflow automation, and backend enhancements.

Overall Assessment

Strengths:

Well-structured deployment infrastructure with Kubernetes/Helm and IBM Code Engine support
Good separation of concerns in CI/CD workflows
Comprehensive IBM Docling integration for document processing

Concerns:

Very large PR scope - should ideally be broken into smaller, focused PRs
Some security and stability issues in workflows need addressing
Missing test coverage for new deployment infrastructure

Critical Issues - MUST FIX

1. Staging Workflow Deploys Wrong Branch Code

Location: ibm-code-engine-staging.yml:51-52

The staging workflow uses :latest images from main branch, but triggers on develop/staging branches. This means staging deploys main branch code, not the staging/develop code being pushed - defeating the purpose of staging!

Fix: Use branch-specific tags or build images in this workflow.

2. Health Check Failures Ignored

Location: ibm-code-engine-staging.yml:221-225

Health check failures only print warnings instead of failing the deployment. Unhealthy deployments may be marked successful.

Fix: Make health checks blocking or add retry logic with failure conditions.

3. Using :latest Tags for Production

Location: Both k8s-deploy-production.yml and ibm-code-engine-staging.yml

No rollback capability and unclear what version is deployed.

Fix: Use semantic versioning or git SHA tags.

Security Concerns

4. Secret Creation Without Validation

Location: k8s-deploy-production.yml:169-183

Creating secrets with literal values from GitHub secrets. If creation fails, sensitive data may be exposed in logs.

Fix: Use Kubernetes External Secrets Operator or sealed-secrets.

5. Base64 Kubeconfig Handling

Location: k8s-deploy-production.yml:146

Malformed KUBECONFIG secret could leak credentials in error messages.

Fix: Add error handling and validation before writing.

6. Auto-Target First Resource Group

Location: ibm-code-engine-staging.yml:105

Automatically targets first resource group without validation - could deploy to wrong resource group.

Fix: Require explicit resource group in repo variables.

Potential Bugs

7. Missing Error Handling in K8s Workflow

Location: k8s-deploy-production.yml:213

If BACKEND_URL is empty (LoadBalancer not ready), curl fails with confusing error.

Fix: Add validation before curl command.

8. DoclingProcessor API Compatibility

Location: docling_processor.py:172-174

Handles both old and new Docling API but no version detection - silent failures possible.

Fix: Add version detection and log warnings.

Performance Considerations

9. Disk Cleanup Removes Build Cache

Location: Multiple workflows

docker system prune removes build cache, slowing subsequent builds.

Fix: Use more targeted cleanup or increase runner disk size.

10. Fixed Sleep Before Health Check

Location: ibm-code-engine-staging.yml:216

sleep 30 may be too short or too long.

Fix: Use exponential backoff with retries.

11. Resource Limit Ratio

Location: values.yaml:51-57

Backend has 2:1 ratio between limit and request (4Gi/2Gi) which can cause OOMKills.

Fix: Use closer ratio like 1.5:1 for better stability.

Test Coverage Issues

12. No Workflow Validation Tests

New workflows have no validation tests.

Fix: Add workflow validation using ./scripts/test-workflows.sh as mentioned in CLAUDE.md.

13. Limited DoclingProcessor Tests

Only 1 test file, missing error cases and integration tests.

Fix: Add integration tests with real PDFs.

14. No Helm Chart Tests

No helm test manifests for post-deployment validation.

Fix: Add test pods in templates/tests/.

Code Quality Issues

15. Dead Code in K8s Workflow

Location: k8s-deploy-production.yml:134-142

Commented-out provider configurations.

Fix: Remove or move to documentation.

16. Magic Numbers in Workflows

Location: ibm-code-engine-staging.yml:149, 196

CPU and memory values hardcoded.

Fix: Move to environment-specific configuration.

17. Missing Type Validation

Location: docling_processor.py:109

Converts non-dict metadata to empty dict silently.

Fix: Log warning when metadata is unexpected type.

Missing Documentation

No step-by-step guide for setting up IBM Code Engine
No documented rollback strategy
No metrics/logging configuration
Secret requirements not documented

Recommendations

Immediate (Before Merge)

Fix staging workflow to deploy correct branch code
Make health checks blocking
Use versioned tags instead of :latest
Add workflow validation tests

Short-term (Next PR)

Split into smaller PRs (Code Engine, K8s, Docling, Workflows)
Add comprehensive testing (workflows, Helm, integration)
Update deployment documentation
Document secret management and rollback procedures

Long-term

Implement GitOps approach (ArgoCD/Flux)
Add automated rollback on health check failure
Implement blue-green deployments
Add comprehensive monitoring/alerting

Conclusion

This PR introduces valuable deployment infrastructure but has critical issues that must be addressed before merging. The staging workflow bug is particularly concerning as it defeats the purpose of having a staging environment.

Review Score: 6/10 - Good foundation, but needs security and stability improvements.

The code quality is generally good with proper type hints, logging, and error handling. The deployment automation needs hardening before production use.

cc: @manavgup

OpenShift on VPC Gen 2 requires COS instance for internal registry backup. Using standard plan (~/bin/zsh.02/GB/month) as lite plan limit was reached.

claude · 2025-10-04T04:02:45Z

Pull Request Review: Staging (#296)

Overview

This PR includes 77 files with 12,100 additions and 99 deletions, representing a major infrastructure and deployment enhancement.

Strengths

1. Excellent CI/CD Improvements

Disk cleanup steps: Smart addition of disk space management in workflows
IBM Code Engine workflow: Well-structured deployment pipeline with clear separation of concerns
Uses pre-built images from GHCR (fast deployment, approximately 5 min vs 15+ min)
Good security practices: minimal permissions, secret handling

2. Strong Test Coverage

DoclingProcessor tests: 631 lines of comprehensive TDD tests
Covers initialization, PDF processing, table extraction, metadata extraction, image handling, error handling, and chunking
Uses proper mocking strategies with patch decorators

3. Clean Code Structure

DoclingProcessor implementation: Well-documented, follows existing patterns
Good error handling with graceful degradation
Uses AsyncIterator pattern consistent with BaseProcessor

Critical Issues (Must Fix Before Merge)

1. Resource Group Selection Logic (ibm-code-engine-staging.yml:105)

Issue: Automatically selects the first resource group via jq, which may not be correct in multi-RG environments.

Risk: Could deploy to wrong resource group.

Recommendation: Use environment variable or add validation to ensure correct resource group is selected.

2. Health Check Timeout (ibm-code-engine-staging.yml:216)

Issue: Fixed 30-second sleep may be insufficient for cold starts with min-scale=0.

Risk: False positives on health checks, especially for backend with 2GB memory.

Recommendation: Implement retry logic with exponential backoff (e.g., 10 retries with increasing delays).

3. Kubernetes Secrets Management (k8s-deploy-production.yml:169-183)

Issue: Uses kubectl create secret with dry-run and apply, but does not handle secret updates properly.

Risk: If secrets change, old values persist. No rotation mechanism.

Recommendation: Delete and recreate secrets to ensure updates are applied.

4. Missing Error Handling (k8s-deploy-production.yml:145)

Issue: No validation that KUBECONFIG secret is valid base64 or contains valid config.

Risk: Silent failures with cryptic kubectl errors.

Recommendation: Add validation before using kubeconfig.

Medium Priority Issues

5. Inconsistent Image Reference

Issue: Comment in ibm-code-engine-staging.yml says main branch but workflow triggers on develop and staging branches.

Recommendation: Update comment or use branch-specific tags.

6. Hardcoded Values

App names, ports, and resource limits are hardcoded across workflows.

Recommendation: Move to environment variables or workflow inputs.

7. Missing Rollback Strategy

Recommendation: Add rollback job with previous image tracking.

Test Coverage Assessment

Excellent Coverage: DoclingProcessor has 631 lines of tests covering all major paths

Missing Coverage: No integration tests for deployment workflows, no E2E tests for Code Engine deployment

Security Review

Good Practices: Minimal permissions in workflows, secrets properly referenced, environment-based approvals

Concerns: Consider using OIDC instead of KUBECONFIG in secrets, consider secret managers like HashiCorp Vault

Approval Status

Assessment: Approve with Conditions

This is a substantial and well-structured PR with excellent deployment infrastructure. The code quality is high, and test coverage is comprehensive.

Conditions for merge:

Fix resource group auto-selection issue (Critical)
Add health check retry logic (Critical for production stability)
Fix K8s secret update handling (Critical for security)
Add KUBECONFIG validation (Critical for deployment reliability)

After addressing critical issues, this PR will significantly improve project deployment capabilities.

Great Work!

The infrastructure additions are impressive and well-thought-out. The DoclingProcessor integration follows TDD principles excellently. Once the critical deployment safety issues are addressed, this will be a solid addition to the codebase.

Files reviewed: 77 files across workflows, deployment configs, backend services, and tests.

claude · 2025-10-04T04:04:10Z

Pull Request Review - Staging Deployment Infrastructure

Overview

This PR introduces comprehensive deployment infrastructure including IBM Code Engine staging deployment, Kubernetes/Helm production capabilities, and enhanced CI/CD workflows. Total changes: 12,124 additions.

Strengths

1. Well-Structured Deployment Strategy

Clear separation between staging (IBM Code Engine) and production (K8s/Helm)
Excellent use of pre-built images (5-min vs 15+ min deployment)
Proper environment-based configuration (dev/staging/prod)

2. Security Best Practices

Minimal permissions principle applied
Secrets properly referenced via GitHub Secrets
Kubeconfig with proper permissions (chmod 600)
Using --dry-run=client before applying K8s resources

3. Good Documentation

Comments throughout workflows
Links to related issues
Clear deployment instructions in CLAUDE.md

CRITICAL Issues

1. Health Check Failures Not Fatal

Location: .github/workflows/ibm-code-engine-staging.yml:221-224

Health checks fail silently. Add retry logic with proper failure handling.

2. Resource Group Auto-Selection

Location: .github/workflows/ibm-code-engine-staging.yml:105

Automatically selects first resource group. Should use explicit IBM_CLOUD_RESOURCE_GROUP variable.

3. Unsafe Script Execution

Location: .github/workflows/ibm-code-engine-staging.yml:82

Downloading scripts without verification. Pin versions or use official GitHub Actions.

4. Secrets Exposure Risk

Location: .github/workflows/k8s-deploy-production.yml:167-183

Creating secrets imperatively could expose data in logs. Use sealed-secrets or Vault.

5. Missing K8s Error Handling

Location: .github/workflows/k8s-deploy-production.yml:207-218

Smoke tests may fail silently if service URL unavailable.

HIGH Priority Issues

Incomplete kubectl configuration (only commented examples)
Missing rollback strategy for failed deployments
Duplicate disk cleanup code across 3 workflows
No tests for deployment scripts

Documentation Gaps

Missing step-by-step Code Engine deployment guide
Missing required GitHub Secrets documentation
Missing troubleshooting guide
Incomplete K8s provider configuration

Must Fix Before Merge

Add retry logic to health checks with proper failure handling
Implement resource group validation (IBM_CLOUD_RESOURCE_GROUP variable)
Add error handling for missing KUBECONFIG
Document all required GitHub Secrets and Variables

Overall Assessment

Score: 7/10

Solid foundation with good security practices. Critical reliability issues must be addressed before production use.

Recommendation:

Safe to merge to staging AFTER fixing critical health check issues
Block production deployment until K8s configuration is complete
Create deployment guide before wider adoption

Great work on this infrastructure foundation! Address critical issues above, especially health check retry logic and resource group validation.

claude · 2025-10-04T04:06:38Z

Pull Request Review: Staging (#296)

Overview

This is a large PR with 12,124 additions across 77 files, primarily focused on deployment infrastructure, CI/CD improvements, and backend enhancements. The changes introduce production-ready Kubernetes/Helm deployments, IBM Code Engine staging workflow, and Docling document processing capabilities.

Strengths

1. Excellent CI/CD Infrastructure

Disk space management: Smart cleanup strategy in workflows prevents build failures
IBM Code Engine workflow: Well-structured with clear job separation, uses pre-built images for fast deployment (~5 min vs 15+ min), follows principle of least privilege
Health checks: Proper verification after deployment
Environment separation: Clear dev/staging/prod environments

2. Production-Ready Helm Charts

Well-organized with proper templating
Good resource management across environments
Proper secret handling with optional keys for API credentials
Health probes configured correctly (liveness + readiness)
HPA support for autoscaling

3. Backend Code Quality

DoclingProcessor: Clean architecture with proper error handling
Good test coverage (14 test cases for Docling processor)
Proper async/await usage and type hints throughout

Issues and Concerns

CRITICAL: Security Issues

1. Secrets Exposure Risk in K8s Workflow (k8s-deploy-production.yml:167-183)

No validation that secrets exist before use
Secrets recreated on every run
Missing error handling if secrets are not set

Fix: Add secret validation and use idempotent secret creation

2. Hardcoded KUBECONFIG Decoding (k8s-deploy-production.yml:145-147)

No validation that KUBECONFIG secret exists
File permissions could be more restrictive (use chmod 400)

3. Sensitive Files in .gitignore

Patterns like .secrets and api-key.json suggest local credential storage
Consider documenting secure credential management practices

HIGH: Workflow Stability Issues

1. Missing Timeout Protection (ibm-code-engine-staging.yml:214-225)

Fixed 30s sleep may not be enough (cold starts can take 60-90s)
Health check failure is logged as warning but does not fail the workflow
Fix: Add retry logic with exponential backoff and fail on timeout

2. Race Condition in Resource Group Selection (ibm-code-engine-staging.yml:105-107)

Assumes first resource group is correct
Fix: Use explicit resource group from environment variable

3. K8s Deployment Lacks Rollback Strategy (k8s-deploy-production.yml:189-198)

No automatic rollback on failed deployment
Fix: Add --atomic flag to helm upgrade and explicit rollback step on failure

MEDIUM: Code Quality Issues

1. Incomplete Error Context (docling_processor.py:89-91)

Re-raising generic Exception loses context
Fix: Wrap in custom DocumentProcessingError exception

2. Missing Validation in Helm Values (values-prod.yaml:93-96)

Example.com domains in production values
Fix: Use placeholders or add validation

3. Disk Cleanup Code Duplication

Identical disk cleanup in multiple workflows
Fix: Create reusable composite action

Test Coverage

Good:

14 comprehensive tests for DoclingProcessor
Tests cover success paths, error handling, and edge cases

Missing:

No integration tests for deployment workflows
No Helm chart validation tests
No smoke tests in K8s deployment workflow

Recommendations Summary

Must Fix (Before Merge):

Add secret validation in K8s workflow
Fix health check to fail workflow on failure (IBM Code Engine)
Add rollback strategy to K8s deployment (use --atomic flag)
Change example.com domains in prod values to placeholders

Should Fix (High Priority):

Add retry logic to health checks with exponential backoff
Use explicit resource group variable for IBM Cloud
Add error context wrapping in DoclingProcessor
Create reusable composite action for disk cleanup

Nice to Have:

Add Helm chart unit tests
Add smoke tests to K8s workflow
Create deployment architecture documentation
Add deployment rollback documentation

Conclusion

This PR represents a significant improvement to deployment infrastructure. The Kubernetes/Helm support and IBM Code Engine staging workflow are excellent additions. Code quality is generally good with proper testing.

However, there are critical security and stability issues that should be addressed before merging, particularly around secret handling and health check validation.

Recommendation: Request changes for critical security issues, then approve after fixes.

Files Reviewed: 77 files changed (+12,124, -99)
Review Focus: CI/CD workflows, Helm charts, backend code, security, testing

Good work overall! The infrastructure improvements are substantial. Just need to tighten up security and error handling.

…le migration plan - Add deployment configuration to .env.example and .env.ci - Create GitHub Actions workflow for OpenShift staging deployment - Add OpenShift manifests (PostgreSQL, etcd, MinIO, Milvus, routes) - Update Helm chart with OpenShift compatibility fixes: - Configurable container registry - Backend health check path (/api/health) - Backend COLLECTIONDB_PASS env var - Frontend nginx writable volumes - Frontend container port 8080 - Create automated deployment script (deploy-openshift-staging.sh) - Update deployment documentation with CI/CD guide - Add DEPLOYMENT_PROGRESS.md with Terraform + Ansible migration plan - Fix pre-commit to exclude Helm templates from YAML validation Related: #261 🚀 Generated with [Claude Code](https://claude.com/claude-code) Co-Authored-By: Claude <noreply@anthropic.com>

manavgup added 17 commits October 1, 2025 23:10

fix: Update frontend Dockerfile path to Dockerfile.frontend

a01a0bc

fix: Auto-fix import sorting and formatting issues

129a8cd

fix: Remove resource group from IBM Cloud login

7cf65be

fix: Auto-detect and target resource group for Code Engine

ab80a5f

fix: Use lite (free) plan for COS to avoid prompts

92c7116

fix: Remove COS dependency - OpenShift uses internal registry

7d25b03

manavgup had a problem deploying to staging October 4, 2025 03:57 — with GitHub Actions Failure

fix: Add COS creation with standard plan for OpenShift registry

92a183e

OpenShift on VPC Gen 2 requires COS instance for internal registry backup. Using standard plan (~/bin/zsh.02/GB/month) as lite plan limit was reached.

manavgup had a problem deploying to staging October 4, 2025 04:00 — with GitHub Actions Failure

fix: Use fixed COS instance name 'rag-modulo-cos'

9cdc29b

manavgup had a problem deploying to staging October 4, 2025 04:02 — with GitHub Actions Failure

manavgup had a problem deploying to staging October 5, 2025 01:17 — with GitHub Actions Failure

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Staging #296

Staging #296

Uh oh!

manavgup commented Oct 4, 2025

Uh oh!

github-actions bot commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025

Uh oh!

claude bot commented Oct 4, 2025

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Staging #296

Are you sure you want to change the base?

Staging #296

Uh oh!

Conversation

manavgup commented Oct 4, 2025

Uh oh!

github-actions bot commented Oct 4, 2025

🚀 Development Environment Options

Option 1: GitHub Codespaces (Recommended)

Option 2: VS Code Dev Containers (Local)

Option 3: Traditional Local Setup

Available Commands

Services Available

Uh oh!

claude bot commented Oct 4, 2025

Pull Request Review - Staging Branch

Overall Assessment

Critical Issues - MUST FIX

1. Staging Workflow Deploys Wrong Branch Code

2. Health Check Failures Ignored

3. Using :latest Tags for Production

Security Concerns

4. Secret Creation Without Validation

5. Base64 Kubeconfig Handling

6. Auto-Target First Resource Group

Potential Bugs

7. Missing Error Handling in K8s Workflow

8. DoclingProcessor API Compatibility

Performance Considerations

9. Disk Cleanup Removes Build Cache

10. Fixed Sleep Before Health Check

11. Resource Limit Ratio

Test Coverage Issues

12. No Workflow Validation Tests

13. Limited DoclingProcessor Tests

14. No Helm Chart Tests

Code Quality Issues

15. Dead Code in K8s Workflow

16. Magic Numbers in Workflows

17. Missing Type Validation

Missing Documentation

Recommendations

Immediate (Before Merge)

Short-term (Next PR)

Long-term

Conclusion

Uh oh!

claude bot commented Oct 4, 2025

Pull Request Review: Staging (#296)

Overview

Strengths

1. Excellent CI/CD Improvements

2. Strong Test Coverage

3. Clean Code Structure

Critical Issues (Must Fix Before Merge)

1. Resource Group Selection Logic (ibm-code-engine-staging.yml:105)

2. Health Check Timeout (ibm-code-engine-staging.yml:216)

3. Kubernetes Secrets Management (k8s-deploy-production.yml:169-183)

4. Missing Error Handling (k8s-deploy-production.yml:145)

Medium Priority Issues

5. Inconsistent Image Reference

6. Hardcoded Values

7. Missing Rollback Strategy

Test Coverage Assessment

Security Review

Approval Status

Great Work!

Uh oh!

claude bot commented Oct 4, 2025

Pull Request Review - Staging Deployment Infrastructure

Overview

Strengths

1. Well-Structured Deployment Strategy

2. Security Best Practices

3. Good Documentation

CRITICAL Issues

1. Health Check Failures Not Fatal

2. Resource Group Auto-Selection

3. Unsafe Script Execution