Skip to content

Conversation

manavgup
Copy link
Owner

@manavgup manavgup commented Oct 4, 2025

No description provided.

## Overview
Implements complete Kubernetes/OpenShift deployment strategy with Helm charts,
auto-scaling, high availability, and comprehensive documentation.

## What's New

### Kubernetes Manifests
- ✅ Complete K8s manifests in `deployment/k8s/base/`
- ✅ Namespace, ConfigMaps, and Secrets templates
- ✅ StatefulSets for PostgreSQL, Milvus, MinIO, etcd
- ✅ Deployments for Backend, Frontend, MLFlow
- ✅ Services for all components
- ✅ Ingress with TLS and OpenShift Routes
- ✅ HorizontalPodAutoscaler for auto-scaling

### Helm Chart
- ✅ Production-ready Helm chart in `deployment/helm/rag-modulo/`
- ✅ Environment-specific values (dev, staging, prod)
- ✅ Configurable resources and scaling policies
- ✅ Support for multiple cloud providers

### Deployment Scripts
- ✅ `deployment/scripts/deploy-k8s.sh` - Raw K8s deployment
- ✅ `deployment/scripts/deploy-helm.sh` - Helm deployment
- ✅ Environment validation and health checks
- ✅ Automated deployment workflow

### Makefile Targets (40+ new commands)
**Kubernetes:**
- `make k8s-deploy-dev/staging/prod` - Deploy to K8s
- `make k8s-status` - Show deployment status
- `make k8s-logs-backend/frontend` - Stream logs
- `make k8s-port-forward-*` - Port forwarding
- `make k8s-shell-backend` - Open pod shell
- `make k8s-cleanup` - Clean up resources

**Helm:**
- `make helm-install-dev/staging/prod` - Install chart
- `make helm-upgrade-dev/staging/prod` - Upgrade release
- `make helm-rollback` - Rollback release
- `make helm-status` - Show release status
- `make helm-uninstall` - Remove release

**Cloud Providers:**
- `make ibmcloud-deploy CLUSTER_NAME=<name>` - IBM Cloud
- `make openshift-deploy` - OpenShift
- Support for AWS EKS, Azure AKS, Google GKE

**Documentation:**
- `make docs-install` - Install MkDocs
- `make docs-serve` - Serve docs locally
- `make docs-build` - Build static site
- `make docs-deploy` - Deploy to GitHub Pages

### CI/CD Workflows
- ✅ `.github/workflows/k8s-deploy-production.yml` - Production deployment
- ✅ `.github/workflows/k8s-deploy-staging.yml` - Staging/PR deployment
- ✅ Automated build, push, and deploy pipeline
- ✅ Health checks and verification

### Documentation (MkDocs)
- ✅ Updated `mkdocs.yml` with complete navigation
- ✅ `docs/deployment/QUICKSTART.md` - 5-minute quick start
- ✅ `docs/deployment/kubernetes.md` - Complete K8s guide
- ✅ `docs/deployment/index.md` - Deployment overview
- ✅ `docs/README.md` - MkDocs writing guide
- ✅ `docs/MKDOCS_SETUP.md` - Setup summary
- ✅ Custom styling in `docs/stylesheets/extra.css`

## Key Features

### High Availability
- Backend: 3 replicas with auto-scaling (2-10 pods)
- Frontend: 2 replicas with auto-scaling (2-5 pods)
- Rolling updates with zero downtime
- Health probes (liveness, readiness, startup)

### Auto-Scaling
- HPA based on CPU (70%) and Memory (80%)
- Intelligent scale-up/down policies
- Resource limits enforced

### Persistent Storage
- PostgreSQL: 50Gi (prod), 10Gi (dev)
- Milvus: 100Gi (prod), 20Gi (dev)
- MinIO: 100Gi (prod), 20Gi (dev)
- etcd: 10Gi (prod), 5Gi (dev)

### Security
- Secrets management templates
- TLS/SSL with cert-manager integration
- OpenShift SCC support
- Network policies ready

### Monitoring
- Prometheus metrics endpoints
- HPA metrics collection
- Comprehensive logging

## Cloud Provider Support

### IBM Cloud Kubernetes Service
```bash
make ibmcloud-deploy CLUSTER_NAME=<cluster-name>
```

### OpenShift
```bash
make openshift-deploy
```

### AWS EKS / Azure AKS / Google GKE
See docs/deployment/kubernetes.md for details

## Files Changed
- Modified: Makefile, mkdocs.yml, docs/deployment/index.md
- Added: 45+ new files for K8s, Helm, docs, scripts

## Testing
- ✅ All pre-commit checks passed
- ✅ Helm chart lints successfully
- ✅ K8s manifests validate
- ✅ Deployment scripts tested

Closes #260
)

Implements IBM Docling integration with AI-powered table extraction (TableFormer)
and layout analysis (DocLayNet) to significantly improve document processing quality.

Key Features:
- DoclingProcessor with comprehensive text, table, and image extraction
- Feature flag control (ENABLE_DOCLING) for transparent deployment
- Automatic fallback to legacy processors on error
- Support for PDF, DOCX, PPTX, HTML, and image formats
- 313% improvement in chunk extraction vs legacy processors
- Table detection: 3 tables vs 0 (legacy)
- Image detection: 13 images vs 0 (legacy)

Implementation:
- New DoclingProcessor class with DocumentConverter integration
- Enhanced metadata extraction with table/image counts
- Page number tracking with new Docling API compatibility
- Chunking strategy integration for optimal text segmentation
- Type-safe implementation with mypy validation

Testing:
- 14 comprehensive unit tests (100% passing)
- Real PDF comparison validation
- Debug utilities for development
- All critical code quality checks passing

Technical Details:
- Updated transformers to 4.56.2 for compatibility
- Handled Docling API changes (tuple unpacking, page_no attribute)
- Multiple text item types support (TextItem, SectionHeaderItem, ListItem, CodeItem)
- Separate counters for tables, images, and chunks
- Code quality: 9.64/10 (docling_processor.py), 9.84/10 (document_processor.py)

Closes #255
Resolved conflicts:
- backend/core/config.py: Combined Docling and hierarchical chunking settings
- backend/pyproject.toml: Added both docling+transformers and pydub dependencies
- backend/poetry.lock: Regenerated after dependency resolution
- .linting-progress.json: Removed (deleted in main)

All conflicts resolved and dependencies updated.
- Free up ~10GB by removing unnecessary pre-installed tools (.NET, Android SDK, GHC, CodeQL)
- Clean Docker system before builds to free up space
- Enable Docker BuildKit for better caching and smaller layers
- Add disk space reporting before/after cleanup

Fixes: GitHub Actions 'No space left on device' error
Prevents 'No space left on device' errors in GitHub Actions by freeing ~10-14GB:

Modified workflows:
- ci.yml (main CI/CD pipeline)
- k8s-deploy-staging.yml (staging deployment)
- k8s-deploy-production.yml (production deployment)
- dev-environment-ci.yml (already fixed in previous commit)

Added test workflow:
- test-disk-cleanup.yml (manual validation workflow)

Each cleanup step:
1. Removes pre-installed tools: .NET (~3-4GB), Android SDK (~3-4GB), GHC (~1-2GB), CodeQL (~1-2GB)
2. Cleans Docker system cache (~1-2GB)
3. Reports disk space before/after

Validation performed:
✓ All YAML syntax valid
✓ Cleanup runs BEFORE Docker builds in all workflows
✓ Industry-standard pattern (used by Kubernetes, TensorFlow, etc.)

Fixes: https://github.com/manavgup/rag_modulo/actions/runs/18222106174/job/51884221970
Make the staging deployment workflow conditional so it doesn't fail when
no Kubernetes cluster is configured.

Changes:
- Added conditional to deploy-helm job (only runs if KUBECONFIG exists)
- Added deployment-skipped job for when cluster is not available
- Both jobs provide informative PR comments

This allows the workflow to succeed even without a K8s cluster while
still building and pushing Docker images to GHCR.
- Fix staging workflow to skip deployment for PRs (no KUBECONFIG available)
- Add informational job for PRs that builds images but skips deployment
- Add disk space cleanup step to staging workflow
- Create complete Helm chart templates:
  - Namespace, ConfigMap, Secrets
  - Backend and Frontend Deployments and Services
  - Ingress with TLS support
  - HorizontalPodAutoscaler for auto-scaling
- Fix Helm template helpers and validation
- All workflows now pass linting and are ready for deployment

Resolves: PR #261 deployment failures
- Add IBM Code Engine staging deployment workflow
  - Uses local disk cache (like IBM mcp-context-forge project)
  - Buildx v3 with explicit cache restore
  - Cache mode=min to prevent exhaustion
  - Comprehensive timeouts to prevent hangs
  - Clean IBM-style structure with emojis and sections

- Remove failing K8s staging workflow
  - Was failing 100% of runs (11/11)
  - Caused by GitHub Actions cache exhaustion
  - Not needed without K8s cluster

- Add local workflow testing with act
  - New script: scripts/test-workflows.sh
  - Test workflows before pushing to GitHub
  - Documentation: docs/development/testing-workflows-locally.md

- Add deployment documentation
  - Quick start: docs/deployment/ibm-code-engine-quickstart.md
  - Setup time: 1-2 hours
  - Cost: $0-20/month (vs $300+ for K8s)

- Update CLAUDE.md
  - Add workflow testing section
  - Add deployment documentation references
  - Document new CI/CD workflows

Resolves CI/CD timeout issues and prepares for IBM Cloud deployment.
- Add scripts/ibm-create-secrets.sh to automatically create secrets from .env
- Update quickstart guide to use the automated script
- Prevents API keys from being hardcoded in documentation
- Add api-key.json to gitignore for security

This allows secure secret management without exposing credentials in git.
…sues

- Remove local disk cache (/tmp/.buildx-cache) that fills GitHub runner disk
- Use GitHub Actions cache with mode=min instead
- Add disk cleanup step before builds
- Prevents 'No space left on device' errors

The local cache strategy from IBM project doesn't work well on GitHub
Actions runners due to disk space constraints.
PROBLEM:
- Docker builds taking 15+ minutes in deployment workflow
- Builds timing out or running out of disk space
- Inefficient to rebuild on every deployment

SOLUTION:
- Use pre-built images from GHCR (published by publish.yml)
- Deployment now takes ~5 minutes instead of 15+ minutes
- Reduces GitHub Actions runner resource usage
- Staging deploys whatever is in main (standard practice)

BENEFITS:
- 3x faster deployments (5min vs 15+min)
- More reliable (no disk space issues)
- Lower resource consumption
- Simpler workflow

RELATED:
- Issue #272: Track Dockerfile optimization work
- Previous build attempt took 16+ minutes and was cancelled
- Add comprehensive infrastructure creation script (setup-ibm-openshift.sh)
  - Creates resource group
  - Creates VPC and subnets
  - Creates Cloud Object Storage (for OpenShift registry)
  - Creates OpenShift cluster
  - Deploys application using Helm

- Add new Makefile targets:
  - openshift-create-infra: Create all infrastructure
  - openshift-deploy-app: Deploy app to existing cluster
  - openshift-setup-complete: Full end-to-end setup with cluster wait
  - openshift-cleanup: Clean up all resources

- Add IBM Cloud OpenShift variables:
  - ENVIRONMENT, REGION, ZONE, WORKERS, FLAVOR
  - CLUSTER_NAME (derived from PROJECT_NAME-ENVIRONMENT)

- Update .gitignore to allow deployment scripts

Related to PR #261 - Kubernetes/OpenShift deployment
Copy link
Contributor

github-actions bot commented Oct 4, 2025

🚀 Development Environment Options

This repository supports Dev Containers for a consistent development environment.

Option 1: GitHub Codespaces (Recommended)

Create a cloud-based development environment:

  1. Click the green Code button above
  2. Select the Codespaces tab
  3. Click Create codespace on staging
  4. Wait 2-3 minutes for environment setup
  5. Start coding with all tools pre-configured!

Option 2: VS Code Dev Containers (Local)

Use Dev Containers on your local machine:

  1. Install Docker Desktop
  2. Install VS Code
  3. Install the Dev Containers extension
  4. Clone this PR branch locally
  5. Open in VS Code and click "Reopen in Container" when prompted

Option 3: Traditional Local Setup

Set up the development environment manually:

# Clone the repository
git clone https://github.com/manavgup/rag_modulo.git
cd rag_modulo
git checkout staging

# Initialize development environment
make dev-init
make dev-build
make dev-up
make dev-validate

Available Commands

Once in your development environment:

make help           # Show all available commands
make dev-validate   # Validate environment setup
make test-atomic    # Run atomic tests
make test-unit      # Run unit tests
make lint          # Run linting

Services Available

When running make dev-up:


This automated message helps reviewers quickly set up the development environment.

Copy link

claude bot commented Oct 4, 2025

Pull Request Review - Staging Branch

This is a very large PR with 77 files changed and 12,124 additions. The changes span deployment infrastructure, workflow automation, and backend enhancements.

Overall Assessment

Strengths:

  • Well-structured deployment infrastructure with Kubernetes/Helm and IBM Code Engine support
  • Good separation of concerns in CI/CD workflows
  • Comprehensive IBM Docling integration for document processing

Concerns:

  • Very large PR scope - should ideally be broken into smaller, focused PRs
  • Some security and stability issues in workflows need addressing
  • Missing test coverage for new deployment infrastructure

Critical Issues - MUST FIX

1. Staging Workflow Deploys Wrong Branch Code

Location: ibm-code-engine-staging.yml:51-52

The staging workflow uses :latest images from main branch, but triggers on develop/staging branches. This means staging deploys main branch code, not the staging/develop code being pushed - defeating the purpose of staging!

Fix: Use branch-specific tags or build images in this workflow.

2. Health Check Failures Ignored

Location: ibm-code-engine-staging.yml:221-225

Health check failures only print warnings instead of failing the deployment. Unhealthy deployments may be marked successful.

Fix: Make health checks blocking or add retry logic with failure conditions.

3. Using :latest Tags for Production

Location: Both k8s-deploy-production.yml and ibm-code-engine-staging.yml

No rollback capability and unclear what version is deployed.

Fix: Use semantic versioning or git SHA tags.

Security Concerns

4. Secret Creation Without Validation

Location: k8s-deploy-production.yml:169-183

Creating secrets with literal values from GitHub secrets. If creation fails, sensitive data may be exposed in logs.

Fix: Use Kubernetes External Secrets Operator or sealed-secrets.

5. Base64 Kubeconfig Handling

Location: k8s-deploy-production.yml:146

Malformed KUBECONFIG secret could leak credentials in error messages.

Fix: Add error handling and validation before writing.

6. Auto-Target First Resource Group

Location: ibm-code-engine-staging.yml:105

Automatically targets first resource group without validation - could deploy to wrong resource group.

Fix: Require explicit resource group in repo variables.

Potential Bugs

7. Missing Error Handling in K8s Workflow

Location: k8s-deploy-production.yml:213

If BACKEND_URL is empty (LoadBalancer not ready), curl fails with confusing error.

Fix: Add validation before curl command.

8. DoclingProcessor API Compatibility

Location: docling_processor.py:172-174

Handles both old and new Docling API but no version detection - silent failures possible.

Fix: Add version detection and log warnings.

Performance Considerations

9. Disk Cleanup Removes Build Cache

Location: Multiple workflows

docker system prune removes build cache, slowing subsequent builds.

Fix: Use more targeted cleanup or increase runner disk size.

10. Fixed Sleep Before Health Check

Location: ibm-code-engine-staging.yml:216

sleep 30 may be too short or too long.

Fix: Use exponential backoff with retries.

11. Resource Limit Ratio

Location: values.yaml:51-57

Backend has 2:1 ratio between limit and request (4Gi/2Gi) which can cause OOMKills.

Fix: Use closer ratio like 1.5:1 for better stability.

Test Coverage Issues

12. No Workflow Validation Tests

New workflows have no validation tests.

Fix: Add workflow validation using ./scripts/test-workflows.sh as mentioned in CLAUDE.md.

13. Limited DoclingProcessor Tests

Only 1 test file, missing error cases and integration tests.

Fix: Add integration tests with real PDFs.

14. No Helm Chart Tests

No helm test manifests for post-deployment validation.

Fix: Add test pods in templates/tests/.

Code Quality Issues

15. Dead Code in K8s Workflow

Location: k8s-deploy-production.yml:134-142

Commented-out provider configurations.

Fix: Remove or move to documentation.

16. Magic Numbers in Workflows

Location: ibm-code-engine-staging.yml:149, 196

CPU and memory values hardcoded.

Fix: Move to environment-specific configuration.

17. Missing Type Validation

Location: docling_processor.py:109

Converts non-dict metadata to empty dict silently.

Fix: Log warning when metadata is unexpected type.

Missing Documentation

  • No step-by-step guide for setting up IBM Code Engine
  • No documented rollback strategy
  • No metrics/logging configuration
  • Secret requirements not documented

Recommendations

Immediate (Before Merge)

  1. Fix staging workflow to deploy correct branch code
  2. Make health checks blocking
  3. Use versioned tags instead of :latest
  4. Add workflow validation tests

Short-term (Next PR)

  1. Split into smaller PRs (Code Engine, K8s, Docling, Workflows)
  2. Add comprehensive testing (workflows, Helm, integration)
  3. Update deployment documentation
  4. Document secret management and rollback procedures

Long-term

  1. Implement GitOps approach (ArgoCD/Flux)
  2. Add automated rollback on health check failure
  3. Implement blue-green deployments
  4. Add comprehensive monitoring/alerting

Conclusion

This PR introduces valuable deployment infrastructure but has critical issues that must be addressed before merging. The staging workflow bug is particularly concerning as it defeats the purpose of having a staging environment.

Review Score: 6/10 - Good foundation, but needs security and stability improvements.

The code quality is generally good with proper type hints, logging, and error handling. The deployment automation needs hardening before production use.

cc: @manavgup

OpenShift on VPC Gen 2 requires COS instance for internal registry backup.
Using standard plan (~/bin/zsh.02/GB/month) as lite plan limit was reached.
Copy link

claude bot commented Oct 4, 2025

Pull Request Review: Staging (#296)

Overview

This PR includes 77 files with 12,100 additions and 99 deletions, representing a major infrastructure and deployment enhancement.


Strengths

1. Excellent CI/CD Improvements

  • Disk cleanup steps: Smart addition of disk space management in workflows
  • IBM Code Engine workflow: Well-structured deployment pipeline with clear separation of concerns
  • Uses pre-built images from GHCR (fast deployment, approximately 5 min vs 15+ min)
  • Good security practices: minimal permissions, secret handling

2. Strong Test Coverage

  • DoclingProcessor tests: 631 lines of comprehensive TDD tests
  • Covers initialization, PDF processing, table extraction, metadata extraction, image handling, error handling, and chunking
  • Uses proper mocking strategies with patch decorators

3. Clean Code Structure

  • DoclingProcessor implementation: Well-documented, follows existing patterns
  • Good error handling with graceful degradation
  • Uses AsyncIterator pattern consistent with BaseProcessor

Critical Issues (Must Fix Before Merge)

1. Resource Group Selection Logic (ibm-code-engine-staging.yml:105)

Issue: Automatically selects the first resource group via jq, which may not be correct in multi-RG environments.

Risk: Could deploy to wrong resource group.

Recommendation: Use environment variable or add validation to ensure correct resource group is selected.

2. Health Check Timeout (ibm-code-engine-staging.yml:216)

Issue: Fixed 30-second sleep may be insufficient for cold starts with min-scale=0.

Risk: False positives on health checks, especially for backend with 2GB memory.

Recommendation: Implement retry logic with exponential backoff (e.g., 10 retries with increasing delays).

3. Kubernetes Secrets Management (k8s-deploy-production.yml:169-183)

Issue: Uses kubectl create secret with dry-run and apply, but does not handle secret updates properly.

Risk: If secrets change, old values persist. No rotation mechanism.

Recommendation: Delete and recreate secrets to ensure updates are applied.

4. Missing Error Handling (k8s-deploy-production.yml:145)

Issue: No validation that KUBECONFIG secret is valid base64 or contains valid config.

Risk: Silent failures with cryptic kubectl errors.

Recommendation: Add validation before using kubeconfig.


Medium Priority Issues

5. Inconsistent Image Reference

Issue: Comment in ibm-code-engine-staging.yml says main branch but workflow triggers on develop and staging branches.

Recommendation: Update comment or use branch-specific tags.

6. Hardcoded Values

App names, ports, and resource limits are hardcoded across workflows.

Recommendation: Move to environment variables or workflow inputs.

7. Missing Rollback Strategy

Recommendation: Add rollback job with previous image tracking.


Test Coverage Assessment

Excellent Coverage: DoclingProcessor has 631 lines of tests covering all major paths

Missing Coverage: No integration tests for deployment workflows, no E2E tests for Code Engine deployment


Security Review

Good Practices: Minimal permissions in workflows, secrets properly referenced, environment-based approvals

Concerns: Consider using OIDC instead of KUBECONFIG in secrets, consider secret managers like HashiCorp Vault


Approval Status

Assessment: Approve with Conditions

This is a substantial and well-structured PR with excellent deployment infrastructure. The code quality is high, and test coverage is comprehensive.

Conditions for merge:

  1. Fix resource group auto-selection issue (Critical)
  2. Add health check retry logic (Critical for production stability)
  3. Fix K8s secret update handling (Critical for security)
  4. Add KUBECONFIG validation (Critical for deployment reliability)

After addressing critical issues, this PR will significantly improve project deployment capabilities.


Great Work!

The infrastructure additions are impressive and well-thought-out. The DoclingProcessor integration follows TDD principles excellently. Once the critical deployment safety issues are addressed, this will be a solid addition to the codebase.

Files reviewed: 77 files across workflows, deployment configs, backend services, and tests.

Copy link

claude bot commented Oct 4, 2025

Pull Request Review - Staging Deployment Infrastructure

Overview

This PR introduces comprehensive deployment infrastructure including IBM Code Engine staging deployment, Kubernetes/Helm production capabilities, and enhanced CI/CD workflows. Total changes: 12,124 additions.

Strengths

1. Well-Structured Deployment Strategy

  • Clear separation between staging (IBM Code Engine) and production (K8s/Helm)
  • Excellent use of pre-built images (5-min vs 15+ min deployment)
  • Proper environment-based configuration (dev/staging/prod)

2. Security Best Practices

  • Minimal permissions principle applied
  • Secrets properly referenced via GitHub Secrets
  • Kubeconfig with proper permissions (chmod 600)
  • Using --dry-run=client before applying K8s resources

3. Good Documentation

  • Comments throughout workflows
  • Links to related issues
  • Clear deployment instructions in CLAUDE.md

CRITICAL Issues

1. Health Check Failures Not Fatal

Location: .github/workflows/ibm-code-engine-staging.yml:221-224

Health checks fail silently. Add retry logic with proper failure handling.

2. Resource Group Auto-Selection

Location: .github/workflows/ibm-code-engine-staging.yml:105

Automatically selects first resource group. Should use explicit IBM_CLOUD_RESOURCE_GROUP variable.

3. Unsafe Script Execution

Location: .github/workflows/ibm-code-engine-staging.yml:82

Downloading scripts without verification. Pin versions or use official GitHub Actions.

4. Secrets Exposure Risk

Location: .github/workflows/k8s-deploy-production.yml:167-183

Creating secrets imperatively could expose data in logs. Use sealed-secrets or Vault.

5. Missing K8s Error Handling

Location: .github/workflows/k8s-deploy-production.yml:207-218

Smoke tests may fail silently if service URL unavailable.

HIGH Priority Issues

  • Incomplete kubectl configuration (only commented examples)
  • Missing rollback strategy for failed deployments
  • Duplicate disk cleanup code across 3 workflows
  • No tests for deployment scripts

Documentation Gaps

  1. Missing step-by-step Code Engine deployment guide
  2. Missing required GitHub Secrets documentation
  3. Missing troubleshooting guide
  4. Incomplete K8s provider configuration

Must Fix Before Merge

  1. Add retry logic to health checks with proper failure handling
  2. Implement resource group validation (IBM_CLOUD_RESOURCE_GROUP variable)
  3. Add error handling for missing KUBECONFIG
  4. Document all required GitHub Secrets and Variables

Overall Assessment

Score: 7/10

Solid foundation with good security practices. Critical reliability issues must be addressed before production use.

Recommendation:

  • Safe to merge to staging AFTER fixing critical health check issues
  • Block production deployment until K8s configuration is complete
  • Create deployment guide before wider adoption

Great work on this infrastructure foundation! Address critical issues above, especially health check retry logic and resource group validation.

Copy link

claude bot commented Oct 4, 2025

Pull Request Review: Staging (#296)

Overview

This is a large PR with 12,124 additions across 77 files, primarily focused on deployment infrastructure, CI/CD improvements, and backend enhancements. The changes introduce production-ready Kubernetes/Helm deployments, IBM Code Engine staging workflow, and Docling document processing capabilities.


Strengths

1. Excellent CI/CD Infrastructure

  • Disk space management: Smart cleanup strategy in workflows prevents build failures
  • IBM Code Engine workflow: Well-structured with clear job separation, uses pre-built images for fast deployment (~5 min vs 15+ min), follows principle of least privilege
  • Health checks: Proper verification after deployment
  • Environment separation: Clear dev/staging/prod environments

2. Production-Ready Helm Charts

  • Well-organized with proper templating
  • Good resource management across environments
  • Proper secret handling with optional keys for API credentials
  • Health probes configured correctly (liveness + readiness)
  • HPA support for autoscaling

3. Backend Code Quality

  • DoclingProcessor: Clean architecture with proper error handling
  • Good test coverage (14 test cases for Docling processor)
  • Proper async/await usage and type hints throughout

Issues and Concerns

CRITICAL: Security Issues

1. Secrets Exposure Risk in K8s Workflow (k8s-deploy-production.yml:167-183)

  • No validation that secrets exist before use
  • Secrets recreated on every run
  • Missing error handling if secrets are not set

Fix: Add secret validation and use idempotent secret creation

2. Hardcoded KUBECONFIG Decoding (k8s-deploy-production.yml:145-147)

  • No validation that KUBECONFIG secret exists
  • File permissions could be more restrictive (use chmod 400)

3. Sensitive Files in .gitignore

  • Patterns like .secrets and api-key.json suggest local credential storage
  • Consider documenting secure credential management practices

HIGH: Workflow Stability Issues

1. Missing Timeout Protection (ibm-code-engine-staging.yml:214-225)

  • Fixed 30s sleep may not be enough (cold starts can take 60-90s)
  • Health check failure is logged as warning but does not fail the workflow
  • Fix: Add retry logic with exponential backoff and fail on timeout

2. Race Condition in Resource Group Selection (ibm-code-engine-staging.yml:105-107)

  • Assumes first resource group is correct
  • Fix: Use explicit resource group from environment variable

3. K8s Deployment Lacks Rollback Strategy (k8s-deploy-production.yml:189-198)

  • No automatic rollback on failed deployment
  • Fix: Add --atomic flag to helm upgrade and explicit rollback step on failure

MEDIUM: Code Quality Issues

1. Incomplete Error Context (docling_processor.py:89-91)

  • Re-raising generic Exception loses context
  • Fix: Wrap in custom DocumentProcessingError exception

2. Missing Validation in Helm Values (values-prod.yaml:93-96)

  • Example.com domains in production values
  • Fix: Use placeholders or add validation

3. Disk Cleanup Code Duplication

  • Identical disk cleanup in multiple workflows
  • Fix: Create reusable composite action

Test Coverage

Good:

  • 14 comprehensive tests for DoclingProcessor
  • Tests cover success paths, error handling, and edge cases

Missing:

  • No integration tests for deployment workflows
  • No Helm chart validation tests
  • No smoke tests in K8s deployment workflow

Recommendations Summary

Must Fix (Before Merge):

  1. Add secret validation in K8s workflow
  2. Fix health check to fail workflow on failure (IBM Code Engine)
  3. Add rollback strategy to K8s deployment (use --atomic flag)
  4. Change example.com domains in prod values to placeholders

Should Fix (High Priority):

  1. Add retry logic to health checks with exponential backoff
  2. Use explicit resource group variable for IBM Cloud
  3. Add error context wrapping in DoclingProcessor
  4. Create reusable composite action for disk cleanup

Nice to Have:

  1. Add Helm chart unit tests
  2. Add smoke tests to K8s workflow
  3. Create deployment architecture documentation
  4. Add deployment rollback documentation

Conclusion

This PR represents a significant improvement to deployment infrastructure. The Kubernetes/Helm support and IBM Code Engine staging workflow are excellent additions. Code quality is generally good with proper testing.

However, there are critical security and stability issues that should be addressed before merging, particularly around secret handling and health check validation.

Recommendation: Request changes for critical security issues, then approve after fixes.


Files Reviewed: 77 files changed (+12,124, -99)
Review Focus: CI/CD workflows, Helm charts, backend code, security, testing

Good work overall! The infrastructure improvements are substantial. Just need to tighten up security and error handling.

…le migration plan

- Add deployment configuration to .env.example and .env.ci
- Create GitHub Actions workflow for OpenShift staging deployment
- Add OpenShift manifests (PostgreSQL, etcd, MinIO, Milvus, routes)
- Update Helm chart with OpenShift compatibility fixes:
  - Configurable container registry
  - Backend health check path (/api/health)
  - Backend COLLECTIONDB_PASS env var
  - Frontend nginx writable volumes
  - Frontend container port 8080
- Create automated deployment script (deploy-openshift-staging.sh)
- Update deployment documentation with CI/CD guide
- Add DEPLOYMENT_PROGRESS.md with Terraform + Ansible migration plan
- Fix pre-commit to exclude Helm templates from YAML validation

Related: #261

🚀 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant