-
Notifications
You must be signed in to change notification settings - Fork 3
Description
Problem
Our CI/CD pipeline builds Docker images multiple times per PR, causing excessive disk usage and build failures.
Identified Issues
1. Duplicate Backend Builds ✅ PARTIALLY FIXED
Problem: Backend built TWICE per PR
- ci.yml (CI/CD Pipeline) - Line 168
- dev-environment-ci.yml - Line 144
Fix Applied (PR #323): Updated dev-environment-ci.yml to only trigger on .devcontainer changes
Remaining: Full pipeline redesign for optimal performance
2. Limited Security Scanning
Current: Only gitleaks + trufflehog (secret scanning)
Missing:
- Dockerfile security (Hadolint)
- Container image security (Dockle)
- CVE scanning (Trivy)
- SBOM generation (Syft)
- Image signing (Cosign)
3. Poor Visibility
Current: Monolithic make lint
hides which specific linter failed
Impact: Must re-run entire lint suite to debug one failure
4. No Integration Tests
Current: Only unit tests
Missing:
- Smoke tests (quick validation)
- Integration tests (full stack)
- E2E tests (user flows)
Solution: 3-Stage Pipeline (Inspired by IBM MCP Context Forge)
Reference: IBM MCP Context Forge - Production-grade CI/CD with 2.6k⭐
Design Principles
- Build Once, Test Everywhere - No duplicate builds
- Fast Feedback First - Cheap checks before expensive
- Separation of Concerns - One workflow, one purpose
- Matrix for Visibility - Parallel jobs show individual failures
- Security-First - Comprehensive scanning at every stage
- Fail-Fast: False - Show all failures, not just first
Architecture Overview
Pull Request
│
├─ STAGE 1: Fast Feedback (2-3 min, parallel)
│ ├─ Lint Matrix (10 linters in parallel)
│ ├─ Security Scan (gitleaks, trufflehog)
│ └─ Test Isolation (atomic tests)
│
├─ STAGE 2: Build & Test (6-8 min, parallel)
│ ├─ Unit Tests (Python 3.12, 80% coverage)
│ └─ Build & Scan (matrix: backend + frontend)
│ ├─ Hadolint → SARIF
│ ├─ Build with BuildKit cache
│ ├─ Dockle → SARIF
│ ├─ Trivy → SARIF
│ └─ Syft → SBOM
│
└─ STAGE 3: Integration (5-7 min, parallel)
├─ Smoke Tests
└─ Integration Tests
Result: 10-12 minutes (down from 17+), 1 build (down from 2)
Implementation Plan
Phase 1: Foundation (Week 1) - Quick Wins
Goal: Eliminate duplicate builds, improve visibility
Tasks:
- ✅ Fix dev-environment-ci triggers (DONE in PR feat: Add IBM Docling integration for enhanced document processing #323)
- Create
01-lint.yml
with matrix strategy - Add BuildKit caching to build
- Set
fail-fast: false
everywhere - Update makefile-testing.yml triggers
Deliverables:
-
.github/workflows/01-lint.yml
(new) - Updated
.github/workflows/ci.yml
(BuildKit cache) - Updated
.github/workflows/makefile-testing.yml
(narrow triggers) -
docs/development/ci-cd-lint-strategy.md
(documentation)
Success Metrics:
- ✅ Zero duplicate builds
- ✅ CI time < 12 minutes
- ✅ Each linter visible in GitHub UI
Phase 2: Security (Week 2) - Production Readiness
Goal: Comprehensive security scanning
Tasks:
- Create
03-build-secure.yml
- Add Hadolint (Dockerfile security)
- Add Dockle (image security)
- Add Trivy (CVE scanning)
- Add Syft (SBOM generation)
- Configure SARIF uploads
- Add weekly CVE cron
Deliverables:
-
.github/workflows/03-build-secure.yml
(new) - SARIF integration with GitHub Security tab
- SBOM artifacts for all images
-
docs/development/ci-cd-security.md
(documentation)
Success Metrics:
- ✅ All builds have SBOM
- ✅ Zero CRITICAL CVEs
- ✅ Security findings in GitHub Security tab
Phase 3: Testing (Week 3) - Quality Assurance
Goal: Comprehensive test coverage
Tasks:
- Create
02-test-unit.yml
with coverage - Increase coverage to 80%
- Create smoke test suite
- Create
04-integration.yml
- Implement integration test matrix
Deliverables:
-
.github/workflows/02-test-unit.yml
(new) -
.github/workflows/04-integration.yml
(new) -
backend/tests/smoke/
(new test directory) - Enhanced
backend/tests/integration/
-
docs/development/ci-cd-testing.md
(documentation)
Success Metrics:
- ✅ 80% unit test coverage
- ✅ 100% critical paths in smoke tests
- ✅ Integration tests < 7 minutes
Phase 4: Advanced (Week 4+) - Excellence
Goal: Advanced features
Tasks:
- Add Cosign image signing
- Add SBOM attestation
- Add E2E tests (Playwright)
- Add performance benchmarks
- Create deployment automation
Deliverables:
-
.github/workflows/05-deploy.yml
(new) - Image signing on main branch
- E2E test suite (optional)
- Performance baselines
-
docs/development/ci-cd-deployment.md
(documentation)
Workflow Specifications
01-lint.yml (New - Matrix Strategy)
name: Lint & Static Analysis
on:
pull_request:
branches: [main]
jobs:
lint:
runs-on: ubuntu-latest
strategy:
fail-fast: false
matrix:
include:
# Config files
- {id: yamllint, cmd: "yamllint .github/"}
- {id: jsonlint, cmd: "find . -name '*.json' -exec jq empty {} \\;"}
- {id: toml, cmd: "tomlcheck backend/pyproject.toml"}
# Python (backend)
- {id: ruff-check, cmd: "cd backend && poetry run ruff check rag_solution/"}
- {id: ruff-format, cmd: "cd backend && poetry run ruff format --check rag_solution/"}
- {id: mypy, cmd: "cd backend && poetry run mypy rag_solution/"}
- {id: pylint, cmd: "cd backend && poetry run pylint rag_solution/"}
- {id: pydocstyle, cmd: "cd backend && poetry run pydocstyle rag_solution/"}
# JavaScript (frontend)
- {id: eslint, cmd: "cd frontend && npm run lint"}
- {id: prettier, cmd: "cd frontend && npm run format:check"}
steps:
- uses: actions/checkout@v4
- name: Setup & Run ${{ matrix.id }}
run: ${{ matrix.cmd }}
Benefit: 10 linters run in parallel, clear failure visibility
03-build-secure.yml (New - Security Pipeline)
name: Secure Docker Build
on:
pull_request:
branches: [main]
schedule:
- cron: "17 18 * * 2" # Weekly CVE check
jobs:
build-scan:
strategy:
matrix:
image: [backend, frontend]
steps:
# 1. Hadolint (Dockerfile security)
- run: hadolint Dockerfile.${{ matrix.image }} --format sarif
# 2. Build with cache
- uses: docker/build-push-action@v5
with:
cache-from: type=gha
cache-to: type=gha,mode=max
# 3. Dockle (image security)
- run: dockle --format sarif ${{ matrix.image }}:latest
# 4. Trivy (CVE scan)
- uses: aquasecurity/trivy-action@master
with:
severity: 'CRITICAL,HIGH'
exit-code: '1'
# 5. Syft (SBOM)
- uses: anchore/sbom-action@v0
with:
format: spdx-json
# 6. Upload SARIF
- uses: github/codeql-action/upload-sarif@v3
Benefit: Production-grade security, SARIF in GitHub Security tab
04-integration.yml (New - Stage 3)
name: Integration & Smoke Tests
jobs:
smoke-tests:
steps:
- Start minimal stack (backend + postgres)
- Test health endpoints
- Test critical APIs
- Duration: 2-3 min
integration-tests:
strategy:
matrix:
suite: [api, vectordb, storage, fullstack]
steps:
- Start required services
- Run integration tests
- Duration: 5-7 min
Benefit: Validates images actually work end-to-end
Expected Outcomes
Before
- Workflows: 9 run (2 duplicate builds)
- Time: 17+ minutes
- Visibility: Poor
- Security: Basic (2 tools)
- Coverage: 60%
- Failures: Common (disk space)
After
- Workflows: 6 focused
- Time: 10-12 minutes
- Visibility: Excellent (matrix)
- Security: Comprehensive (5+ tools)
- Coverage: 80%
- Failures: Rare
Improvement: 40% faster, 50% less disk, production-grade security
Documentation to Generate
As part of each phase implementation:
Phase 1:
docs/development/ci-cd-overview.md
- Architecture overviewdocs/development/ci-cd-lint-matrix.md
- Linting strategy
Phase 2:
docs/development/ci-cd-security-pipeline.md
- Security scanningdocs/development/ci-cd-sbom.md
- SBOM generation
Phase 3:
docs/development/ci-cd-testing-strategy.md
- Test organizationdocs/development/ci-cd-integration-tests.md
- Integration test guide
Phase 4:
docs/development/ci-cd-deployment.md
- Deployment automationdocs/development/ci-cd-signing.md
- Image signing with Cosign
Reference Architecture
Full design spec available in local file: CICD_REDESIGN_MCP_INSPIRED.md
Key inspirations from IBM MCP Context Forge:
- Separate workflows by concern
- Matrix strategy for linting (
lint.yml
) - Comprehensive security (
docker-image.yml
) - Multi-version testing (
pytest.yml
) - adapted: we use 3.12 only - Proper permissions (least privilege)
- BuildKit caching
- SARIF integration
Success Criteria
✅ Phase 1: No duplicate builds, CI < 12 min, matrix visibility
✅ Phase 2: SBOM for all images, zero CRITICAL CVEs, SARIF integration
✅ Phase 3: 80% coverage, smoke tests, integration tests < 7min
✅ Phase 4: Signed images, E2E tests, automated deployment
Priority
High - Affects all PRs, enables production deployment
Implementation starts after this plan is approved.