Skip to content

CI/CD Pipeline Optimization: Eliminate Duplicate Builds and Improve Efficiency #324

@manavgup

Description

@manavgup

Problem

Our CI/CD pipeline builds Docker images multiple times per PR, causing excessive disk usage and build failures.

Identified Issues

1. Duplicate Backend Builds ✅ PARTIALLY FIXED

Problem: Backend built TWICE per PR

  • ci.yml (CI/CD Pipeline) - Line 168
  • dev-environment-ci.yml - Line 144

Fix Applied (PR #323): Updated dev-environment-ci.yml to only trigger on .devcontainer changes
Remaining: Full pipeline redesign for optimal performance

2. Limited Security Scanning

Current: Only gitleaks + trufflehog (secret scanning)
Missing:

  • Dockerfile security (Hadolint)
  • Container image security (Dockle)
  • CVE scanning (Trivy)
  • SBOM generation (Syft)
  • Image signing (Cosign)

3. Poor Visibility

Current: Monolithic make lint hides which specific linter failed
Impact: Must re-run entire lint suite to debug one failure

4. No Integration Tests

Current: Only unit tests
Missing:

  • Smoke tests (quick validation)
  • Integration tests (full stack)
  • E2E tests (user flows)

Solution: 3-Stage Pipeline (Inspired by IBM MCP Context Forge)

Reference: IBM MCP Context Forge - Production-grade CI/CD with 2.6k⭐

Design Principles

  1. Build Once, Test Everywhere - No duplicate builds
  2. Fast Feedback First - Cheap checks before expensive
  3. Separation of Concerns - One workflow, one purpose
  4. Matrix for Visibility - Parallel jobs show individual failures
  5. Security-First - Comprehensive scanning at every stage
  6. Fail-Fast: False - Show all failures, not just first

Architecture Overview

Pull Request
│
├─ STAGE 1: Fast Feedback (2-3 min, parallel)
│  ├─ Lint Matrix (10 linters in parallel)
│  ├─ Security Scan (gitleaks, trufflehog)
│  └─ Test Isolation (atomic tests)
│
├─ STAGE 2: Build & Test (6-8 min, parallel)
│  ├─ Unit Tests (Python 3.12, 80% coverage)
│  └─ Build & Scan (matrix: backend + frontend)
│     ├─ Hadolint → SARIF
│     ├─ Build with BuildKit cache
│     ├─ Dockle → SARIF
│     ├─ Trivy → SARIF
│     └─ Syft → SBOM
│
└─ STAGE 3: Integration (5-7 min, parallel)
   ├─ Smoke Tests
   └─ Integration Tests

Result: 10-12 minutes (down from 17+), 1 build (down from 2)

Implementation Plan

Phase 1: Foundation (Week 1) - Quick Wins

Goal: Eliminate duplicate builds, improve visibility

Tasks:

  1. ✅ Fix dev-environment-ci triggers (DONE in PR feat: Add IBM Docling integration for enhanced document processing #323)
  2. Create 01-lint.yml with matrix strategy
  3. Add BuildKit caching to build
  4. Set fail-fast: false everywhere
  5. Update makefile-testing.yml triggers

Deliverables:

  • .github/workflows/01-lint.yml (new)
  • Updated .github/workflows/ci.yml (BuildKit cache)
  • Updated .github/workflows/makefile-testing.yml (narrow triggers)
  • docs/development/ci-cd-lint-strategy.md (documentation)

Success Metrics:

  • ✅ Zero duplicate builds
  • ✅ CI time < 12 minutes
  • ✅ Each linter visible in GitHub UI

Phase 2: Security (Week 2) - Production Readiness

Goal: Comprehensive security scanning

Tasks:

  1. Create 03-build-secure.yml
  2. Add Hadolint (Dockerfile security)
  3. Add Dockle (image security)
  4. Add Trivy (CVE scanning)
  5. Add Syft (SBOM generation)
  6. Configure SARIF uploads
  7. Add weekly CVE cron

Deliverables:

  • .github/workflows/03-build-secure.yml (new)
  • SARIF integration with GitHub Security tab
  • SBOM artifacts for all images
  • docs/development/ci-cd-security.md (documentation)

Success Metrics:

  • ✅ All builds have SBOM
  • ✅ Zero CRITICAL CVEs
  • ✅ Security findings in GitHub Security tab

Phase 3: Testing (Week 3) - Quality Assurance

Goal: Comprehensive test coverage

Tasks:

  1. Create 02-test-unit.yml with coverage
  2. Increase coverage to 80%
  3. Create smoke test suite
  4. Create 04-integration.yml
  5. Implement integration test matrix

Deliverables:

  • .github/workflows/02-test-unit.yml (new)
  • .github/workflows/04-integration.yml (new)
  • backend/tests/smoke/ (new test directory)
  • Enhanced backend/tests/integration/
  • docs/development/ci-cd-testing.md (documentation)

Success Metrics:

  • ✅ 80% unit test coverage
  • ✅ 100% critical paths in smoke tests
  • ✅ Integration tests < 7 minutes

Phase 4: Advanced (Week 4+) - Excellence

Goal: Advanced features

Tasks:

  1. Add Cosign image signing
  2. Add SBOM attestation
  3. Add E2E tests (Playwright)
  4. Add performance benchmarks
  5. Create deployment automation

Deliverables:

  • .github/workflows/05-deploy.yml (new)
  • Image signing on main branch
  • E2E test suite (optional)
  • Performance baselines
  • docs/development/ci-cd-deployment.md (documentation)

Workflow Specifications

01-lint.yml (New - Matrix Strategy)

name: Lint & Static Analysis

on:
  pull_request:
    branches: [main]

jobs:
  lint:
    runs-on: ubuntu-latest
    strategy:
      fail-fast: false
      matrix:
        include:
          # Config files
          - {id: yamllint, cmd: "yamllint .github/"}
          - {id: jsonlint, cmd: "find . -name '*.json' -exec jq empty {} \\;"}
          - {id: toml, cmd: "tomlcheck backend/pyproject.toml"}
          
          # Python (backend)
          - {id: ruff-check, cmd: "cd backend && poetry run ruff check rag_solution/"}
          - {id: ruff-format, cmd: "cd backend && poetry run ruff format --check rag_solution/"}
          - {id: mypy, cmd: "cd backend && poetry run mypy rag_solution/"}
          - {id: pylint, cmd: "cd backend && poetry run pylint rag_solution/"}
          - {id: pydocstyle, cmd: "cd backend && poetry run pydocstyle rag_solution/"}
          
          # JavaScript (frontend)
          - {id: eslint, cmd: "cd frontend && npm run lint"}
          - {id: prettier, cmd: "cd frontend && npm run format:check"}
    
    steps:
      - uses: actions/checkout@v4
      - name: Setup & Run ${{ matrix.id }}
        run: ${{ matrix.cmd }}

Benefit: 10 linters run in parallel, clear failure visibility

03-build-secure.yml (New - Security Pipeline)

name: Secure Docker Build

on:
  pull_request:
    branches: [main]
  schedule:
    - cron: "17 18 * * 2"  # Weekly CVE check

jobs:
  build-scan:
    strategy:
      matrix:
        image: [backend, frontend]
    
    steps:
      # 1. Hadolint (Dockerfile security)
      - run: hadolint Dockerfile.${{ matrix.image }} --format sarif
      
      # 2. Build with cache
      - uses: docker/build-push-action@v5
        with:
          cache-from: type=gha
          cache-to: type=gha,mode=max
      
      # 3. Dockle (image security)
      - run: dockle --format sarif ${{ matrix.image }}:latest
      
      # 4. Trivy (CVE scan)
      - uses: aquasecurity/trivy-action@master
        with:
          severity: 'CRITICAL,HIGH'
          exit-code: '1'
      
      # 5. Syft (SBOM)
      - uses: anchore/sbom-action@v0
        with:
          format: spdx-json
      
      # 6. Upload SARIF
      - uses: github/codeql-action/upload-sarif@v3

Benefit: Production-grade security, SARIF in GitHub Security tab

04-integration.yml (New - Stage 3)

name: Integration & Smoke Tests

jobs:
  smoke-tests:
    steps:
      - Start minimal stack (backend + postgres)
      - Test health endpoints
      - Test critical APIs
      - Duration: 2-3 min
  
  integration-tests:
    strategy:
      matrix:
        suite: [api, vectordb, storage, fullstack]
    steps:
      - Start required services
      - Run integration tests
      - Duration: 5-7 min

Benefit: Validates images actually work end-to-end

Expected Outcomes

Before

  • Workflows: 9 run (2 duplicate builds)
  • Time: 17+ minutes
  • Visibility: Poor
  • Security: Basic (2 tools)
  • Coverage: 60%
  • Failures: Common (disk space)

After

  • Workflows: 6 focused
  • Time: 10-12 minutes
  • Visibility: Excellent (matrix)
  • Security: Comprehensive (5+ tools)
  • Coverage: 80%
  • Failures: Rare

Improvement: 40% faster, 50% less disk, production-grade security

Documentation to Generate

As part of each phase implementation:

Phase 1:

  • docs/development/ci-cd-overview.md - Architecture overview
  • docs/development/ci-cd-lint-matrix.md - Linting strategy

Phase 2:

  • docs/development/ci-cd-security-pipeline.md - Security scanning
  • docs/development/ci-cd-sbom.md - SBOM generation

Phase 3:

  • docs/development/ci-cd-testing-strategy.md - Test organization
  • docs/development/ci-cd-integration-tests.md - Integration test guide

Phase 4:

  • docs/development/ci-cd-deployment.md - Deployment automation
  • docs/development/ci-cd-signing.md - Image signing with Cosign

Reference Architecture

Full design spec available in local file: CICD_REDESIGN_MCP_INSPIRED.md

Key inspirations from IBM MCP Context Forge:

  • Separate workflows by concern
  • Matrix strategy for linting (lint.yml)
  • Comprehensive security (docker-image.yml)
  • Multi-version testing (pytest.yml) - adapted: we use 3.12 only
  • Proper permissions (least privilege)
  • BuildKit caching
  • SARIF integration

Success Criteria

Phase 1: No duplicate builds, CI < 12 min, matrix visibility
Phase 2: SBOM for all images, zero CRITICAL CVEs, SARIF integration
Phase 3: 80% coverage, smoke tests, integration tests < 7min
Phase 4: Signed images, E2E tests, automated deployment

Priority

High - Affects all PRs, enables production deployment


Implementation starts after this plan is approved.

Metadata

Metadata

Assignees

No one assigned

    Labels

    enhancementNew feature or requestinfrastructureInfrastructure and deployment

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions