Skip to content

Conversation

@jcabrero
Copy link
Member

@jcabrero jcabrero commented Aug 22, 2025

Summary

This PR main motivation is fixing a bug that used all storage space in Github action runners and prevented from uploading the artifacts to ECR.Optimize CI pipeline for better performance and reliability by leveraging self-hosted runners for image building and deployment.

Key Changes

Consolidated Build & Push Pipeline

  • Before: Images built twice - once on self-hosted runner for E2E tests, then rebuilt on GitHub runners for ECR push
  • After: Images built once on self-hosted runner and conditionally pushed to ECR from the same environment

Parallel Image Building

  • Introduced matrix strategy with 3 concurrent runners (runners-per-machine: 3)
  • Build jobs for vllm, attestation, and api now run in parallel
  • Push jobs also parallelized for faster ECR uploads

📦 vLLM Size Compatibility

  • Updated vLLM from 0.7.3 → 0.10.1 increases image size beyond GitHub Actions limits
  • Self-hosted runner approach resolves size constraints

Pipeline Flow

test → start-runner → [build-images × 3] → e2e-tests → [push-images × 3] → stop-runner

Performance Improvements

  • ~3x faster builds: Parallel execution vs sequential
  • ~50% faster overall: Eliminates duplicate image builds
  • Better resource utilization
  • Cleaner logs: Matrix jobs provide individual status and failure isolation

@jcabrero jcabrero changed the title feat: parallelized image building and removed separate build step fix: CICD pipeline to use Runner to Upload images to ECR Aug 22, 2025
@jcabrero jcabrero requested review from blefo and Copilot August 22, 2025 14:55
Copy link
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull Request Overview

This PR optimizes the CI/CD pipeline by consolidating image building and ECR pushing to self-hosted runners, eliminating duplicate builds and resolving storage constraints from larger vLLM images.

  • Introduces parallel matrix builds for vllm, attestation, and api images using 3 concurrent self-hosted runners
  • Moves image building from E2E test job to dedicated build-images job that runs before tests
  • Replaces GitHub runners with self-hosted runners for ECR push operations to handle larger image sizes

Tip: Customize your code reviews with copilot-instructions.md. Create the file or learn how to get started.

build_args: "--target nilai --platform linux/amd64"
steps:
- name: Checkout
uses: actions/checkout@v2
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using outdated checkout action version v2. Should use actions/checkout@v4 for better performance and security features, consistent with other jobs in the workflow.

Suggested change
uses: actions/checkout@v2
uses: actions/checkout@v4

Copilot uses AI. Check for mistakes.
if: ${{ always() }}
steps:
- name: Configure AWS credentials
uses: aws-actions/configure-aws-credentials@v1
Copy link

Copilot AI Aug 22, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Using outdated aws-actions/configure-aws-credentials@v1. Should use v4 for consistency with the push-images job and to benefit from latest security updates and features.

Suggested change
uses: aws-actions/configure-aws-credentials@v1
uses: aws-actions/configure-aws-credentials@v4

Copilot uses AI. Check for mistakes.
@jcabrero jcabrero merged commit aac517d into main Aug 22, 2025
8 checks passed
@jcabrero jcabrero deleted the fix/cicd_pipeline_storage branch August 27, 2025 09:38
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants