Skip to content

Conversation

@pditommaso
Copy link
Member

@pditommaso pditommaso commented Dec 30, 2025

Summary

This PR introduces a new awsecs executor for running Nextflow tasks on AWS ECS Managed Instances.

Why ECS Managed Instances?

AWS ECS Managed Instances provides fine-grained control over EC2 instance types while offloading infrastructure management to AWS. This offers significant advantages over Fargate for scientific workflows:

Key Benefits

Capability ECS Managed Instances Fargate
Instance Type Control ✅ Full control - select specific instance types, GPUs, network-optimized instances ❌ No control - AWS selects instances
GPU Support ✅ Full GPU access (T4, V100, A10G, etc.) ❌ Not supported
Max vCPUs Up to instance limits 16 vCPUs
Max Memory Up to instance limits 120 GB
Ephemeral Storage 30 GiB - 16 TiB 21 - 200 GB
Privileged Containers ✅ SYS_ADMIN, NET_ADMIN, BPF ❌ Limited
Cost Model EC2 pricing + management fee (multi-task per instance) Per-task pricing
Spot Support ✅ EC2 Spot instances ✅ Fargate Spot

Ideal Use Cases

  • GPU-accelerated workloads (ML training, inference)
  • High-memory tasks exceeding Fargate's 120 GB limit
  • Large ephemeral storage needs (up to 16 TiB)
  • Cost optimization via instance type selection and multi-task packing
  • Privileged operations requiring Linux capabilities (e.g., FUSE mounts for Fusion filesystem)

Features

  • Compute Resources: CPUs, memory, GPUs, disk size (30 GiB - 16 TiB), optional EC2 instance type
  • Storage Strategy: AWS S3 work directory via Seqera Fusion filesystem
  • Resilience: Automatic spot instance interruption retry (configurable maxSpotAttempts)
  • Observability: CloudWatch Logs integration for task output
  • FUSE Support: Linux capabilities (SYS_ADMIN) and /dev/fuse device mapping for Fusion filesystem

Requirements

Pre-configured Infrastructure

  • ECS cluster with Managed Instances capacity provider
  • IAM execution role for ECS tasks (image pull, CloudWatch logs)
  • VPC with subnets and security groups (auto-discovered from default VPC if not specified)
  • S3 bucket for work directory

Configuration

process {
    executor = 'awsecs'
    container = 'ubuntu:latest'
    cpus = 4
    memory = '16 GB'
    disk = '100 GB'
    // accelerator 1  // GPU support
    // machineType = 'c6i.2xlarge'  // Optional: pin to specific instance type
}

aws {
    region = 'us-east-1'
    ecs {
        cluster = 'my-cluster'  // Required
        executionRole = 'arn:aws:iam::123456789:role/ecsTaskExecutionRole'  // Required
        // Optional settings with defaults:
        // taskRole = 'arn:aws:iam::...'
        // subnets = ['subnet-...']  // Auto-discovered from default VPC
        // securityGroups = ['sg-...']  // Auto-discovered from default VPC
        // logsGroup = '/aws/ecs/nextflow'
        // assignPublicIp = true
        // maxSpotAttempts = 5
    }
}

fusion.enabled = true
wave.enabled = true

Limitations

  • 14-day lifecycle: ECS Managed Instances drain after 14 days (tasks exceeding this limit will fail with clear error)
  • Fusion required: S3 work directory via Fusion filesystem is mandatory
  • No custom AMIs: Uses AWS-managed Bottlerocket OS

Implementation Status

Phase Status Progress
Phase 1: Setup ✅ Complete 4/4
Phase 2: Foundational ✅ Complete 6/6
Phase 3: Basic Task Execution (MVP) ✅ Complete 18/18
Phase 4: GPU Support ⬜ Not Started 0/5
Phase 5: Custom Storage 🟡 In Progress 4/5
Phase 6: Instance Type Selection ⬜ Not Started 0/4
Phase 7: Monitoring & Debugging 🟡 In Progress 4/15
Phase 8: Polish 🟡 In Progress 1/6

Overall: 37/63 tasks (59%)

Documentation

  • Feature specification: specs/001-ecs-executor/spec.md
  • Implementation plan: specs/001-ecs-executor/plan.md
  • Task breakdown: specs/001-ecs-executor/tasks.md

References


🤖 Generated with Claude Code

Add feature specification for new awsecs executor to run Nextflow tasks
on AWS ECS Managed Instances with support for:
- CPU, memory, GPU, and disk resource configuration
- S3/Fusion filesystem integration
- Spot instance retry handling
- CloudWatch Logs integration

Spec includes clarifications for:
- Pre-configured cluster requirement with validation
- Spot retry behavior (similar to aws.batch.maxSpotAttempts)
- 14-day lifecycle limit handling
- Configuration namespace: aws.ecs.*
- Executor name: awsecs

🤖 Generated with [Claude Code](https://claude.ai/code)

Co-Authored-By: Claude <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@netlify
Copy link

netlify bot commented Dec 30, 2025

Deploy Preview for nextflow-docs-staging canceled.

Name Link
🔨 Latest commit b107f3f
🔍 Latest deploy log https://app.netlify.com/projects/nextflow-docs-staging/deploys/69547112153e6c0008347061

pditommaso and others added 4 commits December 30, 2025 23:29
- plan.md: Implementation phases, architecture design, configuration schema
- research.md: Technical decisions for ECS API patterns, status mapping
- data-model.md: Entity relationships, state diagrams, caching strategy
- quickstart.md: User guide with minimal and full configuration examples
- contracts/ecs-api-contracts.md: AWS ECS API contracts
- spec.md: Added infrastructure config requirements (FR-022 to FR-028)
  - Minimal config: only cluster and executionRole required
  - Auto-discovery for subnets/security groups from default VPC
  - Wave containers dependency for Fusion

🤖 Generated with [Claude Code](https://claude.com/claude-code)

Co-Authored-By: Claude Opus 4.5 <noreply@anthropic.com>
Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
- Generate comprehensive tasks.md with 63 implementation tasks
- Add error handling tasks (T052-T056) for edge cases
- Add disk type support as low-priority task (T063)
- Update FR-024/FR-025 for default VPC error handling
- Update T013, T014 for EC2 client and VPC auto-discovery

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
Implement the AWS ECS executor for running Nextflow tasks on Amazon
Elastic Container Service. This executor requires Fusion filesystem
and supports automatic VPC discovery from the default VPC.

Key features:
- Fusion-only mode for S3-based work directories
- Automatic task definition registration with caching
- VPC auto-discovery from default VPC
- Spot interruption detection and handling
- CloudWatch logs integration

New files:
- AwsEcsExecutor: Main executor implementation
- AwsEcsTaskHandler: Task lifecycle management
- AwsEcsConfig: Configuration options
- RegisterTaskDefinitionModel/ContainerDefinitionModel: ECS API models

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
- Add SYS_ADMIN capability and /dev/fuse device mapping to container
  definitions to enable Fusion filesystem FUSE driver operation
- Update ContainerDefinitionModel with fusionEnabled flag and Linux
  parameters configuration
- Add tests for FUSE driver support in task definitions
- Update spec.md with FR-007a requirement for Linux permissions
- Complete Phase 3 (US1 Basic Tasks) - 18/18 tasks done

Signed-off-by: Paolo Di Tommaso <paolo.ditommaso@gmail.com>
@pditommaso pditommaso changed the title Add AWS ECS Managed Instances executor AWS ECS Managed Instances executor [POC] Dec 31, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants