A production-ready AWS SAM solution for processing large numbers of S3 objects in parallel using AWS Step Functions Distributed Map with ECS workers on EC2 instances.
The architecture shows the complete workflow from S3 input through distributed processing to output:
- S3 Input: Objects stored in
input/folder trigger processing - Step Functions: Orchestrates the entire workflow with distributed map
- ECS Workers: Dynamic scaling EC2 instances process objects in parallel
- Activity Pattern: Workers poll Step Functions Activity for tasks
- S3 Output: Processed objects stored in
processed/folder
The solution uses a sophisticated Step Functions state machine that orchestrates the entire processing pipeline:
- Type: Lambda Task
- Purpose: Dynamically provisions ECS workers
- Input:
worker_countparameter - Function: Calls ECS Provisioner Lambda to:
- Scale Auto Scaling Group to desired worker count
- Wait for EC2 instances to be ready
- Start ECS tasks on the instances
- Output: Provisioning result with worker details
- Type: Distributed Map
- Purpose: Processes all S3 objects in parallel
- Configuration:
Mode: DISTRIBUTED- Uses Step Functions Distributed Map for high concurrencyExecutionType: STANDARD- Full Step Functions featuresMaxConcurrency: 10- Limits parallel executionsToleratedFailurePercentage: 10- Allows 10% failures
- Item Processor: Each S3 object becomes a separate execution
- Task Resource: Step Functions Activity (polling-based)
- Retry Logic: 3 attempts with exponential backoff
- Timeout: 300 seconds per object
- Type: Lambda Task
- Purpose: Clean up resources
- Trigger: Always runs (success or failure via Catch block)
- Function: Calls ECS Provisioner Lambda to:
- Stop ECS tasks
- Scale Auto Scaling Group to 0
- Clean up resources
{
"objects": [
{"Key": "input/file1.txt"},
{"Key": "input/file2.txt"},
{"Key": "input/file3.txt"}
],
"worker_count": 5
}- Retry Logic: Failed object processing retries 3x with exponential backoff
- Fault Tolerance: Up to 10% of objects can fail without stopping the workflow
- Guaranteed Cleanup: Deprovisioning always runs via Catch block
- Timeout Protection: 300-second timeout prevents stuck tasks
- Activity Pattern: Workers poll for tasks, enabling dynamic scaling
- Start: Workflow receives list of S3 objects and worker count
- Provision: Lambda provisions exact number of ECS workers needed
- Distribute: Distributed Map creates one execution per S3 object
- Process: Workers poll Activity for tasks and process objects in parallel
- Monitor: Step Functions tracks progress and handles failures
- Cleanup: Resources are deprovisioned regardless of success/failure
This architecture enables processing thousands of S3 objects with precise resource control and cost optimization.
- π Dynamic Scaling: Worker count specified at execution time (1-100 workers)
- β‘ Fast: Parallel processing with distributed map pattern
- π° Cost-Effective: Dynamic scaling, pay only when processing
- π Reliable: Built-in retry logic and error handling
- π Observable: Comprehensive structured JSON logging
- π³ Containerized: Uses Docker image from application code
- π Portable: Dynamic VPC discovery, works across accounts/regions
- π― Flexible: Easy adaptation to any workload size
- AWS SAM CLI installed and configured
- AWS CLI configured with appropriate region
- Docker installed and running
- AWS Permissions for ECS, Step Functions, S3, ECR, CloudFormation, Auto Scaling
- jq for JSON processing
This solution uses AWS SAM (Serverless Application Model) for infrastructure deployment and management.
# Deploy with default settings (ASG max size: 100)
./deploy.sh
# Or customize deployment with instance type
./deploy.sh 5 m5.large # Still works, but worker count is now dynamicThe deployment script uses sam build and sam deploy commands to provision all AWS resources defined in the SAM template.
# Run end-to-end test with default 3 workers
./test.sh
# Run test with custom worker count
./test.sh 10
# Execute with specific worker count
./execute.sh 5 # 5 workers
./execute.sh 20 # 20 workers
./execute.sh 50 # 50 workers
# Execute with custom S3 prefix
./execute.sh 10 data/ # 10 workers, process 'data/' prefixScript Differences:
test.sh: Complete end-to-end test - generates files, monitors execution, verifies resultsexecute.sh: Quick execution launcher - uses existing S3 files, starts workflow and exits
# Remove all resources
./cleanup.shβββ README.md # This guide
βββ template.yaml # AWS SAM template with dynamic scaling (max 100)
βββ deploy.sh # SAM deployment script
βββ test.sh # End-to-end test with dynamic worker count
βββ execute.sh # Simple execution script for any worker count
βββ cleanup.sh # Resource cleanup
βββ build-and-push.sh # Docker build/push script
βββ application/ # Application code
β βββ processor.py # Worker with activity polling & structured logging
β βββ requirements.txt # Python dependencies
β βββ Dockerfile # Container definition
βββ functions/ # Lambda functions
β βββ ecs_provisioner/ # Dynamic ECS provisioning logic
βββ statemachine/ # Step Functions workflow
βββ workflow-complete.asl.json # Accepts dynamic worker_count input
{
"objects": [
{"Key": "input/file1.txt"},
{"Key": "input/file2.txt"}
],
"worker_count": 10
}- t3.medium - Cost-effective, light workloads
- c5.large - CPU-intensive processing (default)
- m5.large - Balanced CPU/memory
- m5.xlarge - Memory-intensive workloads
| Workers | Instance Type | Objects | Processing Time | Throughput |
|---|---|---|---|---|
| 3 | c5.large | 50 | ~4 minutes | 750 obj/hr |
| 5 | c5.large | 50 | ~2.5 minutes | 1200 obj/hr |
| 10 | c5.large | 50 | ~1.5 minutes | 2000 obj/hr |
| 20 | m5.large | 100 | ~1.5 minutes | 4000 obj/hr |
# View worker logs
aws logs describe-log-streams \
--log-group-name "/ecs/s3-batch-processor-s3-batch-processor" \
--region ap-east-1"Processing task"- Task received from Step Functions Activity"Processing S3 object"- Object processing started"S3 object processed successfully"- Processing completed"Task completed"- Task finished with count
- Visual workflow execution tracking
- Distributed map performance metrics
- Error details and retry information
The core processing happens in application/processor.py. Modify the process_s3_object() method:
def process_s3_object(self, object_key: str, bucket: str = None) -> Dict[str, Any]:
# Your custom processing logic here
# - Image processing: resize, filter, analyze
# - Data transformation: parse, validate, enrich
# - ML inference: classify, predict, score
# - File conversion: PDF to text, format conversion
# Current implementation: 5-second processing simulation
time.sleep(5)
# Return structured result
return {
'object_key': object_key,
'processed_key': processed_key,
'content': content,
'processing_time': processing_time,
'processed_at': datetime.utcnow().isoformat(),
'worker_id': self.worker_id,
'status': 'success'
}- β IAM Roles: Least privilege access managed by SAM
- β Dynamic VPC: Uses default VPC, no hardcoded values
- β Container Security: Non-root user execution
- β S3 Access: Scoped to specific bucket/prefixes
- β SAM Security: Infrastructure as Code with version control
- Zero cost when idle: ASG scales to 0 when no processing
- Dynamic worker count: Scale exactly to your workload needs
- Efficient processing: 5-second processing time per object
- Automatic cleanup: Infrastructure scales down after completion
- Small workloads (< 50 objects): 2-5 workers
- Medium workloads (50-500 objects): 5-20 workers
- Large workloads (500+ objects): 20-100 workers
- Maximum capacity: 100 workers (configurable in template.yaml)
- β Dynamic worker count: 1-100 workers tested and working
- β 1:1:1 ratio: 1 worker = 1 EC2 instance = 1 ECS task
- β Dynamic VPC discovery: Portable across accounts
- β Proper logging: Structured JSON logs with processing details
- β Complete lifecycle: Provision β Process β Deprovision
- β AWS SAM deployment: Infrastructure as Code with repeatable deployments
# Build the application
sam build
# Deploy with guided prompts
sam deploy --guided
# Deploy with custom max workers (default: 100)
sam deploy --parameter-overrides MaxWorkers=200
# View stack outputs
sam list stack-outputs
# Delete the stack
sam delete# Process 10 files with 3 workers
./execute.sh 3
# Process 100 files with 25 workers for faster throughput
./execute.sh 25
# Process files from 'data/' prefix with 10 workers
./execute.sh 10 data/
# Run comprehensive test with 5 workers
./test.sh 5
# Multiple concurrent executions for load testing
./execute.sh 5 && ./execute.sh 8 && ./execute.sh 12π Ready to process thousands of S3 objects efficiently with dynamic AWS SAM scaling!

