S3 Batch Processing with Step Functions Distributed Map + ECS

A production-ready AWS SAM solution for processing large numbers of S3 objects in parallel using AWS Step Functions Distributed Map with ECS workers on EC2 instances.

🏗️ Architecture

The architecture shows the complete workflow from S3 input through distributed processing to output:

S3 Input: Objects stored in input/ folder trigger processing
Step Functions: Orchestrates the entire workflow with distributed map
ECS Workers: Dynamic scaling EC2 instances process objects in parallel
Activity Pattern: Workers poll Step Functions Activity for tasks
S3 Output: Processed objects stored in processed/ folder

🔄 Step Functions Workflow

The solution uses a sophisticated Step Functions state machine that orchestrates the entire processing pipeline:

Workflow States

State Details

1. ProvisionECS State

Type: Lambda Task
Purpose: Dynamically provisions ECS workers
Input: worker_count parameter
Function: Calls ECS Provisioner Lambda to:
- Scale Auto Scaling Group to desired worker count
- Wait for EC2 instances to be ready
- Start ECS tasks on the instances
Output: Provisioning result with worker details

2. ProcessObjects State

Type: Distributed Map
Purpose: Processes all S3 objects in parallel
Configuration:
- Mode: DISTRIBUTED - Uses Step Functions Distributed Map for high concurrency
- ExecutionType: STANDARD - Full Step Functions features
- MaxConcurrency: 10 - Limits parallel executions
- ToleratedFailurePercentage: 10 - Allows 10% failures
Item Processor: Each S3 object becomes a separate execution
Task Resource: Step Functions Activity (polling-based)
Retry Logic: 3 attempts with exponential backoff
Timeout: 300 seconds per object

3. DeprovisionECS State

Type: Lambda Task
Purpose: Clean up resources
Trigger: Always runs (success or failure via Catch block)
Function: Calls ECS Provisioner Lambda to:
- Stop ECS tasks
- Scale Auto Scaling Group to 0
- Clean up resources

Workflow Input Format

{
  "objects": [
    {"Key": "input/file1.txt"},
    {"Key": "input/file2.txt"},
    {"Key": "input/file3.txt"}
  ],
  "worker_count": 5
}

Error Handling & Resilience

Retry Logic: Failed object processing retries 3x with exponential backoff
Fault Tolerance: Up to 10% of objects can fail without stopping the workflow
Guaranteed Cleanup: Deprovisioning always runs via Catch block
Timeout Protection: 300-second timeout prevents stuck tasks
Activity Pattern: Workers poll for tasks, enabling dynamic scaling

Execution Flow

Start: Workflow receives list of S3 objects and worker count
Provision: Lambda provisions exact number of ECS workers needed
Distribute: Distributed Map creates one execution per S3 object
Process: Workers poll Activity for tasks and process objects in parallel
Monitor: Step Functions tracks progress and handles failures
Cleanup: Resources are deprovisioned regardless of success/failure

This architecture enables processing thousands of S3 objects with precise resource control and cost optimization.

✨ Key Features

🚀 Dynamic Scaling: Worker count specified at execution time (1-100 workers)
⚡ Fast: Parallel processing with distributed map pattern
💰 Cost-Effective: Dynamic scaling, pay only when processing
🔄 Reliable: Built-in retry logic and error handling
📊 Observable: Comprehensive structured JSON logging
🐳 Containerized: Uses Docker image from application code
🌐 Portable: Dynamic VPC discovery, works across accounts/regions
🎯 Flexible: Easy adaptation to any workload size

🚀 Quick Start

📋 Prerequisites

AWS SAM CLI installed and configured
AWS CLI configured with appropriate region
Docker installed and running
AWS Permissions for ECS, Step Functions, S3, ECR, CloudFormation, Auto Scaling
jq for JSON processing

🔧 Deploy

This solution uses AWS SAM (Serverless Application Model) for infrastructure deployment and management.

# Deploy with default settings (ASG max size: 100)
./deploy.sh

# Or customize deployment with instance type
./deploy.sh 5 m5.large    # Still works, but worker count is now dynamic

The deployment script uses sam build and sam deploy commands to provision all AWS resources defined in the SAM template.

🧪 Test & Execute

# Run end-to-end test with default 3 workers
./test.sh

# Run test with custom worker count
./test.sh 10

# Execute with specific worker count
./execute.sh 5          # 5 workers
./execute.sh 20         # 20 workers
./execute.sh 50         # 50 workers

# Execute with custom S3 prefix
./execute.sh 10 data/   # 10 workers, process 'data/' prefix

Script Differences:

test.sh: Complete end-to-end test - generates files, monitors execution, verifies results
execute.sh: Quick execution launcher - uses existing S3 files, starts workflow and exits

🧹 Cleanup

# Remove all resources
./cleanup.sh

📁 Project Structure

├── README.md                    # This guide
├── template.yaml               # AWS SAM template with dynamic scaling (max 100)
├── deploy.sh                   # SAM deployment script
├── test.sh                     # End-to-end test with dynamic worker count
├── execute.sh                  # Simple execution script for any worker count
├── cleanup.sh                  # Resource cleanup
├── build-and-push.sh          # Docker build/push script
├── application/                # Application code
│   ├── processor.py           # Worker with activity polling & structured logging
│   ├── requirements.txt       # Python dependencies
│   └── Dockerfile            # Container definition
├── functions/                  # Lambda functions
│   └── ecs_provisioner/      # Dynamic ECS provisioning logic
└── statemachine/              # Step Functions workflow
    └── workflow-complete.asl.json  # Accepts dynamic worker_count input

⚙️ Dynamic Worker Configuration

🎯 Execution Input Format

{
  "objects": [
    {"Key": "input/file1.txt"},
    {"Key": "input/file2.txt"}
  ],
  "worker_count": 10
}

🖥️ Instance Types

t3.medium - Cost-effective, light workloads
c5.large - CPU-intensive processing (default)
m5.large - Balanced CPU/memory
m5.xlarge - Memory-intensive workloads

📊 Expected Performance

Workers	Instance Type	Objects	Processing Time	Throughput
3	c5.large	50	~4 minutes	750 obj/hr
5	c5.large	50	~2.5 minutes	1200 obj/hr
10	c5.large	50	~1.5 minutes	2000 obj/hr
20	m5.large	100	~1.5 minutes	4000 obj/hr

🔍 Monitoring

📋 CloudWatch Logs

# View worker logs
aws logs describe-log-streams \
  --log-group-name "/ecs/s3-batch-processor-s3-batch-processor" \
  --region ap-east-1

📊 Key Log Messages

"Processing task" - Task received from Step Functions Activity
"Processing S3 object" - Object processing started
"S3 object processed successfully" - Processing completed
"Task completed" - Task finished with count

🎯 Step Functions Console

Visual workflow execution tracking
Distributed map performance metrics
Error details and retry information

🎨 Customizing Processing Logic

The core processing happens in application/processor.py. Modify the process_s3_object() method:

def process_s3_object(self, object_key: str, bucket: str = None) -> Dict[str, Any]:
    # Your custom processing logic here
    # - Image processing: resize, filter, analyze
    # - Data transformation: parse, validate, enrich
    # - ML inference: classify, predict, score
    # - File conversion: PDF to text, format conversion
    
    # Current implementation: 5-second processing simulation
    time.sleep(5)
    
    # Return structured result
    return {
        'object_key': object_key,
        'processed_key': processed_key,
        'content': content,
        'processing_time': processing_time,
        'processed_at': datetime.utcnow().isoformat(),
        'worker_id': self.worker_id,
        'status': 'success'
    }

🔒 Security Features

✅ IAM Roles: Least privilege access managed by SAM
✅ Dynamic VPC: Uses default VPC, no hardcoded values
✅ Container Security: Non-root user execution
✅ S3 Access: Scoped to specific bucket/prefixes
✅ SAM Security: Infrastructure as Code with version control

🎯 Production Considerations

💰 Cost Optimization

Zero cost when idle: ASG scales to 0 when no processing
Dynamic worker count: Scale exactly to your workload needs
Efficient processing: 5-second processing time per object
Automatic cleanup: Infrastructure scales down after completion

📈 Scaling Guidelines

Small workloads (< 50 objects): 2-5 workers
Medium workloads (50-500 objects): 5-20 workers
Large workloads (500+ objects): 20-100 workers
Maximum capacity: 100 workers (configurable in template.yaml)

🔧 Tested Configuration

✅ Dynamic worker count: 1-100 workers tested and working
✅ 1:1:1 ratio: 1 worker = 1 EC2 instance = 1 ECS task
✅ Dynamic VPC discovery: Portable across accounts
✅ Proper logging: Structured JSON logs with processing details
✅ Complete lifecycle: Provision → Process → Deprovision
✅ AWS SAM deployment: Infrastructure as Code with repeatable deployments

🛠️ SAM Commands

# Build the application
sam build

# Deploy with guided prompts
sam deploy --guided

# Deploy with custom max workers (default: 100)
sam deploy --parameter-overrides MaxWorkers=200

# View stack outputs
sam list stack-outputs

# Delete the stack
sam delete

🚀 Usage Examples

# Process 10 files with 3 workers
./execute.sh 3

# Process 100 files with 25 workers for faster throughput
./execute.sh 25

# Process files from 'data/' prefix with 10 workers
./execute.sh 10 data/

# Run comprehensive test with 5 workers
./test.sh 5

# Multiple concurrent executions for load testing
./execute.sh 5 && ./execute.sh 8 && ./execute.sh 12

🎉 Ready to process thousands of S3 objects efficiently with dynamic AWS SAM scaling!

Name		Name	Last commit message	Last commit date
Latest commit History 9 Commits
application		application
architecture		architecture
functions		functions
statemachine		statemachine
.gitignore		.gitignore
README.md		README.md
build-and-push.sh		build-and-push.sh
cleanup.sh		cleanup.sh
deploy.sh		deploy.sh
execute.sh		execute.sh
template.yaml		template.yaml
test.sh		test.sh

shawnawshk/s3-batch-processing-with-stepfunction

Folders and files

Latest commit

History

Repository files navigation