Skip to content

Commit

Permalink
feat: add ray orchestrator module (#253)
Browse files Browse the repository at this point in the history
* add ray orchestrator module

* add aws-auth identity mapping

* add modulestack

* add test cases

* yamllint

* changelog

* add training script via config map

* update manifests

* update docs

* minor housekeeping

* add sfn timeout config

* docs

* update screenshots

* proofreading

* pr feedback
  • Loading branch information
kukushking authored Oct 23, 2024
1 parent edc66f1 commit 0240463
Show file tree
Hide file tree
Showing 25 changed files with 1,394 additions and 2 deletions.
3 changes: 3 additions & 0 deletions CHANGELOG.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,7 +8,10 @@ and this project adheres to [Semantic Versioning](https://semver.org/spec/v2.0.0
## UNRELEASED

### **Added**

- added `ray-orchestrator` module
- added GitHub as alternate option for code repository support along with AWS CodeCommit for sagemaker-templates-service-catalog module

### **Changed**
- updated manifests to idf release 1.12.0

Expand Down
2 changes: 2 additions & 0 deletions manifests/fine-tuning-6b/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ groups:
path: manifests/fine-tuning-6b/ray-operator-modules.yaml
- name: ray-cluster
path: manifests/fine-tuning-6b/ray-cluster-modules.yaml
- name: ray-orchestrator
path: manifests/fine-tuning-6b/ray-orchestrator-modules.yaml
targetAccountMappings:
- alias: primary
accountId:
Expand Down
48 changes: 48 additions & 0 deletions manifests/fine-tuning-6b/ray-orchestrator-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: ray-orchestrator
path: modules/eks/ray-orchestrator
parameters:
- name: Namespace
valueFrom:
parameterValue: rayNamespaceName
- name: EksClusterAdminRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: EksHandlerRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksHandlerRoleArn
- name: EksClusterName
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: EksClusterEndpoint
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterEndpoint
- name: EksOidcArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: EksCertAuthData
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterCertAuthData
- name: DataBucketName
valueFrom:
moduleMetadata:
group: base
name: buckets
key: ArtifactsBucketName
39 changes: 37 additions & 2 deletions manifests/ray-on-eks/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -11,12 +11,13 @@ file system. Additionally, a custom Ray container image is supported.

### Architecture

![Ray on Amazon EKS Architecture](docs/ray-on-eks-architecture.jpg "Ray on Amazon EKS Architecture")
![Ray on Amazon EKS Architecture](docs/ray-on-eks-architecture.png "Ray on Amazon EKS Architecture")

### Modules Inventory

- [Ray Operator Module](modules/eks/ray-operator/README.md)
- [Ray Cluster Module](modules/eks/ray-cluster/README.md)
- [Ray Orchestrator Module](modules/eks/ray-orchestrator/README.md)
- [Ray Image Module](modules/eks/ray-image/README.md)
- [EKS Module](https://github.com/awslabs/idf-modules/tree/main/modules/compute/eks)
- [FSx for Lustre Module](https://github.com/awslabs/idf-modules/tree/main/modules/storage/fsx-lustre)
Expand All @@ -31,7 +32,41 @@ For deployment instructions, please refer to [DEPLOYMENT.MD](https://github.com/

## User Guide

### Submitting Jobs
### Submitting Jobs using AWS Step Functions

1. Navigate to AWS Step Functions and find step function starting with `TrainingOnEks`
2. Start a new Step Function execution

![Step Function Execution](docs/step-function.png "Step Function Execution")

To observe the progress of the job using Ray Dashboard,

1. Connect to EKS cluster
```
aws eks update-kubeconfig --region us-east-1 --name eks-cluster-xxx
```

2. Get Ray service endpoint:

```
kubectl get endpoints -n ray
NAME ENDPOINTS AGE
kuberay-head-svc ...:8080,...:10001,...:8000 + 2 more... 98s
kuberay-operator ...:8080 6m37s
```

3. Start port forwarding:

```
kubectl port-forward -n ray --address 0.0.0.0 service/kuberay-head-svc 8265:8265
```

4. Access the Ray Dashboard at `http://localhost:8265`:

![Ray Dashboard](docs/ray-dashboard.png "Ray Dashboard")

### Submitting Jobs from a local machine

After deploying the manifest, follow the steps below to submit a job to the cluster.

Expand Down
2 changes: 2 additions & 0 deletions manifests/ray-on-eks/deployment.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -14,6 +14,8 @@ groups:
path: manifests/ray-on-eks/ray-operator-modules.yaml
- name: ray-cluster
path: manifests/ray-on-eks/ray-cluster-modules.yaml
- name: ray-orchestrator
path: manifests/ray-on-eks/ray-orchestrator-modules.yaml
targetAccountMappings:
- alias: primary
accountId:
Expand Down
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added manifests/ray-on-eks/docs/step-function.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
48 changes: 48 additions & 0 deletions manifests/ray-on-eks/ray-orchestrator-modules.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,48 @@
name: ray-orchestrator
path: modules/eks/ray-orchestrator
parameters:
- name: Namespace
valueFrom:
parameterValue: rayNamespaceName
- name: EksClusterAdminRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: EksHandlerRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksHandlerRoleArn
- name: EksClusterName
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: EksClusterEndpoint
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterEndpoint
- name: EksOidcArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: EksCertAuthData
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterCertAuthData
- name: DataBucketName
valueFrom:
moduleMetadata:
group: base
name: buckets
key: ArtifactsBucketName
121 changes: 121 additions & 0 deletions modules/eks/ray-orchestrator/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,121 @@
# Ray Orchestrator

## Description

This module orchestrates submission of a training job to the Ray Cluster using AWS Step Functions.

## Inputs/Outputs

### Input Parameters

#### Required

- `namespace` - Kubernetes namespace name
- `eks_cluster_admin_role_arn`- ARN of EKS admin role to authenticate kubectl
- `eks_handler_role_arn`- ARN of EKS admin role to authenticate kubectl
- `eks_cluster_name` - Name of the EKS cluster to deploy to
- `eks_cluster_endpoint` - EKS cluster endpoint
- `eks_oidc_arn` - ARN of EKS OIDC provider for IAM roles
- `eks_cert_auth_data` - Auth certificate

#### Optional

- `step_function_timeout` - Step function timeout in minutes. Defaults to `360`
- `data_bucket_name` - Name of the bucket to grant service account permissions to
- `tags` - A dictionary of additional tags to apply to all resources. Defaults to None

## User Guide

### Submitting Jobs using Step Functions

1. Navigate to AWS Step Functions and find step function starting with "TrainingOnEks"
2. Start a new Step Function execution

![Step Function Execution](docs/step-function.png "Step Function Execution")

To observe the progress of the job using Ray Dashboard,

1. Connect to EKS cluster
```
aws eks update-kubeconfig --region us-east-1 --name eks-cluster-xxx
```

2. Get Ray service endpoint:

```
kubectl get endpoints -n ray
NAME ENDPOINTS AGE
kuberay-head-svc ...:8080,...:10001,...:8000 + 2 more... 98s
kuberay-operator ...:8080 6m37s
```

3. Start port forwarding:

```
kubectl port-forward -n ray --address 0.0.0.0 service/kuberay-head-svc 8265:8265
```

4. Access the Ray Dashboard at `http://localhost:8265`:

![Ray Dashboard](docs/ray-dashboard.png "Ray Dashboard")

## Sample manifest declaration

```yaml
name: ray-orchestrator
path: modules/eks/ray-orchestrator
parameters:
- name: Namespace
valueFrom:
parameterValue: rayNamespaceName
- name: EksClusterAdminRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterMasterRoleArn
- name: EksHandlerRoleArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksHandlerRoleArn
- name: EksClusterName
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterName
- name: EksClusterEndpoint
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterEndpoint
- name: EksOidcArn
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksOidcArn
- name: EksCertAuthData
valueFrom:
moduleMetadata:
group: core
name: eks
key: EksClusterCertAuthData
- name: DataBucketName
valueFrom:
moduleMetadata:
group: base
name: buckets
key: ArtifactsBucketName
```
## Module Metadata Outputs
- `EksServiceAccountName`: Service Account Name.
- `EksServiceAccountRoleArn`: Service Account Role ARN.
- `StateMachineArn`: Step Function ARN.
- `LogGroupArn`: log group ARN.
72 changes: 72 additions & 0 deletions modules/eks/ray-orchestrator/app.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,72 @@
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved.
# SPDX-License-Identifier: Apache-2.0

from aws_cdk import App, CfnOutput, Environment, Tags

from ray_orchestrator_stack import RayOrchestrator
from rbac_stack import RbacStack
from settings import ApplicationSettings

app_settings = ApplicationSettings()

app = App()
env = Environment(
account=app_settings.default.account,
region=app_settings.default.region,
)

rbac_stack = RbacStack(
scope=app,
id=f"{app_settings.settings.app_prefix}-rbac",
project_name=app_settings.settings.project_name,
deployment_name=app_settings.settings.deployment_name,
module_name=app_settings.settings.module_name,
eks_cluster_name=app_settings.parameters.eks_cluster_name,
eks_admin_role_arn=app_settings.parameters.eks_cluster_admin_role_arn,
eks_handler_role_arn=app_settings.parameters.eks_handler_role_arn,
eks_oidc_arn=app_settings.parameters.eks_oidc_arn,
namespace_name=app_settings.parameters.namespace,
data_bucket_name=app_settings.parameters.data_bucket_name,
env=env,
)

ray_orchestrator_stack = RayOrchestrator(
scope=app,
id=app_settings.settings.app_prefix,
project_name=app_settings.settings.project_name,
deployment_name=app_settings.settings.deployment_name,
module_name=app_settings.settings.module_name,
eks_cluster_name=app_settings.parameters.eks_cluster_name,
eks_admin_role_arn=app_settings.parameters.eks_cluster_admin_role_arn,
eks_cluster_endpoint=app_settings.parameters.eks_cluster_endpoint,
eks_openid_connect_provider_arn=app_settings.parameters.eks_oidc_arn,
eks_cert_auth_data=app_settings.parameters.eks_cert_auth_data,
namespace_name=app_settings.parameters.namespace,
step_function_timeout=app_settings.parameters.step_function_timeout,
service_account_name=rbac_stack.service_account.service_account_name,
service_account_role_arn=rbac_stack.service_account.role.role_arn,
env=env,
)

if app_settings.parameters.tags:
for tag_key, tag_value in app_settings.parameters.tags.items():
Tags.of(app).add(tag_key, tag_value)

Tags.of(app).add("SeedFarmerDeploymentName", app_settings.settings.deployment_name)
Tags.of(app).add("SeedFarmerModuleName", app_settings.settings.module_name)
Tags.of(app).add("SeedFarmerProjectName", app_settings.settings.project_name)

CfnOutput(
scope=ray_orchestrator_stack,
id="metadata",
value=ray_orchestrator_stack.to_json_string(
{
"EksServiceAccountName": rbac_stack.service_account.service_account_name,
"EksServiceAccountRoleArn": rbac_stack.service_account.role.role_arn,
"StateMachineArn": ray_orchestrator_stack.sm.state_machine_arn,
"LogGroupArn": ray_orchestrator_stack.log_group.log_group_arn,
}
),
)

app.synth(force=True)
Loading

0 comments on commit 0240463

Please sign in to comment.