-
Notifications
You must be signed in to change notification settings - Fork 23
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
feat: add ray orchestrator module (#253)
* add ray orchestrator module * add aws-auth identity mapping * add modulestack * add test cases * yamllint * changelog * add training script via config map * update manifests * update docs * minor housekeeping * add sfn timeout config * docs * update screenshots * proofreading * pr feedback
- Loading branch information
1 parent
edc66f1
commit 0240463
Showing
25 changed files
with
1,394 additions
and
2 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
name: ray-orchestrator | ||
path: modules/eks/ray-orchestrator | ||
parameters: | ||
- name: Namespace | ||
valueFrom: | ||
parameterValue: rayNamespaceName | ||
- name: EksClusterAdminRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterMasterRoleArn | ||
- name: EksHandlerRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksHandlerRoleArn | ||
- name: EksClusterName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterName | ||
- name: EksClusterEndpoint | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterEndpoint | ||
- name: EksOidcArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksOidcArn | ||
- name: EksCertAuthData | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterCertAuthData | ||
- name: DataBucketName | ||
valueFrom: | ||
moduleMetadata: | ||
group: base | ||
name: buckets | ||
key: ArtifactsBucketName |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Binary file not shown.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,48 @@ | ||
name: ray-orchestrator | ||
path: modules/eks/ray-orchestrator | ||
parameters: | ||
- name: Namespace | ||
valueFrom: | ||
parameterValue: rayNamespaceName | ||
- name: EksClusterAdminRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterMasterRoleArn | ||
- name: EksHandlerRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksHandlerRoleArn | ||
- name: EksClusterName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterName | ||
- name: EksClusterEndpoint | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterEndpoint | ||
- name: EksOidcArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksOidcArn | ||
- name: EksCertAuthData | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterCertAuthData | ||
- name: DataBucketName | ||
valueFrom: | ||
moduleMetadata: | ||
group: base | ||
name: buckets | ||
key: ArtifactsBucketName |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,121 @@ | ||
# Ray Orchestrator | ||
|
||
## Description | ||
|
||
This module orchestrates submission of a training job to the Ray Cluster using AWS Step Functions. | ||
|
||
## Inputs/Outputs | ||
|
||
### Input Parameters | ||
|
||
#### Required | ||
|
||
- `namespace` - Kubernetes namespace name | ||
- `eks_cluster_admin_role_arn`- ARN of EKS admin role to authenticate kubectl | ||
- `eks_handler_role_arn`- ARN of EKS admin role to authenticate kubectl | ||
- `eks_cluster_name` - Name of the EKS cluster to deploy to | ||
- `eks_cluster_endpoint` - EKS cluster endpoint | ||
- `eks_oidc_arn` - ARN of EKS OIDC provider for IAM roles | ||
- `eks_cert_auth_data` - Auth certificate | ||
|
||
#### Optional | ||
|
||
- `step_function_timeout` - Step function timeout in minutes. Defaults to `360` | ||
- `data_bucket_name` - Name of the bucket to grant service account permissions to | ||
- `tags` - A dictionary of additional tags to apply to all resources. Defaults to None | ||
|
||
## User Guide | ||
|
||
### Submitting Jobs using Step Functions | ||
|
||
1. Navigate to AWS Step Functions and find step function starting with "TrainingOnEks" | ||
2. Start a new Step Function execution | ||
|
||
![Step Function Execution](docs/step-function.png "Step Function Execution") | ||
|
||
To observe the progress of the job using Ray Dashboard, | ||
|
||
1. Connect to EKS cluster | ||
``` | ||
aws eks update-kubeconfig --region us-east-1 --name eks-cluster-xxx | ||
``` | ||
|
||
2. Get Ray service endpoint: | ||
|
||
``` | ||
kubectl get endpoints -n ray | ||
NAME ENDPOINTS AGE | ||
kuberay-head-svc ...:8080,...:10001,...:8000 + 2 more... 98s | ||
kuberay-operator ...:8080 6m37s | ||
``` | ||
|
||
3. Start port forwarding: | ||
|
||
``` | ||
kubectl port-forward -n ray --address 0.0.0.0 service/kuberay-head-svc 8265:8265 | ||
``` | ||
|
||
4. Access the Ray Dashboard at `http://localhost:8265`: | ||
|
||
![Ray Dashboard](docs/ray-dashboard.png "Ray Dashboard") | ||
|
||
## Sample manifest declaration | ||
|
||
```yaml | ||
name: ray-orchestrator | ||
path: modules/eks/ray-orchestrator | ||
parameters: | ||
- name: Namespace | ||
valueFrom: | ||
parameterValue: rayNamespaceName | ||
- name: EksClusterAdminRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterMasterRoleArn | ||
- name: EksHandlerRoleArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksHandlerRoleArn | ||
- name: EksClusterName | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterName | ||
- name: EksClusterEndpoint | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterEndpoint | ||
- name: EksOidcArn | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksOidcArn | ||
- name: EksCertAuthData | ||
valueFrom: | ||
moduleMetadata: | ||
group: core | ||
name: eks | ||
key: EksClusterCertAuthData | ||
- name: DataBucketName | ||
valueFrom: | ||
moduleMetadata: | ||
group: base | ||
name: buckets | ||
key: ArtifactsBucketName | ||
``` | ||
## Module Metadata Outputs | ||
- `EksServiceAccountName`: Service Account Name. | ||
- `EksServiceAccountRoleArn`: Service Account Role ARN. | ||
- `StateMachineArn`: Step Function ARN. | ||
- `LogGroupArn`: log group ARN. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,72 @@ | ||
# Copyright Amazon.com, Inc. or its affiliates. All Rights Reserved. | ||
# SPDX-License-Identifier: Apache-2.0 | ||
|
||
from aws_cdk import App, CfnOutput, Environment, Tags | ||
|
||
from ray_orchestrator_stack import RayOrchestrator | ||
from rbac_stack import RbacStack | ||
from settings import ApplicationSettings | ||
|
||
app_settings = ApplicationSettings() | ||
|
||
app = App() | ||
env = Environment( | ||
account=app_settings.default.account, | ||
region=app_settings.default.region, | ||
) | ||
|
||
rbac_stack = RbacStack( | ||
scope=app, | ||
id=f"{app_settings.settings.app_prefix}-rbac", | ||
project_name=app_settings.settings.project_name, | ||
deployment_name=app_settings.settings.deployment_name, | ||
module_name=app_settings.settings.module_name, | ||
eks_cluster_name=app_settings.parameters.eks_cluster_name, | ||
eks_admin_role_arn=app_settings.parameters.eks_cluster_admin_role_arn, | ||
eks_handler_role_arn=app_settings.parameters.eks_handler_role_arn, | ||
eks_oidc_arn=app_settings.parameters.eks_oidc_arn, | ||
namespace_name=app_settings.parameters.namespace, | ||
data_bucket_name=app_settings.parameters.data_bucket_name, | ||
env=env, | ||
) | ||
|
||
ray_orchestrator_stack = RayOrchestrator( | ||
scope=app, | ||
id=app_settings.settings.app_prefix, | ||
project_name=app_settings.settings.project_name, | ||
deployment_name=app_settings.settings.deployment_name, | ||
module_name=app_settings.settings.module_name, | ||
eks_cluster_name=app_settings.parameters.eks_cluster_name, | ||
eks_admin_role_arn=app_settings.parameters.eks_cluster_admin_role_arn, | ||
eks_cluster_endpoint=app_settings.parameters.eks_cluster_endpoint, | ||
eks_openid_connect_provider_arn=app_settings.parameters.eks_oidc_arn, | ||
eks_cert_auth_data=app_settings.parameters.eks_cert_auth_data, | ||
namespace_name=app_settings.parameters.namespace, | ||
step_function_timeout=app_settings.parameters.step_function_timeout, | ||
service_account_name=rbac_stack.service_account.service_account_name, | ||
service_account_role_arn=rbac_stack.service_account.role.role_arn, | ||
env=env, | ||
) | ||
|
||
if app_settings.parameters.tags: | ||
for tag_key, tag_value in app_settings.parameters.tags.items(): | ||
Tags.of(app).add(tag_key, tag_value) | ||
|
||
Tags.of(app).add("SeedFarmerDeploymentName", app_settings.settings.deployment_name) | ||
Tags.of(app).add("SeedFarmerModuleName", app_settings.settings.module_name) | ||
Tags.of(app).add("SeedFarmerProjectName", app_settings.settings.project_name) | ||
|
||
CfnOutput( | ||
scope=ray_orchestrator_stack, | ||
id="metadata", | ||
value=ray_orchestrator_stack.to_json_string( | ||
{ | ||
"EksServiceAccountName": rbac_stack.service_account.service_account_name, | ||
"EksServiceAccountRoleArn": rbac_stack.service_account.role.role_arn, | ||
"StateMachineArn": ray_orchestrator_stack.sm.state_machine_arn, | ||
"LogGroupArn": ray_orchestrator_stack.log_group.log_group_arn, | ||
} | ||
), | ||
) | ||
|
||
app.synth(force=True) |
Oops, something went wrong.