Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Workloads implemented for A3-Mega and A3-Ultra machines #306

Merged
merged 9 commits into from
Dec 28, 2024

Conversation

sharabiani
Copy link
Collaborator

@sharabiani sharabiani commented Dec 28, 2024

Fixes / Features

  • Workload jobset for A3 Machines added
  • Workload decorator for A3Ultra implemented
  • Workload decorator for A3Mega implemented
  • TAS (Topology Aware Scheduling) support using Kueue added to A3Ultra and A3Mega workloads
  • ConfigMaps updated for A3Ultra and A3Mega to support workloads requirements
  • Kueue configuration updated to disable Preemption for A3Ultra and A3Mega (Preemption is not supported with TAS)
  • Blueprints updated to support Kueue v0.10.0 in A3Ultra and A3Mega clusters
  • GPU workflows dependency to Maxtext removed
  • Fixed some bugs in GPU workloads flow
  • Document updated to include A3-Mega and A3-Ultra clusters

Testing / Documentation

Testing details.

  • [ y/n ] Tests pass
  • [ y/n ] Appropriate changes to documentation are included in the PR

@sharabiani sharabiani changed the title A3 machines workloads implemented A3-Mega and A3-Ultra workloads implemented Dec 28, 2024
@sharabiani sharabiani changed the title A3-Mega and A3-Ultra workloads implemented Workloads implemented for A3-Mega and A3-Ultra machines Dec 28, 2024
@sharabiani sharabiani marked this pull request as ready for review December 28, 2024 18:19
@sharabiani sharabiani merged commit 0e41cd6 into main Dec 28, 2024
6 checks passed
@sharabiani sharabiani deleted the workloads_a3 branch December 28, 2024 18:55
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant