Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Test with GPU and m+t4 machines #533

Merged
merged 42 commits into from
Apr 27, 2022
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
42 commits
Select commit Hold shift + click to select a range
2559b15
Test with GPU and m–t4 machines
0x2b3bfa0 Apr 25, 2022
accdb86
Restyled by whitespace
restyled-commits Apr 25, 2022
9939a56
Use `Standard_NC4as_T4_v3` on AKS
0x2b3bfa0 Apr 25, 2022
79ffd19
Bump default disk size on iterative_task resource
0x2b3bfa0 Apr 25, 2022
855bda2
Ditto for tests
0x2b3bfa0 Apr 25, 2022
8a850f7
Fix `az` NGC machine image version
0x2b3bfa0 Apr 25, 2022
3e33ecc
Migrate Google Cloud GPU images to official
0x2b3bfa0 Apr 25, 2022
92ac3f8
Remove redundant `grep`
0x2b3bfa0 Apr 25, 2022
8a45f74
Bump default disk to 50 GB
0x2b3bfa0 Apr 25, 2022
93443ee
Ditto
0x2b3bfa0 Apr 25, 2022
da8c7a8
Try Azure DSVM GPU images
0x2b3bfa0 Apr 25, 2022
104d065
Try which providers support unset disk size
0x2b3bfa0 Apr 25, 2022
b4c2ce1
Restyled by gofmt
restyled-commits Apr 25, 2022
3968b2d
Make disk size optional
0x2b3bfa0 Apr 25, 2022
2bbff81
Whoops!
0x2b3bfa0 Apr 25, 2022
454ba24
Restyled by gofmt
restyled-commits Apr 25, 2022
9397935
Use disk_size > 0 everywhere
0x2b3bfa0 Apr 25, 2022
521c765
Fix `gcp` derp
0x2b3bfa0 Apr 25, 2022
7c05a1d
Fix GCP GPU machines
0x2b3bfa0 Apr 25, 2022
fa1663e
Use `yes`because... why not?
0x2b3bfa0 Apr 25, 2022
86dcf87
Indent back script
0x2b3bfa0 Apr 25, 2022
a2c35b0
Simplify test script error handling
0x2b3bfa0 Apr 25, 2022
ee7046d
Remove redundant `Storage` requirement
0x2b3bfa0 Apr 25, 2022
82958c5
Avoid `mkdir` error if directory exists
0x2b3bfa0 Apr 25, 2022
df89dfc
Improve test verbosity & fail fast
0x2b3bfa0 Apr 25, 2022
ea2f6ff
Upgrade AWS DLAMI to CUDA 11.3
0x2b3bfa0 Apr 25, 2022
90a5745
Keep it simple
0x2b3bfa0 Apr 25, 2022
8b5963a
Test `m+k80` to see if `k8s` breaks
0x2b3bfa0 Apr 26, 2022
fc6cc82
Fix `k8s` storage size
0x2b3bfa0 Apr 26, 2022
5ece09d
Restyled by gofmt
restyled-commits Apr 26, 2022
84ad30b
Ahem, ahem
0x2b3bfa0 Apr 26, 2022
0f1c4f0
Use `t4` again
0x2b3bfa0 Apr 26, 2022
eb2d6a0
Remove `k8s` granular GPU selectors
0x2b3bfa0 Apr 26, 2022
6b27b54
Merge branch 'master' into test-m-t4
0x2b3bfa0 Apr 26, 2022
77a7679
Fix last `k8s` issues 🤞
0x2b3bfa0 Apr 26, 2022
f573656
Revert cluster instance change
0x2b3bfa0 Apr 26, 2022
f691c06
Delete linux_amd64
0x2b3bfa0 Apr 27, 2022
78042fc
Avoid mkdir errors
0x2b3bfa0 Apr 27, 2022
c7f3041
Update task/k8s/resources/resource_job.go
0x2b3bfa0 Apr 27, 2022
5b96c7e
docs: auto-disk_size
casperdcl Apr 27, 2022
8b9ba65
docs: nvidia images
casperdcl Apr 27, 2022
4f208b0
nvidia descrtiption
casperdcl Apr 27, 2022
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
2 changes: 1 addition & 1 deletion README.md
Original file line number Diff line number Diff line change
Expand Up @@ -77,7 +77,7 @@ resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
disk_size = 30 # GB
disk_size = -1 # GB. Default -1 for automatic

storage {
workdir = "." # default blank (don't upload)
Expand Down
2 changes: 1 addition & 1 deletion docs/guides/getting-started.md
Original file line number Diff line number Diff line change
Expand Up @@ -38,7 +38,7 @@ resource "iterative_task" "example" {
cloud = "aws" # or any of: gcp, az, k8s
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
disk_size = 30 # GB
disk_size = -1 # GB. Default -1 for automatic

storage {
workdir = "." # default blank (don't upload)
Expand Down
6 changes: 3 additions & 3 deletions docs/resources/task.md
Original file line number Diff line number Diff line change
Expand Up @@ -15,7 +15,7 @@ resource "iterative_task" "example" {
machine = "m" # medium. Or any of: l, xl, m+k80, xl+v100, ...
image = "ubuntu" # or "nvidia", ...
region = "us-west" # or "us-east", "eu-west", ...
disk_size = 30 # GB
disk_size = -1 # GB. Default -1 for automatic
spot = 0 # auto-price. Default -1 to disable, or >0 for hourly USD limit
parallelism = 1
timeout = 24*60*60 # max 24h before forced termination
Expand Down Expand Up @@ -56,7 +56,7 @@ resource "iterative_task" "example" {

- `region` - (Optional) [Cloud region/zone](#cloud-region) to run the task on.
- `machine` - (Optional) See [Machine Types](#machine-type) below.
- `disk_size` - (Optional) Size of the ephemeral machine storage in GB.
- `disk_size` - (Optional) Size of the ephemeral machine storage in GB. `-1`: automatic based on `image`.
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
- `spot` - (Optional) Spot instance price. `-1`: disabled, `0`: automatic price, any other positive number: maximum bidding price in USD per hour (above which the instance is terminated until the price drops).
- `image` - (Optional) [Machine image](#machine-image) to run the task with.
- `parallelism` - (Optional) Number of machines to be launched in parallel.
Expand Down Expand Up @@ -169,7 +169,7 @@ In addition to generic types, it's possible to specify any machine type supporte
The Iterative Provider offers some common machine images which are roughly the same for all supported clouds.

- `ubuntu` - Official [Ubuntu LTS](https://wiki.ubuntu.com/LTS) image (currently 20.04).
- `nvidia` - Official [NVIDIA NGC](https://docs.nvidia.com/ngc/ngc-deploy-public-cloud)-based images, typically needing `disk_size = 32` GB or more.
- `nvidia` - Official Ubuntu LTS with NVIDIA GPU drivers and CUDA toolkit (currently 11.3).

### Cloud-specific

Expand Down
2 changes: 1 addition & 1 deletion iterative/resource_task.go
Original file line number Diff line number Diff line change
Expand Up @@ -55,7 +55,7 @@ func resourceTask() *schema.Resource {
Type: schema.TypeInt,
ForceNew: true,
Optional: true,
Default: 30,
Default: -1,
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
},
"spot": {
Type: schema.TypeFloat,
Expand Down
2 changes: 1 addition & 1 deletion task/aws/resources/data_source_image.go
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,7 @@ func (i *Image) Read(ctx context.Context) error {
image := i.Identifier
images := map[string]string{
"ubuntu": "ubuntu@099720109477:x86_64:*ubuntu/images/hvm-ssd/ubuntu-focal-20.04*",
"nvidia": "ubuntu@898082745236:x86_64:Deep Learning AMI GPU CUDA 11.2.1 (Ubuntu 20.04) 20220306",
"nvidia": "ubuntu@898082745236:x86_64:Deep Learning AMI GPU CUDA 11.3.1 (Ubuntu 20.04) 20220303",
}
if val, ok := images[image]; ok {
image = val
Expand Down
5 changes: 4 additions & 1 deletion task/aws/resources/resource_launch_template.go
Original file line number Diff line number Diff line change
Expand Up @@ -86,7 +86,6 @@ func (l *LaunchTemplate) Create(ctx context.Context) error {
Ebs: &types.LaunchTemplateEbsBlockDeviceRequest{
DeleteOnTermination: aws.Bool(true),
Encrypted: aws.Bool(false),
VolumeSize: aws.Int32(int32(l.Attributes.Size.Storage)),
VolumeType: types.VolumeType("gp2"),
},
},
Expand All @@ -110,6 +109,10 @@ func (l *LaunchTemplate) Create(ctx context.Context) error {
},
}

if size := l.Attributes.Size.Storage; size > 0 {
input.LaunchTemplateData.BlockDeviceMappings[0].Ebs.VolumeSize = aws.Int32(int32(size))
}
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved

if _, err := l.Client.Services.EC2.CreateLaunchTemplate(ctx, &input); err != nil {
var e smithy.APIError
if errors.As(err, &e) && e.ErrorCode() == "InvalidLaunchTemplateName.AlreadyExistsException" {
Expand Down
7 changes: 5 additions & 2 deletions task/az/resources/resource_virtual_machine_scale_set.go
Original file line number Diff line number Diff line change
Expand Up @@ -85,7 +85,7 @@ func (v *VirtualMachineScaleSet) Create(ctx context.Context) error {
image := v.Attributes.Environment.Image
images := map[string]string{
"ubuntu": "ubuntu@Canonical:0001-com-ubuntu-server-focal:20_04-lts:latest",
"nvidia": "ubuntu@nvidia:ngc_base_image_version_b:gen2_21-11-0:latest#plan",
"nvidia": "ubuntu@microsoft-dsvm:ubuntu-2004:2004-gen2:latest",
}
if val, ok := images[image]; ok {
image = val
Expand Down Expand Up @@ -145,7 +145,6 @@ func (v *VirtualMachineScaleSet) Create(ctx context.Context) error {
OsDisk: &compute.VirtualMachineScaleSetOSDisk{
Caching: compute.CachingTypesReadWrite,
CreateOption: compute.DiskCreateOptionTypesFromImage,
DiskSizeGB: to.Int32Ptr(int32(v.Attributes.Size.Storage)),
ManagedDisk: &compute.VirtualMachineScaleSetManagedDiskParameters{
StorageAccountType: compute.StorageAccountTypesStandardLRS,
},
Expand Down Expand Up @@ -192,6 +191,10 @@ func (v *VirtualMachineScaleSet) Create(ctx context.Context) error {
},
}

if size := v.Attributes.Size.Storage; size > 0 {
settings.VirtualMachineScaleSetProperties.VirtualMachineProfile.StorageProfile.OsDisk.DiskSizeGB = to.Int32Ptr(int32(size))
}

if plan == "#plan" {
settings.Plan = &compute.Plan{
Publisher: to.StringPtr(publisher),
Expand Down
2 changes: 2 additions & 0 deletions task/common/machine/script.go
Original file line number Diff line number Diff line change
Expand Up @@ -102,6 +102,8 @@ fi

rclone copy "$RCLONE_REMOTE/data" /tmp/tpi-task

yes | /etc/profile.d/install-driver-prompt.sh # for GCP GPU machines

sudo systemctl daemon-reload
sudo systemctl enable tpi-task.service --now
sudo systemctl disable --now apt-daily.timer
Expand Down
2 changes: 1 addition & 1 deletion task/gcp/resources/data_source_image.go
Original file line number Diff line number Diff line change
Expand Up @@ -32,7 +32,7 @@ func (i *Image) Read(ctx context.Context) error {
image := i.Identifier
images := map[string]string{
"ubuntu": "ubuntu@ubuntu-os-cloud/ubuntu-2004-lts",
"nvidia": "ubuntu@nvidia-ngc-public/nvidia-gpu-cloud-image-20211105",
"nvidia": "ubuntu@deeplearning-platform-release/common-cu113-ubuntu-2004",
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
}
if val, ok := images[image]; ok {
image = val
Expand Down
5 changes: 4 additions & 1 deletion task/gcp/resources/resource_instance_template.go
Original file line number Diff line number Diff line change
Expand Up @@ -131,7 +131,6 @@ func (i *InstanceTemplate) Create(ctx context.Context) error {
Mode: "READ_WRITE",
InitializeParams: &compute.AttachedDiskInitializeParams{
SourceImage: i.Dependencies.Image.Resource.SelfLink,
DiskSizeGb: int64(i.Attributes.Size.Storage),
DiskType: "pd-balanced",
},
},
Expand Down Expand Up @@ -171,6 +170,10 @@ func (i *InstanceTemplate) Create(ctx context.Context) error {
},
}

if size := i.Attributes.Size.Storage; size > 0 {
definition.Properties.Disks[0].InitializeParams.DiskSizeGb = int64(size)
}

insertOperation, err := i.Client.Services.Compute.InstanceTemplates.Insert(i.Client.Credentials.ProjectID, definition).Do()
if err != nil {
if strings.HasSuffix(err.Error(), "alreadyExists") {
Expand Down
18 changes: 9 additions & 9 deletions task/k8s/resources/resource_job.go
Original file line number Diff line number Diff line change
Expand Up @@ -62,13 +62,13 @@ func (j *Job) Create(ctx context.Context) error {
"m": "8-32000",
"l": "32-128000",
"xl": "64-256000",
"m+t4": "4-16000+nvidia-tesla-t4*1",
"m+k80": "4-64000+nvidia-tesla-k80*1",
"l+k80": "32-512000+nvidia-tesla-k80*8",
"xl+k80": "64-768000+nvidia-tesla-k80*16",
"m+v100": "8-64000+nvidia-tesla-v100*1",
"l+v100": "32-256000+nvidia-tesla-v100*4",
"xl+v100": "64-512000+nvidia-tesla-v100*8",
"m+t4": "4-16000+nvidia*1",
"m+k80": "4-64000+nvidia*1",
"l+k80": "32-512000+nvidia*8",
"xl+k80": "64-768000+nvidia*16",
"m+v100": "8-64000+nvidia*1",
"l+v100": "32-256000+nvidia*4",
"xl+v100": "64-512000+nvidia*8",
0x2b3bfa0 marked this conversation as resolved.
Show resolved Hide resolved
}
if val, ok := sizes[size]; ok {
size = val
Expand All @@ -77,7 +77,7 @@ func (j *Job) Create(ctx context.Context) error {
image := j.Attributes.Task.Environment.Image
images := map[string]string{
"ubuntu": "ubuntu",
"nvidia": "nvidia/cuda",
"nvidia": "nvidia/cuda:11.3.1-cudnn8-runtime-ubuntu20.04",
}
if val, ok := images[image]; ok {
image = val
Expand All @@ -91,7 +91,7 @@ func (j *Job) Create(ctx context.Context) error {
// Define the accelerator settings (i.e. GPU type, model, ...)
jobNodeSelector := map[string]string{}
jobAccelerator := match[3]
jobGPUType := "kubernetes.io/gpu"
jobGPUType := "nvidia.com/gpu"
jobGPUCount := match[4]

// Define the dynamic resource allocation limits for the job pods.
Expand Down
11 changes: 8 additions & 3 deletions task/k8s/resources/resource_persistent_volume_claim.go
Original file line number Diff line number Diff line change
Expand Up @@ -14,7 +14,7 @@ import (
"terraform-provider-iterative/task/k8s/client"
)

func NewPersistentVolumeClaim(client *client.Client, identifier common.Identifier, storageClass string, size uint64, many bool) *PersistentVolumeClaim {
func NewPersistentVolumeClaim(client *client.Client, identifier common.Identifier, storageClass string, size int, many bool) *PersistentVolumeClaim {
p := new(PersistentVolumeClaim)
p.Client = client
p.Identifier = identifier.Long()
Expand All @@ -29,7 +29,7 @@ type PersistentVolumeClaim struct {
Identifier string
Attributes struct {
StorageClass string
Size uint64
Size int
Many bool
}
Dependencies struct{}
Expand All @@ -42,6 +42,11 @@ func (p *PersistentVolumeClaim) Create(ctx context.Context) error {
accessMode = kubernetes_core.ReadWriteMany
}

size := p.Attributes.Size
if size <= 0 {
size = 1 // Most StorageClasses disregard size anyway
}

persistentVolumeClaimInput := kubernetes_core.PersistentVolumeClaim{
ObjectMeta: kubernetes_meta.ObjectMeta{
Name: p.Identifier,
Expand All @@ -53,7 +58,7 @@ func (p *PersistentVolumeClaim) Create(ctx context.Context) error {
AccessModes: []kubernetes_core.PersistentVolumeAccessMode{accessMode},
Resources: kubernetes_core.ResourceRequirements{
Requests: kubernetes_core.ResourceList{
kubernetes_core.ResourceStorage: kubernetes_resource.MustParse(strconv.Itoa(int(p.Attributes.Size)) + "G"),
kubernetes_core.ResourceStorage: kubernetes_resource.MustParse(strconv.Itoa(size) + "G"),
},
},
},
Expand Down
4 changes: 2 additions & 2 deletions task/k8s/task.go
Original file line number Diff line number Diff line change
Expand Up @@ -31,7 +31,7 @@ func New(ctx context.Context, cloud common.Cloud, identifier common.Identifier,
}

persistentVolumeClaimStorageClass := ""
persistentVolumeClaimSize := uint64(task.Size.Storage)
persistentVolumeClaimSize := task.Size.Storage
persistentVolumeDirectory := task.Environment.Directory

match := regexp.MustCompile(`^([^:]+):(?:(\d+):)?(.+)$`).FindStringSubmatch(task.Environment.Directory)
Expand All @@ -42,7 +42,7 @@ func New(ctx context.Context, cloud common.Cloud, identifier common.Identifier,
if err != nil {
return nil, err
}
persistentVolumeClaimSize = uint64(number)
persistentVolumeClaimSize = int(number)
}
persistentVolumeDirectory = match[3]
}
Expand Down
20 changes: 14 additions & 6 deletions task/task_test.go
Original file line number Diff line number Diff line change
Expand Up @@ -76,15 +76,14 @@ func TestTask(t *testing.T) {

task := common.Task{
Size: common.Size{
Machine: "m",
Storage: 30,
Machine: "m+t4",
},
Environment: common.Environment{
Image: "ubuntu",
Script: `#!/bin/bash
mkdir cache
Image: "nvidia",
Script: `#!/bin/sh -e
nvidia-smi
mkdir --parents cache output
touch cache/file
mkdir output
echo "$ENVIRONMENT_VARIABLE_DATA" | tee --append output/file
sleep 60
cat output/file
Expand Down Expand Up @@ -132,6 +131,7 @@ func TestTask(t *testing.T) {
for assert.NoError(t, newTask.Read(ctx)) {
logs, err := newTask.Logs(ctx)
require.NoError(t, err)
t.Log(logs)

for _, log := range logs {
if strings.Contains(log, oldData) &&
Expand All @@ -140,6 +140,14 @@ func TestTask(t *testing.T) {
}
}

status, err := newTask.Status(ctx)
require.NoError(t, err)
t.Log(status)

if status[common.StatusCodeFailed] > 0 {
break
}

time.Sleep(10 * time.Second)
}

Expand Down