Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

GCP Batch: Support passing standard machine types to the Google backend #7545

Open
wants to merge 6 commits into
base: develop
Choose a base branch
from

Conversation

javiergaitan
Copy link

@javiergaitan javiergaitan commented Sep 17, 2024

Description

Currently, and as described in #7535, only general-purpose machine types are supported in Google Backend, which prevents running wdl workflows on many machine types available on GCP, including those provisioned with modern GPUs.

I believe the simplest and most general solution would be to pass the machine type directly from the wdl configuration to the Google Batch API. The idea is that this approach would be more resilient to machine types being added or deprecated on GCP, as users would only need to update their wdl workflows in such cases. An alternative approach of mapping machine specs (e.g.: cpu platform and gpu requirements) to standard machine types would potentially introduce an additional layer of maintenance with little benefit.

This PR adds support for a new standardMachineType key in the runtime section, which is only parsed for the Google backend.

Testing

I deployed this internally and verified I can successfully run the following wdl workflow:

version 1.0

task nvidia_smi {
    input {
        String docker_version
    }

    command <<<
        nvidia-smi

        touch .done
        echo "Finished at $(date)"
    >>>

    runtime {
        docker: <internal image>
        disks: "local-disk 50 SSD"
        memory: "32G"
        preemptible: 0
        gpuCount: 1
        gpuType: "nvidia-tesla-a100"
        standardMachineType: "a2-highgpu-1g"
    }

    output {
        File done = ".done"
    }
}

workflow nvidia_smi_wf {
    input {
        String docker_version
    }
    
    call nvidia_smi  as nvidia_smi_call {
        input:
            docker_version = docker_version
    }

    output {
        File done = ".done"
    }
}

Next steps

  • Confirm this approach is in the right direction with the cromwell team.
  • Work on proper unit tests and get this PR ready to be merged.

Release Notes Confirmation

CHANGELOG.md

  • I updated CHANGELOG.md in this PR
  • I assert that this change shouldn't be included in CHANGELOG.md because it doesn't impact community users

Terra Release Notes

  • I added a suggested release notes entry in this Jira ticket
  • I assert that this change doesn't need Jira release notes because it doesn't impact Terra users

…ng the cpuPlatform field in the RuntimeAttributes
@javiergaitan javiergaitan requested a review from a team as a code owner September 17, 2024 16:46
@javiergaitan javiergaitan force-pushed the support-standard-machine-type-google-backend branch from 366115d to 4ba50c3 Compare September 17, 2024 22:39
@javiergaitan javiergaitan force-pushed the support-standard-machine-type-google-backend branch from 4ba50c3 to 3a0ac16 Compare September 18, 2024 00:04
@javiergaitan javiergaitan changed the title GCP Batch: Support passing standard machine types to the Google backend by reusing the cpuPlatform field in the RuntimeAttributes GCP Batch: Support passing standard machine types to the Google backend Sep 18, 2024
@javiergaitan
Copy link
Author

@aednichols @mcovarr : Please let us know if you have any feedback on this PR. I would like to confirm this is in the right direction before working on tests and getting this ready for merging. Alternatively, please let me know if I should reach out to someone else for review.

@jgainerdewar
Copy link
Collaborator

Thanks for taking the time to submit this! Unfortunately the maintaining team doesn't have much bandwidth to collaborate on community contributions right now, but we'll keep it in mind for the future.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants