[BUG] - AWS instance type not properly respected when `gpu` are enabled #2782

viniciusdc · 2024-10-21T12:58:17Z

Describe the bug

Since the latest release, when #2604 changes were integrated, a bug was introduced due to how we currently load our schema and perform validation versus the way the stages files are rendered during deploy. Basicaly, in that PR we changed the behavior on how the instance_types (AL2_x86_64_GPU, AL2_x86_64 and CUSTOM) are forwarded to their respective terraform variables under the node_groups.

Right now, when utilizing the following config block for example:

amazon_web_services:
  ...
  node_groups:
    ...
    gpu-1x-t4:
      instance: g4dn.xlarge
      min_nodes: 0
      max_nodes: 50
      single_subnet: false
      gpu: true
profiles:
  jupyterlab:
   - display_name: G4 GPU Instance 1x
      description: 4 cpu / 16GB RAM / 1 Nvidia T4 GPU (16 GB GPU RAM)
      kubespawner_override:
        image: quay.io/nebari/nebari-jupyterlab-gpu:2024.9.1
        cpu_limit: 4
        cpu_guarantee: 3
        mem_limit: 16G
        mem_guarantee: 10G
        extra_pod_config:
          volumes:
            - name: "dshm"
              emptyDir:
                medium: "Memory"
                sizeLimit: "2Gi"
        extra_container_config:
          volumeMounts:
            - name: "dshm"
              mountPath: "/dev/shm"
        node_selector:
          "dedicated": "gpu-1x-t4"

The expected behavior would be for an instance with a GPU to be spawned and assigned to the user's pod right now, though. The instance is correctly scaled up, though the type is wrongly defaulted to ``AL2_x86_64_GPU`, which results in the incorrect AMI being assigned to the instance and the NVIDIA drivers expected to be installed by the daemon never triggering.

The problem arises from this part of our code:

nebari/src/_nebari/stages/infrastructure/__init__.py

Lines 142 to 172 in ccb8b7e

    
           class AWSNodeGroupInputVars(schema.Base): 
        
               name: str 
        
               instance_type: str 
        
               gpu: bool = False 
        
               min_size: int 
        
               desired_size: int 
        
               max_size: int 
        
               single_subnet: bool 
        
               permissions_boundary: Optional[str] = None 
        
               ami_type: Optional[AWSAmiTypes] = None 
        
               launch_template: Optional[AWSNodeLaunchTemplate] = None 
        
               @field_validator("ami_type", mode="before") 
        
               @classmethod 
        
               def _infer_and_validate_ami_type(cls, value, values) -> str: 
        
                   gpu_enabled = values.get("gpu", False) 
        
                   # Auto-set ami_type if not provided 
        
                   if not value: 
        
                       if values.get("launch_template") and values["launch_template"].ami_id: 
        
                           return "CUSTOM" 
        
                       if gpu_enabled: 
        
                           return "AL2_x86_64_GPU" 
        
                       return "AL2_x86_64" 
        
                   # Explicit validation 
        
                   if value == "AL2_x86_64" and gpu_enabled: 
        
                       raise ValueError( 
        
                           "ami_type 'AL2_x86_64' cannot be used with GPU enabled (gpu=True)." 
        
                       ) 
        
                   return value

I suggest that we remove the "dynamic" handling of the instance type from the Pydantic validator and instead use a custom function to handle the proper logic at run time, for example:

def construct_aws_ami_type(
    gpu_enabled: bool, launch_template: Dict, ami_type: str = None
):
    """Construct the AWS AMI type based on the provided parameters."""
    if ami_type:
        return ami_type

    if launch_template and launch_template.get("ami_id"):
        return "CUSTOM"

    if gpu_enabled:
        return "AL2_x86_64_GPU"

    return "AL2_x86_64"

and there is also a need for changing the current Enum object, as it also is not properly serializable right now:

class AWSAmiTypes(str, enum.Enum):
    AL2_x86_64 = "AL2_x86_64"
    AL2_x86_64_GPU = "AL2_x86_64_GPU"
    CUSTOM = "CUSTOM"

Expected behavior

Gpus instances should scale properly while their drivers are properly installed as well

OS and architecture in which you are running Nebari

Linux

How to Reproduce the problem?

Run an AWS deployment that requires a GPU profile, bug introduced in latest release version (2024.9.1)

Command output

No response

Versions and dependencies used.

No response

Compute environment

AWS

Integrations

No response

Anything else?

No response

The text was updated successfully, but these errors were encountered:

viniciusdc added type: bug 🐛 Something isn't working needs: triage 🚦 Someone needs to have a look at this issue and triage labels Oct 21, 2024

github-project-automation bot added this to 🪴 Nebari Project Management Oct 21, 2024

github-project-automation bot moved this to New 🚦 in 🪴 Nebari Project Management Oct 21, 2024

viniciusdc mentioned this issue Oct 22, 2024

Address issue with AWS instance type schema #2787

Merged

10 tasks

viniciusdc added this to the 2024.9.2 milestone Oct 24, 2024

viniciusdc mentioned this issue Oct 28, 2024

[Release] Hot-Fix release for 2024.9.1 #2798

Closed

7 tasks

viniciusdc closed this as completed in #2787 Oct 29, 2024

github-project-automation bot moved this from New 🚦 to Done 💪🏾 in 🪴 Nebari Project Management Oct 29, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] - AWS instance type not properly respected when `gpu` are enabled #2782

[BUG] - AWS instance type not properly respected when `gpu` are enabled #2782

viniciusdc commented Oct 21, 2024

[BUG] - AWS instance type not properly respected when gpu are enabled #2782

[BUG] - AWS instance type not properly respected when gpu are enabled #2782

Comments

viniciusdc commented Oct 21, 2024

Describe the bug

Expected behavior

OS and architecture in which you are running Nebari

How to Reproduce the problem?

Command output

Versions and dependencies used.

Compute environment

Integrations

Anything else?

[BUG] - AWS instance type not properly respected when `gpu` are enabled #2782

[BUG] - AWS instance type not properly respected when `gpu` are enabled #2782