[BUG] - AWS instance type not properly respected when gpu
are enabled
#2782
Labels
area: schema
good first issue
Good for newcomers
impact: medium 🟨
This item affects some users, not critical
needs: PR 📬
This item has been scoped and needs to be worked on
provider: AWS
type: bug 🐛
Something isn't working
Milestone
Describe the bug
Since the latest release, when #2604 changes were integrated, a bug was introduced due to how we currently load our schema and perform validation versus the way the stages files are rendered during deploy. Basicaly, in that PR we changed the behavior on how the instance_types (
AL2_x86_64_GPU
,AL2_x86_64
andCUSTOM
) are forwarded to their respective terraform variables under the node_groups.Right now, when utilizing the following config block for example:
The expected behavior would be for an instance with a GPU to be spawned and assigned to the user's pod right now, though. The instance is correctly scaled up, though the type is wrongly defaulted to ``AL2_x86_64_GPU`, which results in the incorrect AMI being assigned to the instance and the NVIDIA drivers expected to be installed by the daemon never triggering.
The problem arises from this part of our code:
nebari/src/_nebari/stages/infrastructure/__init__.py
Lines 142 to 172 in ccb8b7e
I suggest that we remove the "dynamic" handling of the instance type from the Pydantic validator and instead use a custom function to handle the proper logic at run time, for example:
and there is also a need for changing the current Enum object, as it also is not properly serializable right now:
Expected behavior
Gpus instances should scale properly while their drivers are properly installed as well
OS and architecture in which you are running Nebari
Linux
How to Reproduce the problem?
Run an AWS deployment that requires a GPU profile, bug introduced in latest release version (
2024.9.1
)Command output
No response
Versions and dependencies used.
No response
Compute environment
AWS
Integrations
No response
Anything else?
No response
The text was updated successfully, but these errors were encountered: