You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ConfigMap "training-config" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "5b-6nodes-tp-2": current value is "5b-6nodes-tp-1"
The text was updated successfully, but these errors were encountered:
Hi @Syulin7 . We are aware of this limitation and are working on a v2 of the k8s support that should eliminate this issue across the launcher.
For now the recommendation is to delete the first helm chart after it is done. Alternatively, you could create a second namespace and run there if you really need to run two jobs concurrently.
@terrykong Thanks! By the way, when using K8S, if multiple stages (such as data_preparation, training) are launched simultaneously, all tasks are created at once without a pipeline. Is this behavior expected, and is it the same in other environments like slurm?
My expectation is that the training stage should only start after the data_preparation stage has completed.
According to this user guide: https://docs.nvidia.com/nemo-framework/user-guide/latest/launcherguide/launchertutorial/multirun.html
However, only the first Helm chart can be deployed successfully, because the ConfigMap has a conflict.https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/f336f483bd9af73c4c665d91654100fa3b0bf0a1/launcher_scripts/nemo_launcher/core/k8s_templates/training/training-config.yaml#L4
The text was updated successfully, but these errors were encountered: