Failed to execute a multirun with different configurations in K8S #238

Syulin7 · 2024-02-20T12:07:27Z

According to this user guide: https://docs.nvidia.com/nemo-framework/user-guide/latest/launcherguide/launchertutorial/multirun.html

python3 main.py -m \
    stages=[training] \
    training.trainer.num_nodes=6 \
    training.run.name="5b_6nodes_tp_\${training.model.tensor_model_parallel_size}" \
    training.model.tensor_model_parallel_size=1,2,4,8

However, only the first Helm chart can be deployed successfully, because the ConfigMap has a conflict.https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/f336f483bd9af73c4c665d91654100fa3b0bf0a1/launcher_scripts/nemo_launcher/core/k8s_templates/training/training-config.yaml#L4

Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ConfigMap "training-config" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "5b-6nodes-tp-2": current value is "5b-6nodes-tp-1"

The text was updated successfully, but these errors were encountered:

terrykong · 2024-02-20T17:25:03Z

Hi @Syulin7 . We are aware of this limitation and are working on a v2 of the k8s support that should eliminate this issue across the launcher.

For now the recommendation is to delete the first helm chart after it is done. Alternatively, you could create a second namespace and run there if you really need to run two jobs concurrently.

Syulin7 · 2024-02-21T02:21:37Z

@terrykong Thanks! By the way, when using K8S, if multiple stages (such as data_preparation, training) are launched simultaneously, all tasks are created at once without a pipeline. Is this behavior expected, and is it the same in other environments like slurm?

My expectation is that the training stage should only start after the data_preparation stage has completed.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Failed to execute a multirun with different configurations in K8S #238

Failed to execute a multirun with different configurations in K8S #238

Syulin7 commented Feb 20, 2024

terrykong commented Feb 20, 2024

Syulin7 commented Feb 21, 2024

Failed to execute a multirun with different configurations in K8S #238

Failed to execute a multirun with different configurations in K8S #238

Comments

Syulin7 commented Feb 20, 2024

terrykong commented Feb 20, 2024

Syulin7 commented Feb 21, 2024