Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Failed to execute a multirun with different configurations in K8S #238

Open
Syulin7 opened this issue Feb 20, 2024 · 2 comments
Open

Failed to execute a multirun with different configurations in K8S #238

Syulin7 opened this issue Feb 20, 2024 · 2 comments

Comments

@Syulin7
Copy link

Syulin7 commented Feb 20, 2024

According to this user guide: https://docs.nvidia.com/nemo-framework/user-guide/latest/launcherguide/launchertutorial/multirun.html

python3 main.py -m \
    stages=[training] \
    training.trainer.num_nodes=6 \
    training.run.name="5b_6nodes_tp_\${training.model.tensor_model_parallel_size}" \
    training.model.tensor_model_parallel_size=1,2,4,8

However, only the first Helm chart can be deployed successfully, because the ConfigMap has a conflict.https://github.com/NVIDIA/NeMo-Megatron-Launcher/blob/f336f483bd9af73c4c665d91654100fa3b0bf0a1/launcher_scripts/nemo_launcher/core/k8s_templates/training/training-config.yaml#L4

Error: INSTALLATION FAILED: rendered manifests contain a resource that already exists. Unable to continue with install: ConfigMap "training-config" in namespace "default" exists and cannot be imported into the current release: invalid ownership metadata; annotation validation error: key "meta.helm.sh/release-name" must equal "5b-6nodes-tp-2": current value is "5b-6nodes-tp-1"
@terrykong
Copy link
Collaborator

Hi @Syulin7 . We are aware of this limitation and are working on a v2 of the k8s support that should eliminate this issue across the launcher.

For now the recommendation is to delete the first helm chart after it is done. Alternatively, you could create a second namespace and run there if you really need to run two jobs concurrently.

@Syulin7
Copy link
Author

Syulin7 commented Feb 21, 2024

@terrykong Thanks! By the way, when using K8S, if multiple stages (such as data_preparation, training) are launched simultaneously, all tasks are created at once without a pipeline. Is this behavior expected, and is it the same in other environments like slurm?

My expectation is that the training stage should only start after the data_preparation stage has completed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants