Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

change mig customer-mig-parted-config configmap but mig-manager use config is not the updated data #891

Open
6 tasks
lengrongfu opened this issue Aug 2, 2024 · 4 comments

Comments

@lengrongfu
Copy link
Contributor

lengrongfu commented Aug 2, 2024

The template below is mostly useful for bug reports and support questions. Feel free to remove anything which doesn't apply to you and add more information where it makes sense.

Important Note: NVIDIA AI Enterprise customers can get support from NVIDIA Enterprise support. Please open a case here.

1. Quick Debug Information

  • OS/Version(e.g. RHEL8.6, Ubuntu22.04): Ubuntu22.04
  • Kernel Version: 5.15.0-88-generic
  • Container Runtime Type/Version(e.g. Containerd, CRI-O, Docker): Containerd
  • K8s Flavor/Version(e.g. K8s, OCP, Rancher, GKE, EKS): K8s
  • GPU Operator Version: 23.9.0

2. Issue or feature description

Briefly explain the issue in terms of expected behavior and current behavior.

3. Steps to reproduce the issue

Detailed steps to reproduce the issue.

  1. deployment a gpu-operator and enable mig. wait deploy success.
  2. to change customer-mig-parted-config configmap, add custom config.
  3. exec mig-manager pod, found mount configmap file is old content.

4. Information to attach (optional if deemed irrelevant)

  • kubernetes pods status: kubectl get pods -n OPERATOR_NAMESPACE
  • kubernetes daemonset status: kubectl get ds -n OPERATOR_NAMESPACE
  • If a pod/ds is in an error state or pending state kubectl describe pod -n OPERATOR_NAMESPACE POD_NAME
  • If a pod/ds is in an error state or pending state kubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containers
  • Output from running nvidia-smi from the driver container: kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smi
  • containerd logs journalctl -u containerd > containerd.log

Collecting full debug bundle (optional):

curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh

NOTE: please refer to the must-gather script for debug data collected.

This bundle can be submitted to us via email: operator_feedback@nvidia.com

@lengrongfu lengrongfu changed the title change mig default-mig-parted-config configmap but mig-manager use config is not the updated data change mig customer-mig-parted-config configmap but mig-manager use config is not the updated data Aug 2, 2024
@cdesiniotis
Copy link
Contributor

@lengrongfu this is the limitation currently -- the mig-manager will not see updates to the configmap unless you restart the mig-manager pod.

@cdesiniotis
Copy link
Contributor

To propagate configmap updates, the clusterpolicy controller would need to watch for changes to the custom configmap and roll out a new mig-manager daemonset, e.g. by adding the hash of the configmap as an annotation.

@lengrongfu
Copy link
Contributor Author

lengrongfu commented Aug 6, 2024

@cdesiniotis I have two ideas:

  1. Watch configmap through clusterpolicy, and then restart mig-manager
  2. Add a config-watch pod in mig-manager, similar to config-manager in device-plugin, https://github.com/NVIDIA/k8s-device-plugin/tree/main/cmd/config-manager

Or https://kubernetes.io/docs/concepts/configuration/configmap/#mounted-configmaps-are-updated-automatically

@lengrongfu
Copy link
Contributor Author

Wait this PR NVIDIA/mig-parted#108 merge, we can add a logic, when migManager.config.name value is not default-mig-parted-config, we can add CONFIG_WATCH this env to mig-manager daemonset.

@cdesiniotis What do you think?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants