Describe the bug
When enabling NRI via --set cdi.nriPluginEnabled=true the toolkit validation pod fails with nvidia-smi not found in path.
I've looked at the code and it seems it's trying to execute nvidia-smi from inside the validation container as opposed to how driver validation validates from the host system.
See:
|
func (t *Toolkit) validate() error { |
This executes nvidia-smi from inside the validation container and it would fail, but for driver validation
see:
|
func validateHostDriver(silent bool) error { |
Fix would be to check nvidia-smi similarly to how driver validation works
nvidia-smi is executed from the host
To Reproduce
Install helm chart with --set cdi.nriPluginEnabled=true
Expected behavior
NRI plugin works
Environment (please provide the following information):
- GPU Operator Version: v26.3.1
- OS: Talos 1.13.0-rc.0
- Kernel Version: 6.18.22-talos
- Container Runtime Version: containerd 2.2.3
- Kubernetes Distro and Version: Talos v1.35.2
Information to attach (optional if deemed irrelevant)
Collecting full debug bundle (optional):
curl -o must-gather.sh -L https://raw.githubusercontent.com/NVIDIA/gpu-operator/main/hack/must-gather.sh
chmod +x must-gather.sh
./must-gather.sh
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com
Describe the bug
When enabling NRI via
--set cdi.nriPluginEnabled=truethe toolkit validation pod fails withnvidia-sminot found in path.I've looked at the code and it seems it's trying to execute nvidia-smi from inside the validation container as opposed to how driver validation validates from the host system.
See:
gpu-operator/cmd/nvidia-validator/main.go
Line 1132 in e6cd031
This executes
nvidia-smifrom inside the validation container and it would fail, but for driver validationsee:
gpu-operator/cmd/nvidia-validator/main.go
Line 745 in e6cd031
Fix would be to check nvidia-smi similarly to how driver validation works
nvidia-smi is executed from the host
To Reproduce
Install helm chart with
--set cdi.nriPluginEnabled=trueExpected behavior
NRI plugin works
Environment (please provide the following information):
Information to attach (optional if deemed irrelevant)
kubectl get pods -n OPERATOR_NAMESPACEkubectl get ds -n OPERATOR_NAMESPACEkubectl describe pod -n OPERATOR_NAMESPACE POD_NAMEkubectl logs -n OPERATOR_NAMESPACE POD_NAME --all-containersnvidia-smifrom the driver container:kubectl exec DRIVER_POD_NAME -n OPERATOR_NAMESPACE -c nvidia-driver-ctr -- nvidia-smijournalctl -u containerd > containerd.logCollecting full debug bundle (optional):
NOTE: please refer to the must-gather script for debug data collected.
This bundle can be submitted to us via email: operator_feedback@nvidia.com