Overview
After installing or reinstalling the GPU Operator on a CMK cluster, DaemonSet pods (such as nvidia-driver-daemonset) may fail to be created with the following error:
failed to call webhook "mutatepod.volcano.sh": service "volcano-admission-service" not found
This occurs when Volcano — a batch scheduling system that Run:ai installs as a dependency — was previously deployed in the cluster but has since been removed, leaving behind orphaned MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources at the cluster level.
These webhook configurations intercept pod creation requests and attempt to contact the Volcano admission service, which no longer exists, causing all affected pod creation to fail.
Prerequisites
kubectlaccess to the CMK cluster with cluster-admin permissions
Steps
- Confirm the Webhook Is the Cause
- Check DaemonSet events for the failing component (e.g.,
nvidia-driver-daemonset) to confirm the webhook error:
- Check DaemonSet events for the failing component (e.g.,
kubectl describe daemonset <daemonset-name> -n <namespace>
- Look for a
FailedCreateevent referencingmutatepod.volcano.shorvolcano-admission-service.
- Identify the Orphaned Webhook Configurations
- List all mutating and validating webhook configurations in the cluster and look for any Volcano-related entries:
kubectl get mutatingwebhookconfigurations
kubectl get validatingwebhookconfigurations- Confirm Volcano Is No Longer Running
- Verify that the Volcano admission service does not exist before deleting the webhooks:
kubectl get svc -A | grep volcano
- If no Volcano services are returned, the webhook configurations are orphaned and safe to remove.
- Delete the Orphaned Webhook Configurations
- Delete both the mutating and validating webhook configurations for Volcano:
kubectl delete mutatingwebhookconfiguration <volcano-mutating-webhook-name>
kubectl delete validatingwebhookconfiguration <volcano-validating-webhook-name>- Common names include
volcano-admission-service-mutating-webhook-configurationandvolcano-admission-service-validating-webhook-configuration, but confirm against the output from Step 2.
- Verify GPU Operator Pods Come Up
- After deleting the webhooks, confirm that the previously blocked DaemonSet pods are now being created and reach
Runningstate:
- After deleting the webhooks, confirm that the previously blocked DaemonSet pods are now being created and reach
kubectl get pods -n <gpu-operator-namespace>
Resolution
The following describes how this issue was resolved in a confirmed case:
- After a clean reinstall of the GPU Operator on a CMK cluster,
nvidia-driver-daemonsetpods failed to be created. DaemonSet events showedFailedCreateerrors referencingmutatepod.volcano.sh: service "volcano-admission-service" not found. - Inspection of the cluster's webhook configurations revealed orphaned Volcano
MutatingWebhookConfigurationandValidatingWebhookConfigurationresources. The customer confirmed Volcano was no longer in use. - Both webhook configurations were deleted.
- The
nvidia-driver-daemonsetpods were immediately created and came upRunningon both nodes. - Remaining GPU Operator components (
nvidia-dcgm-exporter,nvidia-device-plugin) then progressed to their own startup — in this case encountering a separate inotify limits issue, but the webhook blocker was fully resolved.