Skip to main content
Crusoe Support Help Center home page
Crusoe

GPU Operator Pod Creation Blocked by Orphaned Run:ai Volcano Webhook Configurations

Matt Roark
Matt Roark
Updated

Overview

After installing or reinstalling the GPU Operator on a CMK cluster, DaemonSet pods (such as nvidia-driver-daemonset) may fail to be created with the following error:

 
failed to call webhook "mutatepod.volcano.sh": service "volcano-admission-service" not found

This occurs when Volcano — a batch scheduling system that Run:ai installs as a dependency — was previously deployed in the cluster but has since been removed, leaving behind orphaned MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources at the cluster level. 

These webhook configurations intercept pod creation requests and attempt to contact the Volcano admission service, which no longer exists, causing all affected pod creation to fail.


Prerequisites

  • kubectl access to the CMK cluster with cluster-admin permissions

Steps

  1. Confirm the Webhook Is the Cause
    • Check DaemonSet events for the failing component (e.g., nvidia-driver-daemonset) to confirm the webhook error:
     kubectl describe daemonset <daemonset-name> -n <namespace>
  • Look for a FailedCreate event referencing mutatepod.volcano.sh or volcano-admission-service.
  1. Identify the Orphaned Webhook Configurations
    • List all mutating and validating webhook configurations in the cluster and look for any Volcano-related entries:
     kubectl get mutatingwebhookconfigurations
     kubectl get validatingwebhookconfigurations
  1. Confirm Volcano Is No Longer Running
    • Verify that the Volcano admission service does not exist before deleting the webhooks:
     kubectl get svc -A | grep volcano
  • If no Volcano services are returned, the webhook configurations are orphaned and safe to remove.
  1. Delete the Orphaned Webhook Configurations
    • Delete both the mutating and validating webhook configurations for Volcano:
     kubectl delete mutatingwebhookconfiguration <volcano-mutating-webhook-name>
     kubectl delete validatingwebhookconfiguration <volcano-validating-webhook-name>
  • Common names include volcano-admission-service-mutating-webhook-configuration and volcano-admission-service-validating-webhook-configuration, but confirm against the output from Step 2.
  1. Verify GPU Operator Pods Come Up
    • After deleting the webhooks, confirm that the previously blocked DaemonSet pods are now being created and reach Running state:
     kubectl get pods -n <gpu-operator-namespace>

Resolution

The following describes how this issue was resolved in a confirmed case:

  1. After a clean reinstall of the GPU Operator on a CMK cluster, nvidia-driver-daemonset pods failed to be created. DaemonSet events showed FailedCreate errors referencing mutatepod.volcano.sh: service "volcano-admission-service" not found.
  2. Inspection of the cluster's webhook configurations revealed orphaned Volcano MutatingWebhookConfiguration and ValidatingWebhookConfiguration resources. The customer confirmed Volcano was no longer in use.
  3. Both webhook configurations were deleted.
  4. The nvidia-driver-daemonset pods were immediately created and came up Running on both nodes.
  5. Remaining GPU Operator components (nvidia-dcgm-exporter, nvidia-device-plugin) then progressed to their own startup — in this case encountering a separate inotify limits issue, but the webhook blocker was fully resolved.

Additional Resources

Related to

Was this article helpful?

0 out of 0 found this helpful

Still need help?

Our support team is ready to assist you with any questions.

Have more questions? Submit a request

Recently Viewed

Comments

0 comments

Article is closed for comments.