Skip to content

Vcauxbrisebo/vm gpu support#851

Draft
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support
Draft

Vcauxbrisebo/vm gpu support#851
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support

Conversation

@vince-brisebois
Copy link
Copy Markdown
Collaborator

Summary

Adds VFIO GPU passthrough support to openshell-vm using cloud-hypervisor as a second VMM backend alongside libkrun. Includes a full GPU bind/unbind lifecycle with safety checks, nvidia driver deadlock hardening (subprocess isolation with timeout, pre-unbind module cleanup, post-timeout verification), and an RAII guard that restores the original driver on exit.

Related Issue

N/A

Changes

  • VMM backend abstraction: Extract VmBackend trait with LibkrunBackend and CloudHypervisorBackend implementations; auto-select CHV when --gpu is set, reject --backend libkrun --gpu at the CLI level
  • GPU bind lifecycle (gpu_passthrough.rs): Probe sysfs for NVIDIA GPUs, check VFIO/IOMMU readiness, fail-closed safety checks (display outputs, /dev/nvidia* handles, IOMMU groups, VFIO modules, permissions), RAII GpuBindGuard for driver restoration
  • nvidia unbind deadlock hardening: Pre-unbind prep (disable persistence mode, unload nvidia_uvm/nvidia_drm/nvidia_modeset), all sysfs writes and prep commands in subprocesses with timeout (10s/15s), drop(child) without wait() to prevent parent D-state, post-timeout verification that continues if device is actually unbound
  • Cloud-hypervisor backend: Direct kernel boot with virtiofsd, TAP networking with NAT/port forwarding, vsock exec bridge, ACPI shutdown wrapper for --exec mode
  • Kernel kconfig: Add CONFIG_VIRTIO_PCI, CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE, CONFIG_ACPI, CONFIG_PCI, CONFIG_PCI_MSI, CONFIG_DRM, CONFIG_MODULES, CONFIG_MODULE_UNLOAD
  • Guest rootfs: NVIDIA driver install support, device plugin and runtime class manifests, init script GPU detection and module loading
  • CI: gpu-ci.yml workflow on self-hosted GPU runners with OPENSHELL_VM_GPU_E2E=1
  • Architecture docs: Update custom-vm-runtime.md for dual-backend architecture, add vm-gpu-passthrough.md, add both to architecture/README.md index
  • Pre-commit fixes: rustfmt corrections, clippy ptr_arg fix in build.rs, test race condition fix in image.rs

Testing

  • mise run pre-commit passes
  • Unit tests added/updated
  • E2E tests added/updated (if applicable)

Checklist

  • Follows Conventional Commits
  • Commits are signed off (DCO)
  • Architecture docs updated (if applicable)

@copy-pr-bot
Copy link
Copy Markdown

copy-pr-bot bot commented Apr 15, 2026

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
…a unbind hardening

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>
@vince-brisebois vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support branch from 6ab54d1 to 199a712 Compare April 15, 2026 18:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant