Vcauxbrisebo/vm gpu support by vince-brisebois · Pull Request #851 · NVIDIA/OpenShell

vince-brisebois · 2026-04-15T18:07:54Z

Summary

Adds VFIO GPU passthrough support to openshell-vm using cloud-hypervisor as a second VMM backend alongside libkrun. Includes a full GPU bind/unbind lifecycle with safety checks, nvidia driver deadlock hardening (subprocess isolation with timeout, pre-unbind module cleanup, post-timeout verification), and an RAII guard that restores the original driver on exit.

Related Issue

N/A

Changes

VMM backend abstraction: Extract VmBackend trait with LibkrunBackend and CloudHypervisorBackend implementations; auto-select CHV when --gpu is set, reject --backend libkrun --gpu at the CLI level
GPU bind lifecycle (gpu_passthrough.rs): Probe sysfs for NVIDIA GPUs, check VFIO/IOMMU readiness, fail-closed safety checks (display outputs, /dev/nvidia* handles, IOMMU groups, VFIO modules, permissions), RAII GpuBindGuard for driver restoration
nvidia unbind deadlock hardening: Pre-unbind prep (disable persistence mode, unload nvidia_uvm/nvidia_drm/nvidia_modeset), all sysfs writes and prep commands in subprocesses with timeout (10s/15s), drop(child) without wait() to prevent parent D-state, post-timeout verification that continues if device is actually unbound
Cloud-hypervisor backend: Direct kernel boot with virtiofsd, TAP networking with NAT/port forwarding, vsock exec bridge, ACPI shutdown wrapper for --exec mode
Kernel kconfig: Add CONFIG_VIRTIO_PCI, CONFIG_SERIAL_8250, CONFIG_SERIAL_8250_CONSOLE, CONFIG_ACPI, CONFIG_PCI, CONFIG_PCI_MSI, CONFIG_DRM, CONFIG_MODULES, CONFIG_MODULE_UNLOAD
Guest rootfs: NVIDIA driver install support, device plugin and runtime class manifests, init script GPU detection and module loading
CI: gpu-ci.yml workflow on self-hosted GPU runners with OPENSHELL_VM_GPU_E2E=1
Architecture docs: Update custom-vm-runtime.md for dual-backend architecture, add vm-gpu-passthrough.md, add both to architecture/README.md index
Pre-commit fixes: rustfmt corrections, clippy ptr_arg fix in build.rs, test race condition fix in image.rs

Testing

mise run pre-commit passes
Unit tests added/updated
E2E tests added/updated (if applicable)

Checklist

Follows Conventional Commits
Commits are signed off (DCO)
Architecture docs updated (if applicable)

copy-pr-bot · 2026-04-15T18:07:59Z

This pull request requires additional validation before any workflows can run on NVIDIA's runners.

Pull request vetters can view their responsibilities here.

Contributors can view more details about this message here.

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

…a unbind hardening Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois added 2 commits April 15, 2026 18:19

GPU support design and implementation plan

5fb51ac

Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

feat(vm): add GPU passthrough with cloud-hypervisor backend and nvidi…

199a712

…a unbind hardening Signed-off-by: Vincent Caux-Brisebois <vcauxbrisebo@nvidia.com>

vince-brisebois force-pushed the vcauxbrisebo/vm-gpu-support branch from 6ab54d1 to 199a712 Compare April 15, 2026 18:25

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Vcauxbrisebo/vm gpu support#851

Vcauxbrisebo/vm gpu support#851
vince-brisebois wants to merge 2 commits intomainfrom
vcauxbrisebo/vm-gpu-support

vince-brisebois commented Apr 15, 2026

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

vince-brisebois commented Apr 15, 2026

Summary

Related Issue

Changes

Testing

Checklist

Uh oh!

copy-pr-bot bot commented Apr 15, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant