feat(infiniband): add DOCA OFED support for Ubuntu 22.04 and 24.04#8240
feat(infiniband): add DOCA OFED support for Ubuntu 22.04 and 24.04#8240
Conversation
- Add `InfiniBandSizes` SKU map in `gpu_components.go` for RDMA-capable VM sizes (ND-series A100/H100/H200 with `r`, HPC HB v3/v4, HC) - Add `NeedsInfiniBand` template function in `baker.go` and `NEEDS_INFINIBAND` env var in `cse_cmd.sh` - Add Mellanox DOCA apt repo setup (`updateAptWithMellanoxPkg`), cleanup (`removeMellanoxRepos`), and full dependency tree download (`downloadDocaOfedPackages`) in `cse_install_ubuntu.sh` - Cache all `doca-ofed` `.deb` packages during VHD build for air-gapped installation at CSE provisioning time - Add `installDocaOfedFromCache` to install cached packages at CSE time, unsetting `ARCH` to prevent DKMS postinstall failures - Add `configureInfiniBand` to blacklist `ib_ipoib` kernel module - Add `should_skip_doca_ofed` in `cse_helpers.sh` using IMDS VM tag `SkipDocaOfedInstall` to allow users to opt out of installation - Wire up InfiniBand conditional block in `cse_main.sh` `nodePrep()` with skip-tag support Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds DOCA OFED support for InfiniBand/RDMA-capable Azure SKUs on Ubuntu by caching required packages during VHD build and conditionally installing them during node provisioning.
Changes:
- Add an InfiniBand/RDMA-capable SKU map and a
NeedsInfiniBandtemplate function to drive provisioning behavior. - Add Mellanox DOCA apt repo configuration, package caching, and cached install logic for Ubuntu 22.04/24.04.
- Wire conditional InfiniBand installation into CSE flow with an IMDS tag to skip installation.
Reviewed changes
Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.
Show a summary per file
| File | Description |
|---|---|
| vhdbuilder/packer/install-dependencies.sh | Adds Mellanox repo setup, caches DOCA OFED packages during VHD build, and removes Mellanox repos afterward. |
| pkg/agent/datamodel/gpu_components.go | Introduces InfiniBandSizes and IsInfiniBandSKU helper for RDMA-capable SKUs. |
| pkg/agent/baker.go | Exposes NeedsInfiniBand template function for CSE templating. |
| parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh | Adds Mellanox repo management, DOCA OFED download/install-from-cache, and InfiniBand configuration. |
| parts/linux/cloud-init/artifacts/cse_main.sh | Conditionally installs DOCA OFED at provisioning time with skip-tag support. |
| parts/linux/cloud-init/artifacts/cse_helpers.sh | Adds IMDS-tag-based should_skip_doca_ofed helper. |
| parts/linux/cloud-init/artifacts/cse_cmd.sh | Exposes NEEDS_INFINIBAND env var for CSE runtime. |
| mkdir -p "${downloadDir}" | ||
|
|
||
| # Use apt-cache to resolve the full dependency tree of doca-ofed, | ||
| # then download all packages as .deb files for air-gapped installation at CSE time | ||
| local pkg_list | ||
| pkg_list=$(apt-cache depends --recurse --no-recommends --no-suggests \ | ||
| --no-conflicts --no-breaks --no-replaces --no-enhances \ | ||
| doca-ofed 2>/dev/null | grep "^\w" | sort -u) | ||
|
|
||
| pushd "${downloadDir}" >/dev/null || exit | ||
| for pkg in ${pkg_list}; do | ||
| apt-get download "${pkg}" 2>/dev/null || true | ||
| done | ||
| # Also download doca-ofed meta-package itself | ||
| apt-get download doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT | ||
| popd >/dev/null || exit | ||
|
|
There was a problem hiding this comment.
apt-cache dependsoutput generally includes leading indentation before dependency package names, while relation headers likeDepends:start at column 0. Usinggrep "^\w"will typically (a) miss the actual dependency package entries and (b) include non-package tokens likeDepends:/PreDepends:inpkg_list`, resulting in an incomplete cache and subsequent installation failures. Prefer a dependency resolution/download approach that reliably produces real package names (e.g., parse tokens after the colon and strip alternatives, or use an apt-native download-only install that pulls the full dependency closure into a specified cache directory).
| mkdir -p "${downloadDir}" | |
| # Use apt-cache to resolve the full dependency tree of doca-ofed, | |
| # then download all packages as .deb files for air-gapped installation at CSE time | |
| local pkg_list | |
| pkg_list=$(apt-cache depends --recurse --no-recommends --no-suggests \ | |
| --no-conflicts --no-breaks --no-replaces --no-enhances \ | |
| doca-ofed 2>/dev/null | grep "^\w" | sort -u) | |
| pushd "${downloadDir}" >/dev/null || exit | |
| for pkg in ${pkg_list}; do | |
| apt-get download "${pkg}" 2>/dev/null || true | |
| done | |
| # Also download doca-ofed meta-package itself | |
| apt-get download doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT | |
| popd >/dev/null || exit | |
| mkdir -p "${downloadDir}/partial" | |
| # Use apt to resolve and download doca-ofed and its full dependency closure | |
| # as .deb files for air-gapped installation at CSE time. | |
| apt-get install --download-only -y --no-install-recommends \ | |
| -o Dir::Cache::archives="${downloadDir}" \ | |
| doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT |
| if [ "$cpu_arch" = "amd64" ]; then | ||
| repo_arch="x86_64" | ||
| elif [ "$cpu_arch" = "arm64" ]; then | ||
| repo_arch="aarch64" | ||
| else | ||
| echo "Unknown CPU architecture: ${cpu_arch}" | ||
| return | ||
| fi | ||
|
|
||
| local mellanox_ubuntu_version="" | ||
| if [ "${UBUNTU_RELEASE}" = "22.04" ]; then | ||
| mellanox_ubuntu_version="ubuntu22.04" | ||
| elif [ "${UBUNTU_RELEASE}" = "24.04" ]; then | ||
| mellanox_ubuntu_version="ubuntu24.04" | ||
| else | ||
| echo "Mellanox DOCA repo setup is not supported on Ubuntu ${UBUNTU_RELEASE}" | ||
| return | ||
| fi |
There was a problem hiding this comment.
Both error paths use return without a non-zero status, which will typically return the status of the preceding echo (0). That makes failures silently look successful to callers and can lead to later steps running with no Mellanox repo configured. Return a non-zero status (e.g., return 1) or consistently exit with an appropriate error code, matching how other repo-setup functions enforce failure.`
| DEBIAN_FRONTEND=noninteractive dpkg -i "${downloadDir}"/*.deb 2>&1 || { | ||
| # Fix any broken dependencies using only local packages | ||
| apt-get install -f -y --no-install-recommends 2>&1 || { | ||
| echo "Failed to install DOCA OFED packages" | ||
| if [ -n "${original_arch}" ]; then | ||
| export ARCH="${original_arch}" | ||
| fi | ||
| return 1 | ||
| } | ||
| } |
There was a problem hiding this comment.
apt-get install -fmay attempt to download missing packages from configured apt repositories, which undermines the stated goal of “air-gapped installation” and can cause provisioning failures in restricted networks. Also,dpkg -i "${downloadDir}"/*.debcan run into command-line argument length limits if the dependency set is large. Consider installing in a way that (1) only uses local.deb` artifacts (fail fast if any are missing, without attempting network), and (2) avoids glob-arg limits (e.g., stream filenames to dpkg/apt via xargs/find or use an appropriate local-repo approach).
InfiniBandSizesSKU map ingpu_components.gofor RDMA-capable VM sizes (ND-series A100/H100/H200 withr, HPC HB v3/v4, HC)NeedsInfiniBandtemplate function inbaker.goandNEEDS_INFINIBANDenv var incse_cmd.shupdateAptWithMellanoxPkg), cleanup (removeMellanoxRepos), and full dependency tree download (downloadDocaOfedPackages) incse_install_ubuntu.shdoca-ofed.debpackages during VHD build for air-gapped installation at CSE provisioning timeinstallDocaOfedFromCacheto install cached packages at CSE time, unsettingARCHto prevent DKMS postinstall failuresconfigureInfiniBandto blacklistib_ipoibkernel moduleshould_skip_doca_ofedincse_helpers.shusing IMDS VM tagSkipDocaOfedInstallto allow users to opt out of installationcse_main.shnodePrep()with skip-tag supportWhat this PR does / why we need it:
Which issue(s) this PR fixes:
Fixes #