Skip to content

feat(infiniband): add DOCA OFED support for Ubuntu 22.04 and 24.04#8240

Draft
surajssd wants to merge 1 commit intomainfrom
suraj/add-doca-ofed-pkg
Draft

feat(infiniband): add DOCA OFED support for Ubuntu 22.04 and 24.04#8240
surajssd wants to merge 1 commit intomainfrom
suraj/add-doca-ofed-pkg

Conversation

@surajssd
Copy link
Copy Markdown
Member

@surajssd surajssd commented Apr 3, 2026

  • Add InfiniBandSizes SKU map in gpu_components.go for RDMA-capable VM sizes (ND-series A100/H100/H200 with r, HPC HB v3/v4, HC)
  • Add NeedsInfiniBand template function in baker.go and NEEDS_INFINIBAND env var in cse_cmd.sh
  • Add Mellanox DOCA apt repo setup (updateAptWithMellanoxPkg), cleanup (removeMellanoxRepos), and full dependency tree download (downloadDocaOfedPackages) in cse_install_ubuntu.sh
  • Cache all doca-ofed .deb packages during VHD build for air-gapped installation at CSE provisioning time
  • Add installDocaOfedFromCache to install cached packages at CSE time, unsetting ARCH to prevent DKMS postinstall failures
  • Add configureInfiniBand to blacklist ib_ipoib kernel module
  • Add should_skip_doca_ofed in cse_helpers.sh using IMDS VM tag SkipDocaOfedInstall to allow users to opt out of installation
  • Wire up InfiniBand conditional block in cse_main.sh nodePrep() with skip-tag support

What this PR does / why we need it:

Which issue(s) this PR fixes:

Fixes #

- Add `InfiniBandSizes` SKU map in `gpu_components.go` for RDMA-capable
VM sizes (ND-series A100/H100/H200 with `r`, HPC HB v3/v4, HC)
- Add `NeedsInfiniBand` template function in `baker.go` and
`NEEDS_INFINIBAND` env var in `cse_cmd.sh`
- Add Mellanox DOCA apt repo setup (`updateAptWithMellanoxPkg`),
cleanup (`removeMellanoxRepos`), and full dependency tree download
(`downloadDocaOfedPackages`) in `cse_install_ubuntu.sh`
- Cache all `doca-ofed` `.deb` packages during VHD build for air-gapped
installation at CSE provisioning time
- Add `installDocaOfedFromCache` to install cached packages at CSE time,
unsetting `ARCH` to prevent DKMS postinstall failures
- Add `configureInfiniBand` to blacklist `ib_ipoib` kernel module
- Add `should_skip_doca_ofed` in `cse_helpers.sh` using IMDS VM tag
`SkipDocaOfedInstall` to allow users to opt out of installation
- Wire up InfiniBand conditional block in `cse_main.sh` `nodePrep()`
with skip-tag support

Signed-off-by: Suraj Deshmukh <suraj.deshmukh@microsoft.com>
Copilot AI review requested due to automatic review settings April 3, 2026 22:51
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds DOCA OFED support for InfiniBand/RDMA-capable Azure SKUs on Ubuntu by caching required packages during VHD build and conditionally installing them during node provisioning.

Changes:

  • Add an InfiniBand/RDMA-capable SKU map and a NeedsInfiniBand template function to drive provisioning behavior.
  • Add Mellanox DOCA apt repo configuration, package caching, and cached install logic for Ubuntu 22.04/24.04.
  • Wire conditional InfiniBand installation into CSE flow with an IMDS tag to skip installation.

Reviewed changes

Copilot reviewed 7 out of 7 changed files in this pull request and generated 6 comments.

Show a summary per file
File Description
vhdbuilder/packer/install-dependencies.sh Adds Mellanox repo setup, caches DOCA OFED packages during VHD build, and removes Mellanox repos afterward.
pkg/agent/datamodel/gpu_components.go Introduces InfiniBandSizes and IsInfiniBandSKU helper for RDMA-capable SKUs.
pkg/agent/baker.go Exposes NeedsInfiniBand template function for CSE templating.
parts/linux/cloud-init/artifacts/ubuntu/cse_install_ubuntu.sh Adds Mellanox repo management, DOCA OFED download/install-from-cache, and InfiniBand configuration.
parts/linux/cloud-init/artifacts/cse_main.sh Conditionally installs DOCA OFED at provisioning time with skip-tag support.
parts/linux/cloud-init/artifacts/cse_helpers.sh Adds IMDS-tag-based should_skip_doca_ofed helper.
parts/linux/cloud-init/artifacts/cse_cmd.sh Exposes NEEDS_INFINIBAND env var for CSE runtime.

Comment on lines +247 to +263
mkdir -p "${downloadDir}"

# Use apt-cache to resolve the full dependency tree of doca-ofed,
# then download all packages as .deb files for air-gapped installation at CSE time
local pkg_list
pkg_list=$(apt-cache depends --recurse --no-recommends --no-suggests \
--no-conflicts --no-breaks --no-replaces --no-enhances \
doca-ofed 2>/dev/null | grep "^\w" | sort -u)

pushd "${downloadDir}" >/dev/null || exit
for pkg in ${pkg_list}; do
apt-get download "${pkg}" 2>/dev/null || true
done
# Also download doca-ofed meta-package itself
apt-get download doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT
popd >/dev/null || exit

Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apt-cache dependsoutput generally includes leading indentation before dependency package names, while relation headers likeDepends:start at column 0. Usinggrep "^\w"will typically (a) miss the actual dependency package entries and (b) include non-package tokens likeDepends:/PreDepends:inpkg_list`, resulting in an incomplete cache and subsequent installation failures. Prefer a dependency resolution/download approach that reliably produces real package names (e.g., parse tokens after the colon and strip alternatives, or use an apt-native download-only install that pulls the full dependency closure into a specified cache directory).

Suggested change
mkdir -p "${downloadDir}"
# Use apt-cache to resolve the full dependency tree of doca-ofed,
# then download all packages as .deb files for air-gapped installation at CSE time
local pkg_list
pkg_list=$(apt-cache depends --recurse --no-recommends --no-suggests \
--no-conflicts --no-breaks --no-replaces --no-enhances \
doca-ofed 2>/dev/null | grep "^\w" | sort -u)
pushd "${downloadDir}" >/dev/null || exit
for pkg in ${pkg_list}; do
apt-get download "${pkg}" 2>/dev/null || true
done
# Also download doca-ofed meta-package itself
apt-get download doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT
popd >/dev/null || exit
mkdir -p "${downloadDir}/partial"
# Use apt to resolve and download doca-ofed and its full dependency closure
# as .deb files for air-gapped installation at CSE time.
apt-get install --download-only -y --no-install-recommends \
-o Dir::Cache::archives="${downloadDir}" \
doca-ofed 2>/dev/null || exit $ERR_APT_INSTALL_TIMEOUT

Copilot uses AI. Check for mistakes.
Comment on lines +203 to +220
if [ "$cpu_arch" = "amd64" ]; then
repo_arch="x86_64"
elif [ "$cpu_arch" = "arm64" ]; then
repo_arch="aarch64"
else
echo "Unknown CPU architecture: ${cpu_arch}"
return
fi

local mellanox_ubuntu_version=""
if [ "${UBUNTU_RELEASE}" = "22.04" ]; then
mellanox_ubuntu_version="ubuntu22.04"
elif [ "${UBUNTU_RELEASE}" = "24.04" ]; then
mellanox_ubuntu_version="ubuntu24.04"
else
echo "Mellanox DOCA repo setup is not supported on Ubuntu ${UBUNTU_RELEASE}"
return
fi
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Both error paths use return without a non-zero status, which will typically return the status of the preceding echo (0). That makes failures silently look successful to callers and can lead to later steps running with no Mellanox repo configured. Return a non-zero status (e.g., return 1) or consistently exit with an appropriate error code, matching how other repo-setup functions enforce failure.`

Copilot uses AI. Check for mistakes.
Comment on lines +281 to +290
DEBIAN_FRONTEND=noninteractive dpkg -i "${downloadDir}"/*.deb 2>&1 || {
# Fix any broken dependencies using only local packages
apt-get install -f -y --no-install-recommends 2>&1 || {
echo "Failed to install DOCA OFED packages"
if [ -n "${original_arch}" ]; then
export ARCH="${original_arch}"
fi
return 1
}
}
Copy link

Copilot AI Apr 3, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

apt-get install -fmay attempt to download missing packages from configured apt repositories, which undermines the stated goal of “air-gapped installation” and can cause provisioning failures in restricted networks. Also,dpkg -i "${downloadDir}"/*.debcan run into command-line argument length limits if the dependency set is large. Consider installing in a way that (1) only uses local.deb` artifacts (fail fast if any are missing, without attempting network), and (2) avoids glob-arg limits (e.g., stream filenames to dpkg/apt via xargs/find or use an appropriate local-repo approach).

Copilot uses AI. Check for mistakes.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants