GitHub - Navi-AI-Lab/nvllm: (Experimental) A high-throughput and memory-efficient inference and serving engine for LLMs on DGX Spark / GB10 with a custom CuTe Paged Attention Backend

vLLM Fork with focus on GB10 Homelabs

About

vLLM is a fast and easy-to-use library for LLM inference and serving.

Originally developed in the Sky Computing Lab at UC Berkeley, vLLM has evolved into a community-driven project with contributions from both academia and industry.

If you use vLLM for your research, please cite their paper:

@inproceedings{kwon2023efficient,
  title={Efficient Memory Management for Large Language Model Serving with PagedAttention},
  author={Woosuk Kwon and Zhuohan Li and Siyuan Zhuang and Ying Sheng and Lianmin Zheng and Cody Hao Yu and Joseph E. Gonzalez and Hao Zhang and Ion Stoica},
  booktitle={Proceedings of the ACM SIGOPS 29th Symposium on Operating Systems Principles},
  year={2023}
}

Quick Start

Recommended: Build from source. The prebuilt images on GHCR and Docker Hub lag behind main — custom kernel work (CuTe attention, stream-K GEMM) ships here first and images are only rebuilt periodically.

Prebuilt image (convenience)

docker pull ghcr.io/navi-ai-lab/nvllm:latest

Also on Docker Hub: docker.io/naviailab/nvllm:latest

Build from source (recommended)

Required flags: --gpus all --ipc=host --network host (vLLM needs shared memory and GPU access).

Cache mounts (recommended — avoid re-downloads and JIT recompilation on restart):

~/.cache/huggingface — model weights
~/.cache/flashinfer — FlashInfer JIT kernels
~/.cache/vllm_compile → /root/.cache/vllm/torch_compile_cache — CUDA graph cache

For gated models (e.g., Gemma 4): pass HF_TOKEN via env or mount a credentials file.

Prerequisites

NVIDIA DGX Spark (GB10) or GH200
Docker with NVIDIA Container Toolkit
Hugging Face account with model access
huggingface-cli on host (pip install huggingface-hub)

1. Clone and build

git clone https://github.com/Navi-AI-Lab/nvllm.git
cd nvllm
docker build -f docker/Dockerfile.gb10 -t nvllm:gb10 .

Already cloned? Pull the latest first:

cd nvllm && git pull
docker build -f docker/Dockerfile.gb10 -t nvllm:gb10 .

2. Serve a model

./scripts/serve.sh

First run downloads the model automatically (~18 GB). API available at http://localhost:8000/v1.

All models are served as default — use "model": "default" in API calls:

curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model": "default", "messages": [{"role": "user", "content": "Hello"}]}'

Serve Scripts

Script	Model	Status	Context
`serve.sh`	Qwen3.5-27B-NVFP4	Active (default)	64K
`serve-cute.sh`	Qwen3.5-27B-NVFP4 (CuTe Paged Attention; override `HF_MODEL` env)	Active (kernel dev)	64K
`serve-nemotron.sh`	Nemotron-3-Super-120B-A12B-NVFP4	Not Ready	128K
`serve-gemma4.sh`	Gemma 4 31B IT NVFP4	Degraded (see script)	32K

Flags

Flag	Effect
`--tq`	TurboQuant KV cache — more context capacity, ~25% lower throughput (serve.sh only)
`--debug`	Eager mode, no CUDA graphs (for debugging)

Roadmap

Now — Qwen3.5-27B kernel work

CuTe DSL paged attention uber-kernel (fused attention + W_O GEMV + RMSNorm)
CUTE_WO_SPLIT={2,4,8} opt-in K-parallel W_O GEMV. Default remains 1; production soak keeps optimized splits opt-in after wo8 showed +3.3% wall improvement, -25.52 ms p95 TPOT, and GSM8K parity within 1/50. See the production soak writeup.
CUDA graph support (FULL_AND_PIECEWISE mode)
End-to-end fusion validation through Qwen3NextAttention

Next — expand model support

Gemma 4 31B IT — blocked on vLLM PR #38891 (per-layer attention backend for mixed head_dim)
Devstral 2 Large — NVFP4 quantization and serve script

SM120 Stream-K Decode Optimization

This fork includes a custom CUTLASS FP4 GEMM kernel with stream-K scheduling for small-M decode (M≤16). Stream-K distributes K-dimension work across SMs, improving utilization when the batch size is too small to fill all SMs with standard tile scheduling.

Based on CUTLASS's own sm120_bs_gemm_nvf4_nvf4_f32_f32_stream_k test kernel, adapted for vLLM's dispatch:

Tile: 128×128×256 (K doubled from default 128)
Schedule: KernelTmaWarpSpecializedCooperative
Tile scheduler: StreamKScheduler

Benchmarked on Qwen3.5-27B-NVFP4 (rate=8, max-num-seqs=4):

Metric	Baseline	Stream-K	Delta
Output tok/s	40.0	44.9	+12.2%
TPOT p50	89.2 ms	80.0 ms	-10.2%
TPOT p99	91.7 ms	82.7 ms	-9.8%

Trace — committed streamk_graphs.pt.trace.json.gz + per-kernel CSVs.

Warning: Large models (>75 GB) that leave minimal memory headroom on the GB10's 128 GB unified memory may crash during CUDA graph capture with the stream-K kernel. Use a smaller model to test first.

CuTe Paged Attention Backend (Prototype)

Custom paged attention backend using CuTe Python DSL, targeting SM120/SM121 FP8 MMA instructions. Registered as CUTE_PAGED in vLLM's attention backend registry.

Status: Experimental CuTe DSL backend; production decode path since v0.3.0. The β-coop fused kernel (attention + W_O + RMSNorm + MLP) is the default. Opt-in CUTE_WO_SPLIT=8 is production-evidenced but not the default: the controlled harness reports W_O 13754 μs → 1639 μs (8.39×), while serving soak shows wo8 as a modest wall/TPOT win with quality parity. See the harness summary and production soak.

Launch with: ./scripts/serve-cute.sh (default PIECEWISE CUDA graphs). To opt into the evidenced W_O split, run CUTE_WO_SPLIT=8 ./scripts/serve-cute.sh.

Acknowledgments

b12x by Luke Alonso — CuTe DSL paged attention with FP8 KV inline dequant, TMA plane loading, and split-KV merge. Reference implementation for the CuTe paged attention backend. Pinned at c469c66.
- docs/kernel-insights/2026-04-10-b12x-cute-attention.md — CuTe attention & disk cache patterns
- docs/kernel-insights/2026-04-11-b12x-paged-attention.md — Full paged attention kernel architecture (1165 lines, 59 pinned permalinks)
- docs/kernel-insights/2026-04-17-b12x-mlp-fusion.md — Per-slice MLP fusion structure and UE4M3 blockscale encoding referenced from b12x @ c469c66 for the Phase D fused MLP decode kernel.
CUTLASS PR #3030 by blake-snc (Second Nature Computing) — SM120 Flash Attention v2 reference for fused multi-head attention on Blackwell.
- docs/kernel-insights/2026-04-10-cutlass-pr3030-sm120-fmha.md — SM120 FMHA patterns and tile configs
CUTLASS by NVIDIA — CuTe Python DSL for SM120 kernel development. The FP4 decode GEMM kernel with stream-K scheduling is adapted from CUTLASS test kernels.
Simon Veitner's CuTe DSL / NVFP4 blog — Reference reading for NVFP4 GEMV K-parallel reduction patterns. Applied to W_O GEMV in the CUTE_WO_SPLIT opt-in prototype; see the production soak writeup.
vLLM — The upstream project this fork is based on.

Name		Name	Last commit message	Last commit date
Latest commit History 15,883 Commits
.buildkite		.buildkite
.gemini		.gemini
.github		.github
benchmarks		benchmarks
cmake		cmake
csrc		csrc
docker		docker
docs		docs
examples		examples
models/qwen3.6-mtp-tokenizer-patch		models/qwen3.6-mtp-tokenizer-patch
notebooks/nvllm		notebooks/nvllm
requirements		requirements
scripts		scripts
templates		templates
tests		tests
tools		tools
vllm		vllm
.clang-format		.clang-format
.coveragerc		.coveragerc
.dockerignore		.dockerignore
.git-blame-ignore-revs		.git-blame-ignore-revs
.gitignore		.gitignore
.markdownlint.yaml		.markdownlint.yaml
.pre-commit-config.yaml		.pre-commit-config.yaml
.readthedocs.yaml		.readthedocs.yaml
.shellcheckrc		.shellcheckrc
.yapfignore		.yapfignore
AGENTS.md		AGENTS.md
CMakeLists.txt		CMakeLists.txt
CODE_OF_CONDUCT.md		CODE_OF_CONDUCT.md
CONTRIBUTING.md		CONTRIBUTING.md
DCO		DCO
LICENSE		LICENSE
MANIFEST.in		MANIFEST.in
README.md		README.md
RELEASE.md		RELEASE.md
SECURITY.md		SECURITY.md
codecov.yml		codecov.yml
mkdocs.yaml		mkdocs.yaml
pyproject.toml		pyproject.toml
setup.py		setup.py
use_existing_torch.py		use_existing_torch.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

vLLM Fork with focus on GB10 Homelabs

About

Quick Start

Prebuilt image (convenience)

Build from source (recommended)

Prerequisites

1. Clone and build

2. Serve a model

Serve Scripts

Flags

Roadmap

Now — Qwen3.5-27B kernel work

Next — expand model support

SM120 Stream-K Decode Optimization

CuTe Paged Attention Backend (Prototype)

Acknowledgments

About

Uh oh!

Releases 3

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

vLLM Fork with focus on GB10 Homelabs

About

Quick Start

Prebuilt image (convenience)

Build from source (recommended)

Prerequisites

1. Clone and build

2. Serve a model

Serve Scripts

Flags

Roadmap

Now — Qwen3.5-27B kernel work

Next — expand model support

SM120 Stream-K Decode Optimization

CuTe Paged Attention Backend (Prototype)

Acknowledgments

About

Topics

Resources

License

Code of conduct

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases 3

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages