Skip to content

docs(asr): document VRAM-vs-audio-duration limits and add chunked inf…#377

Open
AviArora02-commits wants to merge 1 commit intomicrosoft:mainfrom
AviArora02-commits:docs/vram-audio-duration-guide-367
Open

docs(asr): document VRAM-vs-audio-duration limits and add chunked inf…#377
AviArora02-commits wants to merge 1 commit intomicrosoft:mainfrom
AviArora02-commits:docs/vram-audio-duration-guide-367

Conversation

@AviArora02-commits
Copy link
Copy Markdown

Closes #367

Problem

docs/vibevoice-asr.md states "60-minute single-pass processing" without VRAM
qualification. On RTX 4090 (24 GB), default sdpa OOMs beyond ~30 min
(empirically: 30 min → ✅ ~22 GB peak; 50 min → ❌ OOM).

Changes

  • docs/vibevoice-asr.md: Add Hardware Requirements section documenting
    the VRAM-vs-duration relationship and recommending flash_attention_2 for
    ≤24 GB GPUs.
  • demo/vibevoice_asr_chunked_inference.py: Minimal chunked inference script
    for GPUs where flash-attn is unavailable.

Notes

  • No model code modified.
  • Chunked inference carries a known caveat (per-chunk diarization IDs are not
    globally consistent); this is documented inline.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Docs] Document the VRAM-vs-audio-duration relationship — RTX 4090 (24GB) OOMs on >30min audio at default sdpa attention

1 participant