Skip to content

Fix FP8 CT export metadata for KV cache and attention#1752

Open
yiliu30 wants to merge 3 commits intomainfrom
fix/issue-1751-fp8-ct-export
Open

Fix FP8 CT export metadata for KV cache and attention#1752
yiliu30 wants to merge 3 commits intomainfrom
fix/issue-1751-fp8-ct-export

Conversation

@yiliu30
Copy link
Copy Markdown
Contributor

@yiliu30 yiliu30 commented Apr 28, 2026

Summary

  • add FP8 static attention config-group export for compressed-tensors format
  • keep FP8 KV cache metadata in the exported config
  • add targeted tests for FP8 KV and FP8 attention CT export behavior

Verification

  • /home/yiliu4/workspace/ar/bin/python -m pytest test/test_cpu/export/test_llmc_format.py -k 'static_fp8_kv_config or static_fp8_attention_config or static_fp_export_packs_serially' -q -s
    • passed
  • /home/yiliu4/workspace/ar/bin/python -m pytest test/test_cpu/export/test_llmc_format.py -q
    • issue-specific tests passed; one separate mixed-precision load failure remains in the installed compressed-tensors stack (mxfp8-quantized not registered)

Closes #1751

Copilot AI review requested due to automatic review settings April 28, 2026 11:50
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR fixes and extends FP8_STATIC export to the llm-compressor / compressed-tensors config format by ensuring FP8 KV-cache metadata is preserved and by exporting an FP8 “static attention” config-group, with regression tests covering both behaviors.

Changes:

  • Add FP8 attention config_groups export (non-Linear targets) when static_attention_dtype requests FP8.
  • Preserve / emit kv_cache_scheme metadata for FP8 KV cache and FP8 attention export paths.
  • Add targeted CPU export tests validating saved config.json metadata and reload behavior.

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.

File Description
test/test_cpu/export/test_llmc_format.py Adds regression tests for FP8 KV-cache scheme metadata and FP8 attention config-group export + reload.
auto_round/export/export_to_llmcompressor/export_to_static_fp.py Refactors FP8 QuantizationArgs construction, adds attention config-group export, and adjusts when kv_cache_scheme is included.

Comment on lines +102 to +107
def _use_fp8_attention(static_attention_dtype: str | None) -> bool:
"""Return True if static attention should use FP8."""
if static_attention_dtype in ("fp8", "float8_e4m3fn"):
logger.warning_once("Exporting model with static attention in FP8 dtype.")
return True
return False
Copy link

Copilot AI Apr 28, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

_use_fp8_attention (and similarly _use_fp8_kv) only matches string values, but the compressor config allows static_attention_dtype/static_kv_dtype to be passed as torch.dtype (e.g., torch.float8_e4m3fn). In that case this check will return False and the FP8 attention group / kv_cache_scheme metadata will be silently omitted from the exported config. Consider normalizing dtype inputs (accept torch.dtype values and/or convert to a canonical string) before the membership check so both str and torch.dtype are supported.

Copilot uses AI. Check for mistakes.
@chensuyue
Copy link
Copy Markdown
Contributor

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Feature]: Add CT export and vLLM inference support for FP8 KV cache and attention

3 participants