Fix FP8 CT export metadata for KV cache and attention#1752
Fix FP8 CT export metadata for KV cache and attention#1752
Conversation
There was a problem hiding this comment.
Pull request overview
This PR fixes and extends FP8_STATIC export to the llm-compressor / compressed-tensors config format by ensuring FP8 KV-cache metadata is preserved and by exporting an FP8 “static attention” config-group, with regression tests covering both behaviors.
Changes:
- Add FP8 attention
config_groupsexport (non-Linear targets) whenstatic_attention_dtyperequests FP8. - Preserve / emit
kv_cache_schememetadata for FP8 KV cache and FP8 attention export paths. - Add targeted CPU export tests validating saved
config.jsonmetadata and reload behavior.
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 1 comment.
| File | Description |
|---|---|
test/test_cpu/export/test_llmc_format.py |
Adds regression tests for FP8 KV-cache scheme metadata and FP8 attention config-group export + reload. |
auto_round/export/export_to_llmcompressor/export_to_static_fp.py |
Refactors FP8 QuantizationArgs construction, adds attention config-group export, and adjusts when kv_cache_scheme is included. |
| def _use_fp8_attention(static_attention_dtype: str | None) -> bool: | ||
| """Return True if static attention should use FP8.""" | ||
| if static_attention_dtype in ("fp8", "float8_e4m3fn"): | ||
| logger.warning_once("Exporting model with static attention in FP8 dtype.") | ||
| return True | ||
| return False |
There was a problem hiding this comment.
_use_fp8_attention (and similarly _use_fp8_kv) only matches string values, but the compressor config allows static_attention_dtype/static_kv_dtype to be passed as torch.dtype (e.g., torch.float8_e4m3fn). In that case this check will return False and the FP8 attention group / kv_cache_scheme metadata will be silently omitted from the exported config. Consider normalizing dtype inputs (accept torch.dtype values and/or convert to a canonical string) before the membership check so both str and torch.dtype are supported.
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Summary
Verification
/home/yiliu4/workspace/ar/bin/python -m pytest test/test_cpu/export/test_llmc_format.py -k 'static_fp8_kv_config or static_fp8_attention_config or static_fp_export_packs_serially' -q -s/home/yiliu4/workspace/ar/bin/python -m pytest test/test_cpu/export/test_llmc_format.py -qmxfp8-quantizednot registered)Closes #1751