Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753
Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753lvliang-intel wants to merge 4 commits intomainfrom
Conversation
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
…el/auto-round into lvl/fix_mixed_acc_by_offload
There was a problem hiding this comment.
Pull request overview
Fixes a mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled, by preserving rotation state on the root model and avoiding “clean-mode” reloads that would revert rotated weights.
Changes:
- Preserve
rotation_configon the root module during quant-scheme cleanup and layer-config normalization. - Switch AutoScheme low-CPU scoring to use offload-mode (with retained saved entries) when rotation is enabled, so rotated weights aren’t overwritten by checkpoint reloads.
- Add an option to retain offloaded entries across repeated reload cycles.
Reviewed changes
Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.
| File | Description |
|---|---|
| auto_round/utils/offload.py | Adds retain_saved_entries and changes reload cleanup behavior in offload mode. |
| auto_round/compressors/utils.py | Avoids deleting root rotation_config during layer-config normalization cleanup. |
| auto_round/auto_scheme/utils.py | Avoids deleting root rotation_config when stripping quantization scheme attributes. |
| auto_round/auto_scheme/delta_loss.py | Chooses offload-mode (vs clean-mode) for rotated models during AutoScheme low-CPU scoring. |
| if self.mode == "offload": | ||
| self._load_from_disk(name, module) | ||
| self._remove_saved_entry(name) | ||
| if not self.retain_saved_entries: | ||
| self._remove_saved_entry(name) |
| offload_mode = "clean" | ||
| offload_kwargs = {"model_dir": _model_dir} | ||
| # Rotation mutates weights in memory before AutoScheme starts. Clean-mode | ||
| # reloads from the original checkpoint and would silently discard those | ||
| # transformed weights during scoring and final restore. | ||
| if getattr(model, "rotation_config", None): | ||
| offload_mode = "offload" | ||
| offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True} | ||
| offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs) |
| # Rotation mutates weights in memory before AutoScheme starts. Clean-mode | ||
| # reloads from the original checkpoint and would silently discard those | ||
| # transformed weights during scoring and final restore. | ||
| if getattr(model, "rotation_config", None): | ||
| offload_mode = "offload" | ||
| offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True} | ||
| offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs) |
| offload_kwargs = {"model_dir": _model_dir} | ||
| # Rotation mutates weights in memory before AutoScheme starts. Clean-mode | ||
| # reloads from the original checkpoint and would silently discard those | ||
| # transformed weights during scoring and final restore. | ||
| if getattr(model, "rotation_config", None): | ||
| offload_mode = "offload" | ||
| offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True} | ||
| offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs) | ||
|
|
| model_dir: Optional[str] = None, | ||
| offload_dir_prefix: str = "ar_offload", | ||
| cache_numel: bool = False, | ||
| retain_saved_entries: bool = False, | ||
| ): |
|
Test Result: CUDA_VISIBLE_DEVICES=6 auto_round /mnt/disk2/lvl/Llama-3.1-8B --options "INT4,INT8" --target_bits 5 --rotation_type "hadamard" --tasks piqa --iters 1 --format fake --enable_alg_ext --output_dir ./tmp_llama_mixed
evaluation running time=84s |
|
/azp run Unit-Test-CUDA-AutoRound |
|
Azure Pipelines successfully started running 1 pipeline(s). |
Description
Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled.
This PR preserves the root model’s rotation_config during scheme cleanup and layer-config normalization, and updates AutoScheme offloading to use offload mode for rotated models instead of reloading unrotated checkpoint weights via clean mode. It also keeps offloaded temporary entries reusable across repeated reloads during AutoScheme scoring.
Type of Change
Bug fix
Related Issues
#1742
Checklist Before Submitting
/azp run Unit-Test-CUDA-AutoRound.