Skip to content

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753

Open
lvliang-intel wants to merge 4 commits intomainfrom
lvl/fix_mixed_acc_by_offload
Open

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753
lvliang-intel wants to merge 4 commits intomainfrom
lvl/fix_mixed_acc_by_offload

Conversation

@lvliang-intel
Copy link
Copy Markdown
Contributor

Description

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled.

This PR preserves the root model’s rotation_config during scheme cleanup and layer-config normalization, and updates AutoScheme offloading to use offload mode for rotated models instead of reloading unrotated checkpoint weights via clean mode. It also keeps offloaded temporary entries reusable across repeated reloads during AutoScheme scoring.

Type of Change

Bug fix

Related Issues

#1742

Checklist Before Submitting

  • My code has been tested locally.
  • Documentation has been updated as needed.
  • New or updated tests are included where applicable.
  • The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Signed-off-by: lvliang-intel <liang1.lv@intel.com>
Copilot AI review requested due to automatic review settings April 28, 2026 13:24
@lvliang-intel lvliang-intel changed the title Lvl/fix mixed acc by offload Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled Apr 28, 2026
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Fixes a mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled, by preserving rotation state on the root model and avoiding “clean-mode” reloads that would revert rotated weights.

Changes:

  • Preserve rotation_config on the root module during quant-scheme cleanup and layer-config normalization.
  • Switch AutoScheme low-CPU scoring to use offload-mode (with retained saved entries) when rotation is enabled, so rotated weights aren’t overwritten by checkpoint reloads.
  • Add an option to retain offloaded entries across repeated reload cycles.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File Description
auto_round/utils/offload.py Adds retain_saved_entries and changes reload cleanup behavior in offload mode.
auto_round/compressors/utils.py Avoids deleting root rotation_config during layer-config normalization cleanup.
auto_round/auto_scheme/utils.py Avoids deleting root rotation_config when stripping quantization scheme attributes.
auto_round/auto_scheme/delta_loss.py Chooses offload-mode (vs clean-mode) for rotated models during AutoScheme low-CPU scoring.

Comment on lines 503 to +506
if self.mode == "offload":
self._load_from_disk(name, module)
self._remove_saved_entry(name)
if not self.retain_saved_entries:
self._remove_saved_entry(name)
Comment on lines +992 to +1000
offload_mode = "clean"
offload_kwargs = {"model_dir": _model_dir}
# Rotation mutates weights in memory before AutoScheme starts. Clean-mode
# reloads from the original checkpoint and would silently discard those
# transformed weights during scoring and final restore.
if getattr(model, "rotation_config", None):
offload_mode = "offload"
offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)
Comment on lines +994 to +1000
# Rotation mutates weights in memory before AutoScheme starts. Clean-mode
# reloads from the original checkpoint and would silently discard those
# transformed weights during scoring and final restore.
if getattr(model, "rotation_config", None):
offload_mode = "offload"
offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)
Comment on lines +993 to 1001
offload_kwargs = {"model_dir": _model_dir}
# Rotation mutates weights in memory before AutoScheme starts. Clean-mode
# reloads from the original checkpoint and would silently discard those
# transformed weights during scoring and final restore.
if getattr(model, "rotation_config", None):
offload_mode = "offload"
offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)

Comment on lines 272 to 276
model_dir: Optional[str] = None,
offload_dir_prefix: str = "ar_offload",
cache_numel: bool = False,
retain_saved_entries: bool = False,
):
@lvliang-intel
Copy link
Copy Markdown
Contributor Author

Test Result:

CUDA_VISIBLE_DEVICES=6 auto_round /mnt/disk2/lvl/Llama-3.1-8B --options "INT4,INT8" --target_bits 5 --rotation_type "hadamard" --tasks piqa --iters 1 --format fake --enable_alg_ext --output_dir ./tmp_llama_mixed
2026-04-28 20:46:18 INFO main.py L610: start to quantize /mnt/disk2/lvl/Llama-3.1-8B
Loading weights: 100%|███████████████████████████████████████████| 291/291 [00:00<00:00, 11295.15it/s]
2026-04-28 20:46:19 INFO common.py L364: _patch_mimo_attention_forward called for LlamaForCausalLM
2026-04-28 20:46:19 INFO common.py L367: Skipping patch: not a MiMo model (class name: LlamaForCausalLM)
2026-04-28 20:46:19 INFO base.py L526: using torch.bfloat16 for quantization tuning
2026-04-28 20:46:19 WARNING base.py L1020: activation quantization is an experimental feature with limited support and a complex API. And please save the quantized model to fake format as real deployment is not supported currently
2026-04-28 20:46:19 INFO base.py L562: using algorithm extension for quantization.
2026-04-28 20:46:19 WARNING alg_ext.py L48: algorithm extension has only undergone limited validation on W2A16,INT4, MXFP4 and NVFP4; use with caution.
2026-04-28 20:46:19 INFO apply_rotation_transform.py L120: Applying Hadamard (backend=inplace, data_type=int, fuse_online_to_weight=None).
2026-04-28 20:46:19 WARNING apply_rotation_transform.py L126: this backend does not support real exporting, please export the model to fake format
Rotating: 100%|████████████████████████████████████████████████████| 32/32 [00:38<00:00, 1.19s/layer]
2026-04-28 20:47:08 WARNING modeling_utils.py L4460: loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-04-28 20:47:08 INFO gen_auto_scheme.py L200: AutoScheme option INT4 -> avg_bits=4.0039
2026-04-28 20:47:08 INFO gen_auto_scheme.py L200: AutoScheme option INT8 -> avg_bits=8.0047
2026-04-28 20:47:08 INFO gen_auto_scheme.py L94: Average bits range: [4.004, 8.005], target = 5.000
2026-04-28 20:47:08 INFO offload.py L542: clearing module weights to free RAM...
2026-04-28 20:47:08 INFO offload.py L706: OffloadManager (autoscheme): tempdir = ar_work_space/offload/autoscheme_te8ecrvi
2026-04-28 20:47:20 INFO offload.py L549: module weights cleared
Generating AutoScheme: 0%| | 0/256 [00:00<?, ?it/s]2026-04-28 20:47:45 INFO delta_loss.py L638: AutoScheme: disabled requires_grad on 66 non-wrapper parameters (only wrapper.orig_layer.weight needs grad for scoring; saves ~one model-worth of grad buffer during backward).
2026-04-28 20:47:45 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
Generating AutoScheme: 12%|█████▏ | 32/256 [01:16<01:35, 2.36it/s]/mnt/disk1/lvl/conda_envs/artest/lib/python3.11/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Generating AutoScheme: 50%|████████████████████ | 128/256 [01:58<00:58, 2.18it/s]2026-04-28 20:49:42 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
Generating AutoScheme: 100%|████████████████████████████████████████| 256/256 [03:45<00:00, 2.10it/s]2026-04-28 20:51:12 INFO device.py L1764: AutoScheme complete (low_cpu_mem_usage=enabled) 'peak_ram': 31.76GB, 'peak_vram': 5.6GB
Generating AutoScheme: 100%|████████████████████████████████████████| 256/256 [03:52<00:00, 1.10it/s]
2026-04-28 20:51:13 INFO base.py L1966: start to cache block inputs
2026-04-28 20:51:13 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-04-28 20:51:44 INFO base.py L1983: caching done
Quantizing model.layers.0: 0%| | 0/32 [00:00<?, ?it/s]quantized 7/7 layers in the block, loss iter 0: 0.000003 -> iter 0: 0.000003,'peak_ram': 31.76GB, 'peak_vram': 18.95GB
Quantizing model.layers.1: 3%|█▏ | 1/32 [00:05<02:56, 5.70s/it]quantized 7/7 layers in the block, loss iter 0: 0.000014 -> iter 0: 0.000014,'peak_ram': 31.76GB, 'peak_vram': 19.37GB
Quantizing model.layers.2: 6%|██▍ | 2/32 [00:10<02:43, 5.46s/it]quantized 7/7 layers in the block, loss iter 0: 0.000044 -> iter 0: 0.000044,'peak_ram': 31.76GB, 'peak_vram': 19.76GB
Quantizing model.layers.3: 9%|███▋ | 3/32 [00:16<02:35, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.000080 -> iter 0: 0.000080,'peak_ram': 31.76GB, 'peak_vram': 19.76GB
Quantizing model.layers.4: 12%|████▉ | 4/32 [00:21<02:28, 5.31s/it]quantized 7/7 layers in the block, loss iter 0: 0.000120 -> iter 0: 0.000120,'peak_ram': 31.76GB, 'peak_vram': 19.95GB
Quantizing model.layers.5: 16%|██████ | 5/32 [00:26<02:22, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.000177 -> iter 0: 0.000177,'peak_ram': 31.76GB, 'peak_vram': 19.95GB
Quantizing model.layers.6: 19%|███████▎ | 6/32 [00:31<02:17, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000299 -> iter 0: 0.000299,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.7: 22%|████████▌ | 7/32 [00:37<02:11, 5.27s/it]quantized 7/7 layers in the block, loss iter 0: 0.000458 -> iter 0: 0.000458,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.8: 25%|█████████▊ | 8/32 [00:42<02:06, 5.27s/it]quantized 7/7 layers in the block, loss iter 0: 0.000539 -> iter 0: 0.000539,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.9: 28%|██████████▉ | 9/32 [00:47<02:01, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000726 -> iter 0: 0.000726,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.10: 31%|███████████▌ | 10/32 [00:53<01:56, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000888 -> iter 0: 0.000888,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.11: 34%|████████████▋ | 11/32 [00:58<01:51, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.001166 -> iter 0: 0.001166,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.12: 38%|█████████████▉ | 12/32 [01:03<01:45, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.001506 -> iter 0: 0.001506,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.13: 41%|███████████████ | 13/32 [01:09<01:40, 5.31s/it]quantized 7/7 layers in the block, loss iter 0: 0.001694 -> iter 0: 0.001694,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.14: 44%|████████████████▏ | 14/32 [01:14<01:35, 5.32s/it]quantized 7/7 layers in the block, loss iter 0: 0.002088 -> iter 0: 0.002088,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.15: 47%|█████████████████▎ | 15/32 [01:19<01:30, 5.32s/it]quantized 7/7 layers in the block, loss iter 0: 0.002292 -> iter 0: 0.002292,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.16: 50%|██████████████████▌ | 16/32 [01:25<01:25, 5.33s/it]quantized 7/7 layers in the block, loss iter 0: 0.002952 -> iter 0: 0.002952,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.17: 53%|███████████████████▋ | 17/32 [01:30<01:20, 5.34s/it]quantized 7/7 layers in the block, loss iter 0: 0.003390 -> iter 0: 0.003390,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.18: 56%|████████████████████▊ | 18/32 [01:35<01:14, 5.34s/it]quantized 7/7 layers in the block, loss iter 0: 0.004035 -> iter 0: 0.004035,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.19: 59%|█████████████████████▉ | 19/32 [01:41<01:09, 5.35s/it]quantized 7/7 layers in the block, loss iter 0: 0.005250 -> iter 0: 0.005250,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.20: 62%|███████████████████████▏ | 20/32 [01:46<01:04, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.006056 -> iter 0: 0.006056,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.21: 66%|████████████████████████▎ | 21/32 [01:51<00:59, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.007800 -> iter 0: 0.007800,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.22: 69%|█████████████████████████▍ | 22/32 [01:57<00:53, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.009932 -> iter 0: 0.009932,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.23: 72%|██████████████████████████▌ | 23/32 [02:02<00:48, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.011273 -> iter 0: 0.011273,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.24: 75%|███████████████████████████▊ | 24/32 [02:08<00:43, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.013322 -> iter 0: 0.013322,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.25: 78%|████████████████████████████▉ | 25/32 [02:13<00:37, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.016669 -> iter 0: 0.016669,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.26: 81%|██████████████████████████████ | 26/32 [02:18<00:32, 5.39s/it]quantized 7/7 layers in the block, loss iter 0: 0.020617 -> iter 0: 0.020617,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.27: 84%|███████████████████████████████▏ | 27/32 [02:24<00:26, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.028186 -> iter 0: 0.028186,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.28: 88%|████████████████████████████████▍ | 28/32 [02:29<00:21, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.033418 -> iter 0: 0.033418,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.29: 91%|█████████████████████████████████▌ | 29/32 [02:35<00:16, 5.41s/it]quantized 7/7 layers in the block, loss iter 0: 0.045059 -> iter 0: 0.045059,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.30: 94%|██████████████████████████████████▋ | 30/32 [02:40<00:10, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.067032 -> iter 0: 0.067032,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.31: 97%|███████████████████████████████████▊ | 31/32 [02:45<00:05, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.118175 -> iter 0: 0.118175,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing done: 100%|████████████████████████████████████████████████| 32/32 [02:51<00:00, 5.36s/it]
2026-04-28 20:54:35 INFO device.py L1766: 'peak_ram': 31.76GB, 'peak_vram': 20.32GB
2026-04-28 20:54:35 INFO base.py L2042: quantization tuning time 171.364164352417
2026-04-28 20:54:35 INFO base.py L2061: Summary: quantized 224/225 in the model, unquantized layers: lm_head
2026-04-28 20:54:35 WARNING base.py L3584: Support for exporting activation quantization is limited. Please ensure that your configuration is supported.
Writing model shards: 100%|█████████████████████████████████████████████| 1/1 [00:24<00:00, 24.28s/it]
2026-04-28 20:54:59 INFO device.py L1766: 'peak_ram': 31.76GB, 'peak_vram': 20.32GB
2026-04-28 20:54:59 INFO evaluation.py L420: Using lm-eval version 0.4.11.dev0
2026-04-28 20:54:59 WARNING evaluation.py L424: set add_bos_token=True for llama model.
2026-04-28 20:55:02 WARNING evaluation.py L282: This API does not support auto currently, reset eval_bs to 16
pretrained model kwarg is not of type str. Many other model arguments may be ignored. Please do not launch via accelerate or use parallelize=True if passing an existing model this way.
Passed an already-initialized model through pretrained, assuming single-process call to evaluate() or custom distributed integration
100%|███████████████████████████████████████████████████████████| 1838/1838 [00:01<00:00, 1126.38it/s]
Running loglikelihood requests: 100%|█████████████████████████████| 3676/3676 [00:45<00:00, 80.76it/s]

Tasks Version Filter n-shot Metric Value Stderr
piqa 1 none 0 acc 0.7807 ± 0.0097
none 0 acc_norm 0.7965 ± 0.0094

evaluation running time=84s

@lvliang-intel
Copy link
Copy Markdown
Contributor Author

/azp run Unit-Test-CUDA-AutoRound

@azure-pipelines
Copy link
Copy Markdown

Azure Pipelines successfully started running 1 pipeline(s).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants