Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled by lvliang-intel · Pull Request #1753 · intel/auto-round

lvliang-intel · 2026-04-28T13:24:32Z

Description

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled.

This PR preserves the root model’s rotation_config during scheme cleanup and layer-config normalization, and updates AutoScheme offloading to use offload mode for rotated models instead of reloading unrotated checkpoint weights via clean mode. It also keeps offloaded temporary entries reusable across repeated reloads during AutoScheme scoring.

Type of Change

Bug fix

Related Issues

#1742

Checklist Before Submitting

My code has been tested locally.
Documentation has been updated as needed.
New or updated tests are included where applicable.
The CUDA CI has passed. You can trigger it by commenting /azp run Unit-Test-CUDA-AutoRound.

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

…el/auto-round into lvl/fix_mixed_acc_by_offload

Copilot

Pull request overview

Fixes a mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled, by preserving rotation state on the root model and avoiding “clean-mode” reloads that would revert rotated weights.

Changes:

Preserve rotation_config on the root module during quant-scheme cleanup and layer-config normalization.
Switch AutoScheme low-CPU scoring to use offload-mode (with retained saved entries) when rotation is enabled, so rotated weights aren’t overwritten by checkpoint reloads.
Add an option to retain offloaded entries across repeated reload cycles.

Reviewed changes

Copilot reviewed 4 out of 4 changed files in this pull request and generated 5 comments.

File	Description
auto_round/utils/offload.py	Adds `retain_saved_entries` and changes reload cleanup behavior in offload mode.
auto_round/compressors/utils.py	Avoids deleting root `rotation_config` during layer-config normalization cleanup.
auto_round/auto_scheme/utils.py	Avoids deleting root `rotation_config` when stripping quantization scheme attributes.
auto_round/auto_scheme/delta_loss.py	Chooses offload-mode (vs clean-mode) for rotated models during AutoScheme low-CPU scoring.

        if self.mode == "offload":
            self._load_from_disk(name, module)
-            self._remove_saved_entry(name)
+            if not self.retain_saved_entries:
+                self._remove_saved_entry(name)


+        offload_mode = "clean"
+        offload_kwargs = {"model_dir": _model_dir}
+        # Rotation mutates weights in memory before AutoScheme starts. Clean-mode
+        # reloads from the original checkpoint and would silently discard those
+        # transformed weights during scoring and final restore.
+        if getattr(model, "rotation_config", None):
+            offload_mode = "offload"
+            offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
+        offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)


+        # Rotation mutates weights in memory before AutoScheme starts. Clean-mode
+        # reloads from the original checkpoint and would silently discard those
+        # transformed weights during scoring and final restore.
+        if getattr(model, "rotation_config", None):
+            offload_mode = "offload"
+            offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
+        offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)


+        offload_kwargs = {"model_dir": _model_dir}
+        # Rotation mutates weights in memory before AutoScheme starts. Clean-mode
+        # reloads from the original checkpoint and would silently discard those
+        # transformed weights during scoring and final restore.
+        if getattr(model, "rotation_config", None):
+            offload_mode = "offload"
+            offload_kwargs = {"offload_dir_prefix": "autoscheme", "retain_saved_entries": True}
+        offload_context = OffloadManager(enabled=True, mode=offload_mode, cache_numel=True, **offload_kwargs)



        model_dir: Optional[str] = None,
        offload_dir_prefix: str = "ar_offload",
        cache_numel: bool = False,
+        retain_saved_entries: bool = False,
    ):


lvliang-intel · 2026-04-28T13:34:54Z

Test Result:

CUDA_VISIBLE_DEVICES=6 auto_round /mnt/disk2/lvl/Llama-3.1-8B --options "INT4,INT8" --target_bits 5 --rotation_type "hadamard" --tasks piqa --iters 1 --format fake --enable_alg_ext --output_dir ./tmp_llama_mixed
2026-04-28 20:46:18 INFO main.py L610: start to quantize /mnt/disk2/lvl/Llama-3.1-8B
Loading weights: 100%|███████████████████████████████████████████| 291/291 [00:00<00:00, 11295.15it/s]
2026-04-28 20:46:19 INFO common.py L364: _patch_mimo_attention_forward called for LlamaForCausalLM
2026-04-28 20:46:19 INFO common.py L367: Skipping patch: not a MiMo model (class name: LlamaForCausalLM)
2026-04-28 20:46:19 INFO base.py L526: using torch.bfloat16 for quantization tuning
2026-04-28 20:46:19 WARNING base.py L1020: activation quantization is an experimental feature with limited support and a complex API. And please save the quantized model to fake format as real deployment is not supported currently
2026-04-28 20:46:19 INFO base.py L562: using algorithm extension for quantization.
2026-04-28 20:46:19 WARNING alg_ext.py L48: algorithm extension has only undergone limited validation on W2A16,INT4, MXFP4 and NVFP4; use with caution.
2026-04-28 20:46:19 INFO apply_rotation_transform.py L120: Applying Hadamard (backend=inplace, data_type=int, fuse_online_to_weight=None).
2026-04-28 20:46:19 WARNING apply_rotation_transform.py L126: this backend does not support real exporting, please export the model to fake format
Rotating: 100%|████████████████████████████████████████████████████| 32/32 [00:38<00:00, 1.19s/layer]
2026-04-28 20:47:08 WARNING modeling_utils.py L4460: loss_type=None was set in the config but it is unrecognized. Using the default loss: ForCausalLMLoss.
2026-04-28 20:47:08 INFO gen_auto_scheme.py L200: AutoScheme option INT4 -> avg_bits=4.0039
2026-04-28 20:47:08 INFO gen_auto_scheme.py L200: AutoScheme option INT8 -> avg_bits=8.0047
2026-04-28 20:47:08 INFO gen_auto_scheme.py L94: Average bits range: [4.004, 8.005], target = 5.000
2026-04-28 20:47:08 INFO offload.py L542: clearing module weights to free RAM...
2026-04-28 20:47:08 INFO offload.py L706: OffloadManager (autoscheme): tempdir = ar_work_space/offload/autoscheme_te8ecrvi
2026-04-28 20:47:20 INFO offload.py L549: module weights cleared
Generating AutoScheme: 0%| | 0/256 [00:00<?, ?it/s]2026-04-28 20:47:45 INFO delta_loss.py L638: AutoScheme: disabled requires_grad on 66 non-wrapper parameters (only wrapper.orig_layer.weight needs grad for scoring; saves ~one model-worth of grad buffer during backward).
2026-04-28 20:47:45 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
Generating AutoScheme: 12%|█████▏ | 32/256 [01:16<01:35, 2.36it/s]/mnt/disk1/lvl/conda_envs/artest/lib/python3.11/site-packages/torch/autograd/graph.py:865: UserWarning: Flash Attention defaults to a non-deterministic algorithm. To explicitly enable determinism call torch.use_deterministic_algorithms(True, warn_only=False). (Triggered internally at /pytorch/aten/src/ATen/native/transformers/cuda/attention_backward.cu:114.)
return Variable._execution_engine.run_backward( # Calls into the C++ engine to run the backward pass
Generating AutoScheme: 50%|████████████████████ | 128/256 [01:58<00:58, 2.18it/s]2026-04-28 20:49:42 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
Generating AutoScheme: 100%|████████████████████████████████████████| 256/256 [03:45<00:00, 2.10it/s]2026-04-28 20:51:12 INFO device.py L1764: AutoScheme complete (low_cpu_mem_usage=enabled) 'peak_ram': 31.76GB, 'peak_vram': 5.6GB
Generating AutoScheme: 100%|████████████████████████████████████████| 256/256 [03:52<00:00, 1.10it/s]
2026-04-28 20:51:13 INFO base.py L1966: start to cache block inputs
2026-04-28 20:51:13 INFO calib_dataset.py L912: Preprocessing calibration dataset in a subprocess to avoid memory leaks...
2026-04-28 20:51:44 INFO base.py L1983: caching done
Quantizing model.layers.0: 0%| | 0/32 [00:00<?, ?it/s]quantized 7/7 layers in the block, loss iter 0: 0.000003 -> iter 0: 0.000003,'peak_ram': 31.76GB, 'peak_vram': 18.95GB
Quantizing model.layers.1: 3%|█▏ | 1/32 [00:05<02:56, 5.70s/it]quantized 7/7 layers in the block, loss iter 0: 0.000014 -> iter 0: 0.000014,'peak_ram': 31.76GB, 'peak_vram': 19.37GB
Quantizing model.layers.2: 6%|██▍ | 2/32 [00:10<02:43, 5.46s/it]quantized 7/7 layers in the block, loss iter 0: 0.000044 -> iter 0: 0.000044,'peak_ram': 31.76GB, 'peak_vram': 19.76GB
Quantizing model.layers.3: 9%|███▋ | 3/32 [00:16<02:35, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.000080 -> iter 0: 0.000080,'peak_ram': 31.76GB, 'peak_vram': 19.76GB
Quantizing model.layers.4: 12%|████▉ | 4/32 [00:21<02:28, 5.31s/it]quantized 7/7 layers in the block, loss iter 0: 0.000120 -> iter 0: 0.000120,'peak_ram': 31.76GB, 'peak_vram': 19.95GB
Quantizing model.layers.5: 16%|██████ | 5/32 [00:26<02:22, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.000177 -> iter 0: 0.000177,'peak_ram': 31.76GB, 'peak_vram': 19.95GB
Quantizing model.layers.6: 19%|███████▎ | 6/32 [00:31<02:17, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000299 -> iter 0: 0.000299,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.7: 22%|████████▌ | 7/32 [00:37<02:11, 5.27s/it]quantized 7/7 layers in the block, loss iter 0: 0.000458 -> iter 0: 0.000458,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.8: 25%|█████████▊ | 8/32 [00:42<02:06, 5.27s/it]quantized 7/7 layers in the block, loss iter 0: 0.000539 -> iter 0: 0.000539,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.9: 28%|██████████▉ | 9/32 [00:47<02:01, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000726 -> iter 0: 0.000726,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.10: 31%|███████████▌ | 10/32 [00:53<01:56, 5.28s/it]quantized 7/7 layers in the block, loss iter 0: 0.000888 -> iter 0: 0.000888,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.11: 34%|████████████▋ | 11/32 [00:58<01:51, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.001166 -> iter 0: 0.001166,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.12: 38%|█████████████▉ | 12/32 [01:03<01:45, 5.29s/it]quantized 7/7 layers in the block, loss iter 0: 0.001506 -> iter 0: 0.001506,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.13: 41%|███████████████ | 13/32 [01:09<01:40, 5.31s/it]quantized 7/7 layers in the block, loss iter 0: 0.001694 -> iter 0: 0.001694,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.14: 44%|████████████████▏ | 14/32 [01:14<01:35, 5.32s/it]quantized 7/7 layers in the block, loss iter 0: 0.002088 -> iter 0: 0.002088,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.15: 47%|█████████████████▎ | 15/32 [01:19<01:30, 5.32s/it]quantized 7/7 layers in the block, loss iter 0: 0.002292 -> iter 0: 0.002292,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.16: 50%|██████████████████▌ | 16/32 [01:25<01:25, 5.33s/it]quantized 7/7 layers in the block, loss iter 0: 0.002952 -> iter 0: 0.002952,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.17: 53%|███████████████████▋ | 17/32 [01:30<01:20, 5.34s/it]quantized 7/7 layers in the block, loss iter 0: 0.003390 -> iter 0: 0.003390,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.18: 56%|████████████████████▊ | 18/32 [01:35<01:14, 5.34s/it]quantized 7/7 layers in the block, loss iter 0: 0.004035 -> iter 0: 0.004035,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.19: 59%|█████████████████████▉ | 19/32 [01:41<01:09, 5.35s/it]quantized 7/7 layers in the block, loss iter 0: 0.005250 -> iter 0: 0.005250,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.20: 62%|███████████████████████▏ | 20/32 [01:46<01:04, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.006056 -> iter 0: 0.006056,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.21: 66%|████████████████████████▎ | 21/32 [01:51<00:59, 5.36s/it]quantized 7/7 layers in the block, loss iter 0: 0.007800 -> iter 0: 0.007800,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.22: 69%|█████████████████████████▍ | 22/32 [01:57<00:53, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.009932 -> iter 0: 0.009932,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.23: 72%|██████████████████████████▌ | 23/32 [02:02<00:48, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.011273 -> iter 0: 0.011273,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.24: 75%|███████████████████████████▊ | 24/32 [02:08<00:43, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.013322 -> iter 0: 0.013322,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.25: 78%|████████████████████████████▉ | 25/32 [02:13<00:37, 5.38s/it]quantized 7/7 layers in the block, loss iter 0: 0.016669 -> iter 0: 0.016669,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.26: 81%|██████████████████████████████ | 26/32 [02:18<00:32, 5.39s/it]quantized 7/7 layers in the block, loss iter 0: 0.020617 -> iter 0: 0.020617,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.27: 84%|███████████████████████████████▏ | 27/32 [02:24<00:26, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.028186 -> iter 0: 0.028186,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.28: 88%|████████████████████████████████▍ | 28/32 [02:29<00:21, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.033418 -> iter 0: 0.033418,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.29: 91%|█████████████████████████████████▌ | 29/32 [02:35<00:16, 5.41s/it]quantized 7/7 layers in the block, loss iter 0: 0.045059 -> iter 0: 0.045059,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.30: 94%|██████████████████████████████████▋ | 30/32 [02:40<00:10, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.067032 -> iter 0: 0.067032,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing model.layers.31: 97%|███████████████████████████████████▊ | 31/32 [02:45<00:05, 5.40s/it]quantized 7/7 layers in the block, loss iter 0: 0.118175 -> iter 0: 0.118175,'peak_ram': 31.76GB, 'peak_vram': 20.32GB
Quantizing done: 100%|████████████████████████████████████████████████| 32/32 [02:51<00:00, 5.36s/it]
2026-04-28 20:54:35 INFO device.py L1766: 'peak_ram': 31.76GB, 'peak_vram': 20.32GB
2026-04-28 20:54:35 INFO base.py L2042: quantization tuning time 171.364164352417
2026-04-28 20:54:35 INFO base.py L2061: Summary: quantized 224/225 in the model, unquantized layers: lm_head
2026-04-28 20:54:35 WARNING base.py L3584: Support for exporting activation quantization is limited. Please ensure that your configuration is supported.
Writing model shards: 100%|█████████████████████████████████████████████| 1/1 [00:24<00:00, 24.28s/it]
2026-04-28 20:54:59 INFO device.py L1766: 'peak_ram': 31.76GB, 'peak_vram': 20.32GB
2026-04-28 20:54:59 INFO evaluation.py L420: Using lm-eval version 0.4.11.dev0
2026-04-28 20:54:59 WARNING evaluation.py L424: set add_bos_token=True for llama model.
2026-04-28 20:55:02 WARNING evaluation.py L282: This API does not support auto currently, reset eval_bs to 16
pretrained model kwarg is not of type str. Many other model arguments may be ignored. Please do not launch via accelerate or use parallelize=True if passing an existing model this way.
Passed an already-initialized model through pretrained, assuming single-process call to evaluate() or custom distributed integration
100%|███████████████████████████████████████████████████████████| 1838/1838 [00:01<00:00, 1126.38it/s]
Running loglikelihood requests: 100%|█████████████████████████████| 3676/3676 [00:45<00:00, 80.76it/s]

Tasks	Version	Filter	n-shot	Metric		Value		Stderr
piqa	1	none	0	acc	↑	0.7807	±	0.0097
		none	0	acc_norm	↑	0.7965	±	0.0094

evaluation running time=84s

lvliang-intel · 2026-04-28T14:21:25Z

/azp run Unit-Test-CUDA-AutoRound

azure-pipelines · 2026-04-28T14:21:38Z

Azure Pipelines successfully started running 1 pipeline(s).

lvliang-intel added 3 commits April 28, 2026 20:36

Fix mixed INT4/INT8 accuracy issue caused by offloading

ce14478

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Fix mixed INT4/INT8 accuracy issue caused by offloading

dbc69a1

Signed-off-by: lvliang-intel <liang1.lv@intel.com>

Merge branch 'lvl/fix_mixed_acc_by_offload' of https://github.com/int…

029d867

…el/auto-round into lvl/fix_mixed_acc_by_offload

Copilot AI review requested due to automatic review settings April 28, 2026 13:24

lvliang-intel changed the title ~~Lvl/fix mixed acc by offload~~ Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled Apr 28, 2026

Merge branch 'main' into lvl/fix_mixed_acc_by_offload

916442e

Copilot AI reviewed Apr 28, 2026

View reviewed changes

chensuyue requested review from lkk12014402, wenhuach21 and xin3he April 29, 2026 01:48

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753

Fix mixed-precision accuracy regression when AutoScheme runs with CPU offloading and Hadamard rotation enabled#1753
lvliang-intel wants to merge 4 commits intomainfrom
lvl/fix_mixed_acc_by_offload

lvliang-intel commented Apr 28, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

lvliang-intel commented Apr 28, 2026

Uh oh!

lvliang-intel commented Apr 28, 2026

Uh oh!

azure-pipelines Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

lvliang-intel commented Apr 28, 2026

Description

Type of Change

Related Issues

Checklist Before Submitting

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

lvliang-intel commented Apr 28, 2026

Uh oh!

lvliang-intel commented Apr 28, 2026

Uh oh!

azure-pipelines Bot commented Apr 28, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants